Copyright © 1999 by Precision Insight, Inc., Cedar Park, Texas. All Rights Reserved.
Permission is granted to make and distribute verbatim copies of this document provided the copyright notice and this permission notice are preserved on all copies.
OpenGL is a registered trademark of Silicon Graphics, Inc. Unix is a registered trademark of The Open Group. The `X' device and X Window System are trademarks of The Open Group. XFree86 is a trademark of The XFree86 Project. Linux is a registered trademark of Linus Torvalds. Intel is a registered trademark of Intel Corporation. All other trademarks mentioned are the property of their respective owners.
Although some cards may support some subset of simultaneous accesses, typical Intel-based PC graphics hardware (costing less than US$2000 in 1998) does not allow simultaneous, interleaved access to the frame buffer memory, the MMIO-based command FIFO, and the DMA-based command FIFO. For example, typical cards will allow frame buffer and command FIFO activity while processing updates to the MMIO registers that control the hardware cursor position. Some cards will atomically add commands sent via DMA to the command FIFO, permitting some kinds of simultaneous MMIO and DMA. However, other cards will intermingle MMIO and DMA commands in the command FIFO on a word-by-word basis, thereby completely corrupting the command FIFO. Other cards are not robust enough to permit frame buffer access while other operations are occurring.
Because of these limitations, typical hardware will require a single (per-device) lock that cooperatively restricts access to the hardware by the X server, the kernel, and 3D direct-rendering clients. This section briefly outlines how the hardware lock would typically be used by each of the three types of DRI entity. Some hardware may not require locking in all of the instances listed in this section.
When the hardware is ready for another DMA dispatch from the kernel's DMA buffer queue, the kernel will obtain the hardware lock. If the notification that another DMA dispatch is possible came via a hardware interruption, then the kernel may have to reset the interrupt on the hardware. If the current GLXContext is different from the GLXContext required by the next DMA buffer to be dispatched, then the kernel will have to update the hardware graphics context, possibly via a callback to the X server (see below).
If the X server is performing hardware graphics context
switches on behalf of the kernel, then the kernel will hold
the hardware lock on behalf of the X server while the
context is switched. The X server may perform the context
switch by doing MMIO or by issuing a request to dispatch DMA
immediately (without obtaining another lock), and will then
issue an ioctl
to tell the kernel that the context has
been switched and it is now safe to proceed with DMA
dispatches.
When an operation (e.g., a DMA buffer dispatch) must be synchronized with the next vertical retrace, the kernel will obtain the lock and poll or wait for an interruption before performing the operation.
The 3D client library will obtain the hardware lock when performing software fallbacks that require direct hardware access (e.g., access to the frame buffer). During this time, the client may issue high-priority blocking DMA requests that bypass the normal kernel DMA buffer queues (and that do not require the kernel to obtain the lock).
The 3D client library assumes that the kernel DMA buffer queue for the current GLXContext has been flushed, that other kernel DMA buffer queues are halted (implied by taking the lock), that the hardware is quiescent, and that the current full graphics context is available (including textures and display lists). The initial DRI implement will assume that this context is stored on the hardware, not in the SAREA.
Depending on the requirements of the graphics hardware, the lock may be needed to move the hardware cursor. (At the time of this writing, we have not identified any hardware that requires locking for hardware cursor movement, but we note the possibility so that future driver implementors can check for it.)
2D rendering by the X server will require the lock, and may require quiescent hardware (depending on the operation being performed and the hardware being used).
Region modifications (e.g., moving windows) require the lock and quiescent hardware. Further, all of the kernel DMA buffer queues must be flushed (it may be possible to compute an optimized set of DMA queues that must be flushed, based on the window to be moved and the GLXContexts touched by that window). After the region modification, the window-changed ID will be updated (this update may not have to be locked).
When the application program makes use of the XFree86-DGA protocol extension, the X server will hold the lock on behalf of the DGA client. XFree86-DGA must be supported for clients that access the hardware directly but that do not have knowledge of the DRI.
A new protocol extension that allows fullscreen OpenGL
access within the framework of the DRI will be provided.
The server will issue an ioctl
to halt all kernel DMA
buffer queues for existing GLXContexts. From that
point onward, all newly created GLXContexts will have
operational (i.e., unhalted) kernel DMA buffer queues. This
implies that the client must issue the OpenGL/Fullscreen
request before creating any GLXContexts.
The X server, 3D clients, and the kernel will share the lock cooperatively. Since client processes can die while holding the lock, some process must detect client death and reclaim the lock. The X server can detect client death when the connection used for the X protocol fails. However, in the case of a UNIX Domain Socket (i.e., a named pipe), timely notification requires active I/O on the connection. The kernel-level device driver, however, knows about client death as soon as the client closes the connection to the kernel device driver. Timely notification of client death, together with other kernel-level issues (e.g., handling of SIGSTOP [FM99]), make the use of the kernel as the ultimate lock arbiter compelling.
However, obtaining the hardware lock via an ioctl
is a
heavyweight operation. If an entity is performing several short
operations, the lock will have to be taken and released for each
operation in order to provide user-perceived interactivity. The
next section explores methods for avoiding an ioctl
for each
locking operation by introducing a two-tiered lock. This section
outlines the requirements for the two-tiered lock design by
exploring the transitions between common entity states and
identifying the transitions that are performance-critical
according to the following general criteria:
For purposes of analysis, the system has the following states:
The importance of optimization will be ranked on the following scale:
The rankings are shown below for transitions between any two states:
DA DB SA SB 2D HC RM DA 1 2 3 3 2 2 3 DB 2 1 3 3 2 2 3 SA 3 3 3 3 2 2 3 SB 3 3 3 3 2 2 3 2D 2 2 2 2 1 2 3 HC 2 2 2 2 2 2 3 RM 3 3 3 3 3 3 3
Since our goals are to optimize single GLXContext throughput as
well as X server responsiveness, it is most important that
locking for DMA dispatch and for 2D rendering be optimized. If
the kernel holds the lock for long periods of time,
responsiveness is compromised. If the X server holds the lock
for long periods of time, rendering throughput is compromised.
Hence, the kernel and the X server must not hold the lock for
more than minimum necessary amount of time. However, because
both DMA dispatch and 2D rendering operations are relatively
short, the lock will be taken and released with great frequency:
heavyweight ioctl
-based locking may account for a
significant percentage of the time used to perform the
operations. In the case of advanced MMIO-based graphics hardware
(i.e., vs. available DMA-based hardware), the overhead of the
ioctl
-based lock will be prohibitive.
Because of these requirements and observations, the design of a two-tiered lock was undertaken. We adopt operating system terminology and call the two methods of obtaining the lock ``heavyweight'' and ``lightweight''. The key observation from the above discussion is that locking must be optimized for a single GLXContext taking and releasing the lock many consecutive times (for this discussion, the X server will be considered as a special GLXContext). In particular, the lightweight lock will not be designed to be shared between GLXContexts, because such design could possibly complicate the algorithm at the expense of the more critical case (e.g., by requiring additional work in the lightweight routines to set flags or check for additional state that is important, but seldom or never used in the single GLXContext case).
The following assumptions will simplify lock design:
ioctl
interface to the
kernel (an ioctl
has the approximate overhead of a
system call).[MCS91] discusses synchronization algorithms that make use of busy waiting and atomic operations, as abstracted by the fetch-and-phi primitive, but do not discuss a two-tiered locking mechanism.
[BKMS98] describes a ``thin lock'' that can be ``inflated'' to a ``fat lock''. However, a fat lock cannot ``deflate'' to a thin lock, so these methods do not apply to our work.
[L87] describes a method that, in the absence of contention, uses 7 memory accesses to provide mutual exclusion. If Lamport's algorithm is modified to provide for a kernel-level fallback in the case of contention, an algorithm with fewer reads and writes may be possible. However, since all modern architectures provide atomic fetch-and-phi primitives, there is limited value in exploring an algorithm that depends only on atomic reads and writes.
[LA93] explore "two-phase" algorithms that combine busy-waiting with blocking. The two-tiered lock described in this paper is a Lim and Agarwal two-phase lock with the polling value set to 1. At this time, however, we impose the restriction that only the last process to have the lock can obtain the lock by checking a synchronization variable -- all other lock acquisitions must block. In the future, after we obtain more experience with DRI contention, we may extend our locking algorithm to be a completely general two-phase lock.
Lim an Agarwal note that the idea of combining the advantages of polling and signalling was first suggested by Ousterhout in [O82]. Unfortunately, we were not able to obtain a copy of this paper.
Algorithms and structures are presented in a C-like programming language.
The lock structure is a simple cache-line aligned integer. To avoid processor bus contention on a multiprocessor system, there should not be any other data stored in the same cache line.
typedef struct lock_s { unsigned int context; unsigned int padding[3]; } lock_t;
Bits in the lock word will be used to claim the lock, and to notify the client that the kernel must be involved in contention resolution.
#define LOCK_HELD 0x80000000 #define LOCK_CONT 0x40000000
This is a standard 32-bit compare-and-swap (CAS) routine. It can be implemented atomically with a single Intel486 instruction. On a RISC processor, CAS is usually implemented with the instruction pair load-linked/store-conditional.
int CAS(int *address, int compare_value, int update_value) { if (*address == compare-value) { *address = update-value; return SUCCEED; } else return FAIL; }
void get_lightweight_lock(lock_t *L, int my_context) { if (CAS(&L->context, my_context, LOCK_HELD|my_context) == FAIL) { // Contention, so we use the kernel to arbitrate get_heavyweight_lock(L, my_context); } }
void release_lightweight_lock(lock_t *L, int my_context) { if (CAS(&L->context, LOCK_HELD|my_context, my_context) == FAIL) { // Kernel is requesting the lock release_heavyweight_lock(L, my_context); } }
void get_heavyweight_lock(lock_t *L, int my_context) { for (;;) { do { // If lock held, mark it as contended // Otherwise, try to get it current_context = L->context; if (current_context & LOCK_HELD) new = LOCK_CONT | current_context; else new = LOCK_HELD | my_context; } while (CAS(&L->context, current_context, new)); if (new == LOCK_HELD|my_context) break; // Have lock // Didn't get lock // Suspend process until lock available place_current_process_on_queue(); schedule(); // blocks until wake_up_queued_processes() called // Loop and try to obtain lock again } }
void release_heavyweight_lock(lock_t *L, int my_context) { L->context = 0; // CAS can be used to detect multiple unlocks wake_up_queued_processes(); }
Both getting and releasing a lightweight lock requires a CAS instruction, which may cause a processor bus LOCK cycle. The bus LOCK is avoided on Pentium Pro and later processors if the cache line is resident in the cache of the current processor. Since this is typically the case for the use of this lock, scalability should be good in the number of processors. On RISC processors, the load-linked/store-conditional instruction pair is used to implement CAS. This instruction pair does not cause bus locks, and scales well in the number of processors.
If the lock release uses a simple write operation instead of a CAS, then there will be a race condition between the time the process determines that the kernel wants the lock, and the time the lock is released. During this time, the kernel could set the bit making the request. If the process does not realize the kernel is waiting for the lock, then the kernel will not obtain the lock until the same process or another process requests the lock again. Since this may never happen, deadlock could result.
Thanks to James H. Anderson for discussing lock implementation issues.
[AM97] James H. Anderson and Mark Moir. Universal constructions for large objects. Submitted to IEEE Transactions on Parallel and Distributed Systems, June 1997. Available from http://www.cs.pitt.edu/~moir/Papers/anderson-moir-tpds97.ps. (An earlier version of this paper was published in Proceedings of the Ninth International Workshop on Distributed Algorithms, Lecture Notes in Computer Science 972, Springer-Verlag, pp. 168-182, September 1995.)
[BKMS98] David F. Bacon, Ravi Konuru, Chet Murthy, and Mauricio Serrano. Thin locks: featherweight synchronization for Java. Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation (Montreal, Canada, 17-19 June 1998). Published as SIGPLAN Notices 33, 5 (May 1998), 258-268.
[FM99] Rickard E. Faith and Kevin E. Martin. A Security Analysis of the Direct Rendering Infrastructure. Cedar Park, Texas: Precision Insight, Inc., 1999.
[L87 Leslie Lamport. A fast mutual exclusion algorithm. ACM Transactions on Computer Systems 5, 1 (Feb. 1987), 1-11.
[LA93] Beng-Hong Lim and Anat Agarwal. Waiting algorithms for synchronization in large-scale multiprocessors. ACM Transactions on Computer Systems 11, 3 (Aug. 1993), 253-294.
[M92] Henry Massalin. Synthesis: An Efficient Implementation of Fundamental Operating System Services. Ph.D. dissertation, published as Technical Report CUCS-039-92. Graduate School of Arts and Sciences, Columbia University, 1992, 71-91. Available from ftp://ftp.cs.columbia.edu/reports/reports-1992/cucs-039-92.ps.gz.
[MCS91] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21-65.
[MFOA99] Kevin E. Martin, Rickard E. Faith, Jens Owen, Allen Akin. Direct Rendering Infrastructure, Low-Level Design Document. Cedar Park, Texas: Precision Insight, Inc., 1999.
[O82] J. K. Ousterhout. Scheduling techniques for concurrent systems. In Proceedings of the 3rd International Conference on Distributed Computing Systems. IEEE, New York, 1982, pp. 22-30.
[OM98] Jens Owen and Kevin E. Martin. A Multipipe Direct Rendering Architecture for 3D. Cedar Park, Texas: Precision Insight, Inc., 15 September 1998. Available from http://www.precisioninsight.com/dr/dr.html.