A Multipipe Direct Rendering Architecture for 3D

Jens Owen and Kevin E. Martin, Precision Insight, Inc.

15 September 1998


While rendering 2D primitives can be efficiently and effectively handled by a single rendering pipeline in the X server, multiple pipes are required for rendering 3D primitives in a fast and responsive manner. In this high-level design document, we present an overview of the issues involved in the infrastructure required to handle multiple pipes, a set of potential architectures and a recommended solution. A separate, low-level design document will cover the implementation details and outline the process for adding support for hardware accelerated 3D video cards. Copyright 1998 by Precision Insight, Inc., Cedar Park, Texas.

1. Overview

The XFree86 2D X Windows Server is fast, responsive, and well supported on a wide variety of graphics hardware and operating systems. Why expect any less with the addition of 3D support? OpenGL has become the 3D API of choice in the UNIX community, and Mesa, an OpenGL-like 3D library provides the most complete open source implementation to date. This high-level design document details a framework that can provide well integrated, hardware accelerated, direct 3D rendering into multiple X11 windows. This project is focused on the integration of XFree86 and Mesa, and this document is meant to guide development and maintenance of this open source project.

The design and implementation of this infrastructure is no simple task. We reference the following well written documents to give the reader a broader background on the tasks we are working on:

That's a lot of reading already, and we would like to encourage the reader to read the first one on SGI's direct rendering twice before worrying about anything in the rest of the list.

2. Project goals

The following prioritized goals are meant to guide the overall design:

  1. Maintain server interactivity to keep user interface snappy
  2. Allow 3D hardware to operate at peak performance potential
  3. Support multiple 3D applications rendering directly to hardware simultaneously
  4. Minimize the effort required by 3D application vendors to port to this architecture
  5. Reduce and simplify device dependent code where possible to encourage broad device support
  6. Reduce and simplify OS dependent code where possible to encourage broad OS support

It is important to recognize that these goals will often be at odds with one another and higher priority goals should get more consideration.

3. Managed resources

Context Switching between two processes rendering directly to a single hardware graphics pipeline presents the largest challenge to this project. To allow each process to act as if it is the sole user of the graphics hardware requires a detailed look into the following:

3.1 Command buffering

Hardware accelerated 3D rendering usually involves the notion of buffering primitives at some point in the software/hardware transition. 3D hardware can have multiple sets of commands queued while the host is freed to continue processing. Hardware FIFOs and DMA buffers are common mechanisms for buffering 3D commands.

A 3D command buffer has the potential to take a long time to process. High-end 3D workstations typically utilize an asynchronous 2D pipeline that is allowed to bypass the 3D pipeline to provide a highly interactive user interface. It is necessary for the hardware to be able to suspend its 3D rendering, move the cursor, draw pop up windows and menus, and even move and reshape the 3D window without flushing the 3D pipeline. This type of high-end hardware allows for deep 3D command buffers to be queued and still provide a very snappy feel from the X Server.

Today's low-end 3D hardware is becoming increasing capable regarding features and performance. Many PC solutions are giving the old notion of workstation performance a run for its money. To fully utilize these relatively inexpensive solutions, it is necessary to take additional steps to provide the same level of performance and interactivity that workstation users have grown to expect. For example, even as performance is rising dramatically from simple discrete solutions, features such as asynchronous 2D pipelines are not pervasive. Many solutions require that 2D operations and window updates wait until the completion of the buffer 3D commands, creating a potentially unresponsive X Server.

Providing multiple smaller command buffers creates a window of opportunity for the X Server to get access to the device without having to wait on one large buffer to complete. The idea of small vs. large buffers is certainly relative, but for this document, a small buffer is one that can be processed without significantly compromising the X Server's level of interactivity. Unfortunately, managing one or two small buffers doesn't usually provide enough overlap between 3D scene computations and lower level rendering. Multiple small buffers can be managed to allow the kind of host processor/graphics hardware overlap that is achieved with larger buffers. However, managing multiple smaller buffers will requires a higher overhead than managing one or two larger buffers, but it does provide for those critical windows of opportunity the X Server needs to be responsive.

3.2 Synchronization

Coordinating access to 3D hardware for multiple software pipes running independently can be done by either supporting the virtualization of the device in the kernel, or implementing a good neighbor policy.

Virtualizing the device in the kernel involves setting up the virtual memory page tables to generate a page fault for any graphics process which doesn't currently "own" the device. Ownership in this sense means the device is setup for access by this process. When an independent pipe tries to access the device, a page fault is generated. The kernel saves the previous context and sets up the context for the new process. The kernel then sets up the page table for the new process to allow access, and the old process is setup to generate a page fault. In this way the kernel is involved every time the device needs to be context switched. Additional consideration needs to be given to a multiprocessor implementation to assure that two processes aren't actively thrashing this context mechanism.

A good neighbor policy implementation is based on the idea that the software pipes can manage the context switch on their own if they work together following the same protocol. Shared memory can be used to manage device context information and synchronization to the device can be managed by light weight semaphores in shared memory. A key difference with this method is that fine grain locking needs to be implemented explicitly in the pipeline around all areas where the device will be accessed directly. Also, the pipeline needs to recognize when the lock has been lost and regained so the device state can be reloaded into hardware.

It is important to consider the depth and latency of the hardware pipe and the potential software and hardware buffers backed up behind the pipe when determining a synchronization strategy. Some hardware will allow 2D operations to happen independent of the 3D pipeline, others require the 2D primitives to be inserted in the potentially latent 3D stream, and still other hardware may require the entire 3D pipe to be flushed before any 2D operations are performed. The same issues exist for the idea of context switching between two 3D pipes.

3.3 State management

Hardware designed for 3D rendering has a significant amount of state which is usually in the form of graphics attributes stored in read/write registers on the device. The term Rendering Node is coined to represent the state needed by the device for each rendering context accessing the device directly. The challenge presented with direct rendering is to keep the state consistent from one Render Nodes point of view when it has access to the device. Depending on the synchronization mechanism used, it may be necessary for the state to be managed by the kernel (Virtual Device) or the rendering context (Good Neighbor).

For Virtual Device synchronization via the kernel, it is actually possible to defer device specific knowledge to the X Server. The kernel manages the pages faults, and sends a message to the X Server to manage the state context switching.

3.4 Clipping plane and clip rectangle management

The visible portions of windows on the display are always managed by the X Server. All requests to create/move/resize/destroy windows are managed by the X Server and the corresponding regions in the frame buffer are updated yielding new window regions asynchronously to any client, direct rendering or not.

With Virtual Device support, the hardware state must be capable of clipping to arbitrary size regions and changing the window offset asynchronously from the rest of the rendering pipeline. In other words, the X Server needs to be able to context switch in, move and resize windows, then restore the rendering context without the rendering context depending on or caring about the window location on the screen or the visible area(s) into which it can render. This support is usually found in medium and high-end hardware through the use of clipping planes. The clipping planes are setup appropriately by the X Server. On low-end 3D hardware where only a single hardware clipping rectangle is typically available, this requirement can be prohibitively expensive.

The good neighbor approach requires more involvement on the part of the rendering context and therefore it is reasonable to extend the rendering context to handle low-end hardware by cycling through clip rectangles and rendering repeatedly for each one. Obviously, a long list of clipping rectangles would hurt performance, but the primary case of doing 3D rendering to the top window will continue to perform well.

A hybrid case should be considered as well. For low-end hardware, where virtual device support has been implemented, a clip list can be placed in shared memory by the X Server which would always be checked by the direct rendering contexts.

3.5 Double buffer management

The entire system needs to have the same view of a window's front and back buffers. That means a double buffer swap could come from any OpenGL context or from X's DBE functionality, and all renderers (including X) would realize that the swap has happened. The burden on synchronizing and flushing multiple rendering pipes (including X's 2D rendering) before a buffer swap is placed on the application(s) rendering in that window.

There are two primary ways of implementing double buffering. The most common low-end approach is to allocate a back buffer from available memory (usually host memory for software renderers or offscreen memory for hardware), and copy the back buffer to the front buffer when a swap occurs. This method is straight forward and easy to manage, but a performance penalty is paid by the time involved with the copy of the buffers.

A faster double buffer method known as bit plane double buffering is to change the source of where the displayed buffer is being read. The technique of page swapping has been used by display hardware of all types for years, but the more sophisticated idea of adding a plane layer that controls which buffer is being displayed on a pixel by pixel basis is usually only found on mid to high-end 3D hardware. The true strength in this implementation shows up for multiple double buffered windows that are swapped independently and don't pay the penalty of a back to front buffer copy. However, for hardware that can not automatically change the buffer rendering pointer at double buffer swap time, it will be necessary to flush the pipeline before doing a context switch to any other rendering contexts which use the same drawable if there is already a double buffer swap in the pipeline at context switch time.

A special case for double buffering is full-screen display pointer swapping which is supported on many of the current generation of 3D hardware. A buffer rendering and display pointer is changed for front/back buffers usually when a vertical retrace occurs. Rendering occurs in the non-displayed buffer. When in this mode, only a single rendering context can be used -- all other rendering (including 2D) is halted. This mechanism will be useful for fast full screen 3D rendering. The Mesa/XFree86 DGA extension is a good example of this mode.

3.6 Auxiliary layer management

Layers are groups of planes oriented around pixel presentation. The primary frame buffer containing RGB(A) or indexed color palette data is often referred to as the image planes. A second set of frame buffer data could be overlaid on the image planes which are called the overlay planes. Underlaid color data are called the underlay planes. Bit plane double buffering can add a layer for the bit planes that control which buffer is displayed, as well as an additional image layer for the second color buffer. Window clipping can be done by hardware clipping planes. Ancillary buffers include depth, stencil and accumulation buffers. All these plane layers need to be managed so they track the pixels associated with their drawable.

3.7 Texture management

Texture management needs to be considered for maintaining a decent frame rate on real time OpenGL apps.

Texture hardware architectures

There are a number of different architectures available in today's hardware for storing texture data. Each of the architectures below adds it's own set of requirements.

AGP texturing

Textures are stored in host memory and are read directly from host memory by the graphics chip when they are needed. DMA support is required for type of architecture. The memory available for textures can be quite large, but is still limited to the size of the AGP aperture. For example the LX and BX chipsets from Intel can range from 4 to 256 MBytes.

Texturing from dedicated texture memory

Textures are stored in dedicated local memory on the card and are read directly by the graphics chip when they are needed. There can be multiple separate dedicated local texture memory buffers.

Texturing from local memory

Textures are stored in shared local memory with the front, back, Z and possibly other buffers. Textures are read directly from the shared local memory when they are needed by the graphics chip.

Texture resource sharing strategies

The direct rendering infrastructure manages the limited, shared resource of texture memory on each of these architectures. When multiple clients are trying to use texture memory simultaneously, there will be contention for texture memory. The texture management scheme chosen for a particular driver is a tradeoff between simplicity, efficiency and performance. Here are several ways texture memory can be managed:

Single texture region

Swap the entire texture memory in/out based on the currently active rendering context. Optimized for simplicity and single context rendering at the cost of multicontext performance.

Large segmented regions

Divide texture memory into a small number of large segments and allocate between active rendering contexts. Since the number of active contexts will vary, allocate segments based on a LRU-type algorithm with an entire segment getting swapped out when the context is not active or the LRU algorithm removes the context from the active set. Optimized for simplicity and multiple context rendering at the cost of the amount of memory available to a single active rendering context.

Small cache regions

Divide texture memory into a large number of small segments and use the texture memory like an L2 cache (a la the CPU's L2 cache). Actively swap active textures in/out based on a LRU-type algorithm. Organize the available texture memory into small pieces that can be quickly swapped in when they are needed by an active rendering context. Implement a good neighbor policy that will allows older memory to be recovered without a context switch to the X server. Optimized for efficient memory usage and performance of both single and multiple active rendering contexts at the expense of simplicity.

Any of these strategies can be used with the hardware architectures outlined above. The direct rendering infrastructure should not limit these options from a specific hardware driver suite's implementation. The sample implementation would serve this topic area well by addressing several complex cases including a policy for managing a large number of active small segments.

3.8 Cursor management

Software cursor management can be an involved complexity that requires a broad coordinated effort to remove and replace the cursor before and after rendering to the region the cursor occupies. Although the problem is solvable, this design will not address it because of the prevalence of hardware cursor support in modern RAMDACs.

Hardware cursor support still needs to be managed, but it is usually a straight forward task handled by the X Server, or on some operating systems it is handled by the kernel driver.

4. Potential architectures

Different solutions for managing each of the resources defined above are enumerated in this section.

4.1 Command buffer management

The most straight forward approach requires the direct rendering 3D library to allocate and manage command buffers. Each autonomous primitive is placed in the command buffer. If there is not room in the command buffer, then the command buffer is sent to the device, and the new command is placed in the beginning of a new buffer. The command buffer, while being filled by the client, belongs to the client alone. When the command buffer is full and ready to be sent to the device, the buffer needs to managed by direct rendering infrastructure. The mechanism used to actually send the commands to the device are detailed below.

Some 3D primitives may require more contiguous space than the buffer management scheme can provide. In that case, those primitives should be broken into smaller primitives that can fit in a single contiguous buffer. The actual buffer sizes should be of fixed length, but tunable to optimize the tradeoff between 3D rendering performance and X Server responsiveness.

Managing a queue of DMA buffers

The two approaches evaluated here are rendering thread and wake-up thread. The rendering thread approach requires that command buffers are managed directly by the direct rendering client's main thread, while the wake-up thread approach utilizes an interrupt to wake up a separate thread when the device is ready for the next command buffer.

Rendering thread approach

The rendering thread approach requires the direct rendering client to manage and send the buffers to the hardware. To maintain interactivity and maximize overlap of client processing and hardware rendering, it would be ideal if the direct rendering library were able to asynchronously handle these buffers. However, the client process may not be capable of handling asynchronous signals because the direct rendering library is only one part of a larger client application process (for example, the client process may not allow the direct rendering library to receive a signal from the hardware when it completes a buffer and is ready for the next one). Thus, this approach must rely on managing the command buffers only while the rendering library has control of execution. To maximize the overlap of client processing and hardware rendering, control returns to the client code after a buffer is sent to the hardware. One extreme case occurs when multiple command buffers are ready for hardware rendering. Since only the first buffer is sent to the device before control is returned to the calling application and a large amount of host processing might take place in the client code, very little graphics hardware rendering could be done in parallel and the next buffer would not be processed until the next rendering call was made.

Wake-up thread approach

The wake-up thread approach increases the potential for software and hardware overlap by allowing the X Server and kernel driver to work in concert to manage queued buffers. For devices that can generate an interrupt upon completion of a DMA buffer, the kernel handles the interrupt and if no other processes are trying to get access to the device, then it sends a signal to the X Server to initiate the next DMA transfer in the queue. If the device does not support DMA or interrupts, it is possible to simulate this kind of handling with programmed I/O and a system timer. This approach creates an interrupt and a process context switch for every queued buffer. For scenarios where rendering is bound by the hardware, the extra overhead in the host is small compared to the benefit of keeping the hardware busy rendering most of the time. When the 3D application is host bound, it is advantageous for the rendering library to be able to initiate the first transfer and delay interrupts until such time that command buffers need to be queued. This small management overhead in the direct rendering library can lower the frequency of context switching without penalizing the hardware bound processes.

Optimizing for programmed I/O

Command buffering can also be implemented via Programmed I/O. For some devices, programmed I/O is the principle means of sending commands to the device. This is practical because most 3D devices have a decent sized command FIFO and the hardware essentially buffers these commands at that point. This approach requires the direct rendering process to access the hardware much more frequently, and it becomes imperative to have an extremely fast synchronization mechanism. Even fast shared memory locks can add a substantial overhead when locking around every vertex sent to the device. This type of approach requires asynchronous 2D rendering support to keep the X Server snappy. With DMA becoming prevalent but asynchronous 2D access lagging, this approach becomes less appealing. It is also possible to implement the multiple small buffer management tactics described above using programmed I/O and a system timer.

Dynamic command buffer sizing

One of the project goals is to maintain server interactivity and keep the user interface snappy. This goal can be easily compromised since different graphics primitive take varying lengths of time to complete. With DMA or any other buffering approach, simply filling up a 3D buffer and sending it off to the hardware to render could result in a significant delay in processing of 2D primitives. While this simple fixed size command buffer approach is the easiest to implement and yields high 3D throughput, primitives that take a significant amount of time to process could dramatically decrease interactivity. Two approaches that attempt to control the length of the buffers sent to the hardware could help keep the throughput high while still retaining a reasonable degree of interactivity with the X server are explored below.

Length threshold

One simple approach to controlling the length of the command buffer is to set a length threshold based on the type of primitive currently being rendered. Initially the threshold is set to the maximum length of a buffer but can be lowered by a primitive when it is added to the buffer. If two or more sets of primitives are accumulated in the same buffer, then the buffer threshold will be the minimum threshold of the primitives. The goal is to maintain a relatively fixed time to process one buffer. For example, the graphics pipe can process more flat shaded tris in a buffer than lit, textured tris in a fixed amount of time. So, if both type of tris are in the buffer, the threshold for the buffer would be set to the threshold for the slower primitive, i.e., the lit, textured tris.

Buffer weight threshold

Another simple approach is to assign a weight to each primitive type and accumulate the total weight of the buffer. Once this total weight reaches the weight threshold (or the buffer fills up), it is sent to the hardware to be rendered. The weight assigned to each primitive is proportional to the time it takes the given hardware to render the primitive. In Mesa, each individual primitive (e.g., flat-shaded tris, textured tris, lit smooth-shaded textured tris, etc.) has its own weight in the device driver. The primitive's weight can be easily added to the total command buffer weight at the beginning of each of these functions.

With each dynamic command buffering approach, a threshold (and possibly weights) will need to be determined by the device driver author. A set of tools to help determine these weights will be provided to ease the assigning of thresholds and weights.

4.2 Indirect rendering

The direct rendering infrastructure benefits local clients, but can also be used by the X Server to service GLX clients on remote machines. These remote clients must rely on indirect rendering where GLX protocol is sent over the wire to an X Server. The X Server can utilize the multipipe nature of the direct rendering infrastructure to move these requests into a separate thread or threads freeing up the main X Server thread from these potentially time consuming operations. This approach allows the X Server to continue to be interactive with the user and ultimately keeping the GUI very snappy.

Two solutions for moving GLX requests to a separate thread are evaluated here. First, a three process model is evaluated, then a Multi-rendering single address space model is evaluated.

Using three process

The three process model contains the following three processes: first, a 3D client issuing GLX rendering requests; second, the server process which is servicing the requests; and third, a daemon process to the X Server that handles the time consuming GLX requests, and utilizes the direct rendering infrastructure to access the graphics hardware directly. This implementation requires additional buffer and process management in the high level modules of the server and rendering daemon. The implementation can be made very portable, and adds no complication to the lower level direct rendering infrastructure.

Multi-rendering in a single address space

The multi-rendering single address space model allows for each client requiring GLX support to have their own thread servicing their requests. There is minimal overhead associated managing the threads and requests, and this solution would work well with or without the support of a true multithreaded X Server. Unfortunately, it does require multithreading support in the operating system, and would limit the number of operating systems that could be supported.

4.3 Suspending rendering and DGA support

VT Console Switching and DGA clients will create situations where direct rendering clients will need to be denied access to the device for long periods of time. When full multipipe rendering is allowed to continue, each 3D direct rendering client should continue execution as if nothing had interrupted them.

A special case of direct rendering with a 3D DGA client should be recognized and control given exclusively to that client. In this case, buffer sizes can be increased because 2D rendering by the X Server is not required. Double buffering can be managed as video page swaps because the application window is full screen (as mentioned above), and the mechanisms for a separate management thread to keep the graphics engine busy still apply.

Some 3D Hardware actually resides in a separate device from the 2D hardware. This type of hardware could easily be supported as a 3D DGA client.

5. Recommended solution

Dependencies exist between resource management schemes. Evaluating the full set of potential dependencies is beyond the scope of this document, but the management solution we intent to implement is specified here and the dependencies between its resource management schemes is identified.

5.1 Queued command buffer managed by wake-up thread

Keeping the hardware busy while maintaining server interactivity on lower-end hardware is the primary motivator. We will need to tune buffer sizes to strike a balance between context switching overhead and server responsiveness.

5.2 Good neighbor synchronization and state management

This approach allows us to stay clear of the virtual memory subsystem. This in turn will allow us to support a wider number of operating systems with less resources.

5.3 Clipping plane and clip rectangle management

The initial implementation will support both clipping planes, typically found on higher-end hardware, and a clipping rectangle, as found on lower-end hardware. The multiple, small command buffer scheme will allow for low-end hardware to render to a window with multiple clipping rectangles by resending the command buffers for each rectangle. The design will be extensible to accommodate hardware with multiple hardware clipping rectangles.

5.4 Auxiliary layer management

Multilayer support in first implementation will be limited to ancillary planes only. No overlay or underlay planes will be implemented.

5.5 Software only indirect rendering

We recommend X Server indirect rendering approach using the three process model, and that approach is straight forward once the direct rendering infrastructure is in place. The third process (a daemon process of the X Server) can use the exact same shared library and direct rendering mechanisms that a direct rendering client would use. It frees the X Server up to be responsive to user interaction, and allows 3D primitives to be queued by a hardware optimized pipeline.

This project's goals are focused on direct rendering infrastructure, and consequently, the first implementation may not have an optimized indirect rendering implementation done. A first release will continue to rely on the single threaded software only solution.

5.6 DGA support

DGA will not be obsoleted by this infrastructure. It will continue to be the primary method of supporting full screen 3D rendering (e.g., for games). It is our intention that the command queue management be available for optimizing this mode.

6. Open issues

  1. Does XFree86 support operating systems where the cursor has to be managed via a kernel driver? Are we targeting those implementations?

7. Revision history