May 16, 2022

We haven’t posted updates to the work done on the V3DV driver since
we announced the driver becoming Vulkan 1.1 Conformant.

But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then.

Multisync support

As mentioned on past posts, for the Vulkan driver we tried to focus as much as possible on the userspace part. So we tried to re-use the already existing kernel interface that we had for V3D, used by the OpenGL driver, without modifying/extending it.

This worked fine in general, except for synchronization. The V3D kernel interface only supported one synchronization object per submission. This didn’t properly map with Vulkan synchronization, which is more detailed and complex, and allowed defining several semaphores/fences. We initially handled the situation with workarounds, and left some optional features as unsupported.

After our 1.1 conformance work, our colleage Melissa Wen started to work on adding support for multiple semaphores on the V3D kernel side. Then she also implemented the changes on V3DV to use this new feature. If you want more technical info, she wrote a very detailed explanation on her blog (part1 and part2).

For now the driver has two codepaths that are used depending on if the kernel supports this new feature or not. That also means that, depending on the kernel, the V3DV driver could expose a slightly different set of supported features.

More common code – Migration to the common synchronization framework

For a while, Mesa developers have been doing a great effort to refactor and move common functionality to a single place, so it can be used by all drivers, reducing the amount of code each driver needs to maintain.

During these months we have been porting V3DV to some of that infrastructure, from small bits (common VkShaderModule to NIR code), to a really big one: common synchronization framework.

As mentioned, the Vulkan synchronization model is really detailed and powerful. But that also means it is complex. V3DV support for Vulkan synchronization included heavy use of threads. For example, V3DV needed to rely on a CPU wait (polling with threads) to implement vkCmdWaitEvents, as the GPU lacked a mechanism for this.

This was common to several drivers. So at some point there were multiple versions of complex synchronization code, one per driver. But, some months ago, Jason Ekstrand refactored Anvil support and collaborated with other driver developers to create a common framework. Obviously each driver would have their own needs, but the framework provides enough hooks for that.

After some gitlab and IRC chats, Jason provided a Merge Request with the port of V3DV to this new common framework, that we iterated and tested through the review process.

Also, with this port we got timelime semaphore support for free. Thanks to this change, we got ~1.2k less total lines of code (and have more features!).

Again, we want to thank Jason Ekstrand for all his help.

Support for more extensions:

Since 1.1 got announced the following extension got implemented and exposed:

  • VK_EXT_debug_utils
  • VK_KHR_timeline_semaphore
  • VK_KHR_create_renderpass2
  • VK_EXT_4444_formats
  • VK_KHR_driver_properties
  • VK_KHR_16_bit_storage and VK_KHR_8bit_storage
  • VK_KHR_imageless_framebuffer
  • VK_KHR_depth_stencil_resolve
  • VK_EXT_image_drm_format_modifier
  • VK_EXT_line_rasterization
  • VK_EXT_inline_uniform_block
  • VK_EXT_separate_stencil_usage
  • VK_KHR_separate_depth_stencil_layouts
  • VK_KHR_pipeline_executable_properties
  • VK_KHR_shader_float_controls
  • VK_KHR_spirv_1_4

If you want more details about VK_KHR_pipeline_executable_properties, Iago wrote recently a blog post about it (here)

Android support

Android support for V3DV was added thanks to the work of Roman Stratiienko, who implemented this and submitted Mesa patches. We also want to thank the Android RPi team, and the Lineage RPi maintainer (Konsta) who also created and tested an initial version of that support, which was used as the baseline for the code that Roman submitted. I didn’t test it myself (it’s in my personal TO-DO list), but LineageOS images for the RPi4 are already available.


In addition to new functionality, we also have been working on improving performance. Most of the focus was done on the V3D shader compiler, as improvements to it would be shared among the OpenGL and Vulkan drivers.

But one of the features specific to the Vulkan driver (pending to be ported to OpenGL), is that we have implemented double buffer mode, only available if MSAA is not enabled. This mode would split the tile buffer size in half, so the driver could start processing the next tile while the current one is being stored in memory.

In theory this could improve performance by reducing tile store overhead, so it would be more benefitial when vertex/geometry shaders aren’t too expensive. However, it comes at the cost of reducing tile size, which also causes some overhead on its own.

Testing shows that this helps in some cases (i.e the Vulkan Quake ports) but hurts in others (i.e. Unreal Engine 4), so for the time being we don’t enable this by default. It can be enabled selectively by adding V3D_DEBUG=db to the environment variables. The idea for the future would be to implement a heuristic that would decide when to activate this mode.


If you are interested in watching an overview of the improvements and changes to the driver during the last year, we made a presention in FOSDEM 2022:
“v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi

May 13, 2022

In late 2020, Apple debuted the M1 with Apple’s GPU architecture, AGX, rumoured to be derived from Imagination’s PowerVR series. Since then, we’ve been reverse-engineering AGX and building open source graphics drivers. Last January, I rendered a triangle with my own code, but there has since been a heinous bug lurking:

The driver fails to render large amounts of geometry.

Spinning a cube is fine, low polygon geometry is okay, but detailed models won’t render. Instead, the GPU renders only part of the model and then faults.

Partially rendered bunny

It’s hard to pinpoint how much we can render without faults. It’s not just the geometry complexity that matters. The same geometry can render with simple shaders but fault with complex ones.

That suggests rendering detailed geometry with a complex shader “takes too long”, and the GPU is timing out. Maybe it renders only the parts it finished in time.

Given the hardware architecture, this explanation is unlikely.

This hypothesis is easy to test, because we can control for timing with a shader that takes as long as we like:

for (int i = 0; i < LARGE_NUMBER; ++i) {
    /* some work to prevent the optimizer from removing the loop */

After experimenting with such a shader, we learn…

  • If shaders have a time limit to protect against infinite loops, it’s astronomically high. There’s no way our bunny hits that limit.
  • The symptoms of timing out differ from the symptoms of our driver rendering too much geometry.

That theory is out.

Let’s experiment more. Modifying the shader and seeing where it breaks, we find the only part of the shader contributing to the bug: the amount of data interpolated per vertex. Modern graphics APIs allow specifying “varying” data for each vertex, like the colour or the surface normal. Then, for each triangle the hardware renders, these “varyings” are interpolated across the triangle to provide smooth inputs to the fragment shader, allowing efficient implementation of common graphics techniques like Blinn-Phong shading.

Putting the pieces together, what matters is the product of the number of vertices (geometry complexity) times amount of data per vertex (“shading” complexity). That product is “total amount of per-vertex data”. The GPU faults if we use too much total per-vertex data.


When the hardware processes each vertex, the vertex shader produces per-vertex data. That data has to go somewhere. How this works depends on the hardware architecture. Let’s consider common GPU architectures.1

Traditional immediate mode renderers render directly into the framebuffer. They first run the vertex shader for each vertex of a triangle, then run the fragment shader for each pixel in the triangle. Per-vertex “varying” data is passed almost directly between the shaders, so immediate mode renderers are efficient for complex scenes.

There is a drawback: rendering directly into the framebuffer requires tremendous amounts of memory access to constantly write the results of the fragment shader and to read out back results when blending. Immediate mode renderers are suited to discrete, power-hungry desktop GPUs with dedicated video RAM.

By contrast, tile-based deferred renderers split rendering into two passes. First, the hardware runs all vertex shaders for the entire frame, not just for a single model. Then the framebuffer is divided into small tiles, and dedicated hardware called a tiler determines which triangles are in each tile. Finally, for each tile, the hardware runs all relevant fragment shaders and writes the final blended result to memory.

Tilers reduce memory traffic required for the framebuffer. As the hardware renders a single tile at a time, it keeps a “cached” copy of that tile of the framebuffer (called the “tilebuffer”). The tilebuffer is small, just a few kilobytes, but tilebuffer access is fast. Writing to the tilebuffer is cheap, and unlike immediate renderers, blending is almost free. Because main memory access is expensive and mobile GPUs can’t afford dedicated video memory, tilers are suited to mobile GPUs, like Arm’s Mali, Imaginations’s PowerVR, and Apple’s AGX.

Yes, AGX is a mobile GPU, designed for the iPhone. The M1 is a screaming fast desktop, but its unified memory and tiler GPU have roots in mobile phones. Tilers work well on the desktop, but there are some drawbacks.

First, at the start of a frame, the contents of the tilebuffer are undefined. If the application needs to preserve existing framebuffer contents, the driver needs to load the framebuffer from main memory and store it into the tilebuffer. This is expensive.

Second, because all vertex shaders are run before any fragment shaders, the hardware needs a buffer to store the outputs of all vertex shaders. In general, there is much more data required than space inside the GPU, so this buffer must be in main memory. This is also expensive.

Ah-ha. Because AGX is a tiler, it requires a buffer of all per-vertex data. We fault when we use too much total per-vertex data, overflowing the buffer.

…So how do we allocate a larger buffer?

On some tilers, like older versions of Arm’s Mali GPU, the userspace driver computes how large this “varyings” buffer should be and allocates it.2 To fix the faults, we can try increasing the sizes of all buffers we allocate, in the hopes that one of them contains the per-vertex data.

No dice.

It’s prudent to observe what Apple’s Metal driver does. We can cook up a Metal program drawing variable amounts of geometry and trace all GPU memory allocations that Metal performs while running our program. Doing so, we learn that increasing the amount of geometry drawn does not increase the sizes of any allocated buffers. In fact, it doesn’t change anything in the command buffer submitted to the kernel, except for the single “number of vertices” field in the draw command.

We know that buffer exists. If it’s not allocated by userspace – and by now it seems that it’s not – it must be allocated by the kernel or firmware.

Here’s a funny thought: maybe we don’t specify the size of the buffer at all. Maybe it’s okay for it to overflow, and there’s a way to handle the overflow.

It’s time for a little reconnaissance. Digging through what little public documentation exists for AGX, we learn from one WWDC presentation:

The Tiled Vertex Buffer stores the Tiling phase output, which includes the post-transform vertex data…

But it may cause a Partial Render if full. A Partial Render is when the GPU splits the render pass in order to flush the contents of that buffer.

Bullseye. The buffer we’re chasing, the “tiled vertex buffer”, can overflow. To cope, the GPU stops accepting new geometry, renders the existing geometry, and restarts rendering.

Since partial renders hurt performance, Metal application developers need to know about them to optimize their applications. There should be performance counters flagging this issue. Poking around, we find two:

  • Number of partial renders.
  • Number of bytes used of the parameter buffer.

Wait, what’s a “parameter buffer”?

Remember the rumours that AGX is derived from PowerVR? The public PowerVR optimization guides explain:

[The] list containing pointers to each vertex passed in from the application… is called the parameter buffer (PB) and is stored in system memory along with the vertex data.

Each varying requires additional space in the parameter buffer.

The Tiled Vertex Buffer is the Parameter Buffer. PB is the PowerVR name, TVB is the public Apple name, and PB is still an internal Apple name.

What happens when PowerVR overflows the parameter buffer?

An old PowerVR presentation says that when the parameter buffer is full, the “render is flushed”, meaning “flushed data must be retrieved from the frame buffer as successive tile renders are performed”. In other words, it performs a partial render.

Back to the Apple M1, it seems the hardware is failing to perform a partial render. Let’s revisit the broken render.

Partially rendered bunny, again

Notice parts of the model are correctly rendered. The parts that are not only have the black clear colour of the scene rendered at the start. Let’s consider the logical order of events.

First, the hardware runs vertex shaders for the bunny until the parameter buffer overflows. This works: the partial geometry is correct.

Second, the hardware rasterizes the partial geometry and runs the fragment shaders. This works: the shading is correct.

Third, the hardware flushes the partial render to the framebuffer. This must work for us to see anything at all.

Fourth, the hardware runs vertex shaders for the rest of the bunny’s geometry. This ought to work: the configuration is identical to the original vertex shaders.

Fifth, the hardware rasterizes and shades the rest of the geometry, blending with the old partial render. Because AGX is a tiler, to preserve that existing partial render, the hardware needs to load it back into the tilebuffer. We have no idea how it does this.

Finally, the hardware flushes the render to the framebuffer. This should work as it did the first time.

The only problematic step is loading the framebuffer back into the tilebuffer after a partial render. Usually, the driver supplies two “extra” fragment shaders. One clears the tilebuffer at the start, and the other flushes out the tilebuffer contents at the end.

If the application needs the existing framebuffer contents preserved, instead of writing a clear colour, the “load tilebuffer” program instead reads from the framebuffer to reload the contents. Handling this requires quite a bit of code, but it works in our driver.

Looking closer, AGX requires more auxiliary programs.

The “store” program is supplied twice. I noticed this when initially bringing up the hardware, but the reason for the duplication was unclear. Omitting each copy separately and seeing what breaks, the reason becomes clear: one program flushes the final render, and the other flushes a partial render.3

…What about the program that loads the framebuffer into the tilebuffer?

When a partial render is possible, there are two “load” programs. One writes the clear colour or loads the framebuffer, depending on the application setting. We understand this one. The other always loads the framebuffer.

…Always loads the framebuffer, as in, for loading back with a partial render even if there is a clear at the start of the frame?

If this program is the issue, we can confirm easily. Metal must require it to draw the same bunny, so we can write a Metal application drawing the bunny and stomp over its GPU memory to replace this auxiliary load program with one always loading with black.

Metal drawing the bunny, stomping over its memory.

Doing so, Metal fails in a similar way. That means we’re at the root cause. Looking at our own driver code, we don’t specify any program for this partial render load. Up until now, that’s worked okay. If the parameter buffer is never overflowed, this program is unused. As soon as a partial render is required, however, failing to provide this program means the GPU dereferences a null pointer and faults. That explains our GPU faults at the beginning.

Following Metal, we supply our own program to load back the tilebuffer after a partial render…

Bunny with the fix

…which does not fix the rendering! Cursed, this GPU. The faults go away, but the render still isn’t quite right for the first few frames, indicating partial renders are still broken. Notice the weird artefacts on the feet.

Curiously, the render “repairs itself” after a few frames, suggesting the parameter buffer stops overflowing. This implies the parameter buffer can be resized (by the kernel or by the firmware), and the system is growing the parameter buffer after a few frames in response to overflow. This mechanism makes sense:

  • The hardware can’t allocate more parameter buffer space itself.
  • Overflowing the parameter buffer is expensive, as partial renders require tremendous memory bandwidth.
  • Overallocating the parameter buffer wastes memory for applications rendering simple geometry.

Starting the parameter buffer small and growing in response to overflow provides a balance, reducing the GPU’s memory footprint and minimizing partial renders.

Back to our misrendering. There are actually two buffers being used by our program, a colour buffer (framebuffer)… and a depth buffer. The depth buffer isn’t directly visible, but facilitates the “depth test”, which discards far pixels that are occluded by other close pixels. While the partial render mechanism discards geometry, the depth test discards pixels.

That would explain the missing pixels on our bunny. The depth test is broken with partial renders. Why? The depth test depends on the depth buffer, so the depth buffer must also be stored after a partial render and loaded back when resuming. Comparing a trace from our driver to a trace from Metal, looking for any relevant difference, we eventually stumble on the configuration required to make depth buffer flushes work.

And with that, we get our bunny.

The final Phong shaded bunny

  1. These explanations are massive oversimplifications of how modern GPUs work, but it’s good enough for our purposes here.↩︎

  2. This is a worse idea than it sounds. Starting with the new Valhall architecture, Mali allocates varyings much more efficiently.↩︎

  3. Why the duplication? I have not yet observed Metal using different programs for each. However, for front buffer rendering, partial renders need to be flushed to a temporary buffer for this scheme to work. Of course, you may as well use double buffering at that point.↩︎

May 11, 2022

Today NVIDIA announced that they are releasing an open source kernel driver for their GPUs, so I want to share with you some background information and how I think this will impact Linux graphics and compute going forward.

One thing many people are not aware of is that Red Hat is the only Linux OS company who has a strong presence in the Linux compute and graphics engineering space. There are of course a lot of other people working in the space too, like engineers working for Intel, AMD and NVIDIA or people working for consultancy companies like Collabora or individual community members, but Red Hat as an OS integration company has been very active on trying to ensure we have a maintainable and shared upstream open source stack. This engineering presence is also what has allowed us to move important technologies forward, like getting hiDPI support for Linux some years ago, or working with NVIDIA to get glvnd implemented to remove a pain point for our users since the original OpenGL design only allowed for one OpenGl implementation to be installed at a time. We see ourselves as the open source community’s partner here, fighting to keep the linux graphics stack coherent and maintainable and as a partner for the hardware OEMs to work with when they need help pushing major new initiatives around GPUs for Linux forward. And as the only linux vendor with a significant engineering footprint in GPUs we have been working closely with NVIDIA. People like Kevin Martin, the manager for our GPU technologies team, Ben Skeggs the maintainer of Nouveau and Dave Airlie, the upstream kernel maintainer for the graphics subsystem, Nouveau contributor Karol Herbst and our accelerator lead Tom Rix have all taken part in meetings, code reviews and discussions with NVIDIA. So let me talk a little about what this release means (and also what it doesn’t mean) and what we hope to see come out of this long term.

First of all, what is in this new driver?
What has been released is an out of tree source code kernel driver which has been tested to support CUDA usecases on datacenter GPUs. There is code in there to support display, but it is not complete or fully tested yet. Also this is only the kernel part, a big part of a modern graphics driver are to be found in the firmware and userspace components and those are still closed source. But it does mean we have a NVIDIA kernel driver now that will start being able to consume the GPL-only APIs in the linux kernel, although this initial release doesn’t consume any APIs the old driver wasn’t already using. The driver also only supports NVIDIA Turing chip GPUs and newer, which means it is not targeting GPUs from before 2018. So for the average Linux desktop user, while this is a great first step and hopefully a sign of what is to come, it is not something you are going to start using tomorrow.

What does it mean for the NVidia binary driver?
Not too much immediately. This binary kernel driver will continue to be needed for older pre-Turing NVIDIA GPUs and until the open source kernel module is full tested and extended for display usecases you are likely to continue using it for your system even if you are on Turing or newer. Also as mentioned above regarding firmware and userspace bits and the binary driver is going to continue to be around even once the open source kernel driver is fully capable.

What does it mean for Nouveau?
Let me start with the obvious, this is actually great news for the Nouveau community and the Nouveau driver and NVIDIA has done a great favour to the open source graphics community with this release. And for those unfamiliar with Nouveau, Nouveau is the in-kernel graphics driver for NVIDIA GPUs today which was originally developed as a reverse engineered driver, but which over recent years actually have had active support from NVIDIA. It is fully functional, but is severely hampered by not having had the ability to for instance re-clock the NVIDIA card, meaning that it can’t give you full performance like the binary driver can. This was something we were working with NVIDIA trying to remedy, but this new release provides us with a better path forward. So what does this new driver mean for Nouveau? Less initially, but a lot in the long run. To give a little background first. The linux kernel does not allow multiple drivers for the same hardware, so in order for a new NVIDIA kernel driver to go in the current one will have to go out or at least be limited to a different set of hardware. The current one is Nouveau. And just like the binary driver a big chunk of Nouveau is not in the kernel, but are the userspace pieces found in Mesa and the Nouveau specific firmware that NVIDIA currently kindly makes available. So regardless of the long term effort to create a new open source in-tree kernel driver based on this new open source driver for NVIDIA hardware, Nouveau will very likely be staying around to support pre-turing hardware just like the NVIDIA binary kernel driver will.

The plan we are working towards from our side, but which is likely to take a few years to come to full fruition, is to come up with a way for the NVIDIA binary driver and Mesa to share a kernel driver. The details of how we will do that is something we are still working on and discussing with our friends at NVIDIA to address both the needs of the NVIDIA userspace and the needs of the Mesa userspace. Along with that evolution we hope to work with NVIDIA engineers to refactor the userspace bits of Mesa that are now targeting just Nouveau to be able to interact with this new kernel driver and also work so that the binary driver and Nouveau can share the same firmware. This has clear advantages for both the open source community and the NVIDIA. For the open source community it means that we will now have a kernel driver and firmware that allows things like changing the clocking of the GPU to provide the kind of performance people expect from the NVIDIA graphics card and it means that we will have an open source driver that will have access to the firmware and kernel updates from day one for new generations of NVIDIA hardware. For the ‘binary’ driver, and I put that in ” signs because it will now be less binary :), it means as stated above that it can start taking advantage of the GPL-only APIs in the kernel, distros can ship it and enable secure boot, and it gets an open source consumer of its kernel driver allowing it to go upstream.
If this new shared kernel driver will be known as Nouveau or something completely different is still an open question, and of course it happening at all depends on if we and the rest of the open source community and NVIDIA are able to find a path together to make it happen, but so far everyone seems to be of good will.

What does this release mean for linux distributions like Fedora and RHEL?

Over time it provides a pathway to radically simplify supporting NVIDIA hardware due to the opportunities discussed elsewhere in this document. Long term we will hope be able to get a better user experience with NVIDIA hardware in terms out of box functionality. Which means day 1 support for new chipsets, a high performance open source Mesa driver for NVIDIA and it will allow us to sign the NVIDIA driver alongside the rest of the kernel to enable things like secureboot support. Since this first release is targeting compute one can expect that these options will first be available for compute users and then graphics at a later time.

What are the next steps
Well there is a lot of work to do here. NVIDIA need to continue the effort to make this new driver feature complete for both Compute and Graphics Display usecases, we’d like to work together to come up with a plan for what the future unified kernel driver can look like and a model around it that works for both the community and NVIDIA, we need to add things like a Mesa Vulkan driver. We at Red Hat will be playing an active part in this work as the only Linux vendor with the capacity to do so and we will also work to ensure that the wider open source community has a chance to participate fully like we do for all open source efforts we are part of.

If you want to hear more about this I did talk with Chris Fisher and Linux Action News about this topic. Note: I did state some timelines in that interview which I didn’t make clear was my guesstimates and not in any form official NVIDIA timelines, so apologize for the confusion.

May 10, 2022

In the previous post, I described how we enable multiple syncobjs capabilities in the V3D kernel driver. Now I will tell you what was changed on the userspace side, where we reworked the V3DV sync mechanisms to use Vulkan multiple wait and signal semaphores directly. This change represents greater adherence to the Vulkan submission framework.

I was not used to Vulkan concepts and the V3DV driver. Fortunately, I counted on the guidance of the Igalia’s Graphics team, mainly Iago Toral (thanks!), to understand the Vulkan Graphics Pipeline, sync scopes, and submission order. Therefore, we changed the original V3DV implementation for vkQueueSubmit and all related functions to allow direct mapping of multiple semaphores from V3DV to the V3D-kernel interface.

Disclaimer: Here’s a brief and probably inaccurate background, which we’ll go into more detail later on.

In Vulkan, GPU work submissions are described as command buffers. These command buffers, with GPU jobs, are grouped in a command buffer submission batch, specified by vkSubmitInfo, and submitted to a queue for execution. vkQueueSubmit is the command called to submit command buffers to a queue. Besides command buffers, vkSubmitInfo also specifies semaphores to wait before starting the batch execution and semaphores to signal when all command buffers in the batch are complete. Moreover, a fence in vkQueueSubmit can be signaled when all command buffer batches have completed execution.

From this sequence, we can see some implicit ordering guarantees. Submission order defines the start order of execution between command buffers, in other words, it is determined by the order in which pSubmits appear in VkQueueSubmit and pCommandBuffers appear in VkSubmitInfo. However, we don’t have any completion guarantees for jobs submitted to different GPU queue, which means they may overlap and complete out of order. Of course, jobs submitted to the same GPU engine follow start and finish order. A fence is ordered after all semaphores signal operations for signal operation order. In addition to implicit sync, we also have some explicit sync resources, such as semaphores, fences, and events.

Considering these implicit and explicit sync mechanisms, we rework the V3DV implementation of queue submissions to better use multiple syncobjs capabilities from the kernel. In this merge request, you can find this work: v3dv: add support to multiple wait and signal semaphores. In this blog post, we run through each scope of change of this merge request for a V3D driver-guided description of the multisync support implementation.

Groundwork and basic code clean-up:

As the original V3D-kernel interface allowed only one semaphore, V3DV resorted to booleans to “translate” multiple semaphores into one. Consequently, if a command buffer batch had at least one semaphore, it needed to wait on all jobs submitted complete before starting its execution. So, instead of just boolean, we created and changed structs that store semaphores information to accept the actual list of wait semaphores.

Expose multisync kernel interface to the driver:

In the two commits below, we basically updated the DRM V3D interface from that one defined in the kernel and verified if the multisync capability is available for use.

Handle multiple semaphores for all GPU job types:

At this point, we were only changing the submission design to consider multiple wait semaphores. Before supporting multisync, V3DV was waiting for the last job submitted to be signaled when at least one wait semaphore was defined, even when serialization wasn’t required. V3DV handle GPU jobs according to the GPU queue in which they are submitted:

  • Control List (CL) for binning and rendering
  • Texture Formatting Unit (TFU)
  • Compute Shader Dispatch (CSD)

Therefore, we changed their submission setup to do jobs submitted to any GPU queues able to handle more than one wait semaphores.

These commits created all mechanisms to set arrays of wait and signal semaphores for GPU job submissions:

  • Checking the conditions to define the wait_stage.
  • Wrapping them in a multisync extension.
  • According to the kernel interface (described in the previous blog post), configure the generic extension as a multisync extension.

Finally, we extended the ability of GPU jobs to handle multiple signal semaphores, but at this point, no GPU job is actually in charge of signaling them. With this in place, we could rework part of the code that tracks CPU and GPU job completions by verifying the GPU status and threads spawned by Event jobs.

Rework the QueueWaitIdle mechanism to track the syncobj of the last job submitted in each queue:

As we had only single in/out syncobj interfaces for semaphores, we used a single last_job_sync to synchronize job dependencies of the previous submission. Although the DRM scheduler guarantees the order of starting to execute a job in the same queue in the kernel space, the order of completion isn’t predictable. On the other hand, we still needed to use syncobjs to follow job completion since we have event threads on the CPU side. Therefore, a more accurate implementation requires last_job syncobjs to track when each engine (CL, TFU, and CSD) is idle. We also needed to keep the driver working on previous versions of v3d kernel-driver with single semaphores, then we kept tracking ANY last_job_sync to preserve the previous implementation.

Rework synchronization and submission design to let the jobs handle wait and signal semaphores:

With multiple semaphores support, the conditions for waiting and signaling semaphores changed accordingly to the particularities of each GPU job (CL, CSD, TFU) and CPU job restrictions (Events, CSD indirect, etc.). In this sense, we redesigned V3DV semaphores handling and job submissions for command buffer batches in vkQueueSubmit.

We scrutinized possible scenarios for submitting command buffer batches to change the original implementation carefully. It resulted in three commits more:

We keep track of whether we have submitted a job to each GPU queue (CSD, TFU, CL) and a CPU job for each command buffer. We use syncobjs to track the last job submitted to each GPU queue and a flag that indicates if this represents the beginning of a command buffer.

The first GPU job submitted to a GPU queue in a command buffer should wait on wait semaphores. The first CPU job submitted in a command buffer should call v3dv_QueueWaitIdle() to do the waiting and ignore semaphores (because it is waiting for everything).

If the job is not the first but has the serialize flag set, it should wait on the completion of all last job submitted to any GPU queue before running. In practice, it means using syncobjs to track the last job submitted by queue and add these syncobjs as job dependencies of this serialized job.

If this job is the last job of a command buffer batch, it may be used to signal semaphores if this command buffer batch has only one type of GPU job (because we have guarantees of execution ordering). Otherwise, we emit a no-op job just to signal semaphores. It waits on the completion of all last jobs submitted to any GPU queue and then signal semaphores. Note: We changed this approach to correctly deal with ordering changes caused by event threads at some point. Whenever we have an event job in the command buffer, we cannot use the last job in the last command buffer assumption. We have to wait all event threads complete to signal

After submitting all command buffers, we emit a no-op job to wait on all last jobs by queue completion and signal fence. Note: at some point, we changed this approach to correct deal with ordering changes caused by event threads, as mentioned before.

Final considerations

With many changes and many rounds of reviews, the patchset was merged. After more validations and code review, we polished and fixed the implementation together with external contributions:

Also, multisync capabilities enabled us to add new features to V3DV and switch the driver to the common synchronization and submission framework:

  • v3dv: expose support for semaphore imports

    This was waiting for multisync support in the v3d kernel, which is already available. Exposing this feature however enabled a few more CTS tests that exposed pre-existing bugs in the user-space driver so we fix those here before exposing the feature.

  • v3dv: Switch to the common submit framework

    This should give you emulated timeline semaphores for free and kernel-assisted sharable timeline semaphores for cheap once you have the kernel interface wired in.

We used a set of games to ensure no performance regression in the new implementation. For this, we used GFXReconstruct to capture Vulkan API calls when playing those games. Then, we compared results with and without multisync caps in the kernelspace and also enabling multisync on v3dv. We didn’t observe any compromise in performance, but improvements when replaying scenes of vkQuake game.

As you may already know, we at Igalia have been working on several improvements to the 3D rendering drivers of Broadcom Videocore GPU, found in Raspberry Pi 4 devices. One of our recent works focused on improving V3D(V) drivers adherence to Vulkan submission and synchronization framework. We had to cross various layers from the Linux Graphics stack to add support for multiple syncobjs to V3D(V), from the Linux/DRM kernel to the Vulkan driver. We have delivered bug fixes, a generic gate to extend job submission interfaces, and a more direct sync mapping of the Vulkan framework. These changes did not impact the performance of the tested games and brought greater precision to the synchronization mechanisms. Ultimately, support for multiple syncobjs opened the door to new features and other improvements to the V3DV submission framework.

DRM Syncobjs

But, first, what are DRM sync objs?

* DRM synchronization objects (syncobj, see struct &drm_syncobj) provide a
* container for a synchronization primitive which can be used by userspace
* to explicitly synchronize GPU commands, can be shared between userspace
* processes, and can be shared between different DRM drivers.
* Their primary use-case is to implement Vulkan fences and semaphores.
* At it's core, a syncobj is simply a wrapper around a pointer to a struct
* &dma_fence which may be NULL.

And Jason Ekstrand well-summarized dma_fence features in a talk at the Linux Plumbers Conference 2021:

A struct that represents a (potentially future) event:

  • Has a boolean “signaled” state
  • Has a bunch of useful utility helpers/concepts, such as refcount, callback wait mechanisms, etc.

Provides two guarantees:

  • One-shot: once signaled, it will be signaled forever
  • Finite-time: once exposed, is guaranteed signal in a reasonable amount of time

What does multiple semaphores support mean for Raspberry Pi 4 GPU drivers?

For our main purpose, the multiple syncobjs support means that V3DV can submit jobs with more than one wait and signal semaphore. In the kernel space, wait semaphores become explicit job dependencies to wait on before executing the job. Signal semaphores (or post dependencies), in turn, work as fences to be signaled when the job completes its execution, unlocking following jobs that depend on its completion.

The multisync support development comprised of many decision-making points and steps summarized as follow:

  • added to the v3d kernel-driver capabilities to handle multiple syncobj;
  • exposed multisync capabilities to the userspace through a generic extension; and
  • reworked synchronization mechanisms of the V3DV driver to benefit from this feature
  • enabled simulator to work with multiple semaphores
  • tested on Vulkan games to verify the correctness and possible performance enhancements.

We decided to refactor parts of the V3D(V) submission design in kernel-space and userspace during this development. We improved job scheduling on V3D-kernel and the V3DV job submission design. We also delivered more accurate synchronizing mechanisms and further updates in the Broadcom Vulkan driver running on Raspberry Pi 4. Therefore, we summarize here changes in the kernel space, describing the previous state of the driver, taking decisions, side improvements, and fixes.

From single to multiple binary in/out syncobjs:

Initially, V3D was very limited in the numbers of syncobjs per job submission. V3D job interfaces (CL, CSD, and TFU) only supported one syncobj (in_sync) to be added as an execution dependency and one syncobj (out_sync) to be signaled when a submission completes. Except for CL submission, which accepts two in_syncs: one for binner and another for render job, it didn’t change the limited options.

Meanwhile in the userspace, the V3DV driver followed alternative paths to meet Vulkan’s synchronization and submission framework. It needed to handle multiple wait and signal semaphores, but the V3D kernel-driver interface only accepts one in_sync and one out_sync. In short, V3DV had to fit multiple semaphores into one when submitting every GPU job.

Generic ioctl extension

The first decision was how to extend the V3D interface to accept multiple in and out syncobjs. We could extend each ioctl with two entries of syncobj arrays and two entries for their counters. We could create new ioctls with multiple in/out syncobj. But after examining other drivers solutions to extend their submission’s interface, we decided to extend V3D ioctls (v3d_cl_submit_ioctl, v3d_csd_submit_ioctl, v3d_tfu_submit_ioctl) by a generic ioctl extension.

I found a curious commit message when I was examining how other developers handled the issue in the past:

Author: Chris Wilson <>
Date:   Fri Mar 22 09:23:22 2019 +0000

    drm/i915: Introduce the i915_user_extension_method
    An idea for extending uABI inspired by Vulkan's extension chains.
    Instead of expanding the data struct for each ioctl every time we need
    to add a new feature, define an extension chain instead. As we add
    optional interfaces to control the ioctl, we define a new extension
    struct that can be linked into the ioctl data only when required by the
    user. The key advantage being able to ignore large control structs for
    optional interfaces/extensions, while being able to process them in a
    consistent manner.
    In comparison to other extensible ioctls, the key difference is the
    use of a linked chain of extension structs vs an array of tagged
    pointers. For example,
    struct drm_amdgpu_cs_chunk {
    	__u32		chunk_id;
        __u32		length_dw;
        __u64		chunk_data;

So, inspired by amdgpu_cs_chunk and i915_user_extension, we opted to extend the V3D interface through a generic interface. After applying some suggestions from Iago Toral (Igalia) and Daniel Vetter, we reached the following struct:

struct drm_v3d_extension {
	__u64 next;
	__u32 id;
#define DRM_V3D_EXT_ID_MULTI_SYNC		0x01
	__u32 flags; /* mbz */

This generic extension has an id to identify the feature/extension we are adding to an ioctl (that maps the related struct type), a pointer to the next extension, and flags (if needed). Whenever we need to extend the V3D interface again for another specific feature, we subclass this generic extension into the specific one instead of extending ioctls indefinitely.

Multisync extension

For the multiple syncobjs extension, we define a multi_sync extension struct that subclasses the generic extension struct. It has arrays of in and out syncobjs, the respective number of elements in each of them, and a wait_stage value used in CL submissions to determine which job needs to wait for syncobjs before running.

struct drm_v3d_multi_sync {
	struct drm_v3d_extension base;
	/* Array of wait and signal semaphores */
	__u64 in_syncs;
	__u64 out_syncs;

	/* Number of entries */
	__u32 in_sync_count;
	__u32 out_sync_count;

	/* set the stage (v3d_queue) to sync */
	__u32 wait_stage;

	__u32 pad; /* mbz */

And if a multisync extension is defined, the V3D driver ignores the previous interface of single in/out syncobjs.

Once we had the interface to support multiple in/out syncobjs, v3d kernel-driver needed to handle it. As V3D uses the DRM scheduler for job executions, changing from single syncobj to multiples is quite straightforward. V3D copies from userspace the in syncobjs and uses drm_syncobj_find_fence()+ drm_sched_job_add_dependency() to add all in_syncs (wait semaphores) as job dependencies, i.e. syncobjs to be checked by the scheduler before running the job. On CL submissions, we have the bin and render jobs, so V3D follows the value of wait_stage to determine which job depends on those in_syncs to start its execution.

When V3D defines the last job in a submission, it replaces dma_fence of out_syncs with the done_fence from this last job. It uses drm_syncobj_find() + drm_syncobj_replace_fence() to do that. Therefore, when a job completes its execution and signals done_fence, all out_syncs are signaled too.

Other improvements to v3d kernel driver

This work also made possible some improvements in the original implementation. Following Iago’s suggestions, we refactored the job’s initialization code to allocate memory and initialize a job in one go. With this, we started to clean up resources more cohesively, clearly distinguishing cleanups in case of failure from job completion. We also fixed the resource cleanup when a job is aborted before the DRM scheduler arms it - at that point, drm_sched_job_arm() had recently been introduced to job initialization. Finally, we prepared the semaphore interface to implement timeline syncobjs in the future.

Going Up

The patchset that adds multiple syncobjs support and improvements to V3D is available here and comprises four patches:

  • drm/v3d: decouple adding job dependencies steps from job init
  • drm/v3d: alloc and init job in one shot
  • drm/v3d: add generic ioctl extension
  • drm/v3d: add multiple syncobjs support

After extending the V3D kernel interface to accept multiple syncobjs, we worked on V3DV to benefit from V3D multisync capabilities. In the next post, I will describe a little of this work.

May 09, 2022

As a board, we have been working on several initiatives to make the Foundation a better asset for the GNOME Project. We’re working on a number of threads in parallel, so I wanted to explain the “big picture” a bit more to try and connect together things like the new ED search and the bylaw changes.

We’re all here to see free and open source software succeed and thrive, so that people can be be truly empowered with agency over their technology, rather than being passive consumers. We want to bring GNOME to as many people as possible so that they have computing devices that they can inspect, trust, share and learn from.

In previous years we’ve tried to boost the relevance of GNOME (or technologies such as GTK) or solicit donations from businesses and individuals with existing engagement in FOSS ideology and technology. The problem with this approach is that we’re mostly addressing people and organisations who are already supporting or contributing FOSS in some way. To truly scale our impact, we need to look to the outside world, build better awareness of GNOME outside of our current user base, and find opportunities to secure funding to invest back into the GNOME project.

The Foundation supports the GNOME project with infrastructure, arranging conferences, sponsoring hackfests and travel, design work, legal support, managing sponsorships, advisory board, being the fiscal sponsor of GNOME, GTK, Flathub… and we will keep doing all of these things. What we’re talking about here are additional ways for the Foundation to support the GNOME project – we want to go beyond these activities, and invest into GNOME to grow its adoption amongst people who need it. This has a cost, and that means in parallel with these initiatives, we need to find partners to fund this work.

Neil has previously talked about themes such as education, advocacy, privacy, but we’ve not previously translated these into clear specific initiatives that we would establish in addition to the Foundation’s existing work. This is all a work in progress and we welcome any feedback from the community about refining these ideas, but here are the current strategic initiatives the board is working on. We’ve been thinking about growing our community by encouraging and retaining diverse contributors, and addressing evolving computing needs which aren’t currently well served on the desktop.

Initiative 1. Welcoming newcomers. The community is already spending a lot of time welcoming newcomers and teaching them the best practices. Those activities are as time consuming as they are important, but currently a handful of individuals are running initiatives such as GSoC, Outreachy and outreach to Universities. These activities help bring diverse individuals and perspectives into the community, and helps them develop skills and experience of collaborating to create Open Source projects. We want to make those efforts more sustainable by finding sponsors for these activities. With funding, we can hire people to dedicate their time to operating these programs, including paid mentors and creating materials to support newcomers in future, such as developer documentation, examples and tutorials. This is the initiative that needs to be refined the most before we can turn it into something real.

Initiative 2: Diverse and sustainable Linux app ecosystem. I spoke at the Linux App Summit about the work that GNOME and Endless has been supporting in Flathub, but this is an example of something which has a great overlap between commercial, technical and mission-based advantages. The key goal here is to improve the financial sustainability of participating in our community, which in turn has an impact on the diversity of who we can expect to afford to enter and remain in our community. We believe the existence of this is critically important for individual developers and contributors to unlock earning potential from our ecosystem, through donations or app sales. In turn, a healthy app ecosystem also improves the usefulness of the Linux desktop as a whole for potential users. We believe that we can build a case for commercial vendors in the space to join an advisory board alongside with GNOME, KDE, etc to input into the governance and contribute to the costs of growing Flathub.

Initiative 3: Local-first applications for the GNOME desktop. This is what Thib has been starting to discuss on Discourse, in this thread. There are many different threats to free access to computing and information in today’s world. The GNOME desktop and apps need to give users convenient and reliable access to technology which works similarly to the tools they already use everyday, but keeps them and their data safe from surveillance, censorship, filtering or just being completely cut off from the Internet. We believe that we can seek both philanthropic and grant funding for this work. It will make GNOME a more appealing and comprehensive offering for the many people who want to protect their privacy.

The idea is that these initiatives all sit on the boundary between the GNOME community and the outside world. If the Foundation can grow and deliver these kinds of projects, we are reaching to new people, new contributors and new funding. These contributions and investments back into GNOME represent a true “win-win” for the newcomers and our existing community.

(Originally posted to GNOME Discourse, please feel free to join the discussion there.)

Sometimes you want to go and inspect details of the shaders that are used with specific draw calls in a frame. With RenderDoc this is really easy if the driver implements VK_KHR_pipeline_executable_properties. This extension allows applications to query the driver about various aspects of the executable code generated for a Vulkan pipeline.

I implemented this extension for V3DV, the Vulkan driver for Raspberry Pi 4, last week (it is currently in review process) because I was tired of jumping through loops to get the info I needed when looking at traces. For V3DV we expose the NIR and QPU assembly code as well as various others stats, some of which are quite relevant to performance, such as spill or thread counts.

Some shader statistics

Final NIR code

QPU assembly
May 02, 2022

TLDR: Hermetic /usr/ is awesome; let's popularize image-based OSes with modernized security properties built around immutability, SecureBoot, TPM2, adaptability, auto-updating, factory reset, uniformity – built from traditional distribution packages, but deployed via images.

Over the past years, systemd gained a number of components for building Linux-based operating systems. While these components individually have been adopted by many distributions and products for specific purposes, we did not publicly communicate a broader vision of how they should all fit together in the long run. In this blog story I hope to provide that from my personal perspective, i.e. explain how I personally would build an OS and where I personally think OS development with Linux should go.

I figure this is going to be a longer blog story, but I hope it will be equally enlightening. Please understand though that everything I write about OS design here is my personal opinion, and not one of my employer.

For the last 12 years or so I have been working on Linux OS development, mostly around systemd. In all those years I had a lot of time thinking about the Linux platform, and specifically traditional Linux distributions and their strengths and weaknesses. I have seen many attempts to reinvent Linux distributions in one way or another, to varying success. After all this most would probably agree that the traditional RPM or dpkg/apt-based distributions still define the Linux platform more than others (for 25+ years now), even though some Linux-based OSes (Android, ChromeOS) probably outnumber the installations overall.

And over all those 12 years I kept wondering, how would I actually build an OS for a system or for an appliance, and what are the components necessary to achieve that. And most importantly, how can we make these components generic enough so that they are useful in generic/traditional distributions too, and in other use cases than my own.

The Project

Before figuring out how I would build an OS it's probably good to figure out what type of OS I actually want to build, what purpose I intend to cover. I think a desktop OS is probably the most interesting. Why is that? Well, first of all, I use one of these for my job every single day, so I care immediately, it's my primary tool of work. But more importantly: I think building a desktop OS is one of the most complex overall OS projects you can work on, simply because desktops are so much more versatile and variable than servers or embedded devices. If one figures out the desktop case, I think there's a lot more to learn from, and reuse in the server or embedded case, then going the other way. After all, there's a reason why so much of the widely accepted Linux userspace stack comes from people with a desktop background (including systemd, BTW).

So, let's see how I would build a desktop OS. If you press me hard, and ask me why I would do that given that ChromeOS already exists and more or less is a Linux desktop OS: there's plenty I am missing in ChromeOS, but most importantly, I am lot more interested in building something people can easily and naturally rebuild and hack on, i.e. Google-style over-the-wall open source with its skewed power dynamic is not particularly attractive to me. I much prefer building this within the framework of a proper open source community, out in the open, and basing all this strongly on the status quo ante, i.e. the existing distributions. I think it is crucial to provide a clear avenue to build a modern OS based on the existing distribution model, if there shall ever be a chance to make this interesting for a larger audience.

(Let me underline though: even though I am going to focus on a desktop here, most of this is directly relevant for servers as well, in particular container host OSes and suchlike, or embedded devices, e.g. car IVI systems and so on.)

Design Goals

  1. First and foremost, I think the focus must be on an image-based design rather than a package-based one. For robustness and security it is essential to operate with reproducible, immutable images that describe the OS or large parts of it in full, rather than operating always with fine-grained RPM/dpkg style packages. That's not to say that packages are not relevant (I actually think they matter a lot!), but I think they should be less of a tool for deploying code but more one of building the objects to deploy. A different way to see this: any OS built like this must be easy to replicate in a large number of instances, with minimal variability. Regardless if we talk about desktops, servers or embedded devices: focus for my OS should be on "cattle", not "pets", i.e that from the start it's trivial to reuse the well-tested, cryptographically signed combination of software over a large set of devices the same way, with a maximum of bit-exact reuse and a minimum of local variances.

  2. The trust chain matters, from the boot loader all the way to the apps. This means all code that is run must be cryptographically validated before it is run. All storage must be cryptographically protected: public data must be integrity checked; private data must remain confidential.

    This is in fact where big distributions currently fail pretty badly. I would go as far as saying that SecureBoot on Linux distributions is mostly security theater at this point, if you so will. That's because the initrd that unlocks your FDE (i.e. the cryptographic concept that protects the rest of your system) is not signed or protected in any way. It's trivial to modify for an attacker with access to your hard disk in an undetectable way, and collect your FDE passphrase. The involved bureaucracy around the implementation of UEFI SecureBoot of the big distributions is to a large degree pointless if you ask me, given that once the kernel is assumed to be in a good state, as the next step the system invokes completely unsafe code with full privileges.

    This is a fault of current Linux distributions though, not of SecureBoot in general. Other OSes use this functionality in more useful ways, and we should correct that too.

  3. Pretty much the same thing: offline security matters. I want my data to be reasonably safe at rest, i.e. cryptographically inaccessible even when I leave my laptop in my hotel room, suspended.

  4. Everything should be cryptographically measured, so that remote attestation is supported for as much software shipped on the OS as possible.

  5. Everything should be self descriptive, have single sources of truths that are closely attached to the object itself, instead of stored externally.

  6. Everything should be self-updating. Today we know that software is never bug-free, and thus requires a continuous update cycle. Not only the OS itself, but also any extensions, services and apps running on it.

  7. Everything should be robust in respect to aborted OS operations, power loss and so on. It should be robust towards hosed OS updates (regardless if the download process failed, or the image was buggy), and not require user interaction to recover from them.

  8. There must always be a way to put the system back into a well-defined, guaranteed safe state ("factory reset"). This includes that all sensitive data from earlier uses becomes cryptographically inaccessible.

  9. The OS should enforce clear separation between vendor resources, system resources and user resources: conceptually and when it comes to cryptographical protection.

  10. Things should be adaptive: the system should come up and make the best of the system it runs on, adapt to the storage and hardware. Moreover, the system should support execution on bare metal equally well as execution in a VM environment and in a container environment (i.e. systemd-nspawn).

  11. Things should not require explicit installation. i.e. every image should be a live image. For installation it should be sufficient to dd an OS image onto disk. Thus, strong focus on "instantiate on first boot", rather than "instantiate before first boot".

  12. Things should be reasonably minimal. The image the system starts its life with should be quick to download, and not include resources that can as well be created locally later.

  13. System identity, local cryptographic keys and so on should be generated locally, not be pre-provisioned, so that there's no leak of sensitive data during the transport onto the system possible.

  14. Things should be reasonably democratic and hackable. It should be easy to fork an OS, to modify an OS and still get reasonable cryptographic protection. Modifying your OS should not necessarily imply that your "warranty is voided" and you lose all good properties of the OS, if you so will.

  15. Things should be reasonably modular. The privileged part of the core OS must be extensible, including on the individual system. It's not sufficient to support extensibility just through high-level UI applications.

  16. Things should be reasonably uniform, i.e. ideally the same formats and cryptographic properties are used for all components of the system, regardless if for the host OS itself or the payloads it receives and runs.

  17. Even taking all these goals into consideration, it should still be close to traditional Linux distributions, and take advantage of what they are really good at: integration and security update cycles.

Now that we know our goals and requirements, let's start designing the OS along these lines.

Hermetic /usr/

First of all the OS resources (code, data files, …) should be hermetic in an immutable /usr/. This means that a /usr/ tree should carry everything needed to set up the minimal set of directories and files outside of /usr/ to make the system work. This /usr/ tree can then be mounted read-only into the writable root file system that then will eventually carry the local configuration, state and user data in /etc/, /var/ and /home/ as usual.

Thankfully, modern distributions are surprisingly close to working without issues in such a hermetic context. Specifically, Fedora works mostly just fine: it has adopted the /usr/ merge and the declarative systemd-sysusers and systemd-tmpfiles components quite comprehensively, which means the directory trees outside of /usr/ are automatically generated as needed if missing. In particular /etc/passwd and /etc/group (and related files) are appropriately populated, should they be missing entries.

In my model a hermetic OS is hence comprehensively defined within /usr/: combine the /usr/ tree with an empty, otherwise unpopulated root file system, and it will boot up successfully, automatically adding the strictly necessary files, and resources that are necessary to boot up.

Monopolizing vendor OS resources and definitions in an immutable /usr/ opens multiple doors to us:

  • We can apply dm-verity to the whole /usr/ tree, i.e. guarantee structural, cryptographic integrity on the whole vendor OS resources at once, with full file system metadata.

  • We can implement updates to the OS easily: by implementing an A/B update scheme on the /usr/ tree we can update the OS resources atomically and robustly, while leaving the rest of the OS environment untouched.

  • We can implement factory reset easily: erase the root file system and reboot. The hermetic OS in /usr/ has all the information it needs to set up the root file system afresh — exactly like in a new installation.

Initial Look at the Partition Table

So let's have a look at a suitable partition table, taking a hermetic /usr/ into account. Let's conceptually start with a table of four entries:

  1. An UEFI System Partition (required by firmware to boot)

  2. Immutable, Verity-protected, signed file system with the /usr/ tree in version A

  3. Immutable, Verity-protected, signed file system with the /usr/ tree in version B

  4. A writable, encrypted root file system

(This is just for initial illustration here, as we'll see later it's going to be a bit more complex in the end.)

The Discoverable Partitions Specification provides suitable partition types UUIDs for all of the above partitions. Which is great, because it makes the image self-descriptive: simply by looking at the image's GPT table we know what to mount where. This means we do not need a manual /etc/fstab, and a multitude of tools such as systemd-nspawn and similar can operate directly on the disk image and boot it up.


Now that we have a rough idea how to organize the partition table, let's look a bit at how to boot into that. Specifically, in my model "unified kernels" are the way to go, specifically those implementing Boot Loader Specification Type #2. These are basically kernel images that have an initial RAM disk attached to them, as well as a kernel command line, a boot splash image and possibly more, all wrapped into a single UEFI PE binary. By combining these into one we achieve two goals: they become extremely easy to update (i.e. drop in one file, and you update kernel+initrd) and more importantly, you can sign them as one for the purpose of UEFI SecureBoot.

In my model, each version of such a kernel would be associated with exactly one version of the /usr/ tree: both are always updated at the same time. An update then becomes relatively simple: drop in one new /usr/ file system plus one kernel, and the update is complete.

The boot loader used for all this would be systemd-boot, of course. It's a very simple loader, and implements the aforementioned boot loader specification. This means it requires no explicit configuration or anything: it's entirely sufficient to drop in one such unified kernel file, and it will be picked up, and be made a candidate to boot into.

You might wonder how to configure the root file system to boot from with such a unified kernel that contains the kernel command line and is signed as a whole and thus immutable. The idea here is to use the usrhash= kernel command line option implemented by systemd-veritysetup-generator and systemd-fstab-generator. It does two things: it will search and set up a dm-verity volume for the /usr/ file system, and then mount it. It takes the root hash value of the dm-verity Merkle tree as the parameter. This hash is then also used to find the /usr/ partition in the GPT partition table, under the assumption that the partition UUIDs are derived from it, as per the suggestions in the discoverable partitions specification (see above).

systemd-boot (if not told otherwise) will do a version sort of the kernel image files it finds, and then automatically boot the newest one. Picking a specific kernel to boot will also fixate which version of the /usr/ tree to boot into, because — as mentioned — the Verity root hash of it is built into the kernel command line the unified kernel image contains.

In my model I'd place the kernels directly into the UEFI System Partition (ESP), in order to simplify things. (systemd-boot also supports reading them from a separate boot partition, but let's not complicate things needlessly, at least for now.)

So, with all this, we now already have a boot chain that goes something like this: once the boot loader is run, it will pick the newest kernel, which includes the initial RAM disk and a secure reference to the /usr/ file system to use. This is already great. But a /usr/ alone won't make us happy, we also need a root file system. In my model, that file system would be writable, and the /etc/ and /var/ hierarchies would be located directly on it. Since these trees potentially contain secrets (SSH keys, …) the root file system needs to be encrypted. We'll use LUKS2 for this, of course. In my model, I'd bind this to the TPM2 chip (for compatibility with systems lacking one, we can find a suitable fallback, which then provides weaker guarantees, see below). A TPM2 is a security chip available in most modern PCs. Among other things it contains a persistent secret key that can be used to encrypt data, in a way that only if you possess access to it and can prove you are using validated software you can decrypt it again. The cryptographic measuring I mentioned earlier is what allows this to work. But … let's not get lost too much in the details of TPM2 devices, that'd be material for a novel, and this blog story is going to be way too long already.

What does using a TPM2 bound key for unlocking the root file system get us? We can encrypt the root file system with it, and you can only read or make changes to the root file system if you also possess the TPM2 chip and run our validated version of the OS. This protects us against an evil maid scenario to some level: an attacker cannot just copy the hard disk of your laptop while you leave it in your hotel room, because unless the attacker also steals the TPM2 device it cannot be decrypted. The attacker can also not just modify the root file system, because such changes would be detected on next boot because they aren't done with the right cryptographic key.

So, now we have a system that already can boot up somewhat completely, and run userspace services. All code that is run is verified in some way: the /usr/ file system is Verity protected, and the root hash of it is included in the kernel that is signed via UEFI SecureBoot. And the root file system is locked to the TPM2 where the secret key is only accessible if our signed OS + /usr/ tree is used.

(One brief intermission here: so far all the components I am referencing here exist already, and have been shipped in systemd and other projects already, including the TPM2 based disk encryption. There's one thing missing here however at the moment that still needs to be developed (happy to take PRs!): right now TPM2 based LUKS2 unlocking is bound to PCR hash values. This is hard to work with when implementing updates — what we'd need instead is unlocking by signatures of PCR hashes. TPM2 supports this, but we don't support it yet in our systemd-cryptsetup + systemd-cryptenroll stack.)

One of the goals mentioned above is that cryptographic key material should always be generated locally on first boot, rather than pre-provisioned. This of course has implications for the encryption key of the root file system: if we want to boot into this system we need the root file system to exist, and thus a key already generated that it is encrypted with. But where precisely would we generate it if we have no installer which could generate while installing (as it is done in traditional Linux distribution installers). My proposed solution here is to use systemd-repart, which is a declarative, purely additive repartitioner. It can run from the initrd to create and format partitions on boot, before transitioning into the root file system. It can also format the partitions it creates and encrypt them, automatically enrolling an TPM2-bound key.

So, let's revisit the partition table we mentioned earlier. Here's what in my model we'd actually ship in the initial image:

  1. An UEFI System Partition (ESP)

  2. An immutable, Verity-protected, signed file system with the /usr/ tree in version A

And that's already it. No root file system, no B /usr/ partition, nothing else. Only two partitions are shipped: the ESP with the systemd-boot loader and one unified kernel image, and the A version of the /usr/ partition. Then, on first boot systemd-repart will notice that the root file system doesn't exist yet, and will create it, encrypt it, and format it, and enroll the key into the TPM2. It will also create the second /usr/ partition (B) that we'll need for later A/B updates (which will be created empty for now, until the first update operation actually takes place, see below). Once done the initrd will combine the fresh root file system with the shipped /usr/ tree, and transition into it. Because the OS is hermetic in /usr/ and contains all the systemd-tmpfiles and systemd-sysuser information it can then set up the root file system properly and create any directories and symlinks (and maybe a few files) necessary to operate.

Besides the fact that the root file system's encryption keys are generated on the system we boot from and never leave it, it is also pretty nice that the root file system will be sized dynamically, taking into account the physical size of the backing storage. This is perfect, because on first boot the image will automatically adapt to what it has been dd'ed onto.

Factory Reset

This is a good point to talk about the factory reset logic, i.e. the mechanism to place the system back into a known good state. This is important for two reasons: in our laptop use case, once you want to pass the laptop to someone else, you want to ensure your data is fully and comprehensively erased. Moreover, if you have reason to believe your device was hacked you want to revert the device to a known good state, i.e. ensure that exploits cannot persist. systemd-repart already has a mechanism for it. In the declarations of the partitions the system should have, entries may be marked to be candidates for erasing on factory reset. The actual factory reset is then requested by one of two means: by specifying a specific kernel command line option (which is not too interesting here, given we lock that down via UEFI SecureBoot; but then again, one could also add a second kernel to the ESP that is identical to the first, with only different that it lists this command line option: thus when the user selects this entry it will initiate a factory reset) — and via an EFI variable that can be set and is honoured on the immediately following boot. So here's how a factory reset would then go down: once the factory reset is requested it's enough to reboot. On the subsequent boot systemd-repart runs from the initrd, where it will honour the request and erase the partitions marked for erasing. Once that is complete the system is back in the state we shipped the system in: only the ESP and the /usr/ file system will exist, but the root file system is gone. And from here we can continue as on the original first boot: create a new root file system (and any other partitions), and encrypt/set it up afresh.

So now we have a nice setup, where everything is either signed or encrypted securely. The system can adapt to the system it is booted on automatically on first boot, and can easily be brought back into a well defined state identical to the way it was shipped in.


But of course, such a monolithic, immutable system is only useful for very specific purposes. If /usr/ can't be written to, – at least in the traditional sense – one cannot just go and install a new software package that one needs. So here two goals are superficially conflicting: on one hand one wants modularity, i.e. the ability to add components to the system, and on the other immutability, i.e. that precisely this is prohibited.

So let's see what I propose as a middle ground in my model. First, what's the precise use case for such modularity? I see a couple of different ones:

  1. For some cases it is necessary to extend the system itself at the lowest level, so that the components added in extend (or maybe even replace) the resources shipped in the base OS image, so that they live in the same namespace, and are subject to the same security restrictions and privileges. Exposure to the details of the base OS and its interface for this kind of modularity is at the maximum.

    Example: a module that adds a debugger or tracing tools into the system. Or maybe an optional hardware driver module.

  2. In other cases, more isolation is preferable: instead of extending the system resources directly, additional services shall be added in that bring their own files, can live in their own namespace (but with "windows" into the host namespaces), however still are system components, and provide services to other programs, whether local or remote. Exposure to the details of the base OS for this kind of modularity is restricted: it mostly focuses on the ability to consume and provide IPC APIs from/to the system. Components of this type can still be highly privileged, but the level of integration is substantially smaller than for the type explained above.

    Example: a module that adds a specific VPN connection service to the OS.

  3. Finally, there's the actual payload of the OS. This stuff is relatively isolated from the OS and definitely from each other. It mostly consumes OS APIs, and generally doesn't provide OS APIs. This kind of stuff runs with minimal privileges, and in its own namespace of concepts.

    Example: a desktop app, for reading your emails.

Of course, the lines between these three types of modules are blurry, but I think distinguishing them does make sense, as I think different mechanisms are appropriate for each. So here's what I'd propose in my model to use for this.

  1. For the system extension case I think the systemd-sysext images are appropriate. This tool operates on system extension images that are very similar to the host's disk image: they also contain a /usr/ partition, protected by Verity. However, they just include additions to the host image: binaries that extend the host. When such a system extension image is activated, it is merged via an immutable overlayfs mount into the host's /usr/ tree. Thus any file shipped in such a system extension will suddenly appear as if it was part of the host OS itself. For optional components that should be considered part of the OS more or less this is a very simple and powerful way to combine an immutable OS with an immutable extension. Note that most likely extensions for an OS matching this tool should be built at the same time within the same update cycle scheme as the host OS itself. After all, the files included in the extensions will have dependencies on files in the system OS image, and care must be taken that these dependencies remain in order.

  2. For adding in additional somewhat isolated system services in my model, Portable Services are the proposed tool of choice. Portable services are in most ways just like regular system services; they could be included in the system OS image or an extension image. However, portable services use RootImage= to run off separate disk images, thus within their own namespace. Images set up this way have various ways to integrate into the host OS, as they are in most ways regular system services, which just happen to bring their own directory tree. Also, unlike regular system services, for them sandboxing is opt-out rather than opt-in. In my model, here too the disk images are Verity protected and thus immutable. Just like the host OS they are GPT disk images that come with a /usr/ partition and Verity data, along with signing.

  3. Finally, the actual payload of the OS, i.e. the apps. To be useful in real life here it is important to hook into existing ecosystems, so that a large set of apps are available. Given that on Linux flatpak (or on servers OCI containers) are the established format that pretty much won they are probably the way to go. That said, I think both of these mechanisms have relatively weak properties, in particular when it comes to security, since immutability/measurements and similar are not provided. This means, unlike for system extensions and portable services a complete trust chain with attestation and per-app cryptographically protected data is much harder to implement sanely.

What I'd like to underline here is that the main system OS image, as well as the system extension images and the portable service images are put together the same way: they are GPT disk images, with one immutable file system and associated Verity data. The latter two should also contain a PKCS#7 signature for the top-level Verity hash. This uniformity has many benefits: you can use the same tools to build and process these images, but most importantly: by using a single way to validate them throughout the stack (i.e. Verity, in the latter cases with PKCS#7 signatures), validation and measurement is straightforward. In fact it's so obvious that we don't even have to implement it in systemd: the kernel has direct support for this Verity signature checking natively already (IMA).

So, by composing a system at runtime from a host image, extension images and portable service images we have a nicely modular system where every single component is cryptographically validated on every single IO operation, and every component is measured, in its entire combination, directly in the kernel's IMA subsystem.

(Of course, once you add the desktop apps or OCI containers on top, then these properties are lost further down the chain. But well, a lot is already won, if you can close the chain that far down.)

Note that system extensions are not designed to replicate the fine grained packaging logic of RPM/dpkg. Of course, systemd-sysext is a generic tool, so you can use it for whatever you want, but there's a reason it does not bring support for a dependency language: the goal here is not to replicate traditional Linux packaging (we have that already, in RPM/dpkg, and I think they are actually OK for what they do) but to provide delivery of larger, coarser sets of functionality, in lockstep with the underlying OS' life-cycle and in particular with no interdependencies, except on the underlying OS.

Also note that depending on the use case it might make sense to also use system extensions to modularize the initrd step. This is probably less relevant for a desktop OS, but for server systems it might make sense to package up support for specific complex storage in a systemd-sysext system extension, which can be applied to the initrd that is built into the unified kernel. (In fact, we have been working on implementing signed yet modular initrd support to general purpose Fedora this way.)

Note that portable services are composable from system extension too, by the way. This makes them even more useful, as you can share a common runtime between multiple portable service, or even use the host image as common runtime for portable services. In this model a common runtime image is shared between one or more system extensions, and composed at runtime via an overlayfs instance.

More Modularity: Secondary OS Installs

Having an immutable, cryptographically locked down host OS is great I think, and if we have some moderate modularity on top, that's also great. But oftentimes it's useful to be able to depart/compromise for some specific use cases from that, i.e. provide a bridge for example to allow workloads designed around RPM/dpkg package management to coexist reasonably nicely with such an immutable host.

For this purpose in my model I'd propose using systemd-nspawn containers. The containers are focused on OS containerization, i.e. they allow you to run a full OS with init system and everything as payload (unlike for example Docker containers which focus on a single service, and where running a full OS in it is a mess).

Running systemd-nspawn containers for such secondary OS installs has various nice properties. One of course is that systemd-nspawn supports the same level of cryptographic image validation that we rely on for the host itself. Thus, to some level the whole OS trust chain is reasonably recursive if desired: the firmware validates the OS, and the OS can validate a secondary OS installed within it. In fact, we can run our trusted OS recursively on itself and get similar security guarantees! Besides these security aspects, systemd-nspawn also has really nice properties when it comes to integration with the host. For example the --bind-user= permits binding a host user record and their directory into a container as a simple one step operation. This makes it extremely easy to have a single user and $HOME but share it concurrently with the host and a zoo of secondary OSes in systemd-nspawn containers, which each could run different distributions even.

Developer Mode

Superficially, an OS with an immutable /usr/ appears much less hackable than an OS where everything is writable. Moreover, an OS where everything must be signed and cryptographically validated makes it hard to insert your own code, given you are unlikely to possess access to the signing keys.

To address this issue other systems have supported a "developer" mode: when entered the security guarantees are disabled, and the system can be freely modified, without cryptographic validation. While that's a great concept to have I doubt it's what most developers really want: the cryptographic properties of the OS are great after all, it sucks having to give them up once developer mode is activated.

In my model I'd thus propose two different approaches to this problem. First of all, I think there's value in allowing users to additively extend/override the OS via local developer system extensions. With this scheme the underlying cryptographic validation would remain in tact, but — if this form of development mode is explicitly enabled – the developer could add in more resources from local storage, that are not tied to the OS builder's chain of trust, but a local one (i.e. simply backed by encrypted storage of some form).

The second approach is to make it easy to extend (or in fact replace) the set of trusted validation keys, with local ones that are under the control of the user, in order to make it easy to operate with kernel, OS, extension, portable service or container images signed by the local developer without involvement of the OS builder. This is relatively easy to do for components down the trust chain, i.e. the elements further up the chain should optionally allow additional certificates to allow validation with.

(Note that systemd currently has no explicit support for a "developer" mode like this. I think we should add that sooner or later however.)

Democratizing Code Signing

Closely related to the question of developer mode is the question of code signing. If you ask me, the status quo of UEFI SecureBoot code signing in the major Linux distributions is pretty sad. The work to get stuff signed is massive, but in effect it delivers very little in return: because initrds are entirely unprotected, and reside on partitions lacking any form of cryptographic integrity protection any attacker can trivially easily modify the boot process of any such Linux system and freely collected FDE passphrases entered. There's little value in signing the boot loader and kernel in a complex bureaucracy if it then happily loads entirely unprotected code that processes the actually relevant security credentials: the FDE keys.

In my model, through use of unified kernels this important gap is closed, hence UEFI SecureBoot code signing becomes an integral part of the boot chain from firmware to the host OS. Unfortunately, code signing – and having something a user can locally hack, is to some level conflicting. However, I think we can improve the situation here, and put more emphasis on enrolling developer keys in the trust chain easily. Specifically, I see one relevant approach here: enrolling keys directly in the firmware is something that we should make less of a theoretical exercise and more something we can realistically deploy. See this work in progress making this more automatic and eventually safe. Other approaches are thinkable (including some that build on existing MokManager infrastructure), but given the politics involved, are harder to conclusively implement.

Running the OS itself in a container

What I explain above is put together with running on a bare metal system in mind. However, one of the stated goals is to make the OS adaptive enough to also run in a container environment (specifically: systemd-nspawn) nicely. Booting a disk image on bare metal or in a VM generally means that the UEFI firmware validates and invokes the boot loader, and the boot loader invokes the kernel which then transitions into the final system. This is different for containers: here the container manager immediately calls the init system, i.e. PID 1. Thus the validation logic must be different: cryptographic validation must be done by the container manager. In my model this is solved by shipping the OS image not only with a Verity data partition (as is already necessary for the UEFI SecureBoot trust chain, see above), but also with another partition, containing a PKCS#7 signature of the root hash of said Verity partition. This of course is exactly what I propose for both the system extension and portable service image. Thus, in my model the images for all three uses are put together the same way: an immutable /usr/ partition, accompanied by a Verity partition and a PKCS#7 signature partition. The OS image itself then has two ways "into" the trust chain: either through the signed unified kernel in the ESP (which is used for bare metal and VM boots) or by using the PKCS#7 signature stored in the partition (which is used for container/systemd-nspawn boots).

Parameterizing Kernels

A fully immutable and signed OS has to establish trust in the user data it makes use of before doing so. In the model I describe here, for /etc/ and /var/ we do this via disk encryption of the root file system (in combination with integrity checking). But the point where the root file system is mounted comes relatively late in the boot process, and thus cannot be used to parameterize the boot itself. In many cases it's important to be able to parameterize the boot process however.

For example, for the implementation of the developer mode indicated above it's useful to be able to pass this fact safely to the initrd, in combination with other fields (e.g. hashed root password for allowing in-initrd logins for debug purposes). After all, if the initrd is pre-built by the vendor and signed as whole together with the kernel it cannot be modified to carry such data directly (which is in fact how parameterizing of the initrd to a large degree was traditionally done).

In my model this is achieved through system credentials, which allow passing parameters to systems (and services for the matter) in an encrypted and authenticated fashion, bound to the TPM2 chip. This means that we can securely pass data into the initrd so that it can be authenticated and decrypted only on the system it is intended for and with the unified kernel image it was intended for.


In my model the OS would also carry a swap partition. For the simple reason that only then systemd-oomd.service can provide the best results. Also see In defence of swap: common misconceptions

Updating Images

We have a rough idea how the system shall be organized now, let's next focus on the deployment cycle: software needs regular update cycles, and software that is not updated regularly is a security problem. Thus, I am sure that any modern system must be automatically updated, without this requiring avoidable user interaction.

In my model, this is the job for systemd-sysupdate. It's a relatively simple A/B image updater: it operates either on partitions, on regular files in a directory, or on subdirectories in a directory. Each entry has a version (which is encoded in the GPT partition label for partitions, and in the filename for regular files and directories): whenever an update is initiated the oldest version is erased, and the newest version is downloaded.

With the setup described above a system update becomes a really simple operation. On each update the systemd-sysupdate tool downloads a /usr/ file system partition, an accompanying Verity partition, a PKCS#7 signature partition, and drops it into the host's partition table (where it possibly replaces the oldest version so far stored there). Then it downloads a unified kernel image and drops it into the EFI System Partition's /EFI/Linux (as per Boot Loader Specification; possibly erase the oldest such file there). And that's already the whole update process: four files are downloaded from the server, unpacked and put in the most straightforward of ways into the partition table or file system. Unlike in other OS designs there's no mechanism required to explicitly switch to the newer version, the aforementioned systemd-boot logic will automatically pick the newest kernel once it is dropped in.

Above we talked a lot about modularity, and how to put systems together as a combination of a host OS image, system extension images for the initrd and the host, portable service images and systemd-nspawn container images. I already emphasized that these image files are actually always the same: GPT disk images with partition definitions that match the Discoverable Partition Specification. This comes very handy when thinking about updating: we can use the exact same systemd-sysupdate tool for updating these other images as we use for the host image. The uniformity of the on-disk format allows us to update them uniformly too.

Boot Counting + Assessment

Automatic OS updates do not come without risks: if they happen automatically, and an update goes wrong this might mean your system might be automatically updated into a brick. This of course is less than ideal. Hence it is essential to address this reasonably automatically. In my model, there's systemd's Automatic Boot Assessment for that. The mechanism is simple: whenever a new unified kernel image is dropped into the system it will be stored with a small integer counter value included in the filename. Whenever the unified kernel image is selected for booting by systemd-boot, it is decreased by one. Once the system booted up successfully (which is determined by userspace) the counter is removed from the file name (which indicates "this entry is known to work"). If the counter ever hits zero, this indicates that it tried to boot it a couple of times, and each time failed, thus is apparently "bad". In this case systemd-boot will not consider the kernel anymore, and revert to the next older (that doesn't have a counter of zero).

By sticking the boot counter into the filename of the unified kernel we can directly attach this information to the kernel, and thus need not concern ourselves with cleaning up secondary information about the kernel when the kernel is removed. Updating with a tool like systemd-sysupdate remains a very simple operation hence: drop one old file, add one new file.

Picking the Newest Version

I already mentioned that systemd-boot automatically picks the newest unified kernel image to boot, by looking at the version encoded in the filename. This is done via a simple strverscmp() call (well, truth be told, it's a modified version of that call, different from the one implemented in libc, because real-life package managers use more complex rules for comparing versions these days, and hence it made sense to do that here too). The concept of having multiple entries of some resource in a directory, and picking the newest one automatically is a powerful concept, I think. It means adding/removing new versions is extremely easy (as we discussed above, in systemd-sysupdate context), and allows stateless determination of what to use.

If systemd-boot can do that, what about system extension images, portable service images, or systemd-nspawn container images that do not actually use systemd-boot as the entrypoint? All these tools actually implement the very same logic, but on the partition level: if multiple suitable /usr/ partitions exist, then the newest is determined by comparing the GPT partition label of them.

This is in a way the counterpart to the systemd-sysupdate update logic described above: we always need a way to determine which partition to actually then use after the update took place: and this becomes very easy each time: enumerate possible entries, pick the newest as per the (modified) strverscmp() result.

Home Directory Management

In my model the device's users and their home directories are managed by systemd-homed. This means they are relatively self-contained and can be migrated easily between devices. The numeric UID assignment for each user is done at the moment of login only, and the files in the home directory are mapped as needed via a uidmap mount. It also allows us to protect the data of each user individually with a credential that belongs to the user itself. i.e. instead of binding confidentiality of the user's data to the system-wide full-disk-encryption each user gets their own encrypted home directory where the user's authentication token (password, FIDO2 token, PKCS#11 token, recovery key…) is used as authentication and decryption key for the user's data. This brings a major improvement for security as it means the user's data is cryptographically inaccessible except when the user is actually logged in.

It also allows us to correct another major issue with traditional Linux systems: the way how data encryption works during system suspend. Traditionally on Linux the disk encryption credentials (e.g. LUKS passphrase) is kept in memory also when the system is suspended. This is a bad choice for security, since many (most?) of us probably never turn off their laptop but suspend it instead. But if the decryption key is always present in unencrypted form during the suspended time, then it could potentially be read from there by a sufficiently equipped attacker.

By encrypting the user's home directory with the user's authentication token we can first safely "suspend" the home directory before going to the system suspend state (i.e. flush out the cryptographic keys needed to access it). This means any process currently accessing the home directory will be frozen for the time of the suspend, but that's expected anyway during a system suspend cycle. Why is this better than the status quo ante? In this model the home directory's cryptographic key material is erased during suspend, but it can be safely reacquired on resume, from system code. If the system is only encrypted as a whole however, then the system code itself couldn't reauthenticate the user, because it would be frozen too. By separating home directory encryption from the root file system encryption we can avoid this problem.

Partition Setup

So we discussed the organization of the partitions OS images multiple times in the above, each time focusing on a specific aspect. Let's now summarize how this should look like all together.

In my model, the initial, shipped OS image should look roughly like this:

  • (1) An UEFI System Partition, with systemd-boot as boot loader and one unified kernel
  • (2) A /usr/ partition (version "A"), with a label fooOS_0.7 (under the assumption we called our project fooOS and the image version is 0.7).
  • (3) A Verity partition for the /usr/ partition (version "A"), with the same label
  • (4) A partition carrying the Verity root hash for the /usr/ partition (version "A"), along with a PKCS#7 signature of it, also with the same label

On first boot this is augmented by systemd-repart like this:

  • (5) A second /usr/ partition (version "B"), initially with a label _empty (which is the label systemd-sysupdate uses to mark partitions that currently carry no valid payload)
  • (6) A Verity partition for that (version "B"), similar to the above case, also labelled _empty
  • (7) And ditto a Verity root hash partition with a PKCS#7 signature (version "B"), also labelled _empty
  • (8) A root file system, encrypted and locked to the TPM2
  • (9) A home file system, integrity protected via a key also in TPM2 (encryption is unnecessary, since systemd-homed adds that on its own, and it's nice to avoid duplicate encryption)
  • (10) A swap partition, encrypted and locked to the TPM2

Then, on the first OS update the partitions 5, 6, 7 are filled with a new version of the OS (let's say 0.8) and thus get their label updated to fooOS_0.8. After a boot, this version is active.

On a subsequent update the three partitions fooOS_0.7 get wiped and replaced by fooOS_0.9 and so on.

On factory reset, the partitions 8, 9, 10 are deleted, so that systemd-repart recreates them, using a new set of cryptographic keys.

Here's a graphic that hopefully illustrates the partition stable from shipped image, through first boot, multiple update cycles and eventual factory reset:

Partitions Overview

Trust Chain

So let's summarize the intended chain of trust (for bare metal/VM boots) that ensures every piece of code in this model is signed and validated, and any system secret is locked to TPM2.

  1. First, firmware (or possibly shim) authenticates systemd-boot.

  2. Once systemd-boot picks a unified kernel image to boot, it is also authenticated by firmware/shim.

  3. The unified kernel image contains an initrd, which is the first userspace component that runs. It finds any system extensions passed into the initrd, and sets them up through Verity. The kernel will validate the Verity root hash signature of these system extension images against its usual keyring.

  4. The initrd also finds credentials passed in, then securely unlocks (which means: decrypts + authenticates) them with a secret from the TPM2 chip, locked to the kernel image itself.

  5. The kernel image also contains a kernel command line which contains a usrhash= option that pins the root hash of the /usr/ partition to use.

  6. The initrd then unlocks the encrypted root file system, with a secret bound to the TPM2 chip.

  7. The system then transitions into the main system, i.e. the combination of the Verity protected /usr/ and the encrypted root files system. It then activates two more encrypted (and/or integrity protected) volumes for /home/ and swap, also with a secret tied to the TPM2 chip.

Here's an attempt to illustrate the above graphically:

Trust Chain

This is the trust chain of the basic OS. Validation of system extension images, portable service images, systemd-nspawn container images always takes place the same way: the kernel validates these Verity images along with their PKCS#7 signatures against the kernel's keyring.

File System Choice

In the above I left the choice of file systems unspecified. For the immutable /usr/ partitions squashfs might be a good candidate, but any other that works nicely in a read-only fashion and generates reproducible results is a good choice, too. The home directories as managed by systemd-homed should certainly use btrfs, because it's the only general purpose file system supporting online grow and shrink, which systemd-homed can take benefit of, to manage storage.

For the root file system btrfs is likely also the best idea. That's because we intend to use LUKS/dm-crypt underneath, which by default only provides confidentiality, not authenticity of the data (unless combined with dm-integrity). Since btrfs (unlike xfs/ext4) does full data checksumming it's probably the best choice here, since it means we don't have to use dm-integrity (which comes at a higher performance cost).

OS Installation vs. OS Instantiation

In the discussion above a lot of focus was put on setting up the OS and completing the partition layout and such on first boot. This means installing the OS becomes as simple as dd-ing (i.e. "streaming") the shipped disk image into the final HDD medium. Simple, isn't it?

Of course, such a scheme is just too simple for many setups in real life. Whenever multi-boot is required (i.e. co-installing an OS implementing this model with another unrelated one), dd-ing a disk image onto the HDD is going to overwrite user data that was supposed to be kept around.

In order to cover for this case, in my model, we'd use systemd-repart (again!) to allow streaming the source disk image into the target HDD in a smarter, additive way. The tool after all is purely additive: it will add in partitions or grow them if they are missing or too small. systemd-repart already has all the necessary provisions to not only create a partition on the target disk, but also copy blocks from a raw installer disk. An install operation would then become a two stop process: one invocation of systemd-repart that adds in the /usr/, its Verity and the signature partition to the target medium, populated with a copy of the same partition of the installer medium. And one invocation of bootctl that installs the systemd-boot boot loader in the ESP. (Well, there's one thing missing here: the unified OS kernel also needs to be dropped into the ESP. For now, this can be done with a simple cp call. In the long run, this should probably be something bootctl can do as well, if told so.)

So, with this scheme we have a simple scheme to cover all bases: we can either just dd an image to disk, or we can stream an image onto an existing HDD, adding a couple of new partitions and files to the ESP.

Of course, in reality things are more complex than that even: there's a good chance that the existing ESP is simply too small to carry multiple unified kernels. In my model, the way to address this is by shipping two slightly different systemd-repart partition definition file sets: the ideal case when the ESP is large enough, and a fallback case, where it isn't and where we then add in an addition XBOOTLDR partition (as per the Discoverable Partitions Specification). In that mode the ESP carries the boot loader, but the unified kernels are stored in the XBOOTLDR partition. This scenario is not quite as simple as the XBOOTLDR-less scenario described first, but is equally well supported in the various tools. Note that systemd-repart can be told size constraints on the partitions it shall create or augment, thus to implement this scheme it's enough to invoke the tool with the fallback partition scheme if invocation with the ideal scheme fails.

Either way: regardless how the partitions, the boot loader and the unified kernels ended up on the system's hard disk, on first boot the code paths are the same again: systemd-repart will be called to augment the partition table with the root file system, and properly encrypt it, as was already discussed earlier here. This means: all cryptographic key material used for disk encryption is generated on first boot only, the installer phase does not encrypt anything.

Live Systems vs. Installer Systems vs. Installed Systems

Traditionally on Linux three types of systems were common: "installed" systems, i.e. that are stored on the main storage of the device and are the primary place people spend their time in; "installer" systems which are used to install them and whose job is to copy and setup the packages that make up the installed system; and "live" systems, which were a middle ground: a system that behaves like an installed system in most ways, but lives on removable media.

In my model I'd like to remove the distinction between these three concepts as much as possible: each of these three images should carry the exact same /usr/ file system, and should be suitable to be replicated the same way. Once installed the resulting image can also act as an installer for another system, and so on, creating a certain "viral" effect: if you have one image or installation it's automatically something you can replicate 1:1 with a simple systemd-repart invocation.

Building Images According to this Model

The above explains how the image should look like and how its first boot and update cycle will modify it. But this leaves one question unanswered: how to actually build the initial image for OS instances according to this model?

Note that there's nothing too special about the images following this model: they are ultimately just GPT disk images with Linux file systems, following the Discoverable Partition Specification. This means you can use any set of tools of your choice that can put together GPT disk images for compliant images.

I personally would use mkosi for this purpose though. It's designed to generate compliant images, and has a rich toolset for SecureBoot and signed/Verity file systems already in place.

What is key here is that this model doesn't depart from RPM and dpkg, instead it builds on top of that: in this model they are excellent for putting together images on the build host, but deployment onto the runtime host does not involve individual packages.

I think one cannot underestimate the value traditional distributions bring, regarding security, integration and general polishing. The concepts I describe above are inherited from this, but depart from the idea that distribution packages are a runtime concept and make it a build-time concept instead.

Note that the above is pretty much independent from the underlying distribution.

Final Words

I have no illusions, general purpose distributions are not going to adopt this model as their default any time soon, and it's not even my goal that they do that. The above is my personal vision, and I don't expect people to buy into it 100%, and that's fine. However, what I am interested in is finding the overlaps, i.e. work with people who buy 50% into this vision, and share the components.

My goals here thus are to:

  1. Get distributions to move to a model where images like this can be built from the distribution easily. Specifically this means that distributions make their OS hermetic in /usr/.

  2. Find the overlaps, share components with other projects to revisit how distributions are put together. This is already happening, see systemd-tmpfiles and systemd-sysuser support in various distributions, but I think there's more to share.

  3. Make people interested in building actual real-world images based on general purpose distributions adhering to the model described above. I'd love a "GnomeBook" image with full trust properties, that is built from true Linux distros, such as Fedora or ArchLinux.


  1. What about ostree? Doesn't ostree already deliver what this blog story describes?

    ostree is fine technology, but in respect to security and robustness properties it's not too interesting I think, because unlike image-based approaches it cannot really deliver integrity/robustness guarantees over the whole tree easily. To be able to trust an ostree setup you have to establish trust in the underlying file system first, and the complexity of the file system makes that challenging. To provide an effective offline-secure trust chain through the whole depth of the stack it is essential to cryptographically validate every single I/O operation. In an image-based model this is trivially easy, but in ostree model it's with current file system technology not possible and even if this is added in one way or another in the future (though I am not aware of anyone doing on-access file-based integrity that spans a whole hierarchy of files that was compatible with ostree's hardlink farm model) I think validation is still at too high a level, since Linux file system developers made very clear their implementations are not robust to rogue images. (There's this stuff planned, but doing structural authentication ahead of time instead of on access makes the idea to weak — and I'd expect too slow — in my eyes.)

    With my design I want to deliver similar security guarantees as ChromeOS does, but ostree is much weaker there, and I see no perspective of this changing. In a way ostree's integrity checks are similar to RPM's and enforced on download rather than on access. In the model I suggest above, it's always on access, and thus safe towards offline attacks (i.e. evil maid attacks). In today's world, I think offline security is absolutely necessary though.

    That said, ostree does have some benefits over the model described above: it naturally shares file system inodes if many of the modules/images involved share the same data. It's thus more space efficient on disk (and thus also in RAM/cache to some degree) by default. In my model it would be up to the image builders to minimize shipping overly redundant disk images, by making good use of suitably composable system extensions.

  2. What about configuration management?

    At first glance immutable systems and configuration management don't go that well together. However, do note, that in the model I propose above the root file system with all its contents, including /etc/ and /var/ is actually writable and can be modified like on any other typical Linux distribution. The only exception is /usr/ where the immutable OS is hermetic. That means configuration management tools should work just fine in this model – up to the point where they are used to install additional RPM/dpkg packages, because that's something not allowed in the model above: packages need to be installed at image build time and thus on the image build host, not the runtime host.

  3. What about non-UEFI and non-TPM2 systems?

    The above is designed around the feature set of contemporary PCs, and this means UEFI and TPM2 being available (simply because the PC is pretty much defined by the Windows platform, and current versions of Windows require both).

    I think it's important to make the best of the features of today's PC hardware, and then find suitable fallbacks on more limited hardware. Specifically this means: if there's desire to implement something like the this on non-UEFI or non-TPM2 hardware we should look for suitable fallbacks for the individual functionality, but generally try to add glue to the old systems so that conceptually they behave more like the new systems instead of the other way round. Or in other words: most of the above is not strictly tied to UEFI or TPM2, and for many cases already there are reasonably fallbacks in place for more limited systems. Of course, without TPM2 many of the security guarantees will be weakened.

  4. How would you name an OS built that way?

    I think a desktop OS built this way if it has the GNOME desktop should of course be called GnomeBook, to mimic the ChromeBook name. ;-)

    But in general, I'd call hermetic, adaptive, immutable OSes like this "particles".

How can you help?

  1. Help making Distributions Hermetic in /usr/!

    One of the core ideas of the approach described above is to make the OS hermetic in /usr/, i.e. make it carry a comprehensive description of what needs to be set up outside of it when instantiated. Specifically, this means that system users that are needed are declared in systemd-sysusers snippets, and skeleton files and directories are created via systemd-tmpfiles. Moreover additional partitions should be declared via systemd-repart drop-ins.

    At this point some distributions (such as Fedora) are (probably more by accident than on purpose) already mostly hermetic in /usr/, at least for the most basic parts of the OS. However, this is not complete: many daemons require to have specific resources set up in /var/ or /etc/ before they can work, and the relevant packages do not carry systemd-tmpfiles descriptions that add them if missing. So there are two ways you could help here: politically, it would be highly relevant to convince distributions that an OS that is hermetic in /usr/ is highly desirable and it's a worthy goal for packagers to get there. More specifically, it would be desirable if RPM/dpkg packages would ship with enough systemd-tmpfiles information so that configuration files the packages strictly need for operation are symlinked (or copied) from /usr/share/factory/ if they are missing (even better of course would be if packages from their upstream sources on would just work with an empty /etc/ and /var/, and create themselves what they need and default to good defaults in absence of configuration files).

    Note that distributions that adopted systemd-sysusers, systemd-tmpfiles and the /usr/ merge are already quite close to providing an OS that is hermetic in /usr/. These were the big, the major advancements: making the image fully hermetic should be less controversial – at least that's my guess.

    Also note that making the OS hermetic in /usr/ is not just useful in scenarios like the above. It also means that stuff like this and like this can work well.

  2. Fill in the gaps!

    I already mentioned a couple of missing bits and pieces in the implementation of the overall vision. In the systemd project we'd be delighted to review/merge any PRs that fill in the voids.

  3. Build your own OS like this!

    Of course, while we built all these building blocks and they have been adopted to various levels and various purposes in the various distributions, no one so far built an OS that puts things together just like that. It would be excellent if we had communities that work on building images like what I propose above. i.e. if you want to work on making a secure GnomeBook as I suggest above a reality that would be more than welcome.

    How could this look like specifically? Pick an existing distribution, write a set of mkosi descriptions plus some additional drop-in files, and then build this on some build infrastructure. While doing so, report the gaps, and help us address them.

Further Documentation of Used Components and Concepts

  1. systemd-tmpfiles
  2. systemd-sysusers
  3. systemd-boot
  4. systemd-stub
  5. systemd-sysext
  6. systemd-portabled, Portable Services Introduction
  7. systemd-repart
  8. systemd-nspawn
  9. systemd-sysupdate
  10. systemd-creds, System and Service Credentials
  11. systemd-homed
  12. Automatic Boot Assessment
  13. Boot Loader Specification
  14. Discoverable Partitions Specification
  15. Safely Building Images

Earlier Blog Stories Related to this Topic

  1. The Strange State of Authenticated Boot and Disk Encryption on Generic Linux Distributions
  2. The Wondrous World of Discoverable GPT Disk Images
  3. Unlocking LUKS2 volumes with TPM2, FIDO2, PKCS#11 Security Hardware on systemd 248
  4. Portable Services with systemd v239
  5. mkosi — A Tool for Generating OS Images

And that's all for now.

April 29, 2022

I've been working on kopper recently, which is a complementary project to zink. Just as zink implements OpenGL in terms of Vulkan, kopper seeks to implement the GL window system bindings - like EGL and GLX - in terms of the Vulkan WSI extensions. There are several benefits to doing this, which I'll get into in a future post, but today's story is really about libX11 and libxcb.

Yes, again.

One important GLX feature is the ability to set the swap interval, which is how you get tear-free rendering by syncing buffer swaps to the vertical retrace. A swap interval of 1 is the typical case, where an image update happens once per frame. The Vulkan way to do this is to set the swapchain present mode to FIFO, since FIFO updates are implicitly synced to vblank. Mesa's WSI code for X11 uses a swapchain management thread for FIFO present modes. This thread is started from inside the vulkan driver, and it only uses libxcb to talk to the X server. But libGL is a libX11 client library, so in this scenario there is always an "xlib thread" as well.

libX11 uses libxcb internally these days, because otherwise there would be no way to intermix xlib and xcb calls in the same process. But it does not use libxcb's reflection of the protocol, XGetGeometry does not call xcb_get_geometry for example. Instead, libxcb has an API to allow other code to take over the write side of the display socket, with a callback mechanism to get it back when another xcb client issues a request. The callback function libX11 uses here is straightforward: lock the Display, flush out any internally buffered requests, and return the sequence number of the last request written. Both libraries need this sequence number for various reasons internally, xcb for example uses it to make sure replies go back to the thread that issued the request.

But "lock the Display" here really means call into a vtable in the Display struct. That vtable is filled in during XOpenDisplay, but the individual function pointers are only non-NULL if you called XInitThreads beforehand. And if you're libGL, you have no way to enforce that, your public-facing API operates on a Display that was already created.

So now we see the race. The queue management thread calls into libxcb while the main thread is somewhere inside libX11. Since libX11 has taken the socket, the xcb thread runs the release callback. Since the Display was not made thread-safe at XOpenDisplay time, the release callback does not block, so the xlib thread's work won't be correctly accounted. If you're lucky the two sides will at least write to the socket atomically with respect to each other, but at this point they have diverging opinions about the request sequence numbering, and it's a matter of time until you crash.

It turns out kopper makes this really easy to hit. Like "resize a glxgears window" easy. However, this isn't just a kopper issue, this race exists for every program that uses xcb on a not-necessarily-thread-safe Display. The only reasonable fix is to for libX11 to just always be thread-safe.

So now, it is.

April 26, 2022

I recently blogged about how to run a volatile systemd-nspawn container from your host's /usr/ tree, for quickly testing stuff in your host environment, sharing your home drectory, but all that without making a single modification to your host, and on an isolated node.

The one-liner discussed in that blog story is great for testing during system software development. Let's have a look at another systemd tool that I regularly use to test things during systemd development, in a relatively safe environment, but still taking full benefit of my host's setup.

Since a while now, systemd has been shipping with a simple component called systemd-sysext. It's primary usecase goes something like this: on one hand OS systems with immutable /usr/ hierarchies are fantastic for security, robustness, updating and simplicity, but on the other hand not being able to quickly add stuff to /usr/ is just annoying.

systemd-sysext is supposed to bridge this contradiction: when invoked it will merge a bunch of "system extension" images into /usr/ (and /opt/ as a matter of fact) through the use of read-only overlayfs, making all files shipped in the image instantly and atomically appear in /usr/ during runtime — as if they always had been there. Now, let's say you are building your locked down OS, with an immutable /usr/ tree, and it comes without ability to log into, without debugging tools, without anything you want and need when trying to debug and fix something in the system. With systemd-sysext you could use a system extension image that contains all this, drop it into the system, and activate it with systemd-sysext so that it genuinely extends the host system.

(There are many other usecases for this tool, for example, you could build systems that way that at their base use a generic image, but by installing one or more system extensions get extended to with additional more specific functionality, or drivers, or similar. The tool is generic, use it for whatever you want, but for now let's not get lost in listing all the possibilites.)

What's particularly nice about the tool is that it supports automatically discovered dm-verity images, with signatures and everything. So you can even do this in a fully authenticated, measured, safe way. But I am digressing…

Now that we (hopefully) have a rough understanding what systemd-sysext is and does, let's discuss how specficially we can use this in the context of system software development, to safely use and test bleeding edge development code — built freshly from your project's build tree – in your host OS without having to risk that the host OS is corrupted or becomes unbootable by stuff that didn't quite yet work the way it was envisioned:

The images systemd-sysext merges into /usr/ can be of two kinds: disk images with a file system/verity/signature, or simple, plain directory trees. To make these images available to the tool, they can be placed or symlinked into /usr/lib/extensions/, /var/lib/extensions/, /run/extensions/ (and a bunch of others). So if we now install our freshly built development software into a subdirectory of those paths, then that's entirely sufficient to make them valid system extension images in the sense of systemd-sysext, and thus can be merged into /usr/ to try them out.

To be more specific: when I develop systemd itself, here's what I do regularly, to see how my new development version would behave on my host system. As preparation I checked out the systemd development git tree first of course, hacked around in it a bit, then built it with meson/ninja. And now I want to test what I just built:

sudo DESTDIR=/run/extensions/systemd-test meson install -C build --quiet --no-rebuild &&
        sudo systemd-sysext refresh --force

Explanation: first, we'll install my current build tree as a system extension into /run/extensions/systemd-test/. And then we apply it to the host via the systemd-sysext refresh command. This command will search for all installed system extension images in the aforementioned directories, then unmount (i.e. "unmerge") any previously merged dirs from /usr/ and then freshly mount (i.e. "merge") the new set of system extensions on top of /usr/. And just like that, I have installed my development tree of systemd into the host OS, and all that without actually modifying/replacing even a single file on the host at all. Nothing here actually hit the disk!

Note that all this works on any system really, it is not necessary that the underlying OS even is designed with immutability in mind. Just because the tool was developed with immutable systems in mind it doesn't mean you couldn't use it on traditional systems where /usr/ is mutable as well. In fact, my development box actually runs regular Fedora, i.e. is RPM-based and thus has a mutable /usr/ tree. As long as system extensions are applied the whole of /usr/ becomes read-only though.

Once I am done testing, when I want to revert to how things were without the image installed, it is sufficient to call:

sudo systemd-sysext unmerge

And there you go, all files my development tree generated are gone again, and the host system is as it was before (and /usr/ mutable again, in case one is on a traditional Linux distribution).

Also note that a reboot (regardless if a clean one or an abnormal shutdown) will undo the whole thing automatically, since we installed our build tree into /run/ after all, i.e. a tmpfs instance that is flushed on boot. And given that the overlayfs merge is a runtime thing, too, the whole operation was executed without any persistence. Isn't that great?

(You might wonder why I specified --force on the systemd-sysext refresh line earlier. That's because systemd-sysext actually does some minimal version compatibility checks when applying system extension images. For that it will look at the host's /etc/os-release file with /usr/lib/extension-release.d/extension-release.<name>, and refuse operaton if the image is not actually built for the host OS version. Here we don't want to bother with dropping that file in there, we know already that the extension image is compatible with the host, as we just built it on it. --force allows us to skip the version check.)

You might wonder: what about the combination of the idea from the previous blog story (regarding running container's off the host /usr/ tree) with system extensions? Glad you asked. Right now we have no support for this, but it's high on our TODO list (patches welcome, of course!). i.e. a new switch for systemd-nspawn called --system-extension= that would allow merging one or more such extensions into the container tree booted would be stellar. With that, with a single command I could run a container off my host OS but with a development version of systemd dropped in, all without any persistence. How awesome would that be?

(Oh, and in case you wonder, all of this only works with distributions that have completed the /usr/ merge. On legacy distributions that didn't do that and still place parts of /usr/ all over the hierarchy the above won't work, since merging /usr/ trees via overlayfs is pretty pointess if the OS is not hermetic in /usr/.)

And that's all for now. Happy hacking!

April 24, 2022

The title might be a bit hyperbolic here, but we’re indeed exploring a first step in that direction with radv. The impetus here is the ExecuteIndirect command in Direct3D 12 and some games that are using it in non-trivial ways. (e.g. Halo Infinite)

ExecuteIndirect can be seen as an extension of what we have in Vulkan with vkCmdDrawIndirectCount. It adds extra capabilities. To support that with vkd3d-proton we need the following indirect Vulkan capabilities:

  1. Binding vertex buffers.
  2. Binding index buffers.
  3. Updating push constants.

This functionality happens to be a subset of VK_NV_device_generated_commands and hence I’ve been working on implementing a subset of that extension on radv. Unfortunately, we can’t really give the firmware a “extended indirect draw call” and execute stuff, so we’re stuck generating command buffers on the GPU.

The way the extension works, the application specifies a command “signature” on the CPU, which specifies that for each draw call the application is going to update A, B and C. Then, at runtime, the application provides a buffer providing the data for A, B and C for each draw call. The driver then processes that into a command buffer and then executes that into a secondary command buffer.

The workflow is then as follows:

  1. The application (or vkd3d-proton) provides the command signature to the driver which creates an object out of it.
  2. The application queries how big a command buffer (“preprocess buffer”) of $n$ draws with that signature would be.
  3. The application allocates the preprocess buffer.
  4. The application does its stuff to generate some commands.
  5. The application calls vkCmdPreprocessGeneratedCommandsNV which converts the application buffer into a command buffer (in the preprocess buffer)
  6. The application calls vkCmdExecuteGeneratedCommandsNV to execute the generated command buffer.

What goes into a draw in radv

When the application triggers a draw command in Vulkan, the driver generates GPU commands to do the following:

  1. Flush caches if needed
  2. Set some registers.
  3. Trigger the draw.

Of course we skip any of these steps (or parts of them) when they’re redundant. The majority of the complexity is in the register state we have to set. There are multiple parts here

  1. Fixed function state:

    1. subpass attachments
    2. static/dynamic state (viewports, scissors, etc.)
    3. index buffers
    4. some derived state from the shaders (some tesselation stuff, fragment shader export types, varyings, etc.)
  2. shaders (start address, number of registers, builtins used)
  3. user SGPRs (i.e. registers that are available at the start of a shader invocation)

Overall, most of the pipeline state is fairly easy to emit: we just precompute it on pipeline creation and memcpy it over if we switch shaders. The most difficult is probably the user SGPRs, and the reason for that is that it is derived from a lot of the remaining API state . Note that the list above doesn’t include push constants, descriptor sets or vertex buffers. The driver computes all of these, and generates the user SGPR data from that.

Descriptor sets in radv are just a piece of GPU memory, and radv binds a descriptor set by providing the shader with a pointer to that GPU memory in a user SGPR. Similarly, we have no hardware support for vertex buffers, so radv generates a push descriptor set containing internal texel buffers and then provides a user SGPR with a pointer to that descriptor set.

For push constants, radv has two modes: a portion of the data can be passed in user SGPRs directly, but sometimes a chunk of memory gets allocated and then a pointer to that memory is provided in a user SGPR. This fallback exists because the hardware doesn’t always have enough user SGPRs to fit all the data.

On Vega and later there are 32 user SGPRs, and on earlier GCN GPUs there are 16. This needs to fit pointers to all the referenced descriptor sets (including internal ones like the one for vertex buffers), push constants, builtins like the start vertex and start instance etc. To get the best performance here, radv determines a mapping of API object to user SGPR at shader compile time and then at draw time radv uses that mapping to write user SGPRs.

This results in some interesting behavior, like switching pipelines does cause the driver to update all the user SGPRs because the mapping might have changed.

Furthermore, as an interesting performance hack radv allocates all upload buffers (for the push constant and push descriptor sets), shaders and descriptor pools in a single 4 GiB region of of memory so that we can pass only the bottom 32-bits of all the pointers in a user SGPR, getting us farther with the limited number of user SGPRs. We will see later how that makes things difficult for us.

Generating a commandbuffer on the GPUs

As shown above radv has a bunch of complexity around state for draw calls and if we start generating command buffers on the GPU that risks copying a significant part of that complexity to a shader. Luckily ExecuteIndirect and VK_NV_device_generated_commands have some limitations that make this easier. The app can only change

  1. vertex buffers
  2. index buffers
  3. push constants

VK_NV_device_generated_commands also allows changing shaders and the rotation winding of what is considered a primitive backface but we’ve chosen to ignore that for now since it isn’t needed for ExecuteIndirect (though especially the shader switching could be useful for an application).

The second curveball is that the buffer the application provides needs to provide the same set of data for every draw call. This avoids having to do a lot of serial processing to figure out what the previous state was, which allows processing every draw command in a separate shader invocation. Unfortunately we’re still a bit dependent on the old state that is bound before the indirect command buffer execution:

  1. The previously bound index buffer
  2. Previously bound vertex buffers.
  3. Previously bound push constants.

Remember that for vertex buffers and push constants we may put them in a piece of memory. That piece of memory needs to contains all the vertex buffers/push constants for that draw call, so even if we modify only one of them, we have to copy the rest over. The index buffer is different: in the draw packets for the GPU there is a field that is derived from the index buffer size.

So in vkCmdPreprocessGeneratedCommandsNV radv partitions the preprocess buffer into a command buffer and an upload buffer (for the vertex buffers & push constants), both with a fixed stride based on the command signature. Then it launches a shader which processes a draw call in each invocation:

   if (shader used vertex buffers && we change a vertex buffer) {
      copy all vertex buffers 
      update the changed vertex buffers
      emit a new vertex descriptor set pointer
   if (we change a push constant) {
      if (we change a push constant in memory) {
         copy all push constant
         update changed push constants
         emit a new push constant pointer
      emit all changed inline push constants into user SGPRs
   if (we change the index buffer) {
      emit new index buffers
   emit a draw command
   insert NOPs up to the stride

In vkCmdExecuteGeneratedCommandsNV radv uses the internal equivalent of vkCmdExecuteCommands to execute as if the generated command buffer is a secondary command buffer.


Of course one does not simply move part of the driver to GPU shaders without any challenges. In fact we have a whole bunch of them. Some of them just need a bunch of work to solve, some need some extension specification tweaking and some are hard to solve without significant tradeoffs.

Code maintainability

A big problem is that the code needed for the limited subset of state that is supported is now in 3 places:

  1. The traditional CPU path
  2. For determining how large the preprocess buffer needs to be
  3. For the shader called in vkCmdPreprocessGeneratedCommandsNV to build the preprocess buffer.

Having the same functionality in multiple places is a recipe for things going out of sync. This makes it harder to change this code and much easier for bugs to sneak in. This can be mitigated with a lot of testing, but a bunch of GPU work gets complicated quickly. (e.g. the preprocess buffer being larger than needed still results in correct results, getting a second opinion from the shader to check adds significant complexity).

nir_builder gets old quickly

In the driver at the moment we have no good high level shader compiler. As a result a lot of the internal helper shaders are written using the nir_builder helper to generate nir, the intermediate IR of the shader compiler. Example fragment:

      nir_ssa_def *curr_offset = nir_load_var(b, offset);

      nir_push_if(b, nir_ieq(b, curr_offset, cmd_buf_size));
         nir_jump(b, nir_jump_break);
      nir_pop_if(b, NULL);

      nir_ssa_def *packet_size = nir_isub(b, cmd_buf_size, curr_offset);
      packet_size = nir_umin(b, packet_size, nir_imm_int(b, 0x3ffc * 4));

      nir_ssa_def *len = nir_ushr_imm(b, packet_size, 2);
      len = nir_iadd_imm(b, len, -2);
      nir_ssa_def *packet = nir_pkt3(b, PKT3_NOP, len);

      nir_store_ssbo(b, packet, dst_buf, curr_offset, .write_mask = 0x1,
                     .access = ACCESS_NON_READABLE, .align_mul = 4);
      nir_store_var(b, offset, nir_iadd(b, curr_offset, packet_size), 0x1);
   nir_pop_loop(b, NULL);

It is clear that this all gets very verbose very quickly. This is somewhat fine as long as all the internal shaders are tiny. However, between this and raytracing our internal shaders are getting significantly bigger and the verbosity really becomes a problem.

Interesting things to explore here are to use glslang, or even to try writing our shaders in OpenCL C and then compiling it to SPIR-V at build time. The challenge there is that radv is built on a diverse set of platforms (including Windows, Android and desktop Linux) which can make significant dependencies a struggle.


Ideally your GPU work is very suitable for pipelining to avoid synchronization cost on the GPU. If we generate the command buffer and then execute it we need to have a full GPU sync point in between, which can get very expensive as it waits until the GPU is idle. To avoid this VK_NV_device_generated_commands has added the separate vkCmdPreprocessGeneratedCommandsNV command, so that the application can batch up a bunch of work before incurring the cost a sync point.

However, in radv we have to do the command buffer generation in vkCmdExecuteGeneratedCommandsNV as our command buffer generation depends on some of the other state that is bound, but might not be bound yet when the application calls vkCmdPreprocessGeneratedCommandsNV.

Which brings up a slight spec problem: The extension specification doesn’t specify whether the application is allowed to execute vkCmdExecuteGeneratedCommandsNV on multiple queues concurrently with the same preprocess buffer. If all the writing of that happens in vkCmdPreprocessGeneratedCommandsNV that would result in correct behavior, but if the writing happens in vkCmdExecuteGeneratedCommandsNV this results in a race condition.

The 32-bit pointers

Remember that radv only passes the bottom 32-bits of some pointers around. As a result the application needs to allocate the preprocess buffer in that 4-GiB range. This in itself is easy: just add a new memory type and require it for this usage. However, the devil is in the details.

For example, what should we do for memory budget queries? That is per memory heap, not memory type. However, a new memory heap does not make sense, as the memory is also still subject to physical availability of VRAM, not only address space.

Furthermore, this 4-GiB region is more constrained than other memory, so it would be a shame if applications start allocating random stuff in it. If we look at the existing usage for a pretty heavy game (HZD) we get about

  1. 40 MiB of command buffers + upload buffers
  2. 200 MiB of descriptor pools
  3. 400 MiB of shaders

So typically we have a lot of room available. Ideally the ordering of memory types would get an application to prefer another memory type when we do not need this special region. However, memory object caching poses a big risk here: Would you choose a memory object in the cache that you can reuse/suballocate (potentially in that limited region), or allocate new for a “better” memory type?

Luckily we have not seen that risk play out, but the only real tested user at this point has been vkd3d-proton.

Secondary command buffers.

When executing the generated command buffer radv does that the same way as calling a secondary command buffer. This has a significant limitation: A secondary command buffer cannot call a secondary command buffer on the hardware. As a result the current implementation has a problem if vkCmdExecuteGeneratedCommandsNV gets called on a secondary command buffer.

It is possible to work around this. An example would be to split the secondary command buffer into 3 parts: pre, generated, post. However, that needs a bunch of refactoring to allow multiple internal command buffers per API command buffers.

Where to go next

Don’t expect this upstream very quickly. The main reason for exploring this in radv is ExecuteIndirect support for Halo Infinite, and after some recent updates we’re back into GPU hang limbo with radv/vkd3d-proton there. So while we’re solving that I’m holding off on upstreaming in case the hangs are caused by the implementation of this extension.

Furthermore, this is only a partial implementation of the extension anyways, with a fair number of limitations that we’d ideally eliminate before fully exposing this extension.

April 20, 2022

Let Your Memes Be Dreams

With Mesa 22.1 RC1 firmly out the door, most eyes have turned towards Mesa 22.2.

But not all eyes.

No, while most expected me to be rocketing off towards the next shiny feature, one ticket caught my eye:

Mesa 22.1rc1: Zink on Windows doesn’t work even simple wglgears app fails..

Sadly, I don’t support Windows. I don’t have a test machine to run it, and I don’t even have a VM I could spin up to run Lavapipe. I knew that Kopper was going to cause problems with other frontends, but I didn’t know how many other frontends were actually being used.

The answer was not zero, unfortunately. Plenty of users were enjoying the slow, software driver speed of Zink on Windows to spin those gears, and I had just crushed their dreams.

As I had no plans to change anything here, it would take a new hero to set things right.

The Hero We Deserve

Who here loves X-Plane?

I love X-Plane. It’s my favorite flight simulator. If I could, I’d play it all day every day. And do you know who my favorite X-Plane developer is?

Friend of the blog and part-time Zink developer, Sidney Just.

Some of you might know him from his extensive collection of artisanal blog posts. Some might have seen his work enabling Vulkan<->OpenGL interop in Mesa on Windows.

But did you know that Sid’s latest project is much more groundbreaking than just bumping Zink’s supported extension count far beyond the reach of every other driver?

What if I told you that this image


is Zink running wglgears on a NVIDIA 2070 GPU on Windows at full speed? No software-copy scanout. Just Kopper.

Full Support: Windows Ultimate Home Professional Edition

Over the past couple days, Sid’s done the esoteric work of hammering out WSI support for Zink on Windows, making us the first hardware-accelerated, GL 4.6-capable Mesa driver to run natively on Windows.

Don’t believe me?

Recognize a little Aztec Ruins action from GFXBench?


The results are about what we’d expect of an app I’ve literally never run myself:





Not too bad at all!

In Summary

I think we can safely say that Sid has managed to fix the original bug. Thanks, Sid!

But why is an X-Plane developer working on Zink?

The man himself has this to say on the topic:

X-Plane has traditionally been using OpenGL directly for all of its rendering needs. As a result, for years our plugin SDK has directly exposed the games OpenGL context directly to third party plugins, which have used it to render custom avionic screens and GUI elements. When we finally did the switch to Vulkan and Metal in 2020, one of the big issues we faced was how to deal with plugins. Our solution so far has been to rely on native Vulkan/OpenGL driver interop via extensions, which has mostly worked and allowed us to ship with modern backends.

Unfortunately this puts us at the mercy of the driver to provide good interop. Sadly on some platforms, this just isn’t available at all. On others, the drivers are broken leading to artifacts when mixing Vulkan and GL rendering. To date, our solution has been to just shrug it off and hope for better drivers. X-Plane plugins make use of compatibly profile GL features, as well as core profile features, depending on the authors skill, so libraries like ANGLE were not an option for us.

This is where Zink comes in for us: Being a real GL driver, it has support for all of the features that we need. Being open source also means that any issues that we do discover are much easier to fix ourselves. We’ve made some progress including Zink into the next major version of X-Plane, X-Plane 12, and it’s looking very promising so far. Our hope is to ship X-Plane 12 with Zink as the GL backend for plugins and leave driver interop issues in the past.

The roots of this interest can also be seen in his blog post from last year where he touches on the future of GL plugin support.


Big Triangle’s definitely paying attention now.

And if any of my readers think this work is cool, go buy yourself a copy of X-Plane to say thanks for contributing back to open source.

April 15, 2022

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:

  • Part 1: The high-level view of the whole CI system, and how to fully control test machines remotely (power on, OS to boot, keyboard/screen emulation using a serial console);
  • Part 2: A comparison of the different ways to generate the rootfs of your test environment, and introducing the boot2container project;
  • Part 3: Analysis of the requirements for the CI gateway, catching regressions before deployment, easy roll-back, and netbooting the CI gateway securely over the internet.

In this article, we will finally focus on generating the rootfs/container image of the CI Gateway in a way that enables live patching the system without always needing to reboot.

This work is sponsored by the Valve Corporation.

Introduction: The impact of updates

System updates are a necessary evil for any internet-facing server, unless you want your system to become part of a botnet. This is especially true for CI systems since they let people on the internet run code on machines, often leading to unfair use such as cryptomining (this one is hard to avoid though)!

The problem with system updates is not the 2 or 3 minutes of downtime that it takes to reboot, it is that we cannot reboot while any CI job is running. Scheduling a reboot thus first requires to stop accepting new jobs, wait for the current ones to finish, then finally reboot. This solution may be acceptable if your jobs take ~30 minutes, but what if they last 6h? A reboot suddenly gets close to a typical 8h work day, and we definitely want to have someone looking over the reboot sequence so they can revert to a previous boot configuration if the new one failed.

This problem may be addressed in a cloud environment by live-migrating services/containers/VMs from a non-updated host to an updated one. This is unfortunately a lot more complex to pull off for a bare-metal CI without having a second CI gateway and designing synchronization systems/hardware to arbiter access to the test machines's power/serial consoles/boot configuration.

So, while we cannot always avoid the need to drain the CI jobs before rebooting, what we can do is reduce the cases in which we need to perform this action. Unfortunately, containers have been designed with atomic updates in mind (this is why we want to use them), but that means that trivial operations such as adding an ssh key, a Wireguard peer, or updating a firewall rule will require a reboot. A hacky solution may be for the admins to update the infra container then log in the different CI gateways and manually reproduce the changes they have done in the new container. These changes would be lost at the next reboot, but this is not a problem since the CI gateway would use the latest container when rebooting which already contains the updates. While possible, this solution is error-prone and not testable ahead of time, which is against the requirements for the gateway we laid out in Part 3.

Live patching containers

An improvement to live-updating containers by hand would be to use tools such as Ansible, Salt, or even Puppet to manage and deploy non-critical services and configuration. This would enable live-updating the currently-running container but would need to be run after every reboot. An Ansible playbook may be run locally, so it is not inconceivable for a service to be run at boot that would download the latest playbook and run it. This solution is however forcing developers/admins to decide which services need to have their configuration baked in the container and which services should be deployed using a tool like Ansible... unless...

We could use a tool like Ansible to describe all the packages and services to install, along with their configuration. Creating a container would then be achieved by running the Ansible playbook on a base container image. Assuming that the playbook would truly be idem-potent (running the playbook multiple times will lead to the same final state), this would mean that there would be no differences between the live-patched container and the new container we created. In other words, we simply morph the currently-running container to the wanted configuration by running the same Ansible playbook we used to create the container, but against the live CI gateway! This will not always remove the need to reboot the CI gateways from time to time (updating the kernel, or services which don't support live-updates without affecting CI jobs), but all the smaller changes can get applied in-situ!

The base container image has to contain the basic dependencies of the tool like Ansible, but if it were made to contain all the OS packages, it would split the final image into three container layers: the base OS container, the packages needed, and the configuration. Updating the configuration would thus result in only a few megabytes of update to download at the next reboot rather than the full OS image, thus reducing the reboot time.

Limits to live-patching containers

Ansible is perfectly-suited to morph a container into its newest version, provided that all the resources used remain static between when the new container was created and when the currently-running container gets live-patched. This is because of Ansible's core principle of idempotency of operations: Rather than running commands blindly like in a shell-script, it first checks what is the current state then, if needed, update the state to match the desired target. This makes it safe to run the playbook multiple times, but will also allow us to only reboot services if its configuration or one of its dependencies' changed.

When version pinning of packages is possible (Python, Ruby, Rust, Golang, ...), Ansible can guarantee the idempotency that make live-patching safe. Unfortunately, package managers of Linux distributions are usually not idempotent: They were designed to ship updates, not pin software versions! In practice, this means that there are no guarantees that the package installed during live-patching will be the same as the one installed in the new base container, thus exposing oneself to potential differences in behaviour between the two deployment methods... The only way out of this issue is to create your own package repository and make sure its content will not change between the creation of the new container and the live-patching of all the CI Gateways. Failing that, all I can advise you to do is pick a stable distribution which will try its best to limit functional changes between updates within the same distribution version (Alpine Linux, CentOS, Debian, ...).

In the end, Ansible won't always be able to make live-updating your container strictly equivalent to rebooting into its latest version, but as long as you are aware of its limitations (or work around them), it will make updating your CI gateways way less of a trouble than it would be otherwise! You will need to find the right balance between live-updatability, and ease of maintenance of the code-base of your gateway.

Putting it all together: The example of valve-infra-container

At this point, you may be wondering how all of this looks in practice! Here is the example of the CI gateways we have been developping for Valve:

  • Ansible playbook: You will find here the entire configuration of our CI gateways. NOTE: we are still working on live-patching!;
  • Valve-infra-base-container: The buildah script used to generate the base container;
  • Valve-infra-container: The buildah script used to generate the final container by running the Ansible playbook.

And if you are wondering how we can go from these scripts to working containers, here is how:

$ podman run --rm -d -p 8088:5000 --name registry
$ env \
    IMAGE_NAME=localhost:8088/valve-infra-base-container \
    BASE_IMAGE=archlinux \
    buildah unshare -- .gitlab-ci/
$ env \
    IMAGE_NAME=localhost:8088/valve-infra-container \
    BASE_IMAGE=valve-infra-base-container \
    ANSIBLE_EXTRA_ARGS='--extra-vars service_mgr_override=inside_container -e development=true' \
    buildah unshare -- .gitlab-ci/

And if you were willing to use our Makefile, it gets even easier:

$ make valve-infra-base-container BASE_IMAGE=archlinux IMAGE_NAME=localhost:8088/valve-infra-base-container
$ make valve-infra-container BASE_IMAGE=localhost:8088/valve-infra-base-container IMAGE_NAME=localhost:8088/valve-infra-container

Not too bad, right?

PS: These scripts are constantly being updated, so make sure to check out their current version!


In this post, we highlighted the difficulty of keeping the CI Gateways up to date when CI jobs can take multiple hours to complete, preventing new jobs from starting until the current queue is emptied and the gateway has rebooted.

We have then shown that despite looking like competing solutions to deploy services in production, containers and tools like Ansible can actually work well together to reduce the need for reboots by morphing the currently-running container into the updated one. There are however some limits to this solution which are important to keep in mind when designing the system.

In the next post, we will be designing the executor service which is responsible for time-sharing the test machines between different CI/manual jobs. We will thus be talking about deploying test environments, BOOTP, and serial consoles!

That's all for now, thanks for making it to the end!

April 12, 2022

Another Quarter Down

As everyone who’s anyone knows, the next Mesa release branchpoint is coming up tomorrow. Like usual, here’s the rundown on what to expect from zink in this release:

  • zero performance improvements (that I’m aware of)
  • Kopper has landed: Vulkan WSI is now used and NVIDIA drivers can finally run at full speed
  • lots of bugs fixed
  • seriously so many bugs
  • I’m not even joking
  • literally this whole quarter was just fixing bugs

So if you find a zink problem in the 22.1 release of Mesa, it’s definitely because of Kopper and not actually anything zink-related.


But also this is sort-of-almost-maybe a lavapipe blog, and that driver has had a much more exciting quarter. Here’s a rundown.

New Extensions:

  • VK_EXT_debug_utils
  • VK_EXT_depth_clip_control
  • VK_EXT_graphics_pipeline_library
  • VK_EXT_image_2d_view_of_3d
  • VK_EXT_image_robustness
  • VK_EXT_inline_uniform_block
  • VK_EXT_pipeline_creation_cache_control
  • VK_EXT_pipeline_creation_feedback
  • VK_EXT_primitives_generated_query
  • VK_EXT_shader_demote_to_helper_invocation
  • VK_EXT_subgroup_size_control
  • VK_EXT_texel_buffer_alignment
  • VK_KHR_format_feature_flags2
  • VK_KHR_memory_model
  • VK_KHR_pipeline_library
  • VK_KHR_shader_integer_dot_product
  • VK_KHR_shader_terminate_invocation
  • VK_KHR_swapchain_mutable_format
  • VK_KHR_synchronization2
  • VK_KHR_zero_initialize_workgroup_memory

Vulkan 1.3 is now supported. We’ve landed a number of big optimizations as well, leading to massively improved CI performance.

Lavapipe: the cutting-edge software implementation of Vulkan.

…as long as you don’t need descriptor indexing.

April 07, 2022

Since Kopper got merged today upstream I wanted to write a little about it as I think the value it brings can be unclear for the uninitiated.

Adam Jackson in our graphics team has been working for the last Months together with other community members like Mike Blumenkrantz implementing Kopper. For those unaware Zink is an OpenGL implementation running on top of Vulkan and Kopper is the layer that allows you to translate OpenGL and GLX window handling to Vulkan WSI handling. This means that you can get full OpenGL support even if your GPU only has a Vulkan driver available and it also means you can for instance run GNOME on top of this stack thanks to the addition of Kopper to Zink.

During the lifecycle of the soon to be released Fedora Workstation 36 we expect to allow you to turn on the doing OpenGL using Kopper and Zink as an experimental feature once we update Fedora 36 to Mesa 22.1.

So you might ask why would I care about this as an end user? Well initially you probably will not care much, but over time it is likely that GPU makers will eventually stop developing native OpenGL drivers and just focus on their Vulkan drivers. At that point Zink and Kopper provides you with a backwards compatibility solution for your OpenGL applications. And for Linux distributions it will also at some point help reduce the amount of code we need to ship and maintain significantly as we can just rely on Zink and Kopper everywhere which of course reduces the workload for maintainers.

This is not going to be an overnight transition though, Zink and Kopper will need some time to stabilize and further improve performance. At the moment performance is generally a bit slower than the native drivers, but we have seen some examples of games which actually got better performance with specific driver combinations, but over time we expect to see the negative performance delta shrink. The delta is unlikely to ever fully go away due to the cost of translating between the two APIs, but on the other side we are going to be in a situation in a few years where all current/new applications use Vulkan natively (or through Proton) and thus the stuff that relies on OpenGL will be older software, so combined with faster GPUs you should still get more than good enough performance. And at that point Zink will be a lifesaver for your old OpenGL based applications and games.

April 06, 2022

Just In Time

By the time you read this, Kopper will have landed. This means a number of things have changed:

  • Zink now uses Vulkan WSI and has actual swapchains
  • Combinations of clunky Mesa environment variables are no longer needed; MESA_LOADER_DRIVER_OVERRIDE=zink will work for all drivers
  • Some things that didn’t used to work now work
  • Some things that used to work now don’t

In particular, lots of cases of garbled/flickering rendering (I’m looking at you, Supertuxkart on ANV) will now be perfectly smooth and without issue.

Also there’s no swapinterval control yet, so X11 clients will have no choice but to churn out the maximum amount of FPS possible at all times.

You (probably?) aren’t going to be able to run a compositor on zink just yet, but it’s on the 22.1 TODO list.

Big thanks to Adam Jackson for carrying this project on his back.

April 05, 2022

Apparently, in some parts of this world, the /usr/-merge transition is still ongoing. Let's take the opportunity to have a look at one specific way to take benefit of the /usr/-merge (and associated work) IRL.

I develop system-level software as you might know. Oftentimes I want to run my development code on my PC but be reasonably sure it cannot destroy or otherwise negatively affect my host system. Now I could set up a container tree for that, and boot into that. But often I am too lazy for that, I don't want to bother with a slow package manager setting up a new OS tree for me. So here's what I often do instead — and this only works because of the /usr/-merge.

I run a command like the following (without any preparatory work):

systemd-nspawn \
        --directory=/ \
        --volatile=yes \
        -U \
        --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \
        --set-credential=firstboot.locale:C.UTF-8 \
        --bind-user=lennart \

And then I very quickly get a login prompt on a container that runs the exact same software as my host — but is also isolated from the host. I do not need to prepare any separate OS tree or anything else. It just works. And my host user lennart is just there, ready for me to log into.

So here's what these systemd-nspawn options specifically do:

  • --directory=/ tells systemd-nspawn to run off the host OS' file hierarchy. That smells like danger of course, running two OS instances off the same directory hierarchy. But don't be scared, because:

  • --volatile=yes enables volatile mode. Specifically this means what we configured with --directory=/ as root file system is slightly rearranged. Instead of mounting that tree as it is, we'll mount a tmpfs instance as actual root file system, and then mount the /usr/ subdirectory of the specified hierarchy into the /usr/ subdirectory of the container file hierarchy in read-only fashion – and only that directory. So now we have a container directory tree that is basically empty, but imports all host OS binaries and libraries into its /usr/ tree. All software installed on the host is also available in the container with no manual work. This mechanism only works because on /usr/-merged OSes vendor resources are monopolized at a single place: /usr/. It's sufficient to share that one directory with the container to get a second instance of the host OS running. Note that this means /etc/ and /var/ will be entirely empty initially when this second system boots up. Thankfully, forward looking distributions (such as Fedora) have adopted systemd-tmpfiles and systemd-sysusers quite pervasively, so that system users and files/directories required for operation are created automatically should they be missing. Thus, even though at boot the mentioned directories are initially empty, once the system is booted up they are sufficiently populated for things to just work.

  • -U means we'll enable user namespacing, in fully automatic mode. This does three things: it picks a free host UID range dynamically for the container, then sets up user namespacing for the container processes mapping host UID range to UIDs 0…65534 in the container. It then sets up a similar UID mapped mount on the /usr/ tree of the container. Net effect: file ownerships as set on the host OS tree appear as they belong to the very same users inside of the container environment, except that we use user namespacing for everything, and thus the users are actually neatly isolated from the host.

  • --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) passes a credential to the container. Credentials are bits of data that you can pass to systemd services and whole systems. They are actually awesome concepts (e.g. they support TPM2 authentication/encryption that just works!) but I am not going to go into details around that, given it's off-topic in this specific scenario. Here we just take benefit of the fact that systemd-sysusers looks for a credential called passwd.hashed-password.root to initialize the root password of the system from. We set it to mysecret. This means once the system is booted up we can log in as root and the supplied password. Yay. (Remember, /etc/ is initially empty on this container, and thus also carries no /etc/passwd or /etc/shadow, and thus has no root user record, and thus no root password.)

    mkpasswd is a tool then converts a plain text password into a UNIX hashed password, which is what this specific credential expects.

  • Similar, --set-credential=firstboot.locale:C.UTF-8 tells the systemd-firstboot service in the container to initialize /etc/locale.conf with this locale.

  • --bind-user=lennart binds the host user lennart into the container, also as user lennart. This does two things: it mounts the host user's home directory into the container. It also copies a minimal user record of the specified user into the container that nss-systemd then picks up and includes in the regular user database. This means, once the container is booted up I can log in as lennart with my regular password, and once I logged in I will see my regular host home directory, and can make changes to it. Yippieh! (This does a couple of more things, such as UID mapping and things, but let's not get lost in too much details.)

So, if I run this, I will very quickly get a login prompt, where I can log into as my regular user. I have full access to my host home directory, but otherwise everything is nicely isolated from the host, and changes outside of the home directory are either prohibited or are volatile, i.e. go to a tmpfs instance whose lifetime is bound to the container's lifetime: when I shut down the container I just started, then any changes outside of my user's home directory are lost.

Note that while here I use --volatile=yes in combination with --directory=/ you can actually use it on any OS hierarchy, i.e. just about any directory that contains OS binaries.

Similar, the --bind-user= stuff works with any OS hierarchy too (but do note that only systemd 249 and newer will pick up the user records passed to the container that way, i.e. this requires at least v249 both on the host and in the container to work).

Or in short: the possibilities are endless!


For this all to work, you need:

  1. A recent kernel (5.15 should suffice, as it brings UID mapped mounts for the most common file systems, so that -U and --bind-user= can work well.)

  2. A recent systemd (249 should suffice, which brings --bind-user=, and a -U switch backed by UID mapped mounts).

  3. A distribution that adopted the /usr/-merge, systemd-tmpfiles and systemd-sysusers so that the directory hierarchy and user databases are automatically populated when empty at boot. (Fedora 35 should suffice.)


While a lot of today's software actually out of the box works well on systems that come up with an unpopulated /etc/ and /var/, and either fall back to reasonable built-in defaults, or deploy systemd-tmpfiles to create what is missing, things aren't perfect: some software typically installed an desktop OSes will fail to start when invoked in such a container, and be visible as ugly failed services, but it won't stop me from logging in and using the system for what I want to use it. It would be excellent to get that fixed, though. This can either be fixed in the relevant software upstream (i.e. if opening your configuration file fails with ENOENT, then just default to reasonable defaults), or in the distribution packaging (i.e. add a tmpfiles.d/ file that copies or symlinks in skeleton configuration from /usr/share/factory/etc/ via the C or L line types).

And then there's certain software dealing with hardware management and similar that simply cannot work in a container (as device APIs on Linux are generally not virtualized for containers) reasonably. It would be excellent if software like that would be updated to carry ConditionVirtualization=!container or ConditionPathIsReadWrite=/sys conditionalization in their unit files, so that it is automatically – cleanly – skipped when executed in such a container environment.

And that's all for now.

March 30, 2022

Do you want to start a career in open-source? Do you want to learn amazing skills while getting paid? Keep reading!

Igalia Coding Experience

Igalia logo

Igalia has a grant program that gives students with a background in Computer Science, Information Technology and Free Software their first exposure to the professional world, working hand in hand with Igalia programmers and learning with them. It is called Igalia Coding Experience.

While this experience is open for everyone, Igalia expressly invites women (both cis and trans), trans men, and genderqueer people to apply. The Coding Experience program gives preference to applications coming from underrepresented groups in our industry.

You can apply to any of the offered grants this year: Web Standards, WebKit, Chromium, Compilers and Graphics.

In the case of Graphics, the student will have the opportunity to deal with the Linux DRM subsystem. Specifically, the student will improve the test coverage of DRM drivers through IGT, a testing framework designed for this purpose. These includes learning how to contribute to Linux kernel/DRM, interact with the DRI-devel community, understand DRM core functionality, and increase test coverage of IGT tool.

The conditions of our Coding Experience program are:

  • Mentorship by one of the Igalia’s outstanding open source contributors in the field.
  • It is remote-friendly. Students can participate in it wherever they live.
  • Hours: 450h
  • Compensation: 6,500€
  • Usual timetables:
    • 3 months full-time
    • 6 months part-time

The submission period goes from March 16th until April 30th. Students will be selected in May. We will work with the student to arrange a suitable starting date during 2022, from June onwards, and finishing on a date to be agreed that suits their schedule.

Google Summer of Code (GSoC)

GSoC logo

The popular Google Summer of Code is another option for students. This year, X.Org Foundation participates as Open Source organization. We have some proposed ideas but you can propose any project idea as well.

Timeline for proposals is from April 4th to April 19th. However, you should contact us before in order to discuss your ideas with potential mentors.

GSoC gives some stipend to students too (from 1,500 to 6,000 USD depending on the size of the project and your location). The hours to complete the project varies from 175 to 350 hours depending on the size of the project as well.

Of course, this is a remote-friendly program, so any student in the world can participate in it.


Outreachy logo

Outreachy is another internship program for applicants from around the world who face under-representation, systemic bias or discrimination in the technology industry of their country. Outreachy supports diversity in free and open source software!

Outreachy internships are remote, paid ($7,000), and last three months. Outreachy internships run from May to August and December to March. Applications open in January and August.

The projects listed cover many areas of the open-source software stack: from kernel to distributions work. Please check current proposals to find anything that is interesting for you!

X.Org Endless Vacation of Code (EVoC)

X.Org logo

X.Org Foundation voted in 2008 to initiate a program known as the X.Org Endless Vacation of Code (EVoC) program, in order to give more flexibility to students: an EVoC mentorship can be initiated at any time during the calendar year, the Board can fund as many of these mentorships as it sees fit.

Like the other programs, EVoC is remote-friendly as well. The stipend goes as follows: an initial payment of 500 USD and two further payments of 2,250 USD upon completion of project milestones. EVoC does not set limits in hours, but there are some requirements and steps to do before applying. Please read X.Org Endless Vacation of Code website to learn more.


As you see, there are many ways to enter into the Open Source community. Although I focused in the open source graphics stack related programs, there are many of them.

With all of these possibilities (and many more, including internships at companies), I hope that you can apply and that the experience will encourage you to start a career in the open-source community.

Happy hacking!

March 29, 2022

Ecosystem Victory

Today marks (at last) the release of some cool extensions I’ve had the pleasure of working on:


This extension revolutionizes how PSOs can be managed by the application, and it’s the first step towards solving the dreaded stuttering that zink suffers from when attempting to play any sort of game. There’s definitely going to be more posts from me on this in the future.


Currently, zink has to do some awfulness internally to replicate the awfulness of GL_PRIMITIVES_GENERATED. With this extension, at least some of that awfulness can be pushed down to the driver. And the spec, of course. You can’t scrub this filth out of your soul.


The mesa community being awesome as it is, support for these extensions is already underway:

  • ANV has merge requests up already for both of them
  • RADV has a merge request up for preliminary support of VK_EXT_primitives_generated_query on certain hardware, and VK_EXT_graphics_pipeline_library support is nearing completion

But obviously Lavapipe, being the greatest of all drivers, will already have support landed by the time you read this post.

Let the bug reports flow!

March 22, 2022

OpenSSH has this very nice setting, VerifyHostKeyDNS, which when enabled, will pull SSH host keys from DNS, and you no longer need to either trust on first use, or copy host keys around out of band.

Naturally, trusting unsecured DNS is a bit scary, so this requires the record to be signed using DNSSEC. This has worked for a long time, but then broke, seemingly out of the blue. Running ssh -vvv gave output similar to

debug1: found 4 insecure fingerprints in DNS
debug3: verify_host_key_dns: checking SSHFP type 1 fptype 2
debug3: verify_host_key_dns: checking SSHFP type 4 fptype 2
debug1: verify_host_key_dns: matched SSHFP type 4 fptype 2
debug3: verify_host_key_dns: checking SSHFP type 4 fptype 1
debug1: verify_host_key_dns: matched SSHFP type 4 fptype 1
debug3: verify_host_key_dns: checking SSHFP type 1 fptype 1
debug1: matching host key fingerprint found in DNS

even though the zone was signed, the resolver was checking the signature and I even checked that the DNS response had the AD bit set.

The fix was to add options trust-ad to /etc/resolv.conf. Without this, glibc will discard the AD bit from any upstream DNS servers. Note that you should only add this if you actually have a trusted DNS resolver. I run unbound on localhost, so if somebody can do a man-in-the-middle attack on that traffic, I have other problems.

March 18, 2022

March Forward

Anyone who knows me knows that I hate cardio.

Full stop.

I’m not picking up and putting down all these heavy weights just so I can go for a jog afterwards and lose all my gains.

Similarly, I’m not trying to set a world record for speed-writing code. This stuff takes time, and it can’t be rushed.


Lavapipe: The Best Driver

Today we’re a Lavapipe blog.

Lavapipe is, of course, the software implementation of Vulkan that ships with Mesa, originally braindumped directly into the repo by graphics god and part-time Twitter executive, Dave Airlie. For a long time, the Lavapipe meme has been “Try it on Lavapipe—it’s not conformant, but it still works pretty good haha” and we’ve all had a good chuckle at the idea that anything not officially signed and stamped by Khronos could ever draw a single triangle properly.

But, pending a single MR that fixes the four outstanding failures for Vulkan 1.2 conformance, as of last week, Lavapipe passes 100% of conformance tests. Thus, pending a merge and a Mesa bugfix release, Lavapipe will achieve official conformance.

And then we’ll have a new meme: Vulkan 1.3 When?

Meme Over

As some have noticed, I’ve been quietly (very, very, very, very, very, very, very, very, very, very, very, very quietly) implementing a number of features for Lavapipe over the past week.

But why?

Khronos-fanatics will immediately recognize that these features are all part of Vulkan 1.3.

Which Lavapipe also now supports, pending more merges which I expect to happen early next week.

This is what a sprint looks like.

March 17, 2022

We had a busy 2021 within GNU/Linux graphics stack at Igalia.

Would you like to know what we have done last year? Keep reading!

Open Source Raspberry Pi GPU (VideoCore) drivers

Raspberry Pi 4, model B

Last year both the OpenGL and the Vulkan drivers received a lot of love. For example, we implemented several optimizations such improvements in the v3dv pipeline cache. In this blog post, Alejandro Piñeiro presents how we improved the v3dv pipeline cache times by reducing the two-cache-lookup done previously by only one, and shows some numbers on both a synthetic test (modified CTS test), and some games.

We also did performance improvements of the v3d compilers for OpenGL and Vulkan. Iago Toral explains our work on optimizating the backend compiler with techniques such as improving memory lookup efficiency, reducing instruction counts, instruction packing, uniform handling, among others. There are some numbers that show framerate improvements from ~6 to ~62% on different games / demos.

Framerate improvements Framerate improvement after optimization (in %). Taken from Iago’s blogpost

Of course, there was work related to feature implementation. This blog post from Iago lists some Vulkan extensions implemented in the v3dv driver in 2021… Although not all the implemented extensions are listed there, you can see the driver is quickly catching up in its Vulkan extension support.

My colleague Juan A. Suárez implemented performance counters in the v3d driver (an OpenGL driver) which required modifications in the kernel and in the Mesa driver. More info in his blog post.

There was more work in other areas done in 2021 too, like the improved support for RenderDoc and GFXReconstruct. And not to forget the kernel contributions to the DRM driver done by Melissa Wen, who not only worked on developing features for it, but also reviewed all the patches that came from the community.

However, the biggest milestone for the v3Dv driver was to be Vulkan 1.1 conformant in the last quarter of 2021. That was just one year after becoming Vulkan 1.0 conformant. As you can imagine, that implied a lot of work implementing features, fixing bugs and, of course, improving the driver in many different ways. Great job folks!

If you want to know more about all the work done on these drivers during 2021, there is an awesome talk from my colleague Alejando Piñeiro at FOSDEM 2022: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4”, and another one from my colleague Iago Toral in XDC 2021: “Raspberry Pi Vulkan driver update”. Below you can find the video recordings of both talks.

FOSDEM 2022 talk: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4”

XDC 2021 talk: “Raspberry Pi Vulkan driver update”

Open Source Qualcomm Adreno GPU drivers

RB3 Photo of the Qualcomm® Robotics RB3 Platform embedded board that I use for Turnip development.

There were also several achievements done by igalians on both Freedreno and Turnip drivers. These are reverse engineered open-source drivers for Qualcomm Adreno GPUs: Freedreno for OpenGL and Turnip for Vulkan.

Starting 2021, my colleague Danylo Piliaiev helped with implementing the missing bits in Freedreno for supporting OpenGL 3.3 on Adreno 6xx GPUs. His blog post explained his work, such as implementing ARB_blend_func_extended, ARB_shader_stencil_export and fixing a variety of CTS test failures.

Related to this, my colleague Guilherme G. Piccoli worked on porting a recent kernel to one of the boards we use for Freedreno development: the Inforce 6640. He did an awesome job getting a 5.14 kernel booting on that embedded board. If you want to know more, please read the blog post he wrote explaining all the issues he found and how he fixed them!

Inforce6640 Picture of the Inforce 6640 board that Guilherme used for his development. Image from his blog post.

However the biggest chunk of work was done in Turnip driver. We have implemented a long list of Vulkan extensions: VK_KHR_buffer_device_address, VK_KHR_depth_stencil_resolve, VK_EXT_image_view_min_lod, VK_KHR_spirv_1_4, VK_EXT_descriptor_indexing, VK_KHR_timeline_semaphore, VK_KHR_16bit_storage, VK_KHR_shader_float16, VK_KHR_uniform_buffer_standard_layout, VK_EXT_extended_dynamic_state, VK_KHR_pipeline_executable_properties, VK_VALVE_mutable_descriptor_type, VK_KHR_vulkan_memory_model and many others. Danylo Piliaiev and Hyunjun Ko are terrific developers!

But not all our work was related to feature development, for example I implemented Low-Resolution Z-buffer (LRZ) HW optimization, Danylo fixed a long list of rendering bugs that happened in real-world applications (blog post 1, blog post 2) like D3D games run on Vulkan (thanks to DXVK and VKD3D), instrumented the backend compiler to dump register values, among many other fixes and optimizations.

However, the biggest achievement was getting Vulkan 1.1 conformance for Turnip. Danylo wrote a blog post mentioning all the work we did to achieve that this year.

If you want to know more, don’t miss this FOSDEM 2022 talk given by my colleague Hyunjun Ko called “The status of turnip driver development. What happened in 2021 and will happen in 2022 for turnip.”. Video below.

FOSDEM 2022 talk: “The status of turnip driver development. What happened in 2021 and will happen in 2022 for turnip.”

Vulkan contributions

Our graphics work doesn’t cover only driver development, we also participate in Khronos Group as Vulkan Conformance Test Suite developers and even as spec contributors.

My colleague Ricardo Garcia is a very productive developer. He worked on implementing tests for Vulkan Ray Tracing extensions (read his blog post about ray tracing for more info about this big Vulkan feature), implemented tests for a long list of Vulkan extensions like VK_KHR_present_id and VK_KHR_present_wait, VK_EXT_multi_draw (watch his talk at XDC 2021), VK_EXT_border_color_swizzle (watch his talk at FOSDEM 2022) among many others. In many of these extensions, he contributed to their respective specifications in a significant way (just search for his name in the Vulkan spec!).

XDC 2021 talk: “Quick Overview of VK_EXT_multi_draw”

FOSDEM 2022 talk: “Fun with border colors in Vulkan. An overview of the story behind VK_EXT_border_color_swizzle”

Similarly, I participated modestly in this effort by developing tests for some extensions like VK_EXT_image_view_min_lod (blog post). Of course, both Ricardo and I implemented many new CTS tests by adding coverage to existing ones, we fixed lots of bugs in existing ones and reported dozens of driver issues to the respective Mesa developers.

Not only that, both Ricardo and I appeared as Vulkan 1.3 spec contributors.

Vulkan 1.3

Another interesting work we started in 2021 is Vulkan Video support on Gstreamer. My colleague Víctor Jaquez presented the Vulkan Video extension at XDC 2021 and soon after he started working on Vulkan Video’s h264 decoder support. You can find more information in his blog post, or watching his XDC 2021 talk below:

FOSDEM 2022 talk: “Video decoding in Vulkan: VK_KHR_video_queue/decode APIs”

Before I leave this section, don’t forget to take a look at Ricardo’s blogpost on debugPrintfEXT feature. If you are a Graphics developer, you will find this feature very interesting for debugging issues in your applications!

Along those lines, Danylo presented at XDC 2021 a talk about dissecting and fixing Vulkan rendering issues in drivers with RenderDoc. Very useful for driver developers! Watch the talk below:

XDC 2021 talk: “Dissecting Vulkan rendering issues in drivers with RenderDoc”

To finalize this blog post, remember that you now have vkrunner (the Vulkan shader tester created by Igalia) available for RPM-based GNU/Linux distributions. In case you are working with embedded systems, maybe my blog post about cross-compiling with icecream will help to speed up your builds.

This is just a summary of the highlights we did last year. I’m sorry if I am missing more work from my colleagues.

March 16, 2022

At Last

Those of you in-the-know are well aware that Zink has always had a crippling addiction to seamless cubemaps. Specifically, Vulkan doesn’t support non-seamless cubemaps since nobody wants those anymore, but this is the default mode of sampling for OpenGL.

Thus, it is impossible for Zink to pass GL 4.6 conformance until this issue is resolved.

But what does this even mean?

Cubes: They Have Faces Just Like You And Me

As veterans of intro to geometry courses all know*, a cube is a 3D shape that has six identically-sized sides called “faces”. In computer graphics, each of these faces has its own content that can be read and written discretely.

When interpolating data from a cube-type texture, there are two methods:

  • Using a seamless interpretation of a cube yields cases where pixels may be interpolated across the boundaries of faces
  • Using a non-seamless interpretation of a cube yields cases where pixels may be clamped/wrapped at the boundary of a face

This effectively results in Zink interpolating across the boundaries of cube faces when it should instead be clamping/wrapping pixel data to a single face.

But who cares about all that math nonsense when the result is that Zink is still failing CTS cases?

*Disclosure: I have been advised by my lawyer to state on the record that I have never taken an intro to geometry course and have instead copied this entire blog post off StackOverflow.

How To Make Cubes Worse 101

In order to replicate this basic OpenGL behavior, a substantial amount of code is required—most of it terrible.

The first step is to determine when a cube should be sampled as non-seamless. OpenGL helpfully has only one extension (plus this other extension) which control seamless access of cubemaps, so as long as that one state (plus the other state) isn’t enabled, a cube shouldn’t be interpreted seamlessly.

With this done, what happens at coordinates that lie at the edge of a face? The OpenGL wrap enum covers this. For the purposes of this blog post, only two wrap modes exist:

  • edge clamping - clamp the coordinate to the edge of the face (coord = extent or coord = 0)
  • repeat - pretend the face repeats infinitely (coord %= extent)

So now non-seamless cubes are detected, and the behavior for handling their non-seamlessness is known, but how can this actually be done?


In short, this requires shader rewrites to handle coordinate clamping, then wrapping. Since it’s not going to be desirable to have a different shader variant for every time the wrap mode changes, this means loading the parameters from a UBO. Since it’s further not going to be desirable to have shader variants for each per-texture seamless/non-seamless cube combination, this means also making the rewrite handle the no-op case of continuing to use the original, seamless coordinates after doing all the calculations for the non-seamless case.

Worst of all, this has to be plumbed through the Rube Goldberg machine that is Mesa.

It was terrible, and it continues to be terrible.

If I were another blogger, I would probably take this opportunity to flex my Calculus credentials by putting all kinds of math here, but nobody really wants to read that, and the hell if I know how to make markdown do that whiteboard thing so I can doodle in fancy formulas or whatever from the spec.

Instead, you can read the merge request if you’re that deeply invested in cube mechanics.

March 09, 2022

Subgroup operations or wave intrinsics, such as reducing a value across the threads of a shader subgroup or wave, were introduced in GPU programming languages a while ago. They communicate with other threads of the same wave, for example to exchange the input values of a reduction, but not necessarily with all of them if there is divergent control flow.

In LLVM, we call such operations convergent. Unfortunately, LLVM does not define how the set of communicating threads in convergent operations -- the set of converged threads -- is affected by control flow.

If you're used to thinking in terms of structured control flow, this may seem trivial. Obviously, there is a tree of control flow constructs: loops, if-statements, and perhaps a few others depending on the language. Two threads are converged in the body of a child construct if and only if both execute that body and they are converged in the parent. Throw in some simple and intuitive rules about loop counters and early exits (nested return, break and continue, that sort of thing) and you're done.

In an unstructured control flow graph, the answer is not obvious at all. I gave a presentation at the 2020 LLVM Developers' Meeting that explains some of the challenges as well as a solution proposal that involves adding convergence control tokens to the IR.

Very briefly, convergent operations in the proposal use a token variable that is defined by a convergence control intrinsic. Two dynamic instances of the same static convergent operation from two different threads are converged if and only if the dynamic instances of the control intrinsic producing the used token values were converged.

(The published draft of the proposal talks of multiple threads executing the same dynamic instance. I have since been convinced that it's easier to teach this matter if we instead always give every thread its own dynamic instances and talk about a convergence equivalence relation between dynamic instances. This doesn't change the resulting semantics.)

The draft has three such control intrinsics: anchor, entry, and (loop) heart. Of particular interest here is the heart. For the most common and intuitive use cases, a heart intrinsic is placed in the header of natural loops. The token it defines is used by convergent operations in the loop. The heart intrinsic itself also uses a token that is defined outside the loop: either by another heart in the case of nested loops, or by an anchor or entry. The heart combines two intuitive behaviors:

  • It uses a token in much the same way that convergent operations do: two threads are converged for their first execution of the heart if and only if they were converged at the intrinsic that defined the used token.
  • Two threads are converged at subsequent executions of the heart if and only if they were converged for the first execution and they are currently at the same loop iteration, where iterations are counted by a virtual loop counter that is incremented at the heart.

Viewed from this angle, how about we define a weaker version of these rules that lies somewhere between an anchor and a loop heart? We could call it a "light heart", though I will stick with "iterating anchor". The iterating anchor defines a token but has no arguments. Like for the anchor, the set of converged threads is implementation-defined -- when the iterating anchor is first encountered. When threads encounter the iterating anchor again without leaving the dominance region of its containing basic block, they are converged if and only if they were converged during their previous encounter of the iterating anchor.

The notion of an iterating anchor came up when discussing the convergence behaviors that can be guaranteed for natural loops. Is it possible to guarantee that natural loops always behave in the natural way -- according to their loop counter -- when it comes to convergence?

Naively, this should be possible: just put hearts into loop headers! Unfortunately, that's not so straightforward when multiple natural loops are contained in an irreducible loop: 

Hearts in A and C must refer to a token defined outside the loops; that is, a token defined in E. The resulting program is ill-formed because it has a closed path that goes through two hearts that use the same token, but the path does not go through the definition of that token. This well-formedness rule exists because the rules about heart semantics are unsatisfiable if the rule is broken.

The underlying intuitive issue is that if the branch at E is divergent in a typical implementation, the wave (or subgroup) must choose whether A or C is executed first. Neither choice works. The heart in A indicates that (among the threads that are converged in E) all threads that visit A (whether immediately or via C) must be converged during their first visit of A. But if the wave executes A first, then threads which branch directly from E to A cannot be converged with those that first branch to C. The opposite conflict exists if the wave executes C first.

If we replace the hearts in A and C by iterating anchors, this problem goes away because the convergence during the initial visit of each block is implementation-defined. In practice, it should fall out of which of the blocks the implementation decides to execute first.

So it seems that iterating anchors can fill a gap in the expressiveness of the convergence control design. But are they really a sound addition? There are two main questions:

  • Satisfiability: Can the constraints imposed by iterating anchors be satisfied, or can they cause the sort of logical contradiction discussed for the example above? And if so, is there a simple static rule that prevents such contradictions?
  • Spooky action at a distance: Are there generic code transforms which change semantics while changing a part of the code that is distant from the iterating anchor?
The second question is important because we want to add convergence control to LLVM without having to audit and change the existing generic transforms. We certainly don't want to hurt compile-time performance by increasing the amount of code that generic transforms have to examine for making their decisions.


Consider the following simple CFG with an iterating anchor in A and a heart in B that refers back to a token defined in E:

Now consider two threads that are initially converged with execution traces:

  1. E - A - A - B - X
  2. E - A - B - A - X
The heart rule implies that the threads must be converged in B. The iterating anchor rule implies that if the threads are converged in their first dynamic instances of A, then they must also be converged in their second dynamic instances of A, which leads to a temporal paradox.

One could try to resolve the paradox by saying that the threads cannot be converged in A at all, but this would mean that the threads mustdiverge before a divergent branch occurs. That seems unreasonable, since typical implementations want to avoid divergence as long as control flow is uniform.

The example arguably breaks the spirit of the rule about convergence regions from the draft proposal linked above, and so a minor change to the definition of convergence region may be used to exclude it.

What if the CFG instead looks as follows, which does not break any rules about convergence regions:

For the same execution traces, the heart rule again implies that the threads must be converged in B. The convergence of the first dynamic instances of A are technically implementation-defined, but we'd expect most implementations to be converged there.

The second dynamic instances of A cannot be converged due to the convergence of the dynamic instances of B. That's okay: the second dynamic instance of A in thread 2 is a re-entry into the dominance region of A, and so its convergence is unrelated to any convergence of earlier dynamic instances of A.

Spooky action at a distance 

Unfortunately, we still cannot allow this second example. A program transform may find that the conditional branch in E is constant and the edge from E to B is dead. Removing that edge brings us back to the previous example which is ill-formed. However, a transform which removes the dead edge would not normally inspect the blocks A and B or their dominance relation in detail. The program becomes ill-formed by spooky action at a distance.

The following static rule forbids both example CFGs: if there is a closed path through a heart and an iterating anchor, but not through the definition of the token that the heart uses, then the heart must dominate the iterating anchor.

There is at least one other issue of spooky action at a distance. If the iterating anchor is not the first (non-phi) instruction of its basic block, then it may be preceded by a function call in the same block. The callee may contain control flow that ends up being inlined. Back edges that previously pointed at the block containing the iterating anchor will then point to a different block, which changes the behavior quite drastically. Essentially, the iterating anchor is reduced to a plain anchor.

What can we do about that? It's tempting to decree that an iterating anchor must always be the first (non-phi) instruction of a basic block. Unfortunately, this is not easily done in LLVM in the face of general transforms that might sink instructions or merge basic blocks.

Preheaders to the rescue 

We could chew through some other ideas for making iterating anchors work, but that turns out to be unnecessary. The desired behavior of iterating anchors can be obtained by inserting preheader blocks. The initial example of two natural loops contained in an irreducible loop becomes: 

Place anchors in Ap and Cp and hearts in A and C that use the token defined by their respective dominating anchor. Convergence at the anchors is implementation-defined, but relative to this initial convergence at the anchor, convergence inside the natural loops headed by A and C behaves in the natural way, based on a virtual loop counter. The transform of inserting an anchor in the preheader is easily generalized.

To sum it up: We've concluded that defining an "iterating anchor" convergence control intrinsic is problematic, but luckily also unnecessary. The control intrinsics defined in the original proposal are sufficient. I hope that the discussion that led to those conclusions helps illustrate some aspects of the convergence control proposal for LLVM as well as the goals and principles that drove it.

March 07, 2022

They Say An Image Macro Conveys An Entire Day Of Shouting At The Computer


March 04, 2022

A quick reminder: libei is the library for emulated input. It comes as a pair of C libraries, libei for the client side and libeis for the server side.

libei has been sitting mostly untouched since the last status update. There are two use-cases we need to solve for input emulation in Wayland - the ability to emulate input (think xdotool, or Synergy/Barrier/InputLeap client) and the ability to capture input (think Synergy/Barrier/InputLeap server). The latter effectively blocked development in libei [1], until that use-case was sorted there wasn't much point investing too much into libei - after all it may get thrown out as a bad idea. And epiphanies were as elusive like toilet paper and RATs, so nothing much get done. This changed about a week or two ago when the required lightbulb finally arrived, pre-lit from the factory.

So, the solution to the input capturing use-case is going to be a so-called "passive context" for libei. In the traditional [2] "active context" approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat that typically represent the available screens. libei then sends events through these devices, causing input to be appear in the compositor which moves the cursor around. In a typical and simple use-case you'd get a 1920x1080 absolute pointer device and a keyboard with a $layout keymap, libei then sends events to position the cursor and or happily type away on-screen.

In the "passive context" <deja-vu> approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat </deja-vu> that typically represent the physical devices connected to the host computer. libei then receives events from these devices, causing input to be generated in the libei client. In a typical and simple use-case you'd get a relative pointer device and a keyboard device with a $layout keymap, the compositor then sends events matching the relative input of the connected mouse or touchpad.

The two notable differences are thus: events flow from EIS to libei and the devices don't represent the screen but rather the physical [3] input devices.

This changes libei from a library for emulated input to an input event transport layer between two processes. On a much higher level than e.g. evdev or HID and with more contextual information (seats, devices are logically abstracted, etc.). And of course, the EIS implementation is always in control of the events, regardless which direction they flow. A compositor can implement an event filter or designate key to break the connection to the libei client. In pseudocode, the compositor's input event processing function will look like this:

function handle_input_events():
real_events = libinput.get_events()
for e in real_events:
if input_capture_active:

emulated_events = eis.get_events_from_active_clients()
for e in emulated_events:
Not shown here are the various appropriate filters and conversions in between (e.g. all relative events from libinput devices would likely be sent through the single relative device exposed on the EIS context). Again, the compositor is in control so it would be trivial to implement e.g. capturing of the touchpad only but not the mouse.

In the current design, a libei context can only be active or passive, not both. The EIS context is both, it's up to the implementation to disconnect active or passive clients if it doesn't support those.

Notably, the above only caters for the transport of input events, it doesn't actually make any decision on when to capture events. This handled by the CaptureInput XDG Desktop Portal [4]. The idea here is that an application like Synergy/Barrier/InputLeap server connects to the CaptureInput portal and requests a CaptureInput session. In that session it can define pointer barriers (left edge, right edge, etc.) and, in the future, maybe other triggers. In return it gets a libei socket that it can initialize a libei context from. When the compositor decides that the pointer barrier has been crossed, it re-routes the input events through the EIS context so they pop out in the application. Synergy/Barrier/InputLeap then converts that to the global position, passes it to the right remote Synergy/Barrier/InputLeap client and replays it there through an active libei context where it feeds into the local compositor.

Because the management of when to capture input is handled by the portal and the respective backends, it can be natively integrated into the UI. Because the actual input events are a direct flow between compositor and application, the latency should be minimal. Because it's a high-level event library, you don't need to care about hardware-specific details (unlike, say, the inputfd proposal from 2017). Because the negotiation of when to capture input is through the portal, the application itself can run inside a sandbox. And because libei only handles the transport layer, compositors that don't want to support sandboxes can set up their own negotiation protocol.

So overall, right now this seems like a workable solution.

[1] "blocked" is probably overstating it a bit but no-one else tried to push it forward, so..
[2] "traditional" is probably overstating it for a project that's barely out of alpha development
[3] "physical" is probably overstating it since it's likely to be a logical representation of the types of inputs, e.g. one relative device for all mice/touchpads/trackpoints
[4] "handled by" is probably overstating it since at the time of writing the portal is merely a draft of an XML file

Blogging: That Thing I Forgot About

Yeah, my b, I forgot this was a thing.

Fuck it though, I’m a professional, so I’m gonna pretend I didn’t just skip a month of blogs and get right back into it.


Gallivm is the nir/tgsi-to-llvm translation layer in Gallium that LLVMpipe (and thus Lavapipe) uses to generate the JIT functions which make triangles. It’s very old code in that it predates me knowing how triangles work, but that doesn’t mean it doesn’t have bugs.

And Gallivm bugs are the worst bugs.

For a long time, I’ve had SIGILL crashes on exactly one machine locally for the CTS glob*sampler2D_samplerCube*. These tests pass on everyone else’s machines including CI.

Like I said, Gallivm bugs are the worst bugs.


How does one debug JIT code? GDB can’t be used, valgrind doesn’t work, and, despite what LLVM developers would tell you, building an assert-enabled LLVM doesn’t help at all in most cases here since that will only catch invalid behavior, not questionably valid behavior that very obviously produces invalid results.

So we enter the world of lp_build_print debugging. Much like standard printf debugging, the strategy here is to just lp_build_print_value or lp_build_printf("I hate this part of the shader too") our way to figuring out where in the shader the crash occurs.

Here’s an example shader from dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex that crashes:

#version 310 es
in highp vec4 a_position;
out mediump float v_vtxOut;

struct structType
	mediump sampler2D m0;
	mediump samplerCube m1;
uniform structType u_var;

mediump float compare_float    (mediump float a, mediump float b)  { return abs(a - b) < 0.05 ? 1.0 : 0.0; }
mediump float compare_vec4     (mediump vec4 a, mediump vec4 b)    { return compare_float(a.x, b.x)*compare_float(a.y, b.y)*compare_float(a.z, b.z)*compare_float(a.w, b.w); }

void main (void)
	gl_Position = a_position;
	v_vtxOut = 1.0;
	v_vtxOut *= compare_vec4(texture(u_var.m0, vec2(0.0)), vec4(0.15, 0.52, 0.26, 0.35));
	v_vtxOut *= compare_vec4(texture(u_var.m1, vec3(0.0)), vec4(0.88, 0.09, 0.30, 0.61));

Which, in llvmpipe NIR, is:

source_sha1: {0xcb00c93e, 0x64db3b0f, 0xf4764ad3, 0x12b69222, 0x7fb42437}
inputs: 1
outputs: 2
uniforms: 0
shared: 0
ray queries: 0
decl_var uniform INTERP_MODE_NONE sampler2D lower@u_var.m0 (0, 0, 0)
decl_var uniform INTERP_MODE_NONE samplerCube lower@u_var.m1 (0, 0, 1)
decl_function main (0 params)

impl main {
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = deref_var &a_position (shader_in vec4) 
	vec4 32 ssa_1 = intrinsic load_deref (ssa_0) (access=0)
	vec1 16 ssa_2 = load_const (0xb0cd = -0.150024)
	vec1 16 ssa_3 = load_const (0x2a66 = 0.049988)
	vec1 16 ssa_4 = load_const (0xb829 = -0.520020)
	vec1 16 ssa_5 = load_const (0xb429 = -0.260010)
	vec1 16 ssa_6 = load_const (0xb59a = -0.350098)
	vec1 16 ssa_7 = load_const (0xbb0a = -0.879883)
	vec1 16 ssa_8 = load_const (0xadc3 = -0.090027)
	vec1 16 ssa_9 = load_const (0xb4cd = -0.300049)
	vec1 16 ssa_10 = load_const (0xb8e1 = -0.609863)
	vec2 32 ssa_13 = load_const (0x00000000, 0x00000000) = (0.000000, 0.000000)
	vec1 32 ssa_49 = load_const (0x00000000 = 0.000000)
	vec4 16 ssa_14 = (float16)txl ssa_13 (coord), ssa_49 (lod), 0 (texture), 0 (sampler)
	vec1 16 ssa_15 = fadd ssa_14.x, ssa_2
	vec1 16 ssa_16 = fabs ssa_15
	vec1 16 ssa_17 = fadd ssa_14.y, ssa_4
	vec1 16 ssa_18 = fabs ssa_17
	vec1 16 ssa_19 = fadd ssa_14.z, ssa_5
	vec1 16 ssa_20 = fabs ssa_19
	vec1 16 ssa_21 = fadd ssa_14.w, ssa_6
	vec1 16 ssa_22 = fabs ssa_21
	vec1 16 ssa_23 = fmax ssa_16, ssa_18
	vec1 16 ssa_24 = fmax ssa_23, ssa_20
	vec1 16 ssa_25 = fmax ssa_24, ssa_22
	vec3 32 ssa_27 = load_const (0x00000000, 0x00000000, 0x00000000) = (0.000000, 0.000000, 0.000000)
	vec1 32 ssa_50 = load_const (0x00000000 = 0.000000)
	vec4 16 ssa_28 = (float16)txl ssa_27 (coord), ssa_50 (lod), 1 (texture), 1 (sampler)
	vec1 16 ssa_29 = fadd ssa_28.x, ssa_7
	vec1 16 ssa_30 = fabs ssa_29
	vec1 16 ssa_31 = fadd ssa_28.y, ssa_8
	vec1 16 ssa_32 = fabs ssa_31
	vec1 16 ssa_33 = fadd ssa_28.z, ssa_9
	vec1 16 ssa_34 = fabs ssa_33
	vec1 16 ssa_35 = fadd ssa_28.w, ssa_10
	vec1 16 ssa_36 = fabs ssa_35
	vec1 16 ssa_37 = fmax ssa_30, ssa_32
	vec1 16 ssa_38 = fmax ssa_37, ssa_34
	vec1 16 ssa_39 = fmax ssa_38, ssa_36
	vec1 16 ssa_40 = fmax ssa_25, ssa_39
	vec1 32 ssa_41 = flt32 ssa_40, ssa_3
	vec1 32 ssa_42 = b2f32 ssa_41
	vec1 32 ssa_43 = deref_var &gl_Position (shader_out vec4) 
	intrinsic store_deref (ssa_43, ssa_1) (wrmask=xyzw /*15*/, access=0)
	vec1 32 ssa_44 = deref_var &v_vtxOut (shader_out float) 
	intrinsic store_deref (ssa_44, ssa_42) (wrmask=x /*1*/, access=0)
	/* succs: block_1 */
	block block_1:

There’s two sample ops (txl), and since these tests only do simple texture() calls, it seems reasonable to assume that one of them is causing the crash. Sticking a lp_build_print_value on the texel values fetched by the sample operations will reveal whether the crash occurs before or after them.

What output does this yield?

Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'..
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
[1]    3500332 illegal hardware instruction (core dumped)

Each txl op fetches four values, which means this is the result from the first instruction, but the second one isn’t reached before the crash. Unsurprisingly, this is also the cube sampling instruction, which makes sense given that all the crashes of this type I get are from cube sampling tests.

Now that it’s been determined the second txl is causing the crash, it’s reasonable to assume that the construction of that sampling op is the cause rather than the op itself, as proven by sticking some simple lp_build_printf("What am I doing with my life") calls in just before that op. Indeed, as the printfs confirm, I’m still questioning the life choices that led me to this point, so it’s now proven that the txl instruction itself is the problem.

Cube sampling has a lot of complex math involved for face selection, and I’ve spent a lot of time in there recently. My first guess was that the cube coordinates were bogus. Printing them yielded results:

Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'..
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
cubecoords nan nan nan nan nan nan nan nan
cubecoords nan nan nan nan nan nan nan nan

These cube coords have more NaNs than a 1960s Batman TV series, so it looks like I was right in my hunch. Printing the cube S-face value next yields more NaNs. My printf search continued a couple more iterations until I wound up at this function:

static LLVMValueRef
lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord)
   /* ima = +0.5 / abs(coord); */
   LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5);
   LLVMValueRef absCoord = lp_build_abs(coord_bld, coord);
   LLVMValueRef ima = lp_build_div(coord_bld, posHalf, absCoord);
   return ima;

Immediately, all of us multiverse-brain engineers spot something suspicious: this has a division operation with a user-provided divisor. Printing absCoord here yielded all zeroes, which was about where my remaining energy was at this Friday morning, so I mangled the code slightly:

static LLVMValueRef
lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord)
   /* ima = +0.5 / abs(coord); */
   LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5);
   LLVMValueRef absCoord = lp_build_abs(coord_bld, coord);
   /* avoid div by zero */
   LLVMValueRef sel = lp_build_cmp(coord_bld, PIPE_FUNC_GREATER, absCoord, coord_bld->zero);
   LLVMValueRef div = lp_build_div(coord_bld, posHalf, absCoord);
   LLVMValueRef ima = lp_build_select(coord_bld, sel, div, coord_bld->zero);
   return ima;

And blammo, now that Gallivm could no longer divide by zero, the test was now passing. And so were a lot of others.


There’s been some speculation about how close Zink really is to being “useful”, where “useful” is determined by the majesty of passing GL4.6 CTS.

So how close is it? The answer might shock you.

Remaining Lavapipe Fails: 17

  • KHR-GL46.gpu_shader_fp64.builtin.mod_dvec2,Fail
  • KHR-GL46.gpu_shader_fp64.builtin.mod_dvec3,Fail
  • KHR-GL46.gpu_shader_fp64.builtin.mod_dvec4,Fail
  • KHR-GL46.pipeline_statistics_query_tests_ARB.functional_primitives_vertices_submitted_and_clipping_input_output_primitives,Fail
  • KHR-GL46.tessellation_shader.single.isolines_tessellation,Fail
  • KHR-GL46.tessellation_shader.tessellation_control_to_tessellation_evaluation.data_pass_through,Fail
  • KHR-GL46.tessellation_shader.tessellation_invariance.invariance_rule3,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_point_mode.points_verification,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.degenerate_case,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_tessellation.gl_InvocationID_PatchVerticesIn_PrimitiveID,Fail
  • KHR-GL46.tessellation_shader.vertex.vertex_spacing,Fail
  • KHR-GL46.texture_barrier.disjoint-texels,Fail
  • KHR-GL46.texture_barrier.overlapping-texels,Fail
  • KHR-GL46.texture_barrier_ARB.disjoint-texels,Fail
  • KHR-GL46.texture_barrier_ARB.overlapping-texels,Fail
  • KHR-GL46.texture_swizzle.functional,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.inner_tessellation_level_rounding,Crash

Remaining ANV Fails (Icelake): 9

  • KHR-GL46.pipeline_statistics_query_tests_ARB.functional_primitives_vertices_submitted_and_clipping_input_output_primitives,Fail
  • KHR-GL46.tessellation_shader.single.isolines_tessellation,Fail
  • KHR-GL46.tessellation_shader.tessellation_control_to_tessellation_evaluation.data_pass_through,Fail
  • KHR-GL46.tessellation_shader.tessellation_invariance.invariance_rule3,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_point_mode.points_verification,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.degenerate_case,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.inner_tessellation_level_rounding,Fail
  • KHR-GL46.tessellation_shader.tessellation_shader_tessellation.gl_InvocationID_PatchVerticesIn_PrimitiveID,Fail
  • KHR-GL46.tessellation_shader.vertex.vertex_spacing,Fail

Big Triangle better keep a careful eye on us now.

February 17, 2022

Around 2 years ago while I was working on tessellation support for llvmpipe, and running the heaven benchmark on my Ryzen, I noticed that heaven despite running slowly wasn't saturating all the cores. I dug in a bit, and found that llvmpipe despite threading rasterization, fragment shading and blending stages, never did anything else while those were happening.

I dug into the code as I clearly remembered seeing a concept of a "scene" where all the primitives were binned into and then dispatched. It turned out the "scene" was always executed synchronously.

At the time I wrote support to allow multiple scenes to exist, so while one scene was executing the vertex shading and binning for the next scene could execute, and it would be queued up. For heaven at the time I saw some places where it would build 36 scenes. However heaven was still 1fps with tess, and regressions in other areas were rampant, and I mostly left them in a branch.

The reasons so many things were broken by the patches was that large parts of llvmpipe and also lavapipe, weren't ready for the async pipeline processing. The concept of a fence after the pipeline finished was there, but wasn't used properly everywhere. A lot of operations assumed there was nothing going on behind the scenes so never fenced. Lots of things like queries broke due to fact that a query would always be ready in the old model, but now query availability could return unavailable like a real hw driver. Resource tracking existed but was incomplete, so knowing when to flush wasn't always accurate. Presentation was broken due to incorrect waiting both for GL and Lavapipe. Lavapipe needed semaphore support that actually did things as apps used it between the render and present pipeline pieces.

Mesa CI recently got some paraview traces added to it, and I was doing some perf traces with them. Paraview is a data visualization tool, and it generates vertex heavy workloads, as opposed to compositors and even games. It turned out binning was most of the overhead, and I realized the overlapping series could help this sort of workload. I dusted off the patch series and nailed down all the issues.

Emma Anholt ran some benchmarks on the results with the paraview traces and got

  • pv-waveletvolume fps +13.9279% +/- 4.91667% (n=15)
  • pv-waveletcountour fps +67.8306% +/- 11.4762% (n=3)
which seems like a good return on the investment.

I've got it all lined up in a merge request and it doesn't break CI anymore, so hopefully get it landed in the next while, once I cleanup any misc bits.

February 16, 2022

Earlier this week, Neil McGovern announced that he is due to be stepping down as the Executive Director as the GNOME Foundation later this year. As the President of the board and Neil’s effective manager together with the Executive Committee, I wanted to take a moment to reflect on his achievements in the past 5 years and explain a little about what the next steps would be.

Since joining in 2017, Neil has overseen a productive period of growth and maturity for the Foundation, increasing our influence both within the GNOME project and the wider Free and Open Source Software community. Here’s a few highlights of what he’s achieved together with the Foundation team and the community:

  • Improved public perception of GNOME as a desktop and GTK as a development platform, helping to align interests between key contributors and wider ecosystem stakeholders and establishing an ongoing collaboration with KDE around the Linux App Summit.
  • Worked with the board to improve the maturity of the board itself and allow it to work at a more strategic level, instigating staggered two-year terms for directors providing much-needed stability, and established the Executive and Finance committees to handle specific topics and the Governance committees to take a longer-term look at the board’s composition and capabilities.
  • Arranged 3 major grants to the Foundation totaling $2M and raised a further $250k through targeted fundraising initiatives.
  • Grown the Foundation team to its largest ever size, investing in staff development, and established ongoing direct contributions to GNOME, GTK and Flathub by Foundation staff and contractors.
  • Launched and incubated Flathub as an inclusive and sustainable ecosystem for Linux app developers to engage directly with their users, and delivered the Community Engagement Challenge to invest in the sustainability of our contributor base ­­– the Foundation’s largest and most substantial programs outside of GNOME itself since Outreachy.
  • Achieved a fantastic resolution for GNOME and the wider community, by negotiating a settlement which protects FOSS developers from patent enforcement by the Rothschild group of non-practicing entities.
  • Stood for a diverse and inclusive Foundation, implementing a code of conduct for GNOME events and online spaces, establishing our first code of conduct committee and updating the bylaws to be gender-neutral.
  • Established the GNOME Circle program together with the board, broadening the membership base of the foundation by welcoming app and library developers from the wider ecosystem.

Recognizing and appreciating the amazing progress that GNOME has made with Neil’s support, the search for a new Executive Director provides the opportunity for the Foundation board to set the agenda and next high-level goals we’d like to achieve together with our new Executive Director.

In terms of the desktop, applications, technology, design and development processes, whilst there are always improvements to be made, the board’s general feeling is that thanks to the work of our amazing community of contributors, GNOME is doing very well in terms of what we produce and publish. Recent desktop releases have looked great, highly polished and well-received, and the application ecosystem is growing and improving through new developers and applications bringing great energy at the moment. From here, our largest opportunity in terms of growing the community and our user base is being able to articulate the benefits of what we’ve produced to a wider public audience, and deliver impact which allows us to secure and grow new and sustainable sources of funding.

For individuals, we are able to offer an exceedingly high quality desktop experience and a broad range of powerful applications which are affordable to all, backed by a nonprofit which can be trusted to look after your data, digital security and your best interests as an individual. From the perspective of being a public charity in the US, we also have the opportunity to establish programs that draw upon our community, technology and products to deliver impact such as developing employable skills, incubating new Open Source contributors, learning to program and more.

For our next Executive Director, we will be looking for an individual with existing experience in that nonprofit landscape, ideally with prior experience establishing and raising funds for programs that deliver impact through technology, and appreciation for the values that bring people to Free, Open Source and other Open Culture organizations. Working closely with the existing members, contributors, volunteers and whole GNOME community, and managing our relationships with the Advisory Board and other key partners, we hope to find a candidate that can build public awareness and help people learn about, use and benefit from what GNOME has built over the past two decades.

Neil has agreed to stay in his position for a 6 month transition period, during which he will support the board in our search for a new Executive Director and support a smooth hand-over. Over the coming weeks we will publish the job description for the new ED, and establish a search committee who will be responsible for sourcing and interviewing candidates to make a recommendation to the board for Neil’s successor – a hard act to follow!

I’m confident the community will join me and the board in personally thanking Neil for his 5 years of dedicated service in support of GNOME and the Foundation. Should you have any queries regarding the process, or offers of assistance in the coming hiring process, please don’t hesitate to join the discussion or reach out directly to the board.

February 15, 2022

After roughly 20 years and counting up to 0.40 in release numbers, I've decided to call the next version of the xf86-input-wacom driver the 1.0 release. [1] This cycle has seen a bulk of development (>180 patches) which is roughly as much as the last 12 releases together. None of these patches actually added user-visible features, so let's talk about technical dept and what turned out to be an interesting way of reducing it.

The wacom driver's git history goes back to 2002 and the current batch of maintainers (Ping, Jason and I) have all been working on it for one to two decades. It used to be a Wacom-only driver but with the improvements made to the kernel over the years the driver should work with most tablets that have a kernel driver, albeit some of the more quirky niche features will be more limited (but your non-Wacom devices probably don't have those features anyway).

The one constant was always: the driver was extremely difficult to test, something common to all X input drivers. Development is a cycle of restarting the X server a billion times, testing is mostly plugging hardware in and moving things around in the hope that you can spot the bugs. On a driver that doesn't move much, this isn't necessarily a problem. Until a bug comes along, that requires some core rework of the event handling - in the kernel, libinput and, yes, the wacom driver.

After years of libinput development, I wasn't really in the mood for the whole "plug every tablet in and test it, for every commit". In a rather caffeine-driven development cycle [2], the driver was separated into two logical entities: the core driver and the "frontend". The default frontend is the X11 one which is now a relatively thin layer around the core driver parts, primarily to translate events into the X Server's API. So, not unlike libinput + xf86-input-libinput in terms of architecture. In ascii-art:

+--------------------+ | big giant
/dev/input/event0->| core driver | x11 |->| X server
+--------------------+ | process

Now, that logical separation means we can have another frontend which I implemented as a relatively light GObject wrapper and is now a library creatively called libgwacom:

+-----------------------+ |
/dev/input/event0->| core driver | gwacom |--| tools or test suites
+-----------------------+ |

This isn't a public library or API and it's very much focused on the needs of the X driver so there are some peculiarities in there. What it allows us though is a new wacom-record tool that can hook onto event nodes and print the events as they come out of the driver. So instead of having to restart X and move and click things, you get this:

$ ./builddir/wacom-record
version: 0.99.2
git: xf86-input-wacom-0.99.2-17-g404dfd5a
path: /dev/input/event6
name: "Wacom Intuos Pro M Pen"
- source: 0
event: new-device
name: "Wacom Intuos Pro M Pen"
type: stylus
keys: true
is-absolute: true
is-direct-touch: false
ntouches: 0
naxes: 6
- {type: x , range: [ 0, 44800], resolution: 200000}
- {type: y , range: [ 0, 29600], resolution: 200000}
- {type: pressure , range: [ 0, 65536], resolution: 0}
- {type: tilt_x , range: [ -64, 63], resolution: 57}
- {type: tilt_y , range: [ -64, 63], resolution: 57}
- {type: wheel , range: [ -900, 899], resolution: 0}
- source: 0
mode: absolute
event: motion
mask: [ "x", "y", "pressure", "tilt-x", "tilt-y", "wheel" ]
axes: { x: 28066, y: 17643, pressure: 0, tilt: [ -4, 56], rotation: 0, throttle: 0, wheel: -108, rings: [ 0, 0]
This is YAML which means we can process the output for comparison or just to search for things.

A tool to quickly analyse data makes for faster development iterations but it's still a far cry from reliable regression testing (and writing a test suite is a daunting task at best). But one nice thing about GObject is that it's accessible from other languages, including Python. So our test suite can be in Python, using pytest and all its capabilities, plus all the advantages Python has over C. Most of driver testing comes down to: create a uinput device, set up the driver with some options, push events through that device and verify they come out of the driver in the right sequence and format. I don't need C for that. So there's pull request sitting out there doing exactly that - adding a pytest test suite for a 20-year old X driver written in C. That this is a) possible and b) a lot less work than expected got me quite unreasonably excited. If you do have to maintain an old C library, maybe consider whether's possible doing the same because there's nothing like the warm fuzzy feeling a green tick on a CI pipeline gives you.

[1] As scholars of version numbers know, they make as much sense as your stereotypical uncle's facebook opinion, so why not.
[2] The Colombian GDP probably went up a bit

February 05, 2022

(I nearly went with clutterectomy, but that would be doing our old servant project a disservice.)

Yesterday, I finally merged the work-in-progress branch porting totem to GStreamer's GTK GL sink widget, undoing a lot of the work done in 2011 and 2014 to port the video widget and then to finally make use of its features.

But GTK has been modernised (in GTK3 but in GTK4 even more so), GStreamer grew a collection of GL plugins, Wayland and VA-API matured and clutter (and its siblings clutter-gtk, and clutter-gst) didn't get the resources they needed to follow.

Screenshot_from_2022-02-03_18-03-40A screenshot with practically no changes, as expected

The list of bug fixes and enhancements is substantial:

  • Makes some files that threw shaders warnings playable
  • Fixes resize lag for the widgets embedded in the video widget
  • Fixes interactions with widgets on some HDR capable systems, or even widgets disappearing sometimes (!)
  • Gets rid of the floating blank windows under Wayland
  • Should help with tearing, although that's highly dependent on the system
  • Hi-DPI support
  • Hardware acceleration (through libva)

Until the port to GTK4, we expect a overall drop in performance on systems where there's no VA-API support, and the GTK4 port should bring it to par with the fastest of players available for GNOME.

You can install a Preview version right now by running:

$ flatpak install --user

and filing bug in the GNOME GitLab.

Next stop, a GTK4 port!

February 03, 2022


I always do one of these big roundups for each Mesa release, so here’s what you can expect to see from zink in the upcoming release:

  • fewer hangs on RADV
  • massively improved usability on NVIDIA
  • greatly improved performance with unsupported texture download formats (e.g., CS:GO, L4D2)
  • more extensions: ARB_sparse_texture, ARB_sparse_texture2, ARB_sparse_texture_clamp, EXT_memory_object, EXT_memory_object_fd, GL_EXT_semaphore, GL_EXT_semaphore_fd
  • ~1000% improved glxgears performance (be sure to run with --i-know-this-is-not-a-benchmark to see the real speed)
  • tons and tons and tons of bug fixes

All around looking like another great release.

I Hate gl_PointSize And So Can You

Yes, we’re here.

After literally years of awfulness, I’ve finally solved (for good) the debacle that is point size conversion from GL to Vulkan.

What’s so awful about it, you might be asking. How hard can it be to just add gl_PointSize to a shader, you follow up with as you push your glasses higher up your nose.

Allow me to explain.

In Vulkan, there is exactly one method for setting the size of points: the gl_PointSize shader output controls it, and that’s it.

In OpenGL (core profile):

  • 14.4 Points If program point size mode is enabled, the derived point size is taken from the (potentially clipped) shader built-in gl_PointSize written by the last vertex processing stage and clamped to the implementation-dependent point size range. If the value written to gl_PointSize is less than or equal to zero, or if no value was written to gl_PointSize, results are undefined. If program point size mode is disabled, the derived point size is specified with the command

    void PointSize( float size );

  • Tessellation Evaluation Shader Outputs Tessellation evaluation shaders have a number of built-in output variables used to pass values to equivalent built-in input variables read by subsequent shader stages or to subsequent fixed functionality vertex processing pipeline stages. These variables are gl_Position, gl_PointSize, gl_ClipDistance, and gl_CullDistance, and all behave identically to equivalently named vertex shader outputs.
  • Geometry Shader Outputs The built-in output gl_PointSize, if written, holds the size of the point to be rasterized, measured in pixels

In short, if PROGRAM_POINT_SIZE is enabled, then points are sized based on the gl_PointSize shader output of the last vertex stage.

In OpenGL ES (versions 2.0, 3.0, 3.1):

  • (3.3 | 3.4 | 13.3) Points The point size is taken from the shader built-in gl_PointSize written by the vertex shader, and clamped to the implementation-dependent point size range.

In OpenGL ES (version 3.2):

  • 13.5 Points The point size is determined by the last vertex processing stage. If the last vertex processing stage is not a vertex shader, the point size is 1.0. If the last vertex processing stage is a vertex shader, the point size is taken from the shader built-in gl_PointSize written by the vertex shader, and is clamped to the implementation-dependent point size range.

Thus for an ES context, the point size always comes from the last vertex stage, which means it can be anything it wants to be if that stage is a vertex shader and cannot be written to for all other stages because it is not a valid output (this last, bolded part is going to be really funny in a minute or two).

What do the specs agree on?

  • If a vertex shader is the last vertex stage, it can write gl_PointSize

Literally that’s it.



As we know, Vulkan has a very simple and clearly defined model for point size:

The point size is taken from the (potentially clipped) shader built-in PointSize written by:
• the geometry shader, if active;
• the tessellation evaluation shader, if active and no geometry shader is active;
• the vertex shader, otherwise
- 27.10. Points

It really can be that simple.

So one would think that we can just hook up some conditionals based on the GL rules and then export the correct value.

That would be easy.


It would make sense.




It gets worse (obviously).

gl_PointSize is a valid XFB varying, which means it must be exported correctly to the transform feedback buffer. For the ES case, it’s simple, but for desktop GL, there’s a little something called PROGRAM_POINT_SIZE state which totally fucks that up. Because, as we know, Vulkan has exactly one way of setting point size, and it’s the shader variable.

Thus, if there is a desktop GL context using a vertex shader as its last vertex stage for a draw, and if that shader has its own gl_PointSize value, this value must be exported for XFB.

But not used for point rasterization.

It’s Actually Even Worse Than That

…Because in order to pass CTS for ES 3.2, your implementation also has to be able to violate spec.

Remember above when I said it was going to be funny that gl_PointSize is not a legal output for non-vertex stages in ES contexts?

CTS explicitly has “wide points” tests which verify illegal point sizes that are exported by the tessellation and geometry shader stages. Isn’t that cool?

Also, let’s be reasonable people for a moment, who actually wants a point that’s just one pixel? Nobody can see that on their 8k display.

To Sum Up

I hate GL point size, and so should you.

February 02, 2022

Checking In

I keep meaning to blog, but then I get sidetracked by not blogging.

Truly a tough life.

So what’s new in zink-land?

Nothing too exciting. Mostly bug fixes. I managed to sneak ARB_sparse_texture_clamp in for zink just before the branchpoint, so all the sparse texturing features supported by Mesa will be supported by zink. But only on NVIDIA since they’re the only driver that fully supports Vulkan sparse texturing.

The past couple days I’ve been doing some truly awful things with gl_PointSize to try and make this conformant for all possible cases. It’s a real debacle, and I’ll probably post more in-depth about it so everyone can get a good chuckle.

The one unusual part of my daily routine is that I haven’t rebased my testing branch in at least a couple weeks now since I’ve been trying to iron out regressions. Will I find that everything crashes and fails as soon as I do?


More posts to come.

February 01, 2022

There was an article on Open for Everyone today about Nobara, a Fedora-based distribution optimized for gaming. So I have no beef with Tomas Crider or any other creator/maintainer of a distribution targeting a specific use case. In fact they are usually trying to solve or work around real problems and make things easier for people. That said I have for years felt that the need for these things is a failing in itself and it has been a goal for me in the context of Fedora Workstation to figure out what we can do to remove the need for ‘usecase distros’. So I thought it would be of interest if I talk a bit about how I been viewing these things and the concrete efforts we taken to reduce the need for usecase oriented distributions. It is worth noting that the usecase distributions have of course proven useful for this too, in the sense that they to some degree also function as a very detailed ‘bug report’ for why the general case OS is not enough.
Before I start, you might say, but isn’t Fedora Workstation as usecase OS too? You often talk about having a developer focus? Yes, developers are something we care deeply about, but for instance that doesn’t mean we pre-install 50 IDEs in Fedora Workstation. Fedora Workstation should be a great general purpose OS out of the box and then we should have tools like GNOME Software and Toolbx available to let you quickly and easily tweak it into your ideal development system. But at the same time by being a general purpose OS at heart, it should be equally easy to install Steam and Lutris to start gaming or install Carla and Ardour to start doing audio production. Or install OBS Studio to do video streaming.

Looking back over the years one of the first conclusions I drew from looking at all the usecase distributions out there was that they often where mostly the standard distro, but with a carefully procured list of pre-installed software, for instance the old Fedora game spin was exactly that, a copy of Fedora with a lot of games pre-installed. So why was this valuable to people? For those of us who have been around for a while we remember that the average linux ‘app store’ was a very basic GUI which listed available software by name (usually quite cryptic names) and at best with a small icon. There was almost no other metadata available and search functionality was limited at best. So finding software was not simple, at it was usually more of a ‘search the internet and if you find something interesting see if its packaged for your distro’. So the usecase distros who focused on having procured pre-installed software, be that games, or pro-audio software or graphics tools ot whatever was their focus was basically responding to the fact that finding software was non-trivial and a lot of people maybe missed out on software that could be useful to them since it they simply never learned about its existence.
So when we kicked of the creation of GNOME Software one of the big focuses early on was to create a system for providing good metadata and displaying that metadata in a useful manner. So as an end user the most obvious change was of course the more rich UI of GNOME Software, but maybe just as important was the creation of AppStream, which was a specification for how applications to ship with metadata to allow GNOME Software and others to display much more in-depth information about the application and provide screenshots and so on.

So I do believe that between working on a better ‘App Store’ story for linux between the work on GNOME Software as the actual UI, but also by working with many stakeholders in the Linux ecosystem to define metadata standards like AppStream we made software a lot more discoverable on Linux and thus reduced the need for pre-loading significantly. This work also provided an important baseline for things like Flathub to thrive, as it then had a clear way to provide metadata about the applications it hosts.
We do continue to polish that user experience on an ongoing basis, but I do feel we reduced the need to pre-load a ton of software very significantly already with this.

Of course another aspect of this is application availability, which is why we worked to ensure things like Steam is available in GNOME Software on Fedora Workstation, and which we have now expanded on by starting to include more and more software listings from Flathub. These things makes it easy for our users to find the software they want, but at the same time we are still staying true to our mission of only shipping free software by default in Fedora.

The second major reason for usecase distributions have been that the generic version of the OS didn’t really have the right settings or setup to handle an important usecase. I think pro-audio is the best example of this where usecase distros like Fedora Jam or Ubuntu Studio popped up. The pre-install a lot of relevant software was definitely part of their DNA too, but there was also other issues involved, like the need for a special audio setup with JACK and often also kernel real-time patches applied. When we decided to include Pro-audio support in PipeWire resolving these issues was a big part of it. I strongly believe that we should be able to provide a simple and good out-of-the box experience for musicians and audio engineers on Linux without needing the OS to be specifically configured for the task. The strong and positive response we gotten from the Pro-audio community for PipeWire I believe points to that we are moving in the right direction there. Not claiming things are 100% yet, but we feel very confident that we will get there with PipeWire and make the Pro-Audio folks full fledged members of the Fedora WS community. Interestingly we also spent quite a bit of time trying to ensure the pro-audio tools in Fedora has proper AppStream metadata so that they would appear in GNOME Software as part of this. One area there where we are still looking at is the real time kernel stuff, our current take is that we do believe the remaining unmerged patches are not strictly needed anymore, as most of the important stuff has already been merged, but we are monitoring it as we keep developing and benchmarking PipeWire for the Pro-Audio usecase.

Another reason that I often saw that drove the creation of a usecase distribution is special hardware support, and not necessarily that special hardware, the NVidia driver for instance has triggered a lot of these attempts. The NVidia driver is challenging on a lot of levels and has been something we have been constantly working on. There was technical issues for instance, like the NVidia driver and Mesa fighting over who owned the implementation, which we fixed by the introduction glvnd a few years ago. But for a distro like Fedora that also cares deeply about free and open source software it also provided us with a lot of philosophical challenges. We had to answer the question of how could we on one side make sure our users had easy access to the driver without abandoning our principle around Fedora only shipping free software of out the box? I think we found a good compromise today where the NVidia driver is available in Fedora Workstation for easy install through GNOME Software, but at the same time default to Nouveau of the box. That said this is a part of the story where we are still hard at work to improve things further and while I am not at liberty to mention any details I think I can at least mention that we are meeting with our engineering counterparts at NVidia on almost a weekly basis to discuss how to improve things, not just for graphics, but around compute and other shared areas of interest. The most recent public result of that collaboration was of course the XWayland support in recent NVidia drivers, but I promise you that this is something we keep focusing on and I expect that we will be able to share more cool news and important progress over the course of the year, both for users of the NVidia binary driver and for users of Nouveau.

What are we still looking at in terms of addressing issues like this? Well one thing we are talking about is if there is value/need for a facility to install specific software based on hardware or software. For instance if we detect a high end gaming mouse connected to your system should we install Piper/ratbag or at least make GNOME Software suggest it? And if we detect that you installed Lutris and Steam are there other tools we should recommend you install, like the gamemode GNOME Shell extenion? It is a somewhat hard question to answer, which is why we are still pondering it, on one side it seems like a nice addition, but such connections would mean that we need to have a big database we constantly maintain which isn’t trivial and also having something running on your system to lets say check for those high end mice do add a little overhead that might be a waste for many users.

Another area that we are looking at is the issue of codecs. We did a big effort a couple of years ago and got AC3, mp3, AAC and mpeg2 video cleared for inclusion, and also got the OpenH264 implementation from Cisco made available. That solved a lot of issues, but today with so many more getting into media creation I believe we need to take another stab at it and for instance try to get reliable hardware accelerated encoding and decoding on video. I am not ready to announce anything, but we got a few ideas and leads we are looking at for how to move the needle there in a significant way.

So to summarize, I am not criticizing anyone for putting together what I call usecase distros, but at the same time I really want to get to a point where they are rarely needed, because we should be able to cater to most needs within the context of a general purpose Linux operating system. That said I do appreciate the effort of these distro makers both in terms of trying to help users have a better experience on linux and in indirectly helping us showcase both potential solutions or highlight the major pain points that still needs addressing in a general purpose Linux desktop operating system.

January 26, 2022

(This post was first published with Collabora on Jan 25, 2022.)

A Pixel's Color

My work on Wayland and Weston color management and HDR support has been full of learning new concepts and terms. Many of them are crucial for understanding how color works. I started out so ignorant that I did not know how to blend two pixels together correctly. I did not even know that I did not know - I was just doing the obvious blend, and that was wrong. Now I think I know what I know and do not know, and I also feel that most developers around window systems and graphical applications are as uneducated as I was.

Color knowledge is surprisingly scarce in my field it seems. It is not enough that I educate myself. I need other people to talk to, to review my work, and to write patches that I will be reviewing. With the hope of making it even a little bit easier to understand what is going on with color I wrote the article: A Pixel's Color.

The article goes through most of the important concepts, trying to give you, a programmer, a vague idea of what they are. It does not explain everything too well, because I want you to be able to read through it, but it still got longer than I expected. My intention is to tell you about things you might not know about, so that you would at least know what you do not know.

A warm thank you to everyone who reviewed and commented on the article.

A New Documentation Repository

Originally Wayland CM&HDR extension merge request included documentation about how color management would work on Wayland. The actual protocol extension specification cannot even begin to explain all that.

To make that documentation easier to revise and contribute to, I proposed to move it into a new repository: color-and-hdr. That also allowed us to widen the scope of the documentation, so we can easily include things outside of Wayland: EGL, Vulkan WSI, DRM KMS, and more.

I hope that color-and-hdr documentation repository gains traction and becomes a community maintained effort in gathering information about color and HDR on Linux, and that we can eventually move it out of my personal namespace to become truly community owned.

January 20, 2022

It’s Happening (For Real)

After weeks of hunting for the latest rumors of jekstrand’s future job prospects, I’ve finally done it: zink now supports more extensions than any other OpenGL driver in Mesa.

That’s right.

Check it on mesamatrix if you don’t believe me.

A couple days ago I merged support for the external memory extensions that I’d been putting off, and today we got sparse textures thanks to Qiang Yu at AMD doing 99% of the work to plumb the extensions through the rest of Mesa.

There’s even another sparse texture extension, which I’ve already landed all the support for in zink, that should be enabled for the upcoming release.

What’s Next?

Zink (sometimes) has the performance, now it has the features, so naturally the focus now is going to shift to compatibility and correctness. Kopper is going to mostly take care of the former, which leaves the latter. There aren’t a ton of CTS cases failing.

Ideally, by the end of the year, there won’t be any.

January 18, 2022


The last thing I remember Thursday was trying to get the truth out about Jason Ekstrand’s new role. Days have now passed, and I can’t remember what I was about to say or what I did over the extended weekend.

But Big Triangle sure has been busy. It’s clear I was on to something, because otherwise they wouldn’t have taken such drastic measures. Look at this: jekstrand is claiming Collabora has hired him. This is clearly part of a larger coverup, and the graphics news media are eating it up.

Congratulations to him, sure, but it’s obvious this is just another attempt to throw us off the trail. We may never find out what Jason’s real new job is, but that doesn’t mean we’re going to stop following the hints and clues as they accumulate. Sooner or later, Big Triangle is going to slip up, and then we’ll all know the truth.


In the meantime, zink goes on. I’ve spent quite a long while tinkering with NVIDIA and getting a solid baseline of CTS results. At present, I’m down to about 800 combined fails for GL 4.6 and ES 3.2. Given that lavapipe is at around 80 and RADV is just over 600, both excluding the confidential test suites, this is a pretty decent start.

This is probably going to be the last time I’m on nvidia for a while, and it hasn’t been too bad overall.

The Year’s First Rebrand

The (second) biggest news story for today is a rebrand.

Copper is being renamed.

It will, in fact, be named Kopper to match the zink/vulkan naming scheme.

I can’t overstate how significant this change is and how massive the ecosystem changes around it will be.

Just huge. Like the number of words in this blog post.

January 13, 2022

We Need To Talk

It’s come to my attention that there’s a lot of rumors flying around about what exactly I’m doing aside from posting the latest info about where Jason Ekstrand, who coined the phrase, “If it compiles, we should ship it.” is going to end up.

Everyone knows that jekstrand’s next career move is big news—the kind of industry-shaking maneuvering that has every BigCo from Alphabet to Meta on tenterhooks. This post is going to debunk a number of the most common nonsense I’ve been hearing as well as give some updates about what else I’ve been doing besides scouring the internet for even the tiniest clue about what’s coming for this man’s career in 2022.

Is Jason going to Apple to work on a modernized, open source implementation of Mac OS with a new Finder based on Vulkan?

My sources were very keen on this rumor up until Tuesday, when, in an undisclosed IRC channel, Jason himself had the following to say:

<jekstrand> Sachiel: Contrary to popular belief, I can't work on every idea in the multiverse simultaneously.  I'm limited to the same N dimensions as the rest of you.

This absolutely blew all the existing chatter out of the water. Until now, in the course of working on more sparse texturing extensions, I had the firm impression that we’d be seeing a return to form, likely with a Khronos member company, continuing to work on graphics. But now? With this? Clearly everyone was thinking too small.

Everyone except jekstrand himself, who will be taking up a position at CERN devising new display technology for particle accelerators.

Or at least, that’s what I thought until yesterday.

Is Jason really going to be working at CERN? How well does GPU knowledge translate to theoretical physics?

Unfortunately, this turned out to be bogus, no more than chaff deployed to stop us from getting to the truth because we were too close. Later, while I was pondering how buggy NVIDIA’s sparse image functionality was in the latest beta drivers and attempting to pass what few equally buggy CTS cases there were for ARB_sparse_texture2, I stumbled upon the obvious.

It’s so obvious, in fact, that everyone overlooked it because of how obvious it is.

Jason has left Intel and turned in his badge because he’s on vacation.

As everyone knows, he’s the kind of person who literally does not comprehend time in the same way that the rest of us do. It was his assessment of the HR policy that in order to take time off and leave the office, he had to quit. My latest intel (no pun intended) revealed that managers and executives alike were still scrambling, trying to figure out how to explain the company’s vacation policy using SSA-based compiler terminology, but optimizer passes left their attempts to engage him as no-ops.


So this whole thing was just a ruse?

I’ll be completely honest with you since you’ve read this far: I’ve just heard breaking news today. This is so fresh, so hot-off-the-presses that it’s almost as difficult to reveal as it is that I’ve implemented another 4 GL extensions. When the totality of all my MRs are landed, zink will become the GL driver in Mesa supporting the most extensions, and this is likely to be the case for the next release. Shocking, I know.

But not nearly as shocking as the fact that Jason is actually starting at Texas Instruments working on Vulkan for graphing calculators.

Think about it.

Anyone who knows jekstrand even the smallest amount knows how much sense this makes on both sides. He gets unlimited graphing calculators, and that’s all he had to hear before signing the contract. It’s that simple.

Graphing Calculators? Does Anyone Even Use Those Anymore?

I know at least one person who does, and it’s not Jason Ekstrand. Because in the time that I was writing out the last (and now deprecated) information I had available, there’s been more, even later breaking news.

Copper now has a real MR open for it.

I realize it’s entirely off-topic now to be talking about some measly merge request, but it has the WSI tag on it, which means Jason has no choice but to read through the entire thing.

That’s because he’ll be working for Khronos as the Assistant Deputy Director of Presentation. If there’s presentations to be done by anyone in the graphics space, for any reason, they’ll have to go through jekstrand first. I don’t envy the responsibility and accountability that this sort of role demands; when it comes to shedsmanship, people in the presentation space are several levels above the rest.

We can only hope he’s up to the challenge.

Or at least, we would if that were actually where he was going, because I’ve just heard from

January 10, 2022

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:

  • Part 1: The high-level view of the whole CI system, and how to fully control test machines remotely (power on, OS to boot, keyboard/screen emulation using a serial console);
  • Part 2: A comparison of the different ways to generate the rootfs of your test environment, and introducing the boot2container project.

In this article, we will further discuss the role of the CI gateway, and which steps we can take to simplify its deployment, maintenance, and disaster recovery.

This work is sponsored by the Valve Corporation.

Requirements for the CI gateway

As seen in the part 1 of this CI series, the testing gateway is sitting between the test machines and the public network/internet:

      Internet /   ------------------------------+
    Public network                               |
                                       +---------+--------+                USB
                                       |                  +-----------------------------------+
                                       |      Testing     | Private network                   |
Main power (120/240 V) -----+          |      Gateway     +-----------------+                 |
                            |          +------+--+--------+                 |                 |
                            |                 |  | Serial /                 |                 |
                            |            Main |  | Ethernet                 |                 |
                            |            Power|  |                          |                 |
                +-----------+-----------------|--+--------------+   +-------+--------+   +----+----+
                |              Switchable PDU |                |   |   RJ45 switch  |   | USB Hub |
                |  Port 0    Port 1        ...|         Port N  |   |                |   |         |
                +----+------------------------+-----------------+   +---+------------+   +-+-------+
                     |                                                  |                  |
                Main |                                                  |                  |
                Power|                                                  |                  |
            +--------|--------+               Ethernet                  |                  |
            |                 +-----------------------------------------+   +----+----+    |
            |  Test Machine 1 |            Serial (RS-232 / TTL)            |  Serial |    |
            |                 +---------------------------------------------+  2 USB  +----+ USB
            +-----------------+                                             +---------+

The testing gateway's role is to expose the test machines to the users, either directly or via GitLab/Github. As such, it will likely require the following components:

  • a host Operating System;
  • a config file describing the different test machines;
  • a bunch of services to expose said machines and deploy their test environment on demand.

Since the gateway is connected to the internet, both the OS and the different services needs to be be kept updated relatively often to prevent your CI farm from becoming part of a botnet. This creates interesting issues:

  1. How do we test updates ahead of deployment, to minimize downtime due to bad updates?
  2. How do we make updates atomic, so that we never end up with a partially-updated system?
  3. How do we rollback updates, so that broken updates can be quickly reverted?

These issues can thankfully be addressed by running all the services in a container (as systemd units), started using boot2container. Updating the operating system and the services would simply be done by generating a new container, running tests to validate it, pushing it to a container registry, rebooting the gateway, then waiting while the gateway downloads and execute the new services.

Using boot2container does not however fix the issue of how to update the kernel or boot configuration when the system fails to boot the current one. Indeed, if the kernel/boot2container/kernel command line are stored locally, they can only be modified via an SSH connection and thus require the machine to always be reachable, the gateway will be bricked until an operator boots an alternative operating system.

The easiest way not to brick your gateway after a broken update is to power it through a switchable PDU (so that we can power cycle the machine), and to download the kernel, initramfs (boot2container), and the kernel command line from a remote server at boot time. This is fortunately possible even through the internet by using fancy bootloaders, such as iPXE, and this will be the focus of this article!

Tune in for part 4 to learn more about how to create the container.

iPXE + boot2container: Netbooting your CI infrastructure from anywhere

iPXE is a tiny bootloader that packs a punch! Not only can it boot kernels from local partitions, but it can also connect to the internet, and download kernels/initramfs using HTTP(S). Even more impressive is the little scripting engine which executes boot scripts instead of declarative boot configurations like grub. This enables creating loops, endlessly trying to boot until one method finally succeeds!

Let's start with a basic example, and build towards a production-ready solution!

Netbooting from a local server

In this example, we will focus on netbooting the gateway from a local HTTP server. Let's start by reviewing a simple script that makes iPXE acquire an IP from the local DHCP server, then download and execute another iPXE script from http://<ip of your dev machine>:8000/boot/ipxe. If any step failed, the script will be restarted from the start until a successful boot is achieved.


echo Welcome to Valve infra's iPXE boot script

echo Acquiring an IP
dhcp || goto retry # Keep retrying getting an IP, until we get one
echo Got the IP: $${netX/ip} / $${netX/netmask}


echo Chainloading from the iPXE server...
chain http://<ip of your dev machine>:8000/boot.ipxe

# The boot failed, let's restart!
goto retry

Neat, right? Now, we need to generate a bootable ISO image starting iPXE with the above script run as a default. We will then flash this ISO to a USB pendrive:

$ git clone git://
$ make -C ipxe/src -j`nproc` bin/ipxe.iso EMBED=<boot script file>
$ sudo dd if=ipxe/src/bin/ipxe.iso of=/dev/sdX bs=1M conv=fsync status=progress

Once connected to the gateway, ensure that you boot from the pendrive, and you should see iPXE bootloader trying to boot the kernel, but failing to download the script from http://<ip of your dev machine>:8000/boot.ipxe. So, let's write one:


kernel /files/kernel b2c.container="docker://hello-world"
initrd /files/initrd

This script specifies the following elements:

  • kernel: Download the kernel at http://<ip of your dev machine>:8000/files/kernel, and set the kernel command line to ask boot2container to start the hello-world container
  • initrd: Download the initramfs at http://<ip of your dev machine>:8000/files/initrd
  • boot: Boot the specified boot configuration

Assuming your gateway has an architecture supported by boot2container, you may now download the kernel and initrd from boot2container's releases page. In case it is unsupported, create an issue, or a merge request to add support for it!

Now that you have created all the necessary files for the boot, start the web server on your development machine:

$ ls
boot.ipxe  initrd  kernel
$ python -m http.server 8080
Serving HTTP on port 8000 ( ...
<ip of your gateway> - - [09/Jan/2022 15:32:52] "GET /boot.ipxe HTTP/1.1" 200 -
<ip of your gateway> - - [09/Jan/2022 15:32:56] "GET /kernel HTTP/1.1" 200 -
<ip of your gateway> - - [09/Jan/2022 15:32:54] "GET /initrd HTTP/1.1" 200 -

If everything went well, the gateway should, after a couple of seconds, start downloading the boot script, then the kernel, and finally the initramfs. Once done, your gateway should boot Linux, run docker's hello-world container, then shut down.

Congratulations for netbooting your gateway! However, the current solution has one annoying constraint: it requires a trusted local network and server because we are using HTTP rather than HTTPS... On an untrusted network, a man in the middle could override your boot configuration and take over your CI...

If we were using HTTPS, we could download our boot script/kernel/initramfs directly from any public server, even GIT forges, without fear of any man in the middle! Let's try to achieve this!

Netbooting from public servers

In the previous section, we managed to netboot our gateway from the local network. In this section, we try to improve on it by netbooting using HTTPS. This enables booting from a public server hosted at places such as Linode for $5/month.

As I said earlier, iPXE supports HTTPS. However, if you are anyone like me, you may be wondering how such a small bootloader could know which root certificates to trust. The answer is that iPXE generates an SSL certificate at compilation time which is then used to sign all of the root certificates trusted by Mozilla (default), or any amount of certificate you may want. See iPXE's crypto page for more information.

WARNING: iPXE currently does not like certificates exceeding 4096 bits. This can be a limiting factor when trying to connect to existing servers. We hope to one day fix this bug, but in the mean time, you may be forced to use a 2048 bits Let's Encrypt certificate on a self-hosted web server. See our issue for more information.

WARNING 2: iPXE only supports a limited amount of ciphers. You'll need to make sure they are listed in nginx's ssl_ciphers configuration: AES-128-CBC:AES-256-CBC:AES256-SHA256 and AES128-SHA256:AES256-SHA:AES128-SHA

To get started, install NGINX + Let's encrypt on your server, following your favourite tutorial, copy the boot.ipxe, kernel, and initrd files to the root of the web server, then make sure you can download them using your browser.

With this done, we just need to edit iPXE's general config C header to enable HTTPS support:

$ sed -i 's/#undef\tDOWNLOAD_PROTO_HTTPS/#define\tDOWNLOAD_PROTO_HTTPS/' ipxe/src/config/general.h

Then, let's update our boot script to point to the new server:


echo Welcome to Valve infra's iPXE boot script

echo Acquiring an IP
dhcp || goto retry # Keep retrying getting an IP, until we get one
echo Got the IP: $${netX/ip} / $${netX/netmask}


echo Chainloading from the iPXE server...
chain https://<your server>/boot.ipxe

# The boot failed, let's restart!
goto retry

And finally, let's re-compile iPXE, reflash the gateway pendrive, and boot the gateway!

$ make -C ipxe/src -j`nproc` bin/ipxe.iso EMBED=<boot script file>
$ sudo dd if=ipxe/src/bin/ipxe.iso of=/dev/sdX bs=1M conv=fsync status=progress

If all went well, the gateway should boot and run the hello world container once again! Let's continue our journey by provisioning and backup'ing the local storage of the gateway!

Provisioning and backups of the local storage

In the previous section, we managed to control the boot configuration of our gateway via a public HTTPS server. In this section, we will improve on that by provisioning and backuping any local file the gateway container may need.

Boot2container has a nice feature that enables you to create a volume, and provision it from a bucket in a S3-compatible cloud storage, and sync back any local change. This is done by adding the following arguments to the kernel command line:

  • b2c.minio="s3,${s3_endpoint},${s3_access_key_id},${s3_access_key}": URL and credentials to the S3 service
  • b2c.volume="perm,mirror=s3/${s3_bucket_name},pull_on=pipeline_start,push_on=changes,overwrite,delete": Create a perm podman volume, mirror it from the bucket ${s3_bucket_name} when booting the gateway, then push any local change back to the bucket. Delete or overwrite any existing file when mirroring.
  • b2c.container="-ti -v perm:/mnt/perm docker://alpine": Start an alpine container, and mount the perm container volume to /mnt/perm

Pretty, isn't it? Provided that your bucket is configured to save all the revisions of every file, this trick will kill three birds with one stone: initial provisioning, backup, and automatic recovery of the files in case the local disk fails and gets replaced with a new one!

The issue is that the boot configuration is currently open for everyone to see, if they know where to look for. This means that anyone could tamper with your local storage or even use your bucket to store their files...

Securing the access to the local storage

To prevent attackers from stealing our S3 credentials by simply pointing their web browser to the right URL, we can authenticate incoming HTTPS requests by using an SSL client certificate. A different certificate would be embedded in every gateway's iPXE bootloader and checked by NGINX before serving the boot configuration for this precise gateway. By limiting access to a machine's boot configuration to its associated client certificate fingerprint, we even prevent compromised machines from accessing the data of other machines.

Additionally, secrets should not be kept in the kernel command line, as any process executed on the gateway could easily gain access to it by reading /proc/cmdline. To address this issue, boot2container has a b2c.extra_args_url argument to source additional parameters from this URL. If this URL is generated every time the gateway is downloading its boot configuration, can be accessed only once, and expires soon after being created, then secrets can be kept private inside boot2container and not be exposed to the containers it starts.

Implementing these suggestions in a blog post is a little tricky, so I suggest you check out valve-infra's ipxe-boot-server component for more details. It provides a Makefile that makes it super easy to generate working certificates and create bootable gateway ISOs, a small python-based web service that will serve the right configuration to every gateway (including one-time secrets), and step-by-step instructions to deploy everything!

Assuming you decided to use this component and followed the README, you should then configure the gateway in this way:

$ pwd
/home/ipxe/valve-infra/ipxe-boot-server/files/<fingerprint of your gateway>/
$ ls
boot.ipxe  initrd  kernel  secrets
$ cat boot.ipxe

kernel /files/kernel b2c.extra_args_url="${secrets_url}" b2c.container="-v perm:/mnt/perm docker://alpine" b2c.ntp_peer=auto b2c.cache_device=auto
initrd /files/initrd
$ cat secrets
b2c.minio="bbz,${s3_endpoint},${s3_access_key_id},${s3_access_key}" b2c.volume="perm,mirror=bbz/${s3_bucket_name},pull_on=pipeline_start,push_on=changes,overwrite,delete"

And that's it! We finally made it to the end, and created a secure way to provision our CI gateways with the wanted kernel, Operating System, and even local files!

When Charlie Turner and I started designing this system, we felt it would be a clean and simple way to solve our problems with our CI gateways, but the implementation ended up being quite a little trickier than the high-level view... especially the SSL certificates! However, the certainty that we can now deploy updates and fix our CI gateways even when they are physically inaccessible from us (provided the hardware and PDU are fine) definitely made it all worth it and made the prospect of having users depending on our systems less scary!

Let us know how you feel about it!


In this post, we focused on provisioning the CI gateway with its boot configuration, and local files via the internet. This drastically reduces the risks that updating the gateway's kernel would result in an extended loss of service, as the kernel configuration can quickly be reverted by changing the boot config files which is served from a cloud service provider.

The local file provisioning system also doubles as a backup, and disaster recovery system which will automatically kick in in case of hardware failure thanks to the constant mirroring of the local files with an S3-compatible cloud storage bucket.

In the next post, we will be talking about how to create the infra container, and how we can minimize down time during updates by not needing to reboot the gateway.

That's all for now, thanks for making it to the end!

This Is A Serious Blog

I posted some fun fluff pieces last week to kick off the new year, but now it’s time to get down to brass tacks.

Everyone knows adding features is just flipping on the enable button. Now it’s time to see some real work.

If you don’t like real work, stop reading. Stop right now. Now.

Alright, now that all the haters are gone, let’s put on our bisecting snorkels and dive in.

Regressions Suck

The dream of 2022 was that I’d come back and everything would work exactly how I left it. All the same tests would pass, all the perf would be there, and my driver would compile.

I got two of those things, which isn’t too bad.

After spending a while bisecting and debugging last week, I categorized a number of regressions to RADV problems which probably only affect me since there’s no Vulkan CTS cases for them (yet). But today I came to the last of the problem cases:

There’s nothing too remarkable about the test. It’s XFB, so, according to Jason Ekstrand, future Head of Graphic Wows at Pixar, it’s terrible.

What is remarkable, however is that the test passes fine when run in isolation.

Here We Go Again

Anyone who’s anyone knows what comes next.

  • You find the caselist of the other 499 tests that were run in this block
  • You run the caselist
  • You find out that the test still fails in that caselist
  • You tip your head back to stare at the ceiling and groan

Then it’s another X minutes (where X is usually between 5 and 180 depending on test runtimes) to slowly pare down the caselist to the sequence which actually triggers the failure. For those not in the know, this type of failure indicates a pathological driver bug where a sequence of commands triggers different results if tests are run in a different order.

There is, to my knowledge, no ‘automatic’ way to determine exactly which tests are required to trigger this type of failure from a caselist. It would be great if there was, and it would save me (and probably others who are similarly unaware) considerable time doing this type of caselist fuzzing.

Finally, I was left with this shortened caselist:


What Now?

Ideally, it would be great to be able to use something like gfxreconstruct for this. I could record two captures—one of the test failing in the caselist and one where it passes in isolation—and then compare them.

Here’s an excerpt from that attempt:

"[790]vkCreateShaderModule": {
    "return": "VK_SUCCESS",
    "device": "0x0x4",
    "pCreateInfo": {
        "pNext": null,
        "flags": 0,
        "codeSize": Unhandled VkFormatFeatureFlagBits2KHR,
        "pCode": "0x0x285c8e0"
    "pAllocator": null,
    "[out]pShaderModule": "0x0xe0"

Why is it trying to print an enum value for codeSize you might ask?

I’m not the only one to ask, and it’s still an unresolved mystery.

I was successful in doing the comparison with gfxreconstruct, but it yielded nothing of interest.

Puzzled, I decided to try the test out on lavapipe. Would it pass?


It similarly fails on llvmpipe and IRIS.

But my lavapipe testing revealed an important clue. Given that there are no synchronization issues with lavapipe, this meant I could be certain this was a zink bug. Furthermore, the test failed both when the bug was exhibiting and when it wasn’t, meaning that I could actually see the “passing” values in addition to the failing ones for comparison.

Here’s the failing error output:

Verifying feedback results.
Element at index 0 (tessellation invocation 0) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 1 (tessellation invocation 1) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 2 (tessellation invocation 2) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 3 (tessellation invocation 3) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 4 (tessellation invocation 4) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 5 (tessellation invocation 5) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 6 (tessellation invocation 6) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0.5, 1)
Element at index 7 (tessellation invocation 7) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0.5, 1)
Omitted 24 error(s).

And here’s the passing error output:

Verifying feedback results.
Element at index 3 (tessellation invocation 1) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 4 (tessellation invocation 2) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 5 (tessellation invocation 3) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 6 (tessellation invocation 4) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 7 (tessellation invocation 5) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 8 (tessellation invocation 6) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 9 (tessellation invocation 7) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 10 (tessellation invocation 8) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Omitted 18 error(s).

This might not look like much, but to any zinkologists, there’s an immediate red flag: the Z component of the vertex is 0.5 in the failing case.

What does this remind us of?

Naturally it reminds us of nir_lower_clip_halfz, the compiler pass which converts OpenGL Z coordinate ranges ([-1, 1]) to Vulkan ([0, 1]). This pass is run on the last vertex stage, but if it gets run more than once, a value of -1 becomes 0.5.

Thus, it looks like the pass is being run twice in this test. How can this be verified?

ZINK_DEBUG=spirv will export all spirv shaders used by an app. Therefore, dumping all the shaders for passing and failing runs should confirm that the conversion pass is being run an extra time when they’re compared. The verdict?

@@ -1,7 +1,7 @@
 ; Version: 1.5
 ; Generator: Khronos; 0
-; Bound: 23
+; Bound: 38
 ; Schema: 0
                OpCapability TransformFeedback
                OpCapability Shader
@@ -36,13 +36,28 @@
 %_ptr_Output_v4float = OpTypePointer Output %v4float
 %gl_Position = OpVariable %_ptr_Output_v4float Output
      %v4uint = OpTypeVector %uint 4
+%uint_1056964608 = OpConstant %uint 1056964608
        %main = OpFunction %void None %3
          %18 = OpLabel
                OpBranch %17
          %17 = OpLabel
          %19 = OpLoad %v4float %a_position
          %21 = OpBitcast %v4uint %19
-         %22 = OpBitcast %v4float %21
-               OpStore %gl_Position %22
+         %22 = OpCompositeExtract %uint %21 3
+         %23 = OpCompositeExtract %uint %21 3
+         %24 = OpCompositeExtract %uint %21 2
+         %25 = OpBitcast %float %24
+         %26 = OpBitcast %float %23
+         %27 = OpFAdd %float %25 %26
+         %28 = OpBitcast %uint %27
+         %30 = OpBitcast %float %28
+         %31 = OpBitcast %float %uint_1056964608
+         %32 = OpFMul %float %30 %31
+         %33 = OpBitcast %uint %32
+         %34 = OpCompositeExtract %uint %21 1
+         %35 = OpCompositeExtract %uint %21 0
+         %36 = OpCompositeConstruct %v4uint %35 %34 %33 %22
+         %37 = OpBitcast %v4float %36
+               OpStore %gl_Position %37

And, as is the rule for such things, the fix was a simple one-liner to unset values in the vertex shader key.

It wasn’t technically a regression, but it manifested as such, and fixing it yielded another dozen or so fixes for cases which were affected by the same issue.


January 04, 2022


It’s a busy week here at SGC. There’s emails to read, tickets to catch up on, rumors to spread about jekstrand’s impending move to Principal Engineer of Bose’s headphone compiler team, code to unwrite. The usual. Except now I’m actually around to manage everything instead of ignoring it.

Let’s do a brief catchup of today’s work items.

Sparse Textures

I said this was done yesterday, but the main CTS case for the extension is broken, so I didn’t adequately test it. Fortunately, Qiang Yu from AMD is on the case in addition to doing the original Gallium implementations for these extensions, and I was able to use a WIP patch to fix the test. And run it. And then run it again. And then run it in gdb. And then… And then…

Anyway, it all passes now, and sparse texture support is good to go once Australia comes back from vacation to review patches.

Also I fixed sparse buffer support, which I accidentally broke 6+ months ago but never noticed since only RADV implements these features and I have no games in my test list that use them.


I hate queries. Everyone knows I hate queries. The query code is the worst code in the entire driver. If I never have to open zink_query.c again, I will still have opened it too many times for a single lifetime.

But today I hucked myself back in yet again to try and stop a very legitimate and legal replay of a Switch game from crashing. Everyone knows that anime is the real primary driver of all technology, so as soon as anyone files an anime-related ticket, all driver developers drop everything they’re doing to solve it. Unless they’re on vacation.

In this case, the problem amounted to:

  • vulkan query pools have a maximum number of queries
  • exceeding this causes a crash
  • trying not to exceed it also causes a crash if the person writing the code is dumb
  • 2021 me was much dumber than 2022 me

Rejoice, for you can now play all your weeb games on zink if for some reason that’s where you’re at in your life.

But I’m not judging.

Source Games: Do More Of Them Run On Gallium Nine In 2022?


I came back to the gift of a new CS:GO version which adds DXVK support, so now there’s also Gallium Nine support. It works fine.


Does it work better than other engines?

I don’t know, and I have real work to do so I’m not going to test it, but surely someone will take an interest in benchmarking such things now that I’ve heroically git added a 64bit wrapper to my repo that can be used for testing.

A quick reminder that all Gallium Nine blog post references and tests happen with RadeonSI.

January 03, 2022

It appears that Google created a handy tool that helps finding the command which causes a GPU hang/crash. It is called Graphics Flight Recorder (GFR) and was open-sourced a year ago but didn’t receive any attention. From the readme:

The Graphics Flight Recorder (GFR) is a Vulkan layer to help trackdown and identify the cause of GPU hangs and crashes. It works by instrumenting command buffers with completion tags. When an error is detected a log file containing incomplete command buffers is written. Often the last complete or incomplete commands are responsible for the crash.

It requires VK_AMD_buffer_marker support; however, this extension is rather trivial to implement - I had only to copy-paste the code from our vkCmdSetEvent implementation and that was it. Note, at the moment of writing, GFR unconditionally usesVK_AMD_device_coherent_memory, which could be manually patched out for it to run on other GPUs.

GFR already helped me to fix hangs in “Alien: Isolation” and “Digital Combat Simulator”. In both cases the hang was in a compute shader and the output from GFR looked like:

- # Command:
        id: 6/9
        markerValue: 0x000A0006
        name: vkCmdBindPipeline
          - # parameter:
            name: commandBuffer
            value: 0x000000558CFD2A10
          - # parameter:
            name: pipelineBindPoint
            value: 1
          - # parameter:
            name: pipeline
            value: 0x000000558D3D6750
      - # Command:
        id: 6/9
        message: '>>>>>>>>>>>>>> LAST COMPLETE COMMAND <<<<<<<<<<<<<<'
      - # Command:
        id: 7/9
        markerValue: 0x000A0007
        name: vkCmdDispatch
          - # parameter:
            name: commandBuffer
            value: 0x000000558CFD2A10
          - # parameter:
            name: groupCountX
            value: 5
          - # parameter:
            name: groupCountY
            value: 1
          - # parameter:
            name: groupCountZ
            value: 1
            vkHandle: 0x000000558D3D6750
            bindPoint: compute
              - # shaderInfo:
                stage: cs
                module: (0x000000558F82B2A0)
                entry: "main"
            - # descriptorSet:
              index: 0
              set: 0x000000558E498728
      - # Command:
        id: 8/9
        markerValue: 0x000A0008
        name: vkCmdPipelineBarrier

After confirming that corresponding vkCmdDispatch is indeed the call which hangs, in both cases I made an Amber test which fully simulated the call. For a compute shader, this is relatively easy to do since all you need is to save the decompiled shader and buffers being used by it. Luckily in both cases these Amber tests reproduced the hangs.

With standalone reproducers, the problems were much easier to debug, and fixes were made shortly: MR#14044 for “Alien: Isolation” and MR#14110 for “Digital Combat Simulator”.

Unfortunately this tool is not a panacea:

  • It likely would fail to help with unrecoverable hangs where it would be impossible to read the completion tags back.
  • Or when the mere addition of the tags could “fix” the issue which may happen with synchronization issues.
  • If draw/dispatch calls run in parallel on the GPU, writing tags may force them to execute sequentially or to be imprecise.

Anyway, it’s easy to use so you should give it a try.

We Back

The blog is back. I know everyone’s been furiously spamming F5 to see if there were any secret new posts, but no. There were not.

Today’s the first day of the new year, so I had to dig deep to remember how to do basic stuff like shitpost on IRC. And then someone told me jekstrand was going to Broadcom to work on Windows network drivers?

I’m just gonna say it now:

2022 has gone too far.

I know it’s early, I know some people are seeing this as a hot take, but I’m throwing the statement down before things get worse.

Knock it off, 2022.


Somehow the driver is still in the tree, still builds, and still runs. It’s a miracle.

Thus, since there were obviously no other matters more pressing than not falling behind on MesaMatrix, I spent the morning figuring out how to implement ARB_sparse_texture.

Was this the best decision when I didn’t even remember how to make meson clear its dependency cache? No. No it wasn’t.

But I did it anyway because here at SGC, we take bad ideas and turn them into code.

Your move, 2022.

December 09, 2021
Starting with kernel 5.17 the kernel supports the builtin privacy screens built into the LCD panel of some new laptop models.

This means that the drm drivers will now return -EPROBE_DEFER from their probe() method on models with a builtin privacy screen when the privacy screen provider driver has not been loaded yet.

To avoid any regressions distors should modify their initrd generation tools to include privacy screen provider drivers in the initrd (at least on systems with a privacy screen), before 5.17 kernels start showing up in their repos.

If this change is not made, then users using a graphical bootsplash (plymouth) will get an extra boot-delay of up to 8 seconds (DeviceTimeout in plymouthd.defaults) before plymouth will show and when using disk-encryption where the LUKS password is requested from the initrd, the system will fallback to text-mode after these 8 seconds.

I've written a patch with the necessary changes for dracut, which might be useful as an example for how to deal with this in other initrd generators, see:

I've also filed bugs for tracking this for Fedora, openSUSE, Arch, Debian and Ubuntu.


One of the big issues I have when working on Turnip driver development is that when compiling either Mesa or VK-GL-CTS it takes a lot of time to complete, no matter how powerful the embedded board is. There are reasons for that: typically those board have limited amount of RAM (8 GB for the best case), a slow storage disk (typically UFS 2.1 on-board storage) and CPUs that are not so powerful compared with x86_64 desktop alternatives.

RB3 Photo of the Qualcomm® Robotics RB3 Platform embedded board that I use for Turnip development.

To fix this, it is recommended to do cross-compilation, however installing the development environment for cross-compilation could be cumbersome and prone to errors depending on the toolchain you use. One alternative is to use a distributed compilation system that allows cross-compilation like Icecream.

Icecream is a distributed compilation system that is very useful when you have to compile big projects and/or on low-spec machines, while having powerful machines in the local network that can do that job instead. However, it is not perfect: the linking stage is still done in the machine that submits the job, which depending on the available RAM, could be too much for it (however you can alleviate this a bit by using ZRAM for example).

One of the features that icecream has over its alternatives is that there is no need to install the same toolchain in all the machines as it is able to share the toolchain among all of them. This is very useful as we will see below in this post.


Debian-based systems

$ sudo apt install icecc

Fedora systems

$ sudo dnf install icecream

Compile it from sources

You can compile it from sources.

Configuration of icecc scheduler

You need to have an icecc scheduler in the local network that will balance the load among all the available nodes connected to it.

It does not matter which machine is the scheduler, you can use any of them as it is quite lightweight. To run the scheduler execute the following command:

$ sudo icecc-scheduler

Notice that the machine running this command is going to be the scheduler but it will not participate in the compilation process by default unless you ran iceccd daemon as well (see next step).

Setup on icecc nodes

Launch daemon

First you need to run the iceccd daemon as root. This is not needed on Debian-based systems, as its systemd unit is enabled by default.

You can do that using systemd in the following way:

$ sudo systemctl start iceccd

Or you can enable the daemon at startup time:

$ sudo systemctl enable iceccd

The daemon will connect automatically to the scheduler that is running in the local network. If that’s not the case, or there are more than one scheduler, you can run it standalone and give the scheduler’s IP as parameter:

sudo iceccd -s <ip_scheduler>

Enable icecc compilation

With ccache

If you use ccache (recommended option), you just need to add the following in your .bashrc:

export CCACHE_PREFIX=icecc

Without ccache

To use it without ccache, you need to add its path to $PATH envvar so it is picked before the system compilers:

export PATH=/usr/lib/icecc/bin:$PATH


Same architecture

If you followed the previous steps, any time you compile anything on C/C++, it will distribute the work among the fastest nodes in the network. Notice that it will take into account system load, network connection, cores, among other variables, to decide which node will compile the object file.

Remember that the linking stage is always done in the machine that submits the job.

Different architectures (example cross-compiling for aarch64 on x86_64 nodes)

Icecream Icemon showing my x86_64 desktop (maxwell) cross-compiling a job for my aarch64 board (rb3).

Preparation on x86_64 machine

In one x86_64 machine, you need to create a toolchain. This is not automatically done by icecc as you can have different toolchains for cross-compilation.

Install cross-compiler

For example, you can install the cross-compiler from the distribution repositories:

For Debian-based systems:

sudo apt install crossbuild-essential-arm64

For Fedora:

$ sudo dnf install gcc-aarch64-linux-gnu gcc--c++-aarch64-linux-gnu

Create toolchain for icecc

Finally, to create the toolchain to share in icecc:

$ icecc-create-env --gcc /usr/bin/aarch64-linux-gnu-gcc /usr/bin/aarch64-linux-gnu-g++

This will create a <hash>.tar.gz file. The <hash> is used to identify the toolchain to distribute among the nodes in case there is more than one. But don’t worry, once it is copied to a node, it won’t be copied again as it detects it is already present.

Note: it is important that the toolchain is compatible with the target machine. For example, if my aarch64 board is using Debian 11 Bullseye, it is better if the cross-compilation toolchain is created from a Debian Bullseye x86_64 machine (a VM also works), because you avoid incompatibilities like having different glibc versions.

If you have installed Debian 11 Bullseye in your aarch64, you can use my own cross-compilation toolchain for x86_64 and skip this step.

Copy the toolchain to the aarch64 machine

scp <hash>.tar.gz aarch64-machine-hostname:

Preparation on aarch64

Once the toolchain (<hash>.tar.gz) is copied to the aarch64 machine, you just need to export this on .bashrc:

# Icecc setup for crosscompilation
export CCACHE_PREFIX=icecc
export ICECC_VERSION=x86_64:~/<hash>.tar.gz


Just compile on aarch64 machine and the jobs be distributed among your x86_64 machines as well. Take into account the jobs will be shared among other aarch64 machines as well if icecc decides so, therefore no need to do any extra step.

It is important to remark that the cross-compilation toolchain creation is only needed once, as icecream will copy it on all the x86_64 machines that will execute any job launched by this aarch64 machine. However, you need to copy this toolchain to any aarch64 machines that will use icecream resources for cross-compiling.

Icecream monitor


This is an interesting graphical tool to see the status of the icecc nodes and the jobs under execution.

Install on Debian-based systems

$ sudo apt install icecc-monitor

Install on Fedora

$ sudo dnf install icemon

Install it from sources

You can compile it from sources.


Even though icecream has a good cross-compilation documentation, it was the post written 8 years ago by my Igalia colleague Víctor Jáquez the one that convinced me to setup icecream as explained in this post.

Hope you find this info as useful as I did :-)

December 06, 2021

On the road to AppStream 1.0, a lot of items from the long todo list have been done so far – only one major feature is remaining, external release descriptions, which is a tricky one to implement and specify. For AppStream 1.0 it needs to be present or be rejected though, as it would be a major change in how release data is handled in AppStream.

Besides 1.0 preparation work, the recent 0.15 release and the releases before it come with their very own large set of changes, that are worth a look and may be interesting for your application to support. But first, for a change that affects the implementation and not the XML format:

1. Completely rewritten caching code

Keeping all AppStream data in memory is expensive, especially if the data is huge (as on Debian and Ubuntu with their large repositories generated from desktop-entry files as well) and if processes using AppStream are long-running. The latter is more and more the case, not only does GNOME Software run in the background, KDE uses AppStream in KRunner and Phosh will use it too for reading form factor information. Therefore, AppStream via libappstream provides an on-disk cache that is memory-mapped, so data is only consuming RAM if we are actually doing anything with it.

Previously, AppStream used an LMDB-based cache in the background, with indices for fulltext search and other common search operations. This was a very fast solution, but also came with limitations, LMDB’s maximum key size of 511 bytes became a problem quite often, adjusting the maximum database size (since it has to be set at opening time) was annoyingly tricky, and building dedicated indices for each search operation was very inflexible. In addition to that, the caching code was changed multiple times in the past to allow system-wide metadata to be cached per-user, as some distributions didn’t (want to) build a system-wide cache and therefore ran into performance issues when XML was parsed repeatedly for generation of a temporary cache. In addition to all that, the cache was designed around the concept of “one cache for data from all sources”, which meant that we had to rebuild it entirely if just a small aspect changed, like a MetaInfo file being added to /usr/share/metainfo, which was very inefficient.

To shorten a long story, the old caching code was rewritten with the new concepts of caches not necessarily being system-wide and caches existing for more fine-grained groups of files in mind. The new caching code uses Richard Hughes’ excellent libxmlb internally for memory-mapped data storage. Unlike LMDB, libxmlb knows about the XML document model, so queries can be much more powerful and we do not need to build indices manually. The library is also already used by GNOME Software and fwupd for parsing of (refined) AppStream metadata, so it works quite well for that usecase. As a result, search queries via libappstream are now a bit slower (very much depends on the query, roughly 20% on average), but can be mmuch more powerful. The caching code is a lot more robust, which should speed up startup time of applications. And in addition to all of that, the AsPool class has gained a flag to allow it to monitor AppStream source data for changes and refresh the cache fully automatically and transparently in the background.

All software written against the previous version of the libappstream library should continue to work with the new caching code, but to make use of some of the new features, software using it may need adjustments. A lot of methods have been deprecated too now.

2. Experimental compose support

Compiling MetaInfo and other metadata into AppStream collection metadata, extracting icons, language information, refining data and caching media is an involved process. The appstream-generator tool does this very well for data from Linux distribution sources, but the tool is also pretty “heavyweight” with lots of knobs to adjust, an underlying database and a complex algorithm for icon extraction. Embedding it into other tools via anything else but its command-line API is also not easy (due to D’s GC initialization, and because it was never written with that feature in mind). Sometimes a simpler tool is all you need, so the libappstream-compose library as well as appstreamcli compose are being developed at the moment. The library contains building blocks for developing a tool like appstream-generator while the cli tool allows to simply extract metadata from any directory tree, which can be used by e.g. Flatpak. For this to work well, a lot of appstream-generator‘s D code is translated into plain C, so the implementation stays identical but the language changes.

Ultimately, the generator tool will use libappstream-compose for any general data refinement, and only implement things necessary to extract data from the archive of distributions. New applications (e.g. for new bundling systems and other purposes) can then use the same building blocks to implement new data generators similar to appstream-generator with ease, sharing much of the code that would be identical between implementations anyway.

2. Supporting user input controls

Want to advertise that your application supports touch input? Keyboard input? Has support for graphics tablets? Gamepads? Sure, nothing is easier than that with the new control relation item and supports relation kind (since 0.12.11 / 0.15.0, details):


3. Defining minimum display size requirements

Some applications are unusable below a certain window size, so you do not want to display them in a software center that is running on a device with a small screen, like a phone. In order to encode this information in a flexible way, AppStream now contains a display_length relation item to require or recommend a minimum (or maximum) display size that the described GUI application can work with. For example:

  <display_length compare="ge">360</display_length>

This will make the application require a display length greater or equal to 300 logical pixels. A logical pixel (also device independent pixel) is the amount of pixels that the application can draw in one direction. Since screens, especially phone screens but also screens on a desktop, can be rotated, the display_length value will be checked against the longest edge of a display by default (by explicitly specifying the shorter edge, this can be changed).

This feature is available since 0.13.0, details. See also Tobias Bernard’s blog entry on this topic.

4. Tags

This is a feature that was originally requested for the LVFS/fwupd, but one of the great things about AppStream is that we can take very project-specific ideas and generalize them so something comes out of them that is useful for many. The new tags tag allows people to tag components with an arbitrary namespaced string. This can be useful for project-internal organization of applications, as well as to convey certain additional properties to a software center, e.g. an application could mark itself as “featured” in a specific software center only. Metadata generators may also add their own tags to components to improve organization. AppStream gives no recommendations as to how these tags are to be interpreted except for them being a strictly optional feature. So any meaning is something clients and metadata authors need to negotiate. It therefore is a more specialized usecase of the already existing custom tag, and I expect it to be primarily useful within larger organizations that produce a lot of software components that need sorting. For example:

  <tag namespace="lvfs">vendor-2021q1</tag>
  <tag namespace="plasma">featured</tag>

This feature is available since 0.15.0, details.

5. MetaInfo Creator changes

The MetaInfo Creator (source) tool is a very simple web application that provides you with a form to fill out and will then generate MetaInfo XML to add to your project after you have answered all of its questions. It is an easy way for developers to add the required metadata without having to read the specification or any guides at all.

Recently, I added support for the new control and display_length tags, resolved a few minor issues and also added a button to instantly copy the generated output to clipboard so people can paste it into their project. If you want to create a new MetaInfo file, this tool is the best way to do it!

The creator tool will also not transfer any data out of your webbrowser, it is strictly a client-side application.

And that is about it for the most notable changes in AppStream land! Of course there is a lot more, additional tags for the LVFS and content rating have been added, lots of bugs have been squashed, the documentation has been refined a lot and the library has gained a lot of new API to make building software centers easier. Still, there is a lot to do and quite a few open feature requests too. Onwards to 1.0!

December 02, 2021

Khronos submission indicating Vulkan 1.1 conformance for Turnip on Adreno 618 GPU.

It is a great feat, especially for a driver which is created without hardware documentation. And we support features far from the bare minimum required for conformance.

But first of all, I want to thank and congratulate everyone working on the driver: Connor Abbott, Rob Clark, Emma Anholt, Jonathan Marek, Hyunjun Ko, Samuel Iglesias. And special thanks to Samuel Iglesias and Ricardo Garcia for tirelessly improving Khronos Vulkan Conformance Tests.

At the start of the year, when I started working on Turnip, I looked at the list of failing tests and thought “It wouldn’t take a lot to fix them!”, right, sure… And so I started fixing issues alongside of looking for missing features.

In June there were even more failures than there were in January, how could it be? Of course we were adding new features and it accounted for some of them. However even this list was likely not exhaustive because for gitlab CI instead of running the whole Vulkan CTS suite - we ran 1/3 of it. We didn’t have enough devices to run the whole suite fast enough to make it usable in CI. So I just ran it locally from time to time.

1/3 of the tests doesn’t sound bad and for the most part it’s good enough since we have a huge amount of tests looking like this:


Every format, every operation, etc. Tens of thousands of them.

Unfortunately the selection of tests for a fractional run is as straightforward as possible - just every third test. Which bites us when there a single unique tests, like:


Most of them test something unique that has much higher probability of triggering a special path in a driver compared to uncountable image tests. And they fell through the cracks. I even had to fix one test twice because the CI didn’t run it.

A possible solution is to skip tests only when there is a large swath of them and run smaller groups as-is. But it’s likely more productive to just throw more hardware at the issue =).

Not enough hardware in CI

Another trouble is that we had only one 6xx sub-generation present in CI - Adreno 630. We distinguish four sub-generations. Not only they have some different capabilities, there are also differences in the existing ones, causing the same test to pass on CI and being broken on another newer GPU. Presently in CI we test only Adreno 618 and 630 which are “Gen 1” GPUs and we claimed conformance only for Adreno 618.

Yet another issue is that we could render in tiling and bypass (sysmem) modes. That’s because there are a few features we could support only when there is no tiling and we render directly into the sysmem, and sometimes rendering directly into sysmem is just faster. At the moment we use tiling rendering by default unless we meet an edge case, so by default CTS tests only tiling rendering.

We are forcing sysmem mode for a subset of tests on CI, however it’s not enough because the difference between modes is relevant for more than just a few tests. Thus ideally we should run twice as many tests, and even better would be thrice as many to account for tiling mode without binning vertex shader.

That issue became apparent when I implemented a magical eight-ball to choose between tiling and bypass modes depending on the run-time information in order to squeeze more performance (it’s still work-in-progress). The basic idea is that a single draw call or a few small draw calls is faster to render directly into system memory instead of loading framebuffer into the tile memory and storing it back. But almost every single CTS test does exactly this! Do a single or a few draw calls per render pass, which causes all tests to run in bypass mode. Fun!

Now we would be forced to deal with this issue since with the magic eight-ball games would partly run in the tiling mode and partly in the bypass, making them equally important for real-world workload.

Does conformance matter? Does it reflect anything real-world?

Unfortunately no test suite could wholly reflect what game developers do in their games. However, the amount of tests grows and new tests are getting contributed based on issues found in games and other applications.

When I ran my stash of D3D11 game traces through DXVK on Turnip for the first time - I found a bunch of new crashes and hangs but it took fixing just a few of them for majority of games to render correctly. This shows that Khronos Vulkan Conformance Tests are doing their job and we at Igalia are striving to make them even better.

One of the extensions released as part of Vulkan 1.2.199 was VK_EXT_image_view_min_lod extension. I’m happy to see it published as I have participated in the release process of this extension: from reviewing the spec exhaustively (I even contributed a few things to improve it!) to developing CTS tests for it that will be eventually merged to the CTS repo.

This extension was proposed by Valve to mirror a feature present in Direct3D 12 (check ResourceMinLODClamp here) and Direct3D 11 (check SetResourceMinLOD here). In other words, this extension allows clamping the minimum LOD value accessed by an image view to a minLod value set at image view creation time.

That way, any library or API layer that translates Direct3D 11/12 calls to Vulkan can use the extension to mirror the behavior above on Vulkan directly without workarounds, facilitating the port of Direct3D applications such as games to Vulkan. For example, projects like Vkd3d, Vkd3d-proton and DXVK could benefit from it.

Going into more details, this extension changed how the image level selection is calculated and sets an additional minimum required in the image level for integer texel coordinate operations if it is enabled.

The way to use this feature in an application is very simple:

  • Check the extension is supported and if the physical device supports the respective feature:
// Provided by VK_EXT_image_view_min_lod
typedef struct VkPhysicalDeviceImageViewMinLodFeaturesEXT {
    VkStructureType    sType;
    void*              pNext;
    VkBool32           minLod;
} VkPhysicalDeviceImageViewMinLodFeaturesEXT;
  • Once you know everything is working, enable both the extension and the feature when creating the device.

  • When you want to create a VkImageView that defines a minLod for image accesses, then add the following structure filled with the value you want in VkImageViewCreateInfo’s pNext.

// Provided by VK_EXT_image_view_min_lod
typedef struct VkImageViewMinLodCreateInfoEXT {
    VkStructureType    sType;
    const void*        pNext;
    float              minLod;
} VkImageViewMinLodCreateInfoEXT;

And that’s all! As you see, it is a very simple extension.

Happy hacking!

November 24, 2021

I was interested in how much work a vaapi on top of vulkan video proof of concept would be.

My main reason for being interested is actually video encoding, there is no good vulkan video encoding demo yet, and I'm not experienced enough in the area to write one, but I can hack stuff. I think it is probably easier to hack a vaapi encode to vulkan video encode than write a demo app myself.

With that in mind I decided to see what decode would look like first. I talked to Mike B (most famous zink author) before he left for holidays, then I ignored everything he told me and wrote a super hack.

This morning I convinced zink vaapi on top anv with iris GL doing the presents in mpv to show me some useful frames of video. However zink vaapi on anv with zink GL is failing miserably (well green jellyfish).

I'm not sure how much more I'll push on the decode side at this stage, I really wanted it to validate the driver side code, and I've found a few bugs in there already.

The WIP hacks are at [1]. I might push on to encode side and see if I can workout what it entails, though the encode spec work is a lot more changeable at the moment.


November 19, 2021

Last Post Of The Year

Yes, we’ve finally reached that time. It’s mid-November, and I’ve been storing up all this random stuff to unveil now because I’m taking the rest of the year off.

This will be the final SGC post for 2021. As such, it has to be a good one, doesn’t it?

Zink Roundup

It’s been a wild year for zink. Does anybody even remember how many times I finished the project? I don’t, but it’s been at least a couple. Somehow there’s still more to do though.

I’ll be updating zink-wip one final time later today with the latest Copper snapshot. This is going to be crashier than the usual zink-wip, but that’s because zink-wip doesn’t have nearly as much cool future-zink stuff as it used to these days. Nearly everything is already merged into mainline, or at least everything that’s likely to help with general use, so just use that if you aren’t specifically trying to test out Copper.

One of those things that’s been a thorn in zink’s side for a long time is PBO handling, specifically for unsupported formats like ARGB/ABGR, ALPHA, LUMINANCE, and InTeNsItY. Vulkan has no analogs for any of these, and any app/game which tries to do texture upload or download from them with zink is going to have a very, very bad time, as has been the case with CS:GO, which would take literal days to reach menus due to performing fullscreen GL_LUMINANCE texture downloads.

This is now fixed in the course of landing compute PBO download support, which I blogged about forever ago since it also yields a 2-10x performance improvement for a number of other cases in all Gallium drivers. Or at least the ones that enable it.

CS:GO should now run out of the box in Mesa 22.0, and things like RPCS3 which do a lot of PBO downloading should also see huge improvements.

That’s all I’ve got here for zink, so now it’s time once again…


That’s right, it’s happening. Change your hats, we’re a Gallium blog again for the first time in nearly five months.

Everyone remembers when I promised that you’d be able to run native Linux D3D9 games on the Nine state tracker. Well, I suppose that’s a fancy way of saying Source Engine games, aka the ones Valve ships with native Linux ports, since probably nobody else has shipped any kind of native Linux app that uses the D3D9 API, but still!

That time is now.

Right now.

No more waiting, no new Mesa release required, you can just plug it in and test it out this second for instantly improved performance.

As long as you first acknowledge that this is not a Valve-official project, and it’s only to be used for educational purposes.

But also, please benchmark it lots and tell me your findings. Again, just for educational purposes. Wink.


This has been a long time in the making. After the original post, I knew that the goal here was to eventually be able to run these games without needing any kind of specialized Mesa build, since that’s annoying and also breaks compatibility with running Nine for other purposes.

Thus I enlisted the further help of Nine expert and image enthusiast, Axel Davy, to help smooth out the rough edges once I was done fingerpainting my way to victory.

The result is a simple wrapper which can be preloaded to run any DXVK-compatible (i.e., any of them that support -vulkan) Source Engine game on Nine—and obviously this won’t work on NVIDIA blob at all so don’t bother trying.

In short:

  • clone that repo
  • right click on Properties for e.g., Left 4 Dead 2
  • change the command line to LD_PRELOAD=/path/to/Xnine/ %command% -vulkan

For Portal 2 (at present, though this won’t always be the case), you’ll also need to add NINE_VHACKS=1 to work around some frogs that were accidentally added to the latest version of the game as a developer-only easter egg.

Then just run the game normally, and if everything went right and you have Nine installed in one of the usual places, you should load up the game with Gallium Nine. More details on that in the repo’s README.

GPU Goes Brrr?

Yes. Very brrr.

Here’s your normal GL performance from a simple Portal 2 benchmark:


Around 400 FPS.

Here’s Gallium Nine:


Around 600 FPS.

A 50% improvement with the exact same backend GPU driver isn’t too bad for a simple preload shim.

Can I Get A Side Of SHOTS FIRED With That?

You got it.

What about DXVK?

This isn’t an extensive benchmark, but here we go with that too:


Also around 600 FPS.

I say “around” here because the variation is quite extreme for both Nine and DXVK based on slight changes in variable clock speeds because I didn’t pin them: Nine ranges between 590-610 FPS, and DXVK is 590-620 FPS.

So now there’s two solid, open source methods for improving performance in these games over the normal GL version. But what if we go even deeper?

What if we check out some real performance numbers?

Power Consumption

If you’ve never checked out PowerTOP, it’s a nice way to get an overview of what’s using up system resources and consuming power.

If you’ve never used it for benchmarking, don’t worry, I took care of that too.

Here’s some PowerTOP figures for the same Portal 2 timedemo:

What’s interesting here is that DXVK uses 90%+ CPU, while Nine is only using about 25%. This is a gap that’s consistent across runs, and it likely explains why a number of people find that DXVK doesn’t work on their systems: you still need some amount of CPU to run the actual game calculations, so if you’re on older hardware, you might end up using all of your available CPU just on DXVK internals.

GPU Usage?

Got you covered. Here’s a per-second poll (one row per second) from radeontop.


GPU Usage VGT Usage TA Usage SX Usage SH Usage SPI Usage SC Usage PA Usage DB Usage CB Usage VRAM Usage GTT Usage Memory Clock Shader Clock
35.83% 17.50% 23.33% 28.33% 17.50% 29.17% 28.33% 5.00% 27.50% 26.67% 12.75% 1038.15mb 7.82% 638.53mb 52.19% 0.457ghz 33.52% 0.704ghz
35.83% 17.50% 23.33% 28.33% 17.50% 29.17% 28.33% 5.00% 27.50% 26.67% 12.75% 1038.15mb 7.82% 638.53mb 52.19% 0.457ghz 33.52% 0.704ghz
36.67% 30.00% 33.33% 35.00% 30.00% 35.00% 32.50% 18.33% 30.83% 28.33% 12.76% 1038.57mb 7.82% 638.53mb 48.88% 0.428ghz 36.95% 0.776ghz
75.83% 63.33% 62.50% 66.67% 63.33% 68.33% 65.00% 27.50% 60.83% 53.33% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 95.82% 2.012ghz
71.67% 60.00% 60.00% 64.17% 60.00% 66.67% 60.83% 23.33% 56.67% 51.67% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 96.31% 2.023ghz
75.00% 62.50% 66.67% 66.67% 62.50% 69.17% 68.33% 23.33% 65.83% 59.17% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 96.71% 2.031ghz
63.33% 55.00% 56.67% 58.33% 55.00% 59.17% 59.17% 17.50% 52.50% 50.00% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 89.77% 1.885ghz
78.33% 64.17% 64.17% 65.00% 64.17% 69.17% 70.83% 30.00% 63.33% 58.33% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 97.33% 2.044ghz
73.33% 60.83% 64.17% 65.00% 60.83% 67.50% 64.17% 29.17% 59.17% 51.67% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 97.39% 2.045ghz
60.83% 50.83% 50.00% 53.33% 50.83% 55.00% 50.83% 25.83% 48.33% 45.00% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 95.35% 2.002ghz
67.50% 50.00% 55.00% 59.17% 50.00% 60.00% 58.33% 28.33% 52.50% 45.00% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 87.91% 1.846ghz


GPU Usage VGT Usage TA Usage SX Usage SH Usage SPI Usage SC Usage PA Usage DB Usage CB Usage VRAM Usage GTT Usage Memory Clock Shader Clock
17.50% 11.67% 15.00% 10.83% 11.67% 15.00% 10.83% 3.33% 10.83% 10.00% 7.38% 600.56mb 1.60% 130.48mb 50.38% 0.441ghz 15.76% 0.331ghz
17.50% 11.67% 15.00% 10.83% 11.67% 15.00% 10.83% 3.33% 10.83% 10.00% 7.38% 600.56mb 1.60% 130.48mb 50.38% 0.441ghz 15.76% 0.331ghz
70.83% 63.33% 65.83% 60.00% 63.33% 68.33% 57.50% 24.17% 56.67% 54.17% 7.35% 598.43mb 1.60% 130.48mb 89.50% 0.783ghz 77.09% 1.619ghz
74.17% 70.00% 67.50% 60.00% 70.00% 70.83% 61.67% 17.50% 60.83% 58.33% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 91.03% 1.912ghz
78.33% 69.17% 72.50% 65.00% 69.17% 75.83% 65.83% 15.00% 65.83% 64.17% 7.37% 599.80mb 1.60% 130.47mb 100.00% 0.875ghz 93.92% 1.972ghz
70.83% 67.50% 64.17% 55.00% 67.50% 67.50% 57.50% 20.83% 55.83% 53.33% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 91.93% 1.930ghz
65.00% 64.17% 60.00% 51.67% 64.17% 61.67% 53.33% 18.33% 52.50% 50.83% 7.37% 599.80mb 1.60% 130.47mb 100.00% 0.875ghz 89.95% 1.889ghz
74.17% 68.33% 70.00% 60.83% 68.33% 72.50% 65.00% 24.17% 64.17% 58.33% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 92.53% 1.943ghz
77.50% 73.33% 73.33% 62.50% 73.33% 75.00% 61.67% 22.50% 62.50% 57.50% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 91.21% 1.915ghz
70.00% 65.83% 60.00% 57.50% 65.83% 61.67% 59.17% 24.17% 55.00% 54.17% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 92.69% 1.946ghz
70.00% 65.83% 60.00% 57.50% 65.83% 61.67% 59.17% 24.17% 55.00% 54.17% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 92.69% 1.946ghz

Again, here we see a number of interesting things. DXVK consistently provokes slightly higher clock speeds (because I didn’t pin them), which may explain why it skews slightly higher in the benchmark results. DXVK also uses nearly 2x more VRAM and nearly 5x more GTT. On more modern hardware it’s unlikely that this would matter since we all have more GPU memory than we can possibly use in an OpenGL game, but on older hardware—or in cases where memory usage might lead to power consumption that should be avoided because we’re running on battery—this could end up being significant.


Source Engine games run great on Linux. That’s what we all care about at the end of the day, isn’t it?

But also, if more Source Engine games get ported to DXVK, give them a try with Nine. Or just test the currently ported ones, Portal 2 and Left 4 Dead 2.

I want data.

Lots of data.

Post it here, email it to me, whatever.

Until 2022

Lots of cool projects still in the works, so stay tuned next year!

November 18, 2021

If you own a laptop (Dell, HP, Lenovo) with a WWAN module, it is very likely that the modules are FCC-locked on every boot, and the special FCC unlock procedure needs to be run before they can be used.

Until ModemManager 1.18.2, the procedure was automatically run for the FCC unlock procedures we knew about, but this will no longer happen. Once 1.18.4 is out, the procedure will need to be explicitly enabled by each user, under their own responsibility, or otherwise implicitly enabled after installing an official FCC unlock tool provided by the manufacturer itself.

See a full description of the rationale behind this change in the ModemManager documentation site and the suggested code changes in the gitlab merge request.

If you want to enable the ModemManager provided unofficial FCC unlock tools once you have installed 1.18.4, run (assuming sysconfdir=/etc and datadir=/usr/share) this command (*):

sudo ln -sft /etc/ModemManager/fcc-unlock.d /usr/share/ModemManager/fcc-unlock.available.d/*

The user-enabled tools in /etc should not be removed during package upgrades, so this should be a one-time setup.

(*) Updated to have one single command instead of a for loop; thanks heftig!

November 17, 2021

What If Zink Was Actually The Fastest GL Driver?

In an earlier post I talked about Copper and what it could do on the way to a zink future.

What I didn’t talk about was WSI, or the fact that I’ve already fully implemented it in the course of bashing Copper into a functional state.

Window System Integration

…was the final step for zink to become truly usable.

At present, zink has a very hacky architecture where it loads through the regular driver path, but then for every image that is presented on the screen, it keeps a shadow copy which it blits to just before scanout, and this is the one that gets displayed.

Usually this works great other than the obvious (but minor) overhead that the blit incurs.

Where it doesn’t work great, however, is on non-Mesa drivers.

That’s right. I’m looking at you, NVIDIA.

As long-time blog enthusiasts will remember, I had NVIDIA running on zink some time ago, but there was a problem as it related to performance. Specifically, that single blit turned into a blit and then a full-frame CPU copy, which made getting any sort of game running with usable FPS a bit of a challenge.

WSI solves this by letting the Vulkan driver handle the scanout image entirely, removing all the copies to let zink render more like a normal driver (or game/app).

So How Is it?

That’s what everyone’s probably wondering. I have zink. I have WSI. I have my RTX2070 with the NVIDIA blob driver.

How does NVIDIA’s Vulkan driver (with zink) stack up to NVIDIA’s GL driver?

Everything below is using the 495.44 beta driver, as that’s the latest one at the time of my testing, and the non-beta driver didn’t work at all.

But first, can NVIDIA’s GL driver even render the game I want to show?


Confusingly, the answer is no, this version of NVIDIA’s GL driver can’t correctly render Tomb Raider, which is my go-to for all things GL and benchmarking. I’m gonna let that slide though since it’s still pumping out those frames at a solid rate.

It’s frustrating, but sometimes just passing CTS isn’t enough to be able to run some games, or there’s certain extensions (ARB_bindless_texture) which are under-covered.

The Numbers Don’t Lie

I’ll say as a prelude that it was a bit challenging to get a AAA game running in this environment. There’s some very strange issues happening with the NVIDIA Vulkan driver which prevented me from running quite a lot of things. Tomb Raider was the first one I got going after two full days of hacking at it, and that’s about what my time budget allowed for the excursion, so I stopped at that.

Up first: NVIDIA’s GL driver (495.44) nvtr-gl.png

Second: NVIDIA’s Vulkan driver (495.44) nvtr.png

As we can see, zink with NVIDIA’s Vulkan driver is roughly 25-30% faster than NVIDIA’s GL driver for Tomb Raider.

In Closing

I doubt that zink maintains this performance gap for all titles, but now we know that there are already at least some cases where it can pull ahead. Given that most vendors are shifting resources towards current-year graphics APIs like Vulkan and D3D12, it won’t be surprising if maintenance-mode GL drivers start to fall behind actively developed Vulkan drivers.

In short, there’s a real possibility that zink can provide tangible benefits to vendors who only want to ship Vulkan drivers, and those benefits might be more than (eventually) providing a conformant GL implementation.

Stay tuned for tomorrow when I close out the week strong with one final announcement for the year.

November 15, 2021

Previously I mentioned having AMD VCN h264 support. Today I added initial support for the older UVD engine[1]. This is found on chips from Vega back to SI.

I've only tested it on my Vega so far.

I also worked out the "correct" answer to the how to I send the reset command correctly, however the nvidia player I'm using as a demo doesn't do things that way yet, so I've forked it for now[2].

The answer is to use vkCmdControlVideoCodingKHR to send a reset the first type a session is used. However I can't see how the app is meant to know this is necessary, but I've asked the appropriate people.

The initial anv branch I mentioned last week is now here[3].




Copper: It’s A Thing (Sort of)

Over the past months, I’ve been working with Adam “X Does What I Say” Jackson to try and improve zink’s path through the arcane system of chutes and ladders that comprises Gallium’s loader architecture. The recent victory in getting a Wayland system compositor running is the culmination of those efforts.

I wanted to write at least a short blog post detailing some of the Gallium changes that were necessary to make this happen, if only so I have something to refer back to when I inevevitably break things later, so let’s dig in.

Pipes: How Do They Work?

It’s questionable to me whether anyone really knows how all the Gallium loader and DRI frontend interfaces work without first taking a deep read of the code and then having a nice, refreshing run around the block screaming to let all the crazy out. From what I understand of it, there’s the DRI (userspace) interface, which is used by EGL/GBM/GLX/SMH to manage buffers for scanout. DRI itself is split between software and platform; each DRI interface is a composite made of all the “extensions” which provide additional functionality to enable various API-level extensions.

It’s a real disaster to have to work with, and ideally the eventual removal of classic drivers will allow it to be simplified so that mere humans like me can comprehend its majesty.

Beyond all this, however, there’s the notion that the DRI frontend is responsible for determining the size of the scanout buffer as well as various other attributes. The software path through this is nice and simple since there’s no hardware to negotiate with, and the platform path exists.

Currently, zink runs on the platform path, which means that the DRI frontend is what “runs” zink. It chooses the framebuffer size, manages resizes, and handles multisample resolve blits as needed for every frame that gets rendered.

Too Many CooksPipes

The problem with this methodology is that there’s effecively two WSI systems active simultaneously: the Mesa DRI architecture, and the (eventual) Vulkan WSI infrastructure. Vulkan WSI isn’t going to work at all if it isn’t in charge of deciding things like window size, which means that the existing DRI architecture can’t work, neither in the platform mode nor the software mode.

As we know, there can be only one.

Thus Adam has been toiling away behind the scenes, taking neither vacation nor lunch break for the past ten years in order to iterate on a more optimal solution.

The result?


If you’re a Mesa developer or just a metallurgist, you know why the name Copper was chosen.

The premise of Copper is that it’s a DRI interface extension which can be used exclusively by zink to avoid any of the problem areas previously mentioned. The application will create a window, create a GL context for it, and (eventually) Vulkan WSI can figure things out by just having the window/surface passed through. This shifts all the “driving” WSI code out of DRI and into Vulkan WSI, which is much more natural.

In addition to Copper, zink can now be bound to a slight variation of the Gallium software loader to skip all the driver querying bits. There’s no longer anything to query, as DRI doesn’t have to make decisions anymore. It just calls through to zink normally, and zink can handle everything using the Vulkan API.

Simple and clean.


This all requires a ton of code. Looking at the two largest commits:

  • 29 files changed, 1413 insertions(+), 540 deletions(-)
  • 23 files changed, 834 insertions(+), 206 deletions(-)

Is a big yikes.

I can say with certainty that these improvements won’t be landing before 2022, but eventually they will in one form or another, and then zink will become significantly more flexible.

November 12, 2021

Last week I mentioned I had the basics of h264 decode using the proposed vulkan video on radv. This week I attempted to do the same thing with Intel's Mesa vulkan driver "anv".

Now I'd previously unsuccessfully tried to get vaapi on crocus working but got sidetracked back into other projects. The Intel h264 decoder hasn't changed a lot between ivb/hsw/gen8/gen9 era. I ported what I had from crocus to anv and started trying to get something to decode on my WhiskeyLake.

I wrote the code pretty early on, figured out all the things I had to send the hardware.

The first anv side bridge to cross was Vulkan is doing H264 Picture level decode API, so it means you get handed the encoded slice data. However to program the Intel hw you need to decode the slice header. I wrote a slice header decoder in some common code. The other thing you need to give the intel hw is a number of bits of slice header, which in some encoding schemes is rounded to bytes and in some isn't. Slice headers also have a 3-byte header on them, which Intel hardware wants you to discard or skip before handing it to it.

Once I'd fixed up that sort of thing in anv + crocus, I started getting grey I-frames decoded with later B/P frames using the grey frames as references so you'd see this kinda wierd motion.

That was I think 3 days ago. I've have stared at this intently for those 3 days blaming everything from bitstream encoding to rechecking all my packets (not enough times though). I had someone else verify they could see grey frames.

Today after a long discussion about possibilities, I was randomly comparing a frame from the intel-vaapi-driver and from crocus, and I spotted a packet header, the docs say is 34 dwords long, but intel-vaapi was only encoding 16 dwords, I switched crocus to explicitly state a 16-dword length and I started seeing my I-frames.

Now the B/P frames still have issues. I don't think I'm getting the ref frames logic right yet, but it felt like a decent win after 3 days of staring at it.

The crocus code is [1]. The anv code isn't cleaned up enough to post a pointer to yet, enterprising people might find it. Next week I'll clean it all up, and then start to ponder upstream paths and shared code for radv + anv. Then h265 maybe.


A Long Time Coming

Zink can now run all display platform flavors of Weston (and possibly other compositors?). Expect it in zink-wip later today once it passes another round of my local CI.

Here it is in DRM running weston-simple-egl and weston-simple-dmabuf-egl all on zink:


Under Construction

This has a lot of rough edges, mostly as it relates to X11. In particular:

  • xservers (including xwayland) can’t run because GLAMOR is hard
  • some apps (e.g., Unigine Heaven) randomly get killed by the xserver for unknown reasons
  • if you’re very lucky, you can hit a Vulkan WSI deadlock


I’d go into details on this, but honestly it’s going to be like a week of posts to detail the sheer amount of chainsawing that’s gone into the project.

Stay tuned for that and more next week.

November 11, 2021

Everyone Knows…

That the one true benchmark for graphics is glxgears. It’s been the standard for 20+ years, and it’s going to remain the standard for a long time to come.

Gears Through Time

Zink has gone through a couple phases of glxgears performance.

Everyone remembers weird glxgears that was getting illegal amounts of frames due to its misrendering:


We salute you, old friend.

Now, however, some number of you have become aware of the new threat posed by heavy gears in the Mesa 21.3 release. Whereas glxgears is usually a lightweight, performant benchmarking tool, heavy gears is the opposite, chugging away at up to 20% of a single CPU core with none of the accompanying performance.



What Creates Such A Monster?

The answer won’t surprise you: GL_QUADS.

Indeed, because zink is a driver relying on the Vulkan API, only the primitive types supported by Vulkan can be directly drawn. This means any app using GL_QUADS is going to have a very bad time.

glxgears is exactly one of these apps, and (now that there’s a ticket open) I was forced to take action.


The root of the problem here is that gears passes its vertices into GL to be drawn as a rectangle, but zink can only draw triangles. This (currently) results in doing a very non-performant readback of the index buffer before every draw call to convert the draw to a triangle-based one.

A smart person might say “Why not just convert the vertices to triangles as you get them instead of waiting until they’re in the buffer?”

Thankfully, a smart person did say that and then do the accompanying work. The result is that finally, after all these years, zink can actually perform well in a real benchmark:


Stay Tuned

For more exciting zink updates. You won’t want to miss them.

November 10, 2021


In the course of working on more CI-related things for zink, I came across a series of troublesome tests (KHR-GL46.geometry_shader.rendering.rendering.triangles_*) that triggered a severe performance issue. Specifically, the LLVM optimizer spends absolute ages trying to optimize ubershaders like this one used in the tests:

#version 440

in  vec4 position;
out vec4 vs_gs_color;

uniform bool  is_lines_output;
uniform bool  is_indexed_draw_call;
uniform bool  is_instanced_draw_call;
uniform bool  is_points_output;
uniform bool  is_triangle_fan_input;
uniform bool  is_triangle_strip_input;
uniform bool  is_triangle_strip_adjacency_input;
uniform bool  is_triangles_adjacency_input;
uniform bool  is_triangles_input;
uniform bool  is_triangles_output;
uniform ivec2 renderingTargetSize;
uniform ivec2 singleRenderingTargetSize;

void main()
    gl_Position = position + vec4(float(gl_InstanceID) ) * vec4(0, float(singleRenderingTargetSize.y) / float(renderingTargetSize.y), 0, 0) * vec4(2.0);
    vs_gs_color = vec4(1.0, 0.0, 0.0, 0.0);

    if (is_lines_output)
        if (!is_indexed_draw_call)
            if (is_triangle_fan_input)
                   case 0:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 1:
                   case 5:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 2:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 4:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangle_strip_input)
                   case 1:
                   case 6: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 0:
                   case 4:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 2:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 5:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangle_strip_adjacency_input)
                   case 2:
                   case 12: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 0:
                   case 8:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 6:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 10: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangles_input)
                   case 0: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 1: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 2: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 5: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 6: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 7: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 8: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 9:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 10: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 11: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
            if (is_triangles_adjacency_input)
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);

                    case 0: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 2: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 6: vs_gs_color  = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 8: vs_gs_color  = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 10: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 12: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 14: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 16: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 18: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 20: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 22: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
            if (is_triangles_input)
                    case 11: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 10: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 9:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 8:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 7:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 6:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 5:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 4:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 3:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 0:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
            if (is_triangle_fan_input)
                   case 5:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4:
                   case 0:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 3:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 2:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangle_strip_input)
                switch (gl_VertexID)
                   case 5:
                   case 0: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 6:
                   case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangle_strip_adjacency_input)
                   case 11:
                   case 1:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 13:
                   case 5:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 9:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 7:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 3:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangles_adjacency_input)
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);

                    case 23: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 21: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 19: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 17: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 15: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 13: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 11: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 9:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 7:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 5: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 3: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 1: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
    if (is_points_output)
        if (!is_indexed_draw_call)
            if (is_triangles_adjacency_input)
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);

                switch (gl_VertexID)
                    case 0:
                    case 6:
                    case 12:
                    case 18: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 2:
                    case 22: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 4:
                    case 8: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 10:
                    case 14: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 16:
                    case 20: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
            if (is_triangle_fan_input)
                   case 0:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 1:
                   case 5:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 3:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 4:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangle_strip_input)
                switch (gl_VertexID)
                   case 1:
                   case 4:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 0:
                   case 6:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 3:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 5:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangle_strip_adjacency_input)
                switch (gl_VertexID)
                   case 2:
                   case 8:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 0:
                   case 12: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 4:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 6:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 10: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangles_input)
                switch (gl_VertexID)
                    case 0:
                    case 3:
                    case 6:
                    case 9: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 1:
                    case 11: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 2:
                    case 4: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 5:
                    case 7: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 8:
                    case 10: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
            if (is_triangle_fan_input)
                switch (gl_VertexID)
                   case 5:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 4:
                   case 0:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 3:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 2:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 1:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangle_strip_input)
                switch (gl_VertexID)
                   case 5:
                   case 2:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 6:
                   case 0:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 4:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 3:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 1:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangle_strip_adjacency_input)
                switch (gl_VertexID)
                   case 11:
                   case 5:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 13:
                   case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 9:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 7:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 3:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
            if (is_triangles_adjacency_input)
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);
                switch (gl_VertexID)
                    case 23:
                    case 17:
                    case 11:
                    case 5: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 21:
                    case 1: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 19:
                    case 15: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 13:
                    case 9: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 7:
                    case 3: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
            if (is_triangles_input)
                switch (gl_VertexID)
                    case 11:
                    case 8:
                    case 5:
                    case 2: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 10:
                    case 0: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 9:
                    case 7: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 6:
                    case 4: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 3:
                    case 1: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
    if (is_triangles_output)
        int vertex_id = 0;

        if (!is_indexed_draw_call && is_triangles_adjacency_input && (gl_VertexID % 2 == 0) )
            vertex_id = gl_VertexID / 2 + 1;
            vertex_id = gl_VertexID + 1;

        vs_gs_color = vec4(float(vertex_id) / 48.0, float(vertex_id % 3) / 2.0, float(vertex_id % 4) / 3.0, float(vertex_id % 5) / 4.0);

By ages I mean upwards of 10 minutes per test.


When In Doubt, Inline

Fortunately, zink already has tools to combat exactly this problem: ZINK_INLINE_UNIFORMS.

This feature analyzes shaders to determine if inlining uniform values will be beneficial, and if so, it rewrites the shader with the uniform values as constants rather than loads. This brings the resulting NIR for the shader from 4000+ lines down to just under 300. The tests all become near-instant to run as well.

Uniform inlining has been in zink for a while, but it’s been disabled by default (except on zink-wip for testing) as this isn’t a feature that’s typically desirable when running apps/games; every time the uniforms are updated, a new shader must be compiled, and this causes (even more) stuttering, making games on zink (even more) unplayable.

On CPU-based drivers like lavapipe, however, the time to compile a shader is usually less than the time to actually run a shader, so the trade-off becomes worth doing.

Stay tuned for exciting announcements in the next few days.

November 05, 2021

A few weeks ago I watched Victor's excellent talk on Vulkan Video. This made me question my skills in this area. I'm pretty vague on video processing hardware, I really have no understanding of H264 or any of the standards. I've been loosely following the Vulkan video group inside of Khronos, but I can't say I've understood it or been useful.

radeonsi has a gallium vaapi driver, that talks to firmware driver encoder on the hardware, surely copying what it is programming can't be that hard. I got an mpv/vaapi setup running and tested some videos on that setup just to get comfortable. I looked at what sort of data was being pushed about.

The thing is the firmware is doing all the work here, the driver is mostly just responsible for taking semi-parsed h264 bitstream data structures and giving them in memory buffers to the fw API. Then the resulting decoded image should be magically in a buffer.

I then got the demo nvidia video decoder application mentioned in Victor's talk.

I ported the code to radv in a couple of days, but then began a long journey into the unknown. The firmware is quite expectant on exactly what it wants and when it wants it. After fixing some interactions with the video player, I started to dig.

Now vaapi and DXVA (Windows) are context based APIs. This means they are like OpenGL, where you create a context, do a bunch of work, and tear it down, the driver does all the hw queuing of commands internally. All the state is held in the context. Vulkan is a command buffer based API. The application records command buffers and then enqueues those command buffers to the hardware itself.

So the vaapi driver works like this for a video

create hw ctx, flush, decode, flush, decode, flush, decode, flush, decode, flush, destroy hw ctx, flush

However Vulkan wants things to be more like

Create Session, record command buffer with (begin, decode, end) send to hw, (begin, decode, end), send to hw, End Sesssion

There is no way at the Create/End session time to submit things to the hardware.

After a week or two of hair removal and insightful irc chats I stumbled over a decent enough workaround to avoid the hw dying and managed to decode a H264 video of some jellyfish.

The work is based on bunch of other stuff, and is in no way suitable for upstreaming yet, not to mention the Vulkan specification is only beta/provisional so can't be used anywhere outside of development.

The preliminary code is in my gitlab repo here[1]. It has a start on h265 decode, but it's not working at all yet, and I think the h264 code is a bit hangy randomly.

I'm not sure where this is going yet, but it was definitely an interesting experiment.


November 04, 2021

A basic example of the git alias function syntax looks like this.

    shortcut = "!f() \
        echo Hello world!; \
    }; f"

This syntax defines a function f and then calls it. These aliases are executed in a sh shell, which means there's no access to Bash / Zsh specific functionality.

Every command is ended with a ; and each line ended with a \. This is easy enough to grok. But when we try to clean up the above snippet and add some quotes to "Hello world!", we hit this obtuse error message.

}; f: 1: Syntax error: end of file unexpected (expecting "}")

This syntax error is caused by quotes needing to be escaped. The reason for this comes down to how git tokenizes and executes these functions. If you're curious …