![]() |
|
May 16, 2022 | |
![]() |
We haven’t posted updates to the work done on the V3DV driver since
we announced the driver becoming Vulkan 1.1 Conformant.
But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then.
As mentioned on past posts, for the Vulkan driver we tried to focus as much as possible on the userspace part. So we tried to re-use the already existing kernel interface that we had for V3D, used by the OpenGL driver, without modifying/extending it.
This worked fine in general, except for synchronization. The V3D kernel interface only supported one synchronization object per submission. This didn’t properly map with Vulkan synchronization, which is more detailed and complex, and allowed defining several semaphores/fences. We initially handled the situation with workarounds, and left some optional features as unsupported.
After our 1.1 conformance work, our colleage Melissa Wen started to work on adding support for multiple semaphores on the V3D kernel side. Then she also implemented the changes on V3DV to use this new feature. If you want more technical info, she wrote a very detailed explanation on her blog (part1 and part2).
For now the driver has two codepaths that are used depending on if the kernel supports this new feature or not. That also means that, depending on the kernel, the V3DV driver could expose a slightly different set of supported features.
For a while, Mesa developers have been doing a great effort to refactor and move common functionality to a single place, so it can be used by all drivers, reducing the amount of code each driver needs to maintain.
During these months we have been porting V3DV to some of that infrastructure, from small bits (common VkShaderModule to NIR code), to a really big one: common synchronization framework.
As mentioned, the Vulkan synchronization model is really detailed and powerful. But that also means it is complex. V3DV support for Vulkan synchronization included heavy use of threads. For example, V3DV needed to rely on a CPU wait (polling with threads) to implement vkCmdWaitEvents, as the GPU lacked a mechanism for this.
This was common to several drivers. So at some point there were multiple versions of complex synchronization code, one per driver. But, some months ago, Jason Ekstrand refactored Anvil support and collaborated with other driver developers to create a common framework. Obviously each driver would have their own needs, but the framework provides enough hooks for that.
After some gitlab and IRC chats, Jason provided a Merge Request with the port of V3DV to this new common framework, that we iterated and tested through the review process.
Also, with this port we got timelime semaphore support for free. Thanks to this change, we got ~1.2k less total lines of code (and have more features!).
Again, we want to thank Jason Ekstrand for all his help.
Since 1.1 got announced the following extension got implemented and exposed:
If you want more details about VK_KHR_pipeline_executable_properties, Iago wrote recently a blog post about it (here)
Android support for V3DV was added thanks to the work of Roman Stratiienko, who implemented this and submitted Mesa patches. We also want to thank the Android RPi team, and the Lineage RPi maintainer (Konsta) who also created and tested an initial version of that support, which was used as the baseline for the code that Roman submitted. I didn’t test it myself (it’s in my personal TO-DO list), but LineageOS images for the RPi4 are already available.
In addition to new functionality, we also have been working on improving performance. Most of the focus was done on the V3D shader compiler, as improvements to it would be shared among the OpenGL and Vulkan drivers.
But one of the features specific to the Vulkan driver (pending to be ported to OpenGL), is that we have implemented double buffer mode, only available if MSAA is not enabled. This mode would split the tile buffer size in half, so the driver could start processing the next tile while the current one is being stored in memory.
In theory this could improve performance by reducing tile store overhead, so it would be more benefitial when vertex/geometry shaders aren’t too expensive. However, it comes at the cost of reducing tile size, which also causes some overhead on its own.
Testing shows that this helps in some cases (i.e the Vulkan Quake ports) but hurts in others (i.e. Unreal Engine 4), so for the time being we don’t enable this by default. It can be enabled selectively by adding V3D_DEBUG=db to the environment variables. The idea for the future would be to implement a heuristic that would decide when to activate this mode.
If you are interested in watching an overview of the improvements and changes to the driver during the last year, we made a presention in FOSDEM 2022:
“v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi
4”
![]() |
|
May 13, 2022 | |
![]() |
In late 2020, Apple debuted the M1 with Apple’s GPU architecture, AGX, rumoured to be derived from Imagination’s PowerVR series. Since then, we’ve been reverse-engineering AGX and building open source graphics drivers. Last January, I rendered a triangle with my own code, but there has since been a heinous bug lurking:
The driver fails to render large amounts of geometry.
Spinning a cube is fine, low polygon geometry is okay, but detailed models won’t render. Instead, the GPU renders only part of the model and then faults.
It’s hard to pinpoint how much we can render without faults. It’s not just the geometry complexity that matters. The same geometry can render with simple shaders but fault with complex ones.
That suggests rendering detailed geometry with a complex shader “takes too long”, and the GPU is timing out. Maybe it renders only the parts it finished in time.
Given the hardware architecture, this explanation is unlikely.
This hypothesis is easy to test, because we can control for timing with a shader that takes as long as we like:
for (int i = 0; i < LARGE_NUMBER; ++i) {
/* some work to prevent the optimizer from removing the loop */
}
After experimenting with such a shader, we learn…
That theory is out.
Let’s experiment more. Modifying the shader and seeing where it breaks, we find the only part of the shader contributing to the bug: the amount of data interpolated per vertex. Modern graphics APIs allow specifying “varying” data for each vertex, like the colour or the surface normal. Then, for each triangle the hardware renders, these “varyings” are interpolated across the triangle to provide smooth inputs to the fragment shader, allowing efficient implementation of common graphics techniques like Blinn-Phong shading.
Putting the pieces together, what matters is the product of the number of vertices (geometry complexity) times amount of data per vertex (“shading” complexity). That product is “total amount of per-vertex data”. The GPU faults if we use too much total per-vertex data.
Why?
When the hardware processes each vertex, the vertex shader produces per-vertex data. That data has to go somewhere. How this works depends on the hardware architecture. Let’s consider common GPU architectures.1
Traditional immediate mode renderers render directly into the framebuffer. They first run the vertex shader for each vertex of a triangle, then run the fragment shader for each pixel in the triangle. Per-vertex “varying” data is passed almost directly between the shaders, so immediate mode renderers are efficient for complex scenes.
There is a drawback: rendering directly into the framebuffer requires tremendous amounts of memory access to constantly write the results of the fragment shader and to read out back results when blending. Immediate mode renderers are suited to discrete, power-hungry desktop GPUs with dedicated video RAM.
By contrast, tile-based deferred renderers split rendering into two passes. First, the hardware runs all vertex shaders for the entire frame, not just for a single model. Then the framebuffer is divided into small tiles, and dedicated hardware called a tiler determines which triangles are in each tile. Finally, for each tile, the hardware runs all relevant fragment shaders and writes the final blended result to memory.
Tilers reduce memory traffic required for the framebuffer. As the hardware renders a single tile at a time, it keeps a “cached” copy of that tile of the framebuffer (called the “tilebuffer”). The tilebuffer is small, just a few kilobytes, but tilebuffer access is fast. Writing to the tilebuffer is cheap, and unlike immediate renderers, blending is almost free. Because main memory access is expensive and mobile GPUs can’t afford dedicated video memory, tilers are suited to mobile GPUs, like Arm’s Mali, Imaginations’s PowerVR, and Apple’s AGX.
Yes, AGX is a mobile GPU, designed for the iPhone. The M1 is a screaming fast desktop, but its unified memory and tiler GPU have roots in mobile phones. Tilers work well on the desktop, but there are some drawbacks.
First, at the start of a frame, the contents of the tilebuffer are undefined. If the application needs to preserve existing framebuffer contents, the driver needs to load the framebuffer from main memory and store it into the tilebuffer. This is expensive.
Second, because all vertex shaders are run before any fragment shaders, the hardware needs a buffer to store the outputs of all vertex shaders. In general, there is much more data required than space inside the GPU, so this buffer must be in main memory. This is also expensive.
Ah-ha. Because AGX is a tiler, it requires a buffer of all per-vertex data. We fault when we use too much total per-vertex data, overflowing the buffer.
…So how do we allocate a larger buffer?
On some tilers, like older versions of Arm’s Mali GPU, the userspace driver computes how large this “varyings” buffer should be and allocates it.2 To fix the faults, we can try increasing the sizes of all buffers we allocate, in the hopes that one of them contains the per-vertex data.
No dice.
It’s prudent to observe what Apple’s Metal driver does. We can cook up a Metal program drawing variable amounts of geometry and trace all GPU memory allocations that Metal performs while running our program. Doing so, we learn that increasing the amount of geometry drawn does not increase the sizes of any allocated buffers. In fact, it doesn’t change anything in the command buffer submitted to the kernel, except for the single “number of vertices” field in the draw command.
We know that buffer exists. If it’s not allocated by userspace – and by now it seems that it’s not – it must be allocated by the kernel or firmware.
Here’s a funny thought: maybe we don’t specify the size of the buffer at all. Maybe it’s okay for it to overflow, and there’s a way to handle the overflow.
It’s time for a little reconnaissance. Digging through what little public documentation exists for AGX, we learn from one WWDC presentation:
The Tiled Vertex Buffer stores the Tiling phase output, which includes the post-transform vertex data…
But it may cause a Partial Render if full. A Partial Render is when the GPU splits the render pass in order to flush the contents of that buffer.
Bullseye. The buffer we’re chasing, the “tiled vertex buffer”, can overflow. To cope, the GPU stops accepting new geometry, renders the existing geometry, and restarts rendering.
Since partial renders hurt performance, Metal application developers need to know about them to optimize their applications. There should be performance counters flagging this issue. Poking around, we find two:
Wait, what’s a “parameter buffer”?
Remember the rumours that AGX is derived from PowerVR? The public PowerVR optimization guides explain:
[The] list containing pointers to each vertex passed in from the application… is called the parameter buffer (PB) and is stored in system memory along with the vertex data.
Each varying requires additional space in the parameter buffer.
The Tiled Vertex Buffer is the Parameter Buffer. PB is the PowerVR name, TVB is the public Apple name, and PB is still an internal Apple name.
What happens when PowerVR overflows the parameter buffer?
An old PowerVR presentation says that when the parameter buffer is full, the “render is flushed”, meaning “flushed data must be retrieved from the frame buffer as successive tile renders are performed”. In other words, it performs a partial render.
Back to the Apple M1, it seems the hardware is failing to perform a partial render. Let’s revisit the broken render.
Notice parts of the model are correctly rendered. The parts that are not only have the black clear colour of the scene rendered at the start. Let’s consider the logical order of events.
First, the hardware runs vertex shaders for the bunny until the parameter buffer overflows. This works: the partial geometry is correct.
Second, the hardware rasterizes the partial geometry and runs the fragment shaders. This works: the shading is correct.
Third, the hardware flushes the partial render to the framebuffer. This must work for us to see anything at all.
Fourth, the hardware runs vertex shaders for the rest of the bunny’s geometry. This ought to work: the configuration is identical to the original vertex shaders.
Fifth, the hardware rasterizes and shades the rest of the geometry, blending with the old partial render. Because AGX is a tiler, to preserve that existing partial render, the hardware needs to load it back into the tilebuffer. We have no idea how it does this.
Finally, the hardware flushes the render to the framebuffer. This should work as it did the first time.
The only problematic step is loading the framebuffer back into the tilebuffer after a partial render. Usually, the driver supplies two “extra” fragment shaders. One clears the tilebuffer at the start, and the other flushes out the tilebuffer contents at the end.
If the application needs the existing framebuffer contents preserved, instead of writing a clear colour, the “load tilebuffer” program instead reads from the framebuffer to reload the contents. Handling this requires quite a bit of code, but it works in our driver.
Looking closer, AGX requires more auxiliary programs.
The “store” program is supplied twice. I noticed this when initially bringing up the hardware, but the reason for the duplication was unclear. Omitting each copy separately and seeing what breaks, the reason becomes clear: one program flushes the final render, and the other flushes a partial render.3
…What about the program that loads the framebuffer into the tilebuffer?
When a partial render is possible, there are two “load” programs. One writes the clear colour or loads the framebuffer, depending on the application setting. We understand this one. The other always loads the framebuffer.
…Always loads the framebuffer, as in, for loading back with a partial render even if there is a clear at the start of the frame?
If this program is the issue, we can confirm easily. Metal must require it to draw the same bunny, so we can write a Metal application drawing the bunny and stomp over its GPU memory to replace this auxiliary load program with one always loading with black.
Doing so, Metal fails in a similar way. That means we’re at the root cause. Looking at our own driver code, we don’t specify any program for this partial render load. Up until now, that’s worked okay. If the parameter buffer is never overflowed, this program is unused. As soon as a partial render is required, however, failing to provide this program means the GPU dereferences a null pointer and faults. That explains our GPU faults at the beginning.
Following Metal, we supply our own program to load back the tilebuffer after a partial render…
…which does not fix the rendering! Cursed, this GPU. The faults go away, but the render still isn’t quite right for the first few frames, indicating partial renders are still broken. Notice the weird artefacts on the feet.
Curiously, the render “repairs itself” after a few frames, suggesting the parameter buffer stops overflowing. This implies the parameter buffer can be resized (by the kernel or by the firmware), and the system is growing the parameter buffer after a few frames in response to overflow. This mechanism makes sense:
Starting the parameter buffer small and growing in response to overflow provides a balance, reducing the GPU’s memory footprint and minimizing partial renders.
Back to our misrendering. There are actually two buffers being used by our program, a colour buffer (framebuffer)… and a depth buffer. The depth buffer isn’t directly visible, but facilitates the “depth test”, which discards far pixels that are occluded by other close pixels. While the partial render mechanism discards geometry, the depth test discards pixels.
That would explain the missing pixels on our bunny. The depth test is broken with partial renders. Why? The depth test depends on the depth buffer, so the depth buffer must also be stored after a partial render and loaded back when resuming. Comparing a trace from our driver to a trace from Metal, looking for any relevant difference, we eventually stumble on the configuration required to make depth buffer flushes work.
And with that, we get our bunny.
These explanations are massive oversimplifications of how modern GPUs work, but it’s good enough for our purposes here.↩︎
This is a worse idea than it sounds. Starting with the new Valhall architecture, Mali allocates varyings much more efficiently.↩︎
Why the duplication? I have not yet observed Metal using different programs for each. However, for front buffer rendering, partial renders need to be flushed to a temporary buffer for this scheme to work. Of course, you may as well use double buffering at that point.↩︎
![]() |
|
May 11, 2022 | |
![]() |
Background
Today NVIDIA announced that they are releasing an open source kernel driver for their GPUs, so I want to share with you some background information and how I think this will impact Linux graphics and compute going forward.
One thing many people are not aware of is that Red Hat is the only Linux OS company who has a strong presence in the Linux compute and graphics engineering space. There are of course a lot of other people working in the space too, like engineers working for Intel, AMD and NVIDIA or people working for consultancy companies like Collabora or individual community members, but Red Hat as an OS integration company has been very active on trying to ensure we have a maintainable and shared upstream open source stack. This engineering presence is also what has allowed us to move important technologies forward, like getting hiDPI support for Linux some years ago, or working with NVIDIA to get glvnd implemented to remove a pain point for our users since the original OpenGL design only allowed for one OpenGl implementation to be installed at a time. We see ourselves as the open source community’s partner here, fighting to keep the linux graphics stack coherent and maintainable and as a partner for the hardware OEMs to work with when they need help pushing major new initiatives around GPUs for Linux forward. And as the only linux vendor with a significant engineering footprint in GPUs we have been working closely with NVIDIA. People like Kevin Martin, the manager for our GPU technologies team, Ben Skeggs the maintainer of Nouveau and Dave Airlie, the upstream kernel maintainer for the graphics subsystem, Nouveau contributor Karol Herbst and our accelerator lead Tom Rix have all taken part in meetings, code reviews and discussions with NVIDIA. So let me talk a little about what this release means (and also what it doesn’t mean) and what we hope to see come out of this long term.
First of all, what is in this new driver?
What has been released is an out of tree source code kernel driver which has been tested to support CUDA usecases on datacenter GPUs. There is code in there to support display, but it is not complete or fully tested yet. Also this is only the kernel part, a big part of a modern graphics driver are to be found in the firmware and userspace components and those are still closed source. But it does mean we have a NVIDIA kernel driver now that will start being able to consume the GPL-only APIs in the linux kernel, although this initial release doesn’t consume any APIs the old driver wasn’t already using. The driver also only supports NVIDIA Turing chip GPUs and newer, which means it is not targeting GPUs from before 2018. So for the average Linux desktop user, while this is a great first step and hopefully a sign of what is to come, it is not something you are going to start using tomorrow.
What does it mean for the NVidia binary driver?
Not too much immediately. This binary kernel driver will continue to be needed for older pre-Turing NVIDIA GPUs and until the open source kernel module is full tested and extended for display usecases you are likely to continue using it for your system even if you are on Turing or newer. Also as mentioned above regarding firmware and userspace bits and the binary driver is going to continue to be around even once the open source kernel driver is fully capable.
What does it mean for Nouveau?
Let me start with the obvious, this is actually great news for the Nouveau community and the Nouveau driver and NVIDIA has done a great favour to the open source graphics community with this release. And for those unfamiliar with Nouveau, Nouveau is the in-kernel graphics driver for NVIDIA GPUs today which was originally developed as a reverse engineered driver, but which over recent years actually have had active support from NVIDIA. It is fully functional, but is severely hampered by not having had the ability to for instance re-clock the NVIDIA card, meaning that it can’t give you full performance like the binary driver can. This was something we were working with NVIDIA trying to remedy, but this new release provides us with a better path forward. So what does this new driver mean for Nouveau? Less initially, but a lot in the long run. To give a little background first. The linux kernel does not allow multiple drivers for the same hardware, so in order for a new NVIDIA kernel driver to go in the current one will have to go out or at least be limited to a different set of hardware. The current one is Nouveau. And just like the binary driver a big chunk of Nouveau is not in the kernel, but are the userspace pieces found in Mesa and the Nouveau specific firmware that NVIDIA currently kindly makes available. So regardless of the long term effort to create a new open source in-tree kernel driver based on this new open source driver for NVIDIA hardware, Nouveau will very likely be staying around to support pre-turing hardware just like the NVIDIA binary kernel driver will.
The plan we are working towards from our side, but which is likely to take a few years to come to full fruition, is to come up with a way for the NVIDIA binary driver and Mesa to share a kernel driver. The details of how we will do that is something we are still working on and discussing with our friends at NVIDIA to address both the needs of the NVIDIA userspace and the needs of the Mesa userspace. Along with that evolution we hope to work with NVIDIA engineers to refactor the userspace bits of Mesa that are now targeting just Nouveau to be able to interact with this new kernel driver and also work so that the binary driver and Nouveau can share the same firmware. This has clear advantages for both the open source community and the NVIDIA. For the open source community it means that we will now have a kernel driver and firmware that allows things like changing the clocking of the GPU to provide the kind of performance people expect from the NVIDIA graphics card and it means that we will have an open source driver that will have access to the firmware and kernel updates from day one for new generations of NVIDIA hardware. For the ‘binary’ driver, and I put that in ” signs because it will now be less binary :), it means as stated above that it can start taking advantage of the GPL-only APIs in the kernel, distros can ship it and enable secure boot, and it gets an open source consumer of its kernel driver allowing it to go upstream.
If this new shared kernel driver will be known as Nouveau or something completely different is still an open question, and of course it happening at all depends on if we and the rest of the open source community and NVIDIA are able to find a path together to make it happen, but so far everyone seems to be of good will.
What does this release mean for linux distributions like Fedora and RHEL?
Over time it provides a pathway to radically simplify supporting NVIDIA hardware due to the opportunities discussed elsewhere in this document. Long term we will hope be able to get a better user experience with NVIDIA hardware in terms out of box functionality. Which means day 1 support for new chipsets, a high performance open source Mesa driver for NVIDIA and it will allow us to sign the NVIDIA driver alongside the rest of the kernel to enable things like secureboot support. Since this first release is targeting compute one can expect that these options will first be available for compute users and then graphics at a later time.
What are the next steps
Well there is a lot of work to do here. NVIDIA need to continue the effort to make this new driver feature complete for both Compute and Graphics Display usecases, we’d like to work together to come up with a plan for what the future unified kernel driver can look like and a model around it that works for both the community and NVIDIA, we need to add things like a Mesa Vulkan driver. We at Red Hat will be playing an active part in this work as the only Linux vendor with the capacity to do so and we will also work to ensure that the wider open source community has a chance to participate fully like we do for all open source efforts we are part of.
If you want to hear more about this I did talk with Chris Fisher and Linux Action News about this topic. Note: I did state some timelines in that interview which I didn’t make clear was my guesstimates and not in any form official NVIDIA timelines, so apologize for the confusion.
![]() |
|
May 10, 2022 | |
![]() |
In the previous post, I described how we enable multiple syncobjs capabilities in the V3D kernel driver. Now I will tell you what was changed on the userspace side, where we reworked the V3DV sync mechanisms to use Vulkan multiple wait and signal semaphores directly. This change represents greater adherence to the Vulkan submission framework.
I was not used to Vulkan concepts and the V3DV driver. Fortunately, I counted on the guidance of the Igalia’s Graphics team, mainly Iago Toral (thanks!), to understand the Vulkan Graphics Pipeline, sync scopes, and submission order. Therefore, we changed the original V3DV implementation for vkQueueSubmit and all related functions to allow direct mapping of multiple semaphores from V3DV to the V3D-kernel interface.
Disclaimer: Here’s a brief and probably inaccurate background, which we’ll go into more detail later on.
In Vulkan, GPU work submissions are described as command buffers. These command buffers, with GPU jobs, are grouped in a command buffer submission batch, specified by vkSubmitInfo, and submitted to a queue for execution. vkQueueSubmit is the command called to submit command buffers to a queue. Besides command buffers, vkSubmitInfo also specifies semaphores to wait before starting the batch execution and semaphores to signal when all command buffers in the batch are complete. Moreover, a fence in vkQueueSubmit can be signaled when all command buffer batches have completed execution.
From this sequence, we can see some implicit ordering guarantees. Submission order defines the start order of execution between command buffers, in other words, it is determined by the order in which pSubmits appear in VkQueueSubmit and pCommandBuffers appear in VkSubmitInfo. However, we don’t have any completion guarantees for jobs submitted to different GPU queue, which means they may overlap and complete out of order. Of course, jobs submitted to the same GPU engine follow start and finish order. A fence is ordered after all semaphores signal operations for signal operation order. In addition to implicit sync, we also have some explicit sync resources, such as semaphores, fences, and events.
Considering these implicit and explicit sync mechanisms, we rework the V3DV implementation of queue submissions to better use multiple syncobjs capabilities from the kernel. In this merge request, you can find this work: v3dv: add support to multiple wait and signal semaphores. In this blog post, we run through each scope of change of this merge request for a V3D driver-guided description of the multisync support implementation.
As the original V3D-kernel interface allowed only one semaphore, V3DV resorted to booleans to “translate” multiple semaphores into one. Consequently, if a command buffer batch had at least one semaphore, it needed to wait on all jobs submitted complete before starting its execution. So, instead of just boolean, we created and changed structs that store semaphores information to accept the actual list of wait semaphores.
In the two commits below, we basically updated the DRM V3D interface from that one defined in the kernel and verified if the multisync capability is available for use.
At this point, we were only changing the submission design to consider multiple wait semaphores. Before supporting multisync, V3DV was waiting for the last job submitted to be signaled when at least one wait semaphore was defined, even when serialization wasn’t required. V3DV handle GPU jobs according to the GPU queue in which they are submitted:
Therefore, we changed their submission setup to do jobs submitted to any GPU queues able to handle more than one wait semaphores.
These commits created all mechanisms to set arrays of wait and signal semaphores for GPU job submissions:
Finally, we extended the ability of GPU jobs to handle multiple signal semaphores, but at this point, no GPU job is actually in charge of signaling them. With this in place, we could rework part of the code that tracks CPU and GPU job completions by verifying the GPU status and threads spawned by Event jobs.
As we had only single in/out syncobj interfaces for semaphores, we used a
single last_job_sync
to synchronize job dependencies of the previous
submission. Although the DRM scheduler guarantees the order of starting to
execute a job in the same queue in the kernel space, the order of completion
isn’t predictable. On the other hand, we still needed to use syncobjs to follow
job completion since we have event threads on the CPU side. Therefore, a more
accurate implementation requires last_job syncobjs to track when each engine
(CL, TFU, and CSD) is idle. We also needed to keep the driver working on
previous versions of v3d kernel-driver with single semaphores, then we kept
tracking ANY last_job_sync
to preserve the previous implementation.
With multiple semaphores support, the conditions for waiting and signaling semaphores changed accordingly to the particularities of each GPU job (CL, CSD, TFU) and CPU job restrictions (Events, CSD indirect, etc.). In this sense, we redesigned V3DV semaphores handling and job submissions for command buffer batches in vkQueueSubmit.
We scrutinized possible scenarios for submitting command buffer batches to change the original implementation carefully. It resulted in three commits more:
We keep track of whether we have submitted a job to each GPU queue (CSD, TFU, CL) and a CPU job for each command buffer. We use syncobjs to track the last job submitted to each GPU queue and a flag that indicates if this represents the beginning of a command buffer.
The first GPU job submitted to a GPU queue in a command buffer should wait on wait semaphores. The first CPU job submitted in a command buffer should call v3dv_QueueWaitIdle() to do the waiting and ignore semaphores (because it is waiting for everything).
If the job is not the first but has the serialize flag set, it should wait on the completion of all last job submitted to any GPU queue before running. In practice, it means using syncobjs to track the last job submitted by queue and add these syncobjs as job dependencies of this serialized job.
If this job is the last job of a command buffer batch, it may be used to signal semaphores if this command buffer batch has only one type of GPU job (because we have guarantees of execution ordering). Otherwise, we emit a no-op job just to signal semaphores. It waits on the completion of all last jobs submitted to any GPU queue and then signal semaphores. Note: We changed this approach to correctly deal with ordering changes caused by event threads at some point. Whenever we have an event job in the command buffer, we cannot use the last job in the last command buffer assumption. We have to wait all event threads complete to signal
After submitting all command buffers, we emit a no-op job to wait on all last jobs by queue completion and signal fence. Note: at some point, we changed this approach to correct deal with ordering changes caused by event threads, as mentioned before.
With many changes and many rounds of reviews, the patchset was merged. After more validations and code review, we polished and fixed the implementation together with external contributions:
Also, multisync capabilities enabled us to add new features to V3DV and switch the driver to the common synchronization and submission framework:
This was waiting for multisync support in the v3d kernel, which is already available. Exposing this feature however enabled a few more CTS tests that exposed pre-existing bugs in the user-space driver so we fix those here before exposing the feature.
This should give you emulated timeline semaphores for free and kernel-assisted sharable timeline semaphores for cheap once you have the kernel interface wired in.
We used a set of games to ensure no performance regression in the new implementation. For this, we used GFXReconstruct to capture Vulkan API calls when playing those games. Then, we compared results with and without multisync caps in the kernelspace and also enabling multisync on v3dv. We didn’t observe any compromise in performance, but improvements when replaying scenes of vkQuake game.
As you may already know, we at Igalia have been working on several improvements to the 3D rendering drivers of Broadcom Videocore GPU, found in Raspberry Pi 4 devices. One of our recent works focused on improving V3D(V) drivers adherence to Vulkan submission and synchronization framework. We had to cross various layers from the Linux Graphics stack to add support for multiple syncobjs to V3D(V), from the Linux/DRM kernel to the Vulkan driver. We have delivered bug fixes, a generic gate to extend job submission interfaces, and a more direct sync mapping of the Vulkan framework. These changes did not impact the performance of the tested games and brought greater precision to the synchronization mechanisms. Ultimately, support for multiple syncobjs opened the door to new features and other improvements to the V3DV submission framework.
But, first, what are DRM sync objs?
* DRM synchronization objects (syncobj, see struct &drm_syncobj) provide a
* container for a synchronization primitive which can be used by userspace
* to explicitly synchronize GPU commands, can be shared between userspace
* processes, and can be shared between different DRM drivers.
* Their primary use-case is to implement Vulkan fences and semaphores.
[...]
* At it's core, a syncobj is simply a wrapper around a pointer to a struct
* &dma_fence which may be NULL.
And Jason Ekstrand well-summarized dma_fence features in a talk at the Linux Plumbers Conference 2021:
A struct that represents a (potentially future) event:
- Has a boolean “signaled” state
- Has a bunch of useful utility helpers/concepts, such as refcount, callback wait mechanisms, etc.
Provides two guarantees:
- One-shot: once signaled, it will be signaled forever
- Finite-time: once exposed, is guaranteed signal in a reasonable amount of time
For our main purpose, the multiple syncobjs support means that V3DV can submit jobs with more than one wait and signal semaphore. In the kernel space, wait semaphores become explicit job dependencies to wait on before executing the job. Signal semaphores (or post dependencies), in turn, work as fences to be signaled when the job completes its execution, unlocking following jobs that depend on its completion.
The multisync support development comprised of many decision-making points and steps summarized as follow:
We decided to refactor parts of the V3D(V) submission design in kernel-space and userspace during this development. We improved job scheduling on V3D-kernel and the V3DV job submission design. We also delivered more accurate synchronizing mechanisms and further updates in the Broadcom Vulkan driver running on Raspberry Pi 4. Therefore, we summarize here changes in the kernel space, describing the previous state of the driver, taking decisions, side improvements, and fixes.
Initially, V3D was very limited in the numbers of syncobjs per job submission. V3D job interfaces (CL, CSD, and TFU) only supported one syncobj (in_sync) to be added as an execution dependency and one syncobj (out_sync) to be signaled when a submission completes. Except for CL submission, which accepts two in_syncs: one for binner and another for render job, it didn’t change the limited options.
Meanwhile in the userspace, the V3DV driver followed alternative paths to meet Vulkan’s synchronization and submission framework. It needed to handle multiple wait and signal semaphores, but the V3D kernel-driver interface only accepts one in_sync and one out_sync. In short, V3DV had to fit multiple semaphores into one when submitting every GPU job.
The first decision was how to extend the V3D interface to accept multiple in and out syncobjs. We could extend each ioctl with two entries of syncobj arrays and two entries for their counters. We could create new ioctls with multiple in/out syncobj. But after examining other drivers solutions to extend their submission’s interface, we decided to extend V3D ioctls (v3d_cl_submit_ioctl, v3d_csd_submit_ioctl, v3d_tfu_submit_ioctl) by a generic ioctl extension.
I found a curious commit message when I was examining how other developers handled the issue in the past:
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Fri Mar 22 09:23:22 2019 +0000
drm/i915: Introduce the i915_user_extension_method
An idea for extending uABI inspired by Vulkan's extension chains.
Instead of expanding the data struct for each ioctl every time we need
to add a new feature, define an extension chain instead. As we add
optional interfaces to control the ioctl, we define a new extension
struct that can be linked into the ioctl data only when required by the
user. The key advantage being able to ignore large control structs for
optional interfaces/extensions, while being able to process them in a
consistent manner.
In comparison to other extensible ioctls, the key difference is the
use of a linked chain of extension structs vs an array of tagged
pointers. For example,
struct drm_amdgpu_cs_chunk {
__u32 chunk_id;
__u32 length_dw;
__u64 chunk_data;
};
[...]
So, inspired by amdgpu_cs_chunk
and i915_user_extension
, we opted to extend
the V3D interface through a generic interface. After applying some suggestions
from Iago Toral (Igalia) and Daniel
Vetter, we reached the following struct:
struct drm_v3d_extension {
__u64 next;
__u32 id;
#define DRM_V3D_EXT_ID_MULTI_SYNC 0x01
__u32 flags; /* mbz */
};
This generic extension has an id to identify the feature/extension we are adding to an ioctl (that maps the related struct type), a pointer to the next extension, and flags (if needed). Whenever we need to extend the V3D interface again for another specific feature, we subclass this generic extension into the specific one instead of extending ioctls indefinitely.
For the multiple syncobjs extension, we define a multi_sync
extension struct
that subclasses the generic extension struct. It has arrays of in and out
syncobjs, the respective number of elements in each of them, and a wait_stage
value used in CL submissions to determine which job needs to wait for syncobjs
before running.
struct drm_v3d_multi_sync {
struct drm_v3d_extension base;
/* Array of wait and signal semaphores */
__u64 in_syncs;
__u64 out_syncs;
/* Number of entries */
__u32 in_sync_count;
__u32 out_sync_count;
/* set the stage (v3d_queue) to sync */
__u32 wait_stage;
__u32 pad; /* mbz */
};
And if a multisync extension is defined, the V3D driver ignores the previous interface of single in/out syncobjs.
Once we had the interface to support multiple in/out syncobjs, v3d
kernel-driver needed to handle it. As V3D uses the DRM scheduler for job
executions, changing from single syncobj to multiples is quite straightforward.
V3D copies from userspace the in syncobjs and uses drm_syncobj_find_fence()+
drm_sched_job_add_dependency()
to add all in_syncs
(wait semaphores) as job
dependencies, i.e. syncobjs to be checked by the scheduler before running the
job. On CL submissions, we have the bin and render jobs, so V3D follows the
value of wait_stage
to determine which job depends on those in_syncs
to start
its execution.
When V3D defines the last job in a submission, it replaces dma_fence
of
out_syncs
with the done_fence
from this last job. It uses drm_syncobj_find() +
drm_syncobj_replace_fence()
to do that. Therefore, when a job completes its
execution and signals done_fence
, all out_syncs
are signaled too.
This work also made possible some improvements in the original implementation.
Following Iago’s suggestions, we refactored the job’s initialization code to
allocate memory and initialize a job in one go. With this, we started to clean
up resources more cohesively, clearly distinguishing cleanups in case of
failure from job completion. We also fixed the resource cleanup when a job is
aborted before the DRM scheduler arms it - at that point, drm_sched_job_arm()
had recently been introduced to job initialization. Finally, we prepared the
semaphore interface to implement timeline syncobjs in the future.
The patchset that adds multiple syncobjs support and improvements to V3D is available here and comprises four patches:
After extending the V3D kernel interface to accept multiple syncobjs, we worked on V3DV to benefit from V3D multisync capabilities. In the next post, I will describe a little of this work.
![]() |
|
May 09, 2022 | |
![]() |
As a board, we have been working on several initiatives to make the Foundation a better asset for the GNOME Project. We’re working on a number of threads in parallel, so I wanted to explain the “big picture” a bit more to try and connect together things like the new ED search and the bylaw changes.
We’re all here to see free and open source software succeed and thrive, so that people can be be truly empowered with agency over their technology, rather than being passive consumers. We want to bring GNOME to as many people as possible so that they have computing devices that they can inspect, trust, share and learn from.
In previous years we’ve tried to boost the relevance of GNOME (or technologies such as GTK) or solicit donations from businesses and individuals with existing engagement in FOSS ideology and technology. The problem with this approach is that we’re mostly addressing people and organisations who are already supporting or contributing FOSS in some way. To truly scale our impact, we need to look to the outside world, build better awareness of GNOME outside of our current user base, and find opportunities to secure funding to invest back into the GNOME project.
The Foundation supports the GNOME project with infrastructure, arranging conferences, sponsoring hackfests and travel, design work, legal support, managing sponsorships, advisory board, being the fiscal sponsor of GNOME, GTK, Flathub… and we will keep doing all of these things. What we’re talking about here are additional ways for the Foundation to support the GNOME project – we want to go beyond these activities, and invest into GNOME to grow its adoption amongst people who need it. This has a cost, and that means in parallel with these initiatives, we need to find partners to fund this work.
Neil has previously talked about themes such as education, advocacy, privacy, but we’ve not previously translated these into clear specific initiatives that we would establish in addition to the Foundation’s existing work. This is all a work in progress and we welcome any feedback from the community about refining these ideas, but here are the current strategic initiatives the board is working on. We’ve been thinking about growing our community by encouraging and retaining diverse contributors, and addressing evolving computing needs which aren’t currently well served on the desktop.
Initiative 1. Welcoming newcomers. The community is already spending a lot of time welcoming newcomers and teaching them the best practices. Those activities are as time consuming as they are important, but currently a handful of individuals are running initiatives such as GSoC, Outreachy and outreach to Universities. These activities help bring diverse individuals and perspectives into the community, and helps them develop skills and experience of collaborating to create Open Source projects. We want to make those efforts more sustainable by finding sponsors for these activities. With funding, we can hire people to dedicate their time to operating these programs, including paid mentors and creating materials to support newcomers in future, such as developer documentation, examples and tutorials. This is the initiative that needs to be refined the most before we can turn it into something real.
Initiative 2: Diverse and sustainable Linux app ecosystem. I spoke at the Linux App Summit about the work that GNOME and Endless has been supporting in Flathub, but this is an example of something which has a great overlap between commercial, technical and mission-based advantages. The key goal here is to improve the financial sustainability of participating in our community, which in turn has an impact on the diversity of who we can expect to afford to enter and remain in our community. We believe the existence of this is critically important for individual developers and contributors to unlock earning potential from our ecosystem, through donations or app sales. In turn, a healthy app ecosystem also improves the usefulness of the Linux desktop as a whole for potential users. We believe that we can build a case for commercial vendors in the space to join an advisory board alongside with GNOME, KDE, etc to input into the governance and contribute to the costs of growing Flathub.
Initiative 3: Local-first applications for the GNOME desktop. This is what Thib has been starting to discuss on Discourse, in this thread. There are many different threats to free access to computing and information in today’s world. The GNOME desktop and apps need to give users convenient and reliable access to technology which works similarly to the tools they already use everyday, but keeps them and their data safe from surveillance, censorship, filtering or just being completely cut off from the Internet. We believe that we can seek both philanthropic and grant funding for this work. It will make GNOME a more appealing and comprehensive offering for the many people who want to protect their privacy.
The idea is that these initiatives all sit on the boundary between the GNOME community and the outside world. If the Foundation can grow and deliver these kinds of projects, we are reaching to new people, new contributors and new funding. These contributions and investments back into GNOME represent a true “win-win” for the newcomers and our existing community.
(Originally posted to GNOME Discourse, please feel free to join the discussion there.)
Sometimes you want to go and inspect details of the shaders that are used with specific draw calls in a frame. With RenderDoc this is really easy if the driver implements VK_KHR_pipeline_executable_properties. This extension allows applications to query the driver about various aspects of the executable code generated for a Vulkan pipeline.
I implemented this extension for V3DV, the Vulkan driver for Raspberry Pi 4, last week (it is currently in review process) because I was tired of jumping through loops to get the info I needed when looking at traces. For V3DV we expose the NIR and QPU assembly code as well as various others stats, some of which are quite relevant to performance, such as spill or thread counts.
![]() |
|
May 02, 2022 | |
![]() |
TLDR: Hermetic /usr/
is awesome; let's popularize image-based OSes
with modernized security properties built around immutability,
SecureBoot, TPM2, adaptability, auto-updating, factory reset,
uniformity – built from traditional distribution packages, but
deployed via images.
Over the past years, systemd gained a number of components for building Linux-based operating systems. While these components individually have been adopted by many distributions and products for specific purposes, we did not publicly communicate a broader vision of how they should all fit together in the long run. In this blog story I hope to provide that from my personal perspective, i.e. explain how I personally would build an OS and where I personally think OS development with Linux should go.
I figure this is going to be a longer blog story, but I hope it will be equally enlightening. Please understand though that everything I write about OS design here is my personal opinion, and not one of my employer.
For the last 12 years or so I have been working on Linux OS
development, mostly around systemd
. In all those years I had a lot
of time thinking about the Linux platform, and specifically
traditional Linux distributions and their strengths and weaknesses. I
have seen many attempts to reinvent Linux distributions in one way or
another, to varying success. After all this most would probably
agree that the traditional RPM or dpkg/apt-based distributions still
define the Linux platform more than others (for 25+ years now), even
though some Linux-based OSes (Android, ChromeOS) probably outnumber
the installations overall.
And over all those 12 years I kept wondering, how would I actually build an OS for a system or for an appliance, and what are the components necessary to achieve that. And most importantly, how can we make these components generic enough so that they are useful in generic/traditional distributions too, and in other use cases than my own.
Before figuring out how I would build an OS it's probably good to figure out what type of OS I actually want to build, what purpose I intend to cover. I think a desktop OS is probably the most interesting. Why is that? Well, first of all, I use one of these for my job every single day, so I care immediately, it's my primary tool of work. But more importantly: I think building a desktop OS is one of the most complex overall OS projects you can work on, simply because desktops are so much more versatile and variable than servers or embedded devices. If one figures out the desktop case, I think there's a lot more to learn from, and reuse in the server or embedded case, then going the other way. After all, there's a reason why so much of the widely accepted Linux userspace stack comes from people with a desktop background (including systemd, BTW).
So, let's see how I would build a desktop OS. If you press me hard, and ask me why I would do that given that ChromeOS already exists and more or less is a Linux desktop OS: there's plenty I am missing in ChromeOS, but most importantly, I am lot more interested in building something people can easily and naturally rebuild and hack on, i.e. Google-style over-the-wall open source with its skewed power dynamic is not particularly attractive to me. I much prefer building this within the framework of a proper open source community, out in the open, and basing all this strongly on the status quo ante, i.e. the existing distributions. I think it is crucial to provide a clear avenue to build a modern OS based on the existing distribution model, if there shall ever be a chance to make this interesting for a larger audience.
(Let me underline though: even though I am going to focus on a desktop here, most of this is directly relevant for servers as well, in particular container host OSes and suchlike, or embedded devices, e.g. car IVI systems and so on.)
First and foremost, I think the focus must be on an image-based design rather than a package-based one. For robustness and security it is essential to operate with reproducible, immutable images that describe the OS or large parts of it in full, rather than operating always with fine-grained RPM/dpkg style packages. That's not to say that packages are not relevant (I actually think they matter a lot!), but I think they should be less of a tool for deploying code but more one of building the objects to deploy. A different way to see this: any OS built like this must be easy to replicate in a large number of instances, with minimal variability. Regardless if we talk about desktops, servers or embedded devices: focus for my OS should be on "cattle", not "pets", i.e that from the start it's trivial to reuse the well-tested, cryptographically signed combination of software over a large set of devices the same way, with a maximum of bit-exact reuse and a minimum of local variances.
The trust chain matters, from the boot loader all the way to the apps. This means all code that is run must be cryptographically validated before it is run. All storage must be cryptographically protected: public data must be integrity checked; private data must remain confidential.
This is in fact where big distributions currently fail pretty badly. I would go as far as saying that SecureBoot on Linux distributions is mostly security theater at this point, if you so will. That's because the initrd that unlocks your FDE (i.e. the cryptographic concept that protects the rest of your system) is not signed or protected in any way. It's trivial to modify for an attacker with access to your hard disk in an undetectable way, and collect your FDE passphrase. The involved bureaucracy around the implementation of UEFI SecureBoot of the big distributions is to a large degree pointless if you ask me, given that once the kernel is assumed to be in a good state, as the next step the system invokes completely unsafe code with full privileges.
This is a fault of current Linux distributions though, not of SecureBoot in general. Other OSes use this functionality in more useful ways, and we should correct that too.
Pretty much the same thing: offline security matters. I want my data to be reasonably safe at rest, i.e. cryptographically inaccessible even when I leave my laptop in my hotel room, suspended.
Everything should be cryptographically measured, so that remote attestation is supported for as much software shipped on the OS as possible.
Everything should be self descriptive, have single sources of truths that are closely attached to the object itself, instead of stored externally.
Everything should be self-updating. Today we know that software is never bug-free, and thus requires a continuous update cycle. Not only the OS itself, but also any extensions, services and apps running on it.
Everything should be robust in respect to aborted OS operations, power loss and so on. It should be robust towards hosed OS updates (regardless if the download process failed, or the image was buggy), and not require user interaction to recover from them.
There must always be a way to put the system back into a well-defined, guaranteed safe state ("factory reset"). This includes that all sensitive data from earlier uses becomes cryptographically inaccessible.
The OS should enforce clear separation between vendor resources, system resources and user resources: conceptually and when it comes to cryptographical protection.
Things should be adaptive: the system should come up and make the
best of the system it runs on, adapt to the storage and
hardware. Moreover, the system should support execution on bare
metal equally well as execution in a VM environment and in a
container environment (i.e. systemd-nspawn
).
Things should not require explicit installation. i.e. every image
should be a live image. For installation it should be sufficient to
dd
an OS image onto disk. Thus, strong focus on "instantiate on
first boot", rather than "instantiate before first boot".
Things should be reasonably minimal. The image the system starts its life with should be quick to download, and not include resources that can as well be created locally later.
System identity, local cryptographic keys and so on should be generated locally, not be pre-provisioned, so that there's no leak of sensitive data during the transport onto the system possible.
Things should be reasonably democratic and hackable. It should be easy to fork an OS, to modify an OS and still get reasonable cryptographic protection. Modifying your OS should not necessarily imply that your "warranty is voided" and you lose all good properties of the OS, if you so will.
Things should be reasonably modular. The privileged part of the core OS must be extensible, including on the individual system. It's not sufficient to support extensibility just through high-level UI applications.
Things should be reasonably uniform, i.e. ideally the same formats and cryptographic properties are used for all components of the system, regardless if for the host OS itself or the payloads it receives and runs.
Even taking all these goals into consideration, it should still be close to traditional Linux distributions, and take advantage of what they are really good at: integration and security update cycles.
Now that we know our goals and requirements, let's start designing the OS along these lines.
/usr/
First of all the OS resources (code, data files, …) should be
hermetic in an immutable /usr/
. This means that a /usr/
tree
should carry everything needed to set up the minimal set of
directories and files outside of /usr/
to make the system work. This
/usr/
tree can then be mounted read-only into the writable root file
system that then will eventually carry the local configuration, state
and user data in /etc/
, /var/
and /home/
as usual.
Thankfully, modern distributions are surprisingly close to working
without issues in such a hermetic context. Specifically, Fedora works
mostly just fine: it has adopted the /usr/
merge and the declarative
systemd-sysusers
and
systemd-tmpfiles
components quite comprehensively, which means the directory trees
outside of /usr/
are automatically generated as needed if missing.
In particular /etc/passwd
and /etc/group
(and related files) are
appropriately populated, should they be missing entries.
In my model a hermetic OS is hence comprehensively defined within
/usr/
: combine the /usr/
tree with an empty, otherwise unpopulated
root file system, and it will boot up successfully, automatically
adding the strictly necessary files, and resources that are necessary
to boot up.
Monopolizing vendor OS resources and definitions in an immutable
/usr/
opens multiple doors to us:
We can apply dm-verity
to the whole /usr/
tree, i.e. guarantee
structural, cryptographic integrity on the whole vendor OS resources
at once, with full file system metadata.
We can implement updates to the OS easily: by implementing an A/B
update scheme on the /usr/
tree we can update the OS resources
atomically and robustly, while leaving the rest of the OS environment
untouched.
We can implement factory reset easily: erase the root file system
and reboot. The hermetic OS in /usr/
has all the information it
needs to set up the root file system afresh — exactly like in a new
installation.
So let's have a look at a suitable partition table, taking a hermetic
/usr/
into account. Let's conceptually start with a table of four
entries:
An UEFI System Partition (required by firmware to boot)
Immutable, Verity-protected, signed file system with the /usr/
tree in version A
Immutable, Verity-protected, signed file system with the /usr/
tree in version B
A writable, encrypted root file system
(This is just for initial illustration here, as we'll see later it's going to be a bit more complex in the end.)
The Discoverable Partitions
Specification provides
suitable partition types UUIDs for all of the above partitions. Which
is great, because it makes the image self-descriptive: simply by
looking at the image's GPT table we know what to mount where. This
means we do not need a manual /etc/fstab
, and a multitude of tools
such as systemd-nspawn
and similar can operate directly on the disk
image and boot it up.
Now that we have a rough idea how to organize the partition table, let's look a bit at how to boot into that. Specifically, in my model "unified kernels" are the way to go, specifically those implementing Boot Loader Specification Type #2. These are basically kernel images that have an initial RAM disk attached to them, as well as a kernel command line, a boot splash image and possibly more, all wrapped into a single UEFI PE binary. By combining these into one we achieve two goals: they become extremely easy to update (i.e. drop in one file, and you update kernel+initrd) and more importantly, you can sign them as one for the purpose of UEFI SecureBoot.
In my model, each version of such a kernel would be associated with
exactly one version of the /usr/
tree: both are always updated at
the same time. An update then becomes relatively simple: drop in one
new /usr/
file system plus one kernel, and the update is complete.
The boot loader used for all this would be systemd-boot, of course. It's a very simple loader, and implements the aforementioned boot loader specification. This means it requires no explicit configuration or anything: it's entirely sufficient to drop in one such unified kernel file, and it will be picked up, and be made a candidate to boot into.
You might wonder how to configure the root file system to boot from
with such a unified kernel that contains the kernel command line and
is signed as a whole and thus immutable. The idea here is to use the
usrhash=
kernel command line option implemented by
systemd-veritysetup-generator
and
systemd-fstab-generator. It
does two things: it will search and set up a dm-verity
volume for
the /usr/
file system, and then mount it. It takes the root hash
value of the dm-verity
Merkle tree as the parameter. This hash is
then also used to find the /usr/
partition in the GPT partition
table, under the assumption that the partition UUIDs are derived from
it, as per the suggestions in the discoverable partitions
specification (see above).
systemd-boot
(if not told otherwise) will do a version sort of the
kernel image files it finds, and then automatically boot the newest
one. Picking a specific kernel to boot will also fixate which version
of the /usr/
tree to boot into, because — as mentioned — the Verity
root hash of it is built into the kernel command line the unified
kernel image contains.
In my model I'd place the kernels directly into the UEFI System
Partition (ESP), in order to simplify things. (systemd-boot
also
supports reading them from a separate boot partition, but let's not
complicate things needlessly, at least for now.)
So, with all this, we now already have a boot chain that goes
something like this: once the boot loader is run, it will pick the
newest kernel, which includes the initial RAM disk and a secure
reference to the /usr/
file system to use. This is already
great. But a /usr/
alone won't make us happy, we also need a root
file system. In my model, that file system would be writable, and the
/etc/
and /var/
hierarchies would be located directly on it. Since
these trees potentially contain secrets (SSH keys, …) the root file
system needs to be encrypted. We'll use LUKS2 for this, of course. In
my model, I'd bind this to the TPM2 chip (for compatibility with
systems lacking one, we can find a suitable fallback, which then
provides weaker guarantees, see below). A TPM2 is a security chip
available in most modern PCs. Among other things it contains a
persistent secret key that can be used to encrypt data, in a way that
only if you possess access to it and can prove you are using validated
software you can decrypt it again. The cryptographic measuring I
mentioned earlier is what allows this to work. But … let's not get
lost too much in the details of TPM2 devices, that'd be material for a
novel, and this blog story is going to be way too long already.
What does using a TPM2 bound key for unlocking the root file system get us? We can encrypt the root file system with it, and you can only read or make changes to the root file system if you also possess the TPM2 chip and run our validated version of the OS. This protects us against an evil maid scenario to some level: an attacker cannot just copy the hard disk of your laptop while you leave it in your hotel room, because unless the attacker also steals the TPM2 device it cannot be decrypted. The attacker can also not just modify the root file system, because such changes would be detected on next boot because they aren't done with the right cryptographic key.
So, now we have a system that already can boot up somewhat completely,
and run userspace services. All code that is run is verified in some
way: the /usr/
file system is Verity protected, and the root hash of
it is included in the kernel that is signed via UEFI SecureBoot. And
the root file system is locked to the TPM2 where the secret key is
only accessible if our signed OS + /usr/
tree is used.
(One brief intermission here: so far all the components I am
referencing here exist already, and have been shipped in systemd
and
other projects already, including the TPM2 based disk
encryption. There's one thing missing here however at the moment that
still needs to be developed (happy to take PRs!): right now TPM2 based
LUKS2 unlocking is bound to PCR hash values. This is hard to work with
when implementing updates — what we'd need instead is unlocking by
signatures of PCR hashes. TPM2 supports this, but we don't support it
yet in our systemd-cryptsetup
+ systemd-cryptenroll
stack.)
One of the goals mentioned above is that cryptographic key material
should always be generated locally on first boot, rather than
pre-provisioned. This of course has implications for the encryption
key of the root file system: if we want to boot into this system we
need the root file system to exist, and thus a key already generated
that it is encrypted with. But where precisely would we generate it if
we have no installer which could generate while installing (as it is
done in traditional Linux distribution installers). My proposed
solution here is to use
systemd-repart
,
which is a declarative, purely additive repartitioner. It can run from
the initrd to create and format partitions on boot, before
transitioning into the root file system. It can also format the
partitions it creates and encrypt them, automatically enrolling an
TPM2-bound key.
So, let's revisit the partition table we mentioned earlier. Here's what in my model we'd actually ship in the initial image:
An UEFI System Partition (ESP)
An immutable, Verity-protected, signed file system with the /usr/
tree in version A
And that's already it. No root file system, no B /usr/
partition,
nothing else. Only two partitions are shipped: the ESP with the
systemd-boot
loader and one unified kernel image, and the A version
of the /usr/
partition. Then, on first boot systemd-repart
will
notice that the root file system doesn't exist yet, and will create
it, encrypt it, and format it, and enroll the key into the TPM2. It
will also create the second /usr/
partition (B) that we'll need for
later A/B updates (which will be created empty for now, until the
first update operation actually takes place, see below). Once done the
initrd will combine the fresh root file system with the shipped
/usr/
tree, and transition into it. Because the OS is hermetic in
/usr/
and contains all the systemd-tmpfiles
and systemd-sysuser
information it can then set up the root file system properly and
create any directories and symlinks (and maybe a few files) necessary
to operate.
Besides the fact that the root file system's encryption keys are
generated on the system we boot from and never leave it, it is also
pretty nice that the root file system will be sized dynamically,
taking into account the physical size of the backing storage. This is
perfect, because on first boot the image will automatically adapt to what
it has been dd
'ed onto.
This is a good point to talk about the factory reset logic, i.e. the
mechanism to place the system back into a known good state. This is
important for two reasons: in our laptop use case, once you want to
pass the laptop to someone else, you want to ensure your data is fully
and comprehensively erased. Moreover, if you have reason to believe
your device was hacked you want to revert the device to a known good
state, i.e. ensure that exploits cannot persist. systemd-repart
already has a mechanism for it. In the declarations of the partitions
the system should have, entries may be marked to be candidates for
erasing on factory reset. The actual factory reset is then requested
by one of two means: by specifying a specific kernel command line
option (which is not too interesting here, given we lock that down via
UEFI SecureBoot; but then again, one could also add a second kernel to
the ESP that is identical to the first, with only different that it
lists this command line option: thus when the user selects this entry
it will initiate a factory reset) — and via an EFI variable that can
be set and is honoured on the immediately following boot. So here's
how a factory reset would then go down: once the factory reset is
requested it's enough to reboot. On the subsequent boot
systemd-repart
runs from the initrd, where it will honour the
request and erase the partitions marked for erasing. Once that is
complete the system is back in the state we shipped the system in:
only the ESP and the /usr/
file system will exist, but the root file
system is gone. And from here we can continue as on the original first
boot: create a new root file system (and any other partitions), and
encrypt/set it up afresh.
So now we have a nice setup, where everything is either signed or encrypted securely. The system can adapt to the system it is booted on automatically on first boot, and can easily be brought back into a well defined state identical to the way it was shipped in.
But of course, such a monolithic, immutable system is only useful for
very specific purposes. If /usr/
can't be written to, – at least in
the traditional sense – one cannot just go and install a new software
package that one needs. So here two goals are superficially
conflicting: on one hand one wants modularity, i.e. the ability to
add components to the system, and on the other immutability, i.e. that
precisely this is prohibited.
So let's see what I propose as a middle ground in my model. First, what's the precise use case for such modularity? I see a couple of different ones:
For some cases it is necessary to extend the system itself at the lowest level, so that the components added in extend (or maybe even replace) the resources shipped in the base OS image, so that they live in the same namespace, and are subject to the same security restrictions and privileges. Exposure to the details of the base OS and its interface for this kind of modularity is at the maximum.
Example: a module that adds a debugger or tracing tools into the system. Or maybe an optional hardware driver module.
In other cases, more isolation is preferable: instead of extending the system resources directly, additional services shall be added in that bring their own files, can live in their own namespace (but with "windows" into the host namespaces), however still are system components, and provide services to other programs, whether local or remote. Exposure to the details of the base OS for this kind of modularity is restricted: it mostly focuses on the ability to consume and provide IPC APIs from/to the system. Components of this type can still be highly privileged, but the level of integration is substantially smaller than for the type explained above.
Example: a module that adds a specific VPN connection service to the OS.
Finally, there's the actual payload of the OS. This stuff is relatively isolated from the OS and definitely from each other. It mostly consumes OS APIs, and generally doesn't provide OS APIs. This kind of stuff runs with minimal privileges, and in its own namespace of concepts.
Example: a desktop app, for reading your emails.
Of course, the lines between these three types of modules are blurry, but I think distinguishing them does make sense, as I think different mechanisms are appropriate for each. So here's what I'd propose in my model to use for this.
For the system extension case I think the
systemd-sysext
images are appropriate. This tool operates on
system extension images that are very similar to the host's disk
image: they also contain a /usr/
partition, protected by
Verity. However, they just include additions to the host image:
binaries that extend the host. When such a system extension image
is activated, it is merged via an immutable overlayfs
mount into
the host's /usr/
tree. Thus any file shipped in such a system
extension will suddenly appear as if it was part of the host OS
itself. For optional components that should be considered part of
the OS more or less this is a very simple and powerful way to
combine an immutable OS with an immutable extension. Note that most
likely extensions for an OS matching this tool should be built at
the same time within the same update cycle scheme as the host OS
itself. After all, the files included in the extensions will have
dependencies on files in the system OS image, and care must be
taken that these dependencies remain in order.
For adding in additional somewhat isolated system services in my
model, Portable Services
are the proposed tool of choice. Portable services are in most ways
just like regular system services; they could be included in the
system OS image or an extension image. However, portable services
use
RootImage=
to run off separate disk images, thus within their own
namespace. Images set up this way have various ways to integrate
into the host OS, as they are in most ways regular system services,
which just happen to bring their own directory tree. Also, unlike
regular system services, for them sandboxing is opt-out rather than
opt-in. In my model, here too the disk images are Verity protected
and thus immutable. Just like the host OS they are GPT disk images
that come with a /usr/
partition and Verity data, along with
signing.
Finally, the actual payload of the OS, i.e. the apps. To be useful in real life here it is important to hook into existing ecosystems, so that a large set of apps are available. Given that on Linux flatpak (or on servers OCI containers) are the established format that pretty much won they are probably the way to go. That said, I think both of these mechanisms have relatively weak properties, in particular when it comes to security, since immutability/measurements and similar are not provided. This means, unlike for system extensions and portable services a complete trust chain with attestation and per-app cryptographically protected data is much harder to implement sanely.
What I'd like to underline here is that the main system OS image, as well as the system extension images and the portable service images are put together the same way: they are GPT disk images, with one immutable file system and associated Verity data. The latter two should also contain a PKCS#7 signature for the top-level Verity hash. This uniformity has many benefits: you can use the same tools to build and process these images, but most importantly: by using a single way to validate them throughout the stack (i.e. Verity, in the latter cases with PKCS#7 signatures), validation and measurement is straightforward. In fact it's so obvious that we don't even have to implement it in systemd: the kernel has direct support for this Verity signature checking natively already (IMA).
So, by composing a system at runtime from a host image, extension images and portable service images we have a nicely modular system where every single component is cryptographically validated on every single IO operation, and every component is measured, in its entire combination, directly in the kernel's IMA subsystem.
(Of course, once you add the desktop apps or OCI containers on top, then these properties are lost further down the chain. But well, a lot is already won, if you can close the chain that far down.)
Note that system extensions are not designed to replicate the fine
grained packaging logic of RPM/dpkg. Of course, systemd-sysext
is a
generic tool, so you can use it for whatever you want, but there's a
reason it does not bring support for a dependency language: the goal
here is not to replicate traditional Linux packaging (we have that
already, in RPM/dpkg, and I think they are actually OK for what they
do) but to provide delivery of larger, coarser sets of functionality,
in lockstep with the underlying OS' life-cycle and in particular with
no interdependencies, except on the underlying OS.
Also note that depending on the use case it might make sense to also
use system extensions to modularize the initrd
step. This is
probably less relevant for a desktop OS, but for server systems it
might make sense to package up support for specific complex storage in
a systemd-sysext
system extension, which can be applied to the
initrd that is built into the unified kernel. (In fact, we have been
working on implementing signed yet modular initrd support to general
purpose Fedora this way.)
Note that portable services are composable from system extension too,
by the way. This makes them even more useful, as you can share a
common runtime between multiple portable service, or even use the host
image as common runtime for portable services. In this model a common
runtime image is shared between one or more system extensions, and
composed at runtime via an overlayfs
instance.
Having an immutable, cryptographically locked down host OS is great I think, and if we have some moderate modularity on top, that's also great. But oftentimes it's useful to be able to depart/compromise for some specific use cases from that, i.e. provide a bridge for example to allow workloads designed around RPM/dpkg package management to coexist reasonably nicely with such an immutable host.
For this purpose in my model I'd propose using systemd-nspawn
containers. The containers are focused on OS containerization,
i.e. they allow you to run a full OS with init system and everything
as payload (unlike for example Docker containers which focus on a
single service, and where running a full OS in it is a mess).
Running systemd-nspawn
containers for such secondary OS installs has
various nice properties. One of course is that systemd-nspawn
supports the same level of cryptographic image validation that we rely
on for the host itself. Thus, to some level the whole OS trust chain
is reasonably recursive if desired: the firmware validates the OS, and the OS can
validate a secondary OS installed within it. In fact, we can run our
trusted OS recursively on itself and get similar security guarantees!
Besides these security aspects, systemd-nspawn
also has really nice
properties when it comes to integration with the host. For example the
--bind-user=
permits binding a host user record and their directory
into a container as a simple one step operation. This makes it
extremely easy to have a single user and $HOME
but share it
concurrently with the host and a zoo of secondary OSes in
systemd-nspawn
containers, which each could run different
distributions even.
Superficially, an OS with an immutable /usr/
appears much less
hackable than an OS where everything is writable. Moreover, an OS
where everything must be signed and cryptographically validated makes
it hard to insert your own code, given you are unlikely to possess
access to the signing keys.
To address this issue other systems have supported a "developer" mode: when entered the security guarantees are disabled, and the system can be freely modified, without cryptographic validation. While that's a great concept to have I doubt it's what most developers really want: the cryptographic properties of the OS are great after all, it sucks having to give them up once developer mode is activated.
In my model I'd thus propose two different approaches to this problem. First of all, I think there's value in allowing users to additively extend/override the OS via local developer system extensions. With this scheme the underlying cryptographic validation would remain in tact, but — if this form of development mode is explicitly enabled – the developer could add in more resources from local storage, that are not tied to the OS builder's chain of trust, but a local one (i.e. simply backed by encrypted storage of some form).
The second approach is to make it easy to extend (or in fact replace) the set of trusted validation keys, with local ones that are under the control of the user, in order to make it easy to operate with kernel, OS, extension, portable service or container images signed by the local developer without involvement of the OS builder. This is relatively easy to do for components down the trust chain, i.e. the elements further up the chain should optionally allow additional certificates to allow validation with.
(Note that systemd currently has no explicit support for a "developer" mode like this. I think we should add that sooner or later however.)
Closely related to the question of developer mode is the question of code signing. If you ask me, the status quo of UEFI SecureBoot code signing in the major Linux distributions is pretty sad. The work to get stuff signed is massive, but in effect it delivers very little in return: because initrds are entirely unprotected, and reside on partitions lacking any form of cryptographic integrity protection any attacker can trivially easily modify the boot process of any such Linux system and freely collected FDE passphrases entered. There's little value in signing the boot loader and kernel in a complex bureaucracy if it then happily loads entirely unprotected code that processes the actually relevant security credentials: the FDE keys.
In my model, through use of unified kernels this important gap is closed, hence UEFI SecureBoot code signing becomes an integral part of the boot chain from firmware to the host OS. Unfortunately, code signing – and having something a user can locally hack, is to some level conflicting. However, I think we can improve the situation here, and put more emphasis on enrolling developer keys in the trust chain easily. Specifically, I see one relevant approach here: enrolling keys directly in the firmware is something that we should make less of a theoretical exercise and more something we can realistically deploy. See this work in progress making this more automatic and eventually safe. Other approaches are thinkable (including some that build on existing MokManager infrastructure), but given the politics involved, are harder to conclusively implement.
What I explain above is put together with running on a bare metal
system in mind. However, one of the stated goals is to make the OS
adaptive enough to also run in a container environment (specifically:
systemd-nspawn
) nicely. Booting a disk image on bare metal or in a
VM generally means that the UEFI firmware validates and invokes the
boot loader, and the boot loader invokes the kernel which then
transitions into the final system. This is different for containers:
here the container manager immediately calls the init system, i.e. PID
1. Thus the validation logic must be different: cryptographic
validation must be done by the container manager. In my model this is
solved by shipping the OS image not only with a Verity data partition
(as is already necessary for the UEFI SecureBoot trust chain, see
above), but also with another partition, containing a PKCS#7 signature
of the root hash of said Verity partition. This of course is exactly
what I propose for both the system extension and portable service
image. Thus, in my model the images for all three uses are put
together the same way: an immutable /usr/
partition, accompanied by
a Verity partition and a PKCS#7 signature partition. The OS image
itself then has two ways "into" the trust chain: either through the
signed unified kernel in the ESP (which is used for bare metal and VM
boots) or by using the PKCS#7 signature stored in the partition
(which is used for container/systemd-nspawn
boots).
A fully immutable and signed OS has to establish trust in the user
data it makes use of before doing so. In the model I describe here,
for /etc/
and /var/
we do this via disk encryption of the root
file system (in combination with integrity checking). But the point
where the root file system is mounted comes relatively late in the
boot process, and thus cannot be used to parameterize the boot
itself. In many cases it's important to be able to parameterize the
boot process however.
For example, for the implementation of the developer mode indicated above it's useful to be able to pass this fact safely to the initrd, in combination with other fields (e.g. hashed root password for allowing in-initrd logins for debug purposes). After all, if the initrd is pre-built by the vendor and signed as whole together with the kernel it cannot be modified to carry such data directly (which is in fact how parameterizing of the initrd to a large degree was traditionally done).
In my model this is achieved through system credentials, which allow passing parameters to systems (and services for the matter) in an encrypted and authenticated fashion, bound to the TPM2 chip. This means that we can securely pass data into the initrd so that it can be authenticated and decrypted only on the system it is intended for and with the unified kernel image it was intended for.
In my model the OS would also carry a swap partition. For the simple
reason that only then
systemd-oomd.service
can provide the best results. Also see In defence of swap: common
misconceptions
We have a rough idea how the system shall be organized now, let's next focus on the deployment cycle: software needs regular update cycles, and software that is not updated regularly is a security problem. Thus, I am sure that any modern system must be automatically updated, without this requiring avoidable user interaction.
In my model, this is the job for systemd-sysupdate. It's a relatively simple A/B image updater: it operates either on partitions, on regular files in a directory, or on subdirectories in a directory. Each entry has a version (which is encoded in the GPT partition label for partitions, and in the filename for regular files and directories): whenever an update is initiated the oldest version is erased, and the newest version is downloaded.
With the setup described above a system update becomes a really simple
operation. On each update the systemd-sysupdate
tool downloads a
/usr/
file system partition, an accompanying Verity partition, a
PKCS#7 signature partition, and drops it into the host's partition
table (where it possibly replaces the oldest version so far stored
there). Then it downloads a unified kernel image and drops it into
the EFI System Partition's /EFI/Linux
(as per Boot Loader
Specification; possibly erase the oldest such file there). And that's
already the whole update process: four files are downloaded from the
server, unpacked and put in the most straightforward of ways into the
partition table or file system. Unlike in other OS designs there's no
mechanism required to explicitly switch to the newer version, the
aforementioned systemd-boot
logic will automatically pick the newest
kernel once it is dropped in.
Above we talked a lot about modularity, and how to put systems
together as a combination of a host OS image, system extension images
for the initrd and the host, portable service images and
systemd-nspawn
container images. I already emphasized that these
image files are actually always the same: GPT disk images with
partition definitions that match the Discoverable Partition
Specification. This comes very handy when thinking about updating: we
can use the exact same systemd-sysupdate
tool for updating these
other images as we use for the host image. The uniformity of the
on-disk format allows us to update them uniformly too.
Automatic OS updates do not come without risks: if they happen
automatically, and an update goes wrong this might mean your system
might be automatically updated into a brick. This of course is less
than ideal. Hence it is essential to address this reasonably
automatically. In my model, there's systemd's Automatic Boot
Assessment for
that. The mechanism is simple: whenever a new unified kernel image is
dropped into the system it will be stored with a small integer counter
value included in the filename. Whenever the unified kernel image is
selected for booting by systemd-boot
, it is decreased by one. Once
the system booted up successfully (which is determined by userspace)
the counter is removed from the file name (which indicates "this entry
is known to work"). If the counter ever hits zero, this indicates that
it tried to boot it a couple of times, and each time failed, thus is
apparently "bad". In this case systemd-boot
will not consider the
kernel anymore, and revert to the next older (that doesn't have a
counter of zero).
By sticking the boot counter into the filename of the unified kernel
we can directly attach this information to the kernel, and thus need
not concern ourselves with cleaning up secondary information about the
kernel when the kernel is removed. Updating with a tool like
systemd-sysupdate
remains a very simple operation hence: drop one
old file, add one new file.
I already mentioned that systemd-boot
automatically picks the newest
unified kernel image to boot, by looking at the version encoded in the
filename. This is done via a simple
strverscmp()
call (well, truth be told, it's a modified version of that call,
different from the one implemented in libc, because real-life package
managers use more complex rules for comparing versions these days, and
hence it made sense to do that here too). The concept of having
multiple entries of some resource in a directory, and picking the
newest one automatically is a powerful concept, I think. It means
adding/removing new versions is extremely easy (as we discussed above,
in systemd-sysupdate
context), and allows stateless determination of
what to use.
If systemd-boot
can do that, what about system extension images,
portable service images, or systemd-nspawn
container images that do
not actually use systemd-boot
as the entrypoint? All these tools
actually implement the very same logic, but on the partition level: if
multiple suitable /usr/
partitions exist, then the newest is determined
by comparing the GPT partition label of them.
This is in a way the counterpart to the systemd-sysupdate
update
logic described above: we always need a way to determine which
partition to actually then use after the update took place: and this
becomes very easy each time: enumerate possible entries, pick the
newest as per the (modified) strverscmp()
result.
In my model the device's users and their home directories are managed
by
systemd-homed
. This
means they are relatively self-contained and can be migrated easily
between devices. The numeric UID assignment for each user is done at
the moment of login only, and the files in the home directory are
mapped as needed via a uidmap
mount. It also allows us to protect
the data of each user individually with a credential that belongs to
the user itself. i.e. instead of binding confidentiality of the user's
data to the system-wide full-disk-encryption each user gets their own
encrypted home directory where the user's authentication token
(password, FIDO2 token, PKCS#11 token, recovery key…) is used as
authentication and decryption key for the user's data. This brings
a major improvement for security as it means the user's data is
cryptographically inaccessible except when the user is actually logged
in.
It also allows us to correct another major issue with traditional Linux systems: the way how data encryption works during system suspend. Traditionally on Linux the disk encryption credentials (e.g. LUKS passphrase) is kept in memory also when the system is suspended. This is a bad choice for security, since many (most?) of us probably never turn off their laptop but suspend it instead. But if the decryption key is always present in unencrypted form during the suspended time, then it could potentially be read from there by a sufficiently equipped attacker.
By encrypting the user's home directory with the user's authentication token we can first safely "suspend" the home directory before going to the system suspend state (i.e. flush out the cryptographic keys needed to access it). This means any process currently accessing the home directory will be frozen for the time of the suspend, but that's expected anyway during a system suspend cycle. Why is this better than the status quo ante? In this model the home directory's cryptographic key material is erased during suspend, but it can be safely reacquired on resume, from system code. If the system is only encrypted as a whole however, then the system code itself couldn't reauthenticate the user, because it would be frozen too. By separating home directory encryption from the root file system encryption we can avoid this problem.
So we discussed the organization of the partitions OS images multiple times in the above, each time focusing on a specific aspect. Let's now summarize how this should look like all together.
In my model, the initial, shipped OS image should look roughly like this:
systemd-boot
as boot loader and one unified kernel/usr/
partition (version "A"), with a label fooOS_0.7
(under the assumption we called our project fooOS
and the image version is 0.7
)./usr/
partition (version "A"), with the same label/usr/
partition (version "A"), along with a PKCS#7 signature of it, also with the same labelOn first boot this is augmented by systemd-repart
like this:
/usr/
partition (version "B"), initially with a label _empty
(which is the label systemd-sysupdate
uses to mark partitions that currently carry no valid payload)_empty
_empty
systemd-homed
adds that on its own, and it's nice to avoid duplicate encryption)Then, on the first OS update the partitions 5, 6, 7 are filled with a
new version of the OS (let's say 0.8
) and thus get their label
updated to fooOS_0.8
. After a boot, this version is active.
On a subsequent update the three partitions fooOS_0.7
get wiped and
replaced by fooOS_0.9
and so on.
On factory reset, the partitions 8, 9, 10 are deleted, so that
systemd-repart
recreates them, using a new set of cryptographic
keys.
Here's a graphic that hopefully illustrates the partition stable from shipped image, through first boot, multiple update cycles and eventual factory reset:
So let's summarize the intended chain of trust (for bare metal/VM boots) that ensures every piece of code in this model is signed and validated, and any system secret is locked to TPM2.
First, firmware (or possibly shim) authenticates systemd-boot
.
Once systemd-boot
picks a unified kernel image to boot, it is
also authenticated by firmware/shim.
The unified kernel image contains an initrd, which is the first userspace component that runs. It finds any system extensions passed into the initrd, and sets them up through Verity. The kernel will validate the Verity root hash signature of these system extension images against its usual keyring.
The initrd also finds credentials passed in, then securely unlocks (which means: decrypts + authenticates) them with a secret from the TPM2 chip, locked to the kernel image itself.
The kernel image also contains a kernel command line which contains
a usrhash=
option that pins the root hash of the /usr/
partition
to use.
The initrd then unlocks the encrypted root file system, with a secret bound to the TPM2 chip.
The system then transitions into the main system, i.e. the
combination of the Verity protected /usr/
and the encrypted root
files system. It then activates two more encrypted (and/or
integrity protected) volumes for /home/
and swap, also with a
secret tied to the TPM2 chip.
Here's an attempt to illustrate the above graphically:
This is the trust chain of the basic OS. Validation of system
extension images, portable service images, systemd-nspawn
container
images always takes place the same way: the kernel validates these
Verity images along with their PKCS#7 signatures against the kernel's
keyring.
In the above I left the choice of file systems unspecified. For the
immutable /usr/
partitions squashfs
might be a good candidate, but
any other that works nicely in a read-only fashion and generates
reproducible results is a good choice, too. The home directories as managed
by systemd-homed
should certainly use btrfs
, because it's the only
general purpose file system supporting online grow and shrink, which
systemd-homed
can take benefit of, to manage storage.
For the root file system btrfs
is likely also the best idea. That's
because we intend to use LUKS/dm-crypt
underneath, which by default
only provides confidentiality, not authenticity of the data (unless
combined with dm-integrity
). Since btrfs
(unlike xfs/ext4) does
full data checksumming it's probably the best choice here, since it
means we don't have to use dm-integrity
(which comes at a higher
performance cost).
In the discussion above a lot of focus was put on setting up the OS
and completing the partition layout and such on first boot. This means
installing the OS becomes as simple as dd
-ing (i.e. "streaming") the
shipped disk image into the final HDD medium. Simple, isn't it?
Of course, such a scheme is just too simple for many setups in real
life. Whenever multi-boot is required (i.e. co-installing an OS
implementing this model with another unrelated one), dd
-ing a disk
image onto the HDD is going to overwrite user data that was supposed
to be kept around.
In order to cover for this case, in my model, we'd use
systemd-repart
(again!) to allow streaming the source disk image
into the target HDD in a smarter, additive way. The tool after all is
purely additive: it will add in partitions or grow them if they are
missing or too small. systemd-repart
already has all the necessary
provisions to not only create a partition on the target disk, but also
copy blocks from a raw installer disk. An install operation would then
become a two stop process: one invocation of systemd-repart
that
adds in the /usr/
, its Verity and the signature partition to the
target medium, populated with a copy of the same partition of the
installer medium. And one invocation of bootctl
that installs the
systemd-boot
boot loader in the ESP. (Well, there's one thing
missing here: the unified OS kernel also needs to be dropped into the
ESP. For now, this can be done with a simple cp
call. In the long
run, this should probably be something bootctl
can do as well, if
told so.)
So, with this scheme we have a simple scheme to cover all bases: we
can either just dd
an image to disk, or we can stream an image onto
an existing HDD, adding a couple of new partitions and files to the
ESP.
Of course, in reality things are more complex than that even: there's
a good chance that the existing ESP is simply too small to carry
multiple unified kernels. In my model, the way to address this is by
shipping two slightly different systemd-repart
partition definition
file sets: the ideal case when the ESP is large enough, and a
fallback case, where it isn't and where we then add in an addition
XBOOTLDR partition (as per the Discoverable Partitions
Specification). In that mode the ESP carries the boot loader, but the
unified kernels are stored in the XBOOTLDR partition. This scenario is
not quite as simple as the XBOOTLDR-less scenario described first, but
is equally well supported in the various tools. Note that
systemd-repart
can be told size constraints on the partitions it
shall create or augment, thus to implement this scheme it's enough to
invoke the tool with the fallback partition scheme if invocation with
the ideal scheme fails.
Either way: regardless how the partitions, the boot loader and the
unified kernels ended up on the system's hard disk, on first boot the
code paths are the same again: systemd-repart
will be called to
augment the partition table with the root file system, and properly
encrypt it, as was already discussed earlier here. This means: all
cryptographic key material used for disk encryption is generated on
first boot only, the installer phase does not encrypt anything.
Traditionally on Linux three types of systems were common: "installed" systems, i.e. that are stored on the main storage of the device and are the primary place people spend their time in; "installer" systems which are used to install them and whose job is to copy and setup the packages that make up the installed system; and "live" systems, which were a middle ground: a system that behaves like an installed system in most ways, but lives on removable media.
In my model I'd like to remove the distinction between these three
concepts as much as possible: each of these three images should carry
the exact same /usr/
file system, and should be suitable to be
replicated the same way. Once installed the resulting image can also
act as an installer for another system, and so on, creating a certain
"viral" effect: if you have one image or installation it's
automatically something you can replicate 1:1 with a simple
systemd-repart
invocation.
The above explains how the image should look like and how its first boot and update cycle will modify it. But this leaves one question unanswered: how to actually build the initial image for OS instances according to this model?
Note that there's nothing too special about the images following this model: they are ultimately just GPT disk images with Linux file systems, following the Discoverable Partition Specification. This means you can use any set of tools of your choice that can put together GPT disk images for compliant images.
I personally would use mkosi
for
this purpose though. It's designed to generate compliant images, and
has a rich toolset for SecureBoot and signed/Verity file systems
already in place.
What is key here is that this model doesn't depart from RPM and dpkg, instead it builds on top of that: in this model they are excellent for putting together images on the build host, but deployment onto the runtime host does not involve individual packages.
I think one cannot underestimate the value traditional distributions bring, regarding security, integration and general polishing. The concepts I describe above are inherited from this, but depart from the idea that distribution packages are a runtime concept and make it a build-time concept instead.
Note that the above is pretty much independent from the underlying distribution.
I have no illusions, general purpose distributions are not going to adopt this model as their default any time soon, and it's not even my goal that they do that. The above is my personal vision, and I don't expect people to buy into it 100%, and that's fine. However, what I am interested in is finding the overlaps, i.e. work with people who buy 50% into this vision, and share the components.
My goals here thus are to:
Get distributions to move to a model where images like this can be
built from the distribution easily. Specifically this means that
distributions make their OS hermetic in /usr/
.
Find the overlaps, share components with other projects to revisit
how distributions are put together. This is already happening, see
systemd-tmpfiles
and systemd-sysuser
support in various
distributions, but I think there's more to share.
Make people interested in building actual real-world images based on general purpose distributions adhering to the model described above. I'd love a "GnomeBook" image with full trust properties, that is built from true Linux distros, such as Fedora or ArchLinux.
What about ostree
? Doesn't ostree
already deliver what this blog story describes?
ostree
is fine technology, but in respect to security and
robustness properties it's not too interesting I think, because
unlike image-based approaches it cannot really deliver
integrity/robustness guarantees over the whole tree easily. To be
able to trust an ostree
setup you have to establish trust in the
underlying file system first, and the complexity of the file
system makes that challenging. To provide an effective
offline-secure trust chain through the whole depth of the stack it
is essential to cryptographically validate every single I/O
operation. In an image-based model this is trivially easy, but in
ostree
model it's with current file system technology not
possible and even if this is added in one way or another in the
future (though I am not aware of anyone doing on-access file-based
integrity that spans a whole hierarchy of files that was
compatible with ostree
's hardlink farm model) I think validation
is still at too high a level, since Linux file system developers
made very clear their implementations are not robust to rogue
images. (There's this stuff
planned,
but doing structural authentication ahead of time instead of on
access makes the idea to weak — and I'd expect too slow — in my
eyes.)
With my design I want to deliver similar security guarantees as
ChromeOS does, but ostree
is much weaker there, and I see no
perspective of this changing. In a way ostree
's integrity checks
are similar to RPM's and enforced on download rather than on
access. In the model I suggest above, it's always on access, and
thus safe towards offline attacks (i.e. evil maid attacks). In
today's world, I think offline security is absolutely necessary
though.
That said, ostree
does have some benefits over the model
described above: it naturally shares file system inodes if many of
the modules/images involved share the same data. It's thus more
space efficient on disk (and thus also in RAM/cache to some
degree) by default. In my model it would be up to the image
builders to minimize shipping overly redundant disk images, by
making good use of suitably composable system extensions.
What about configuration management?
At first glance immutable systems and configuration management
don't go that well together. However, do note, that in the model
I propose above the root file system with all its contents,
including /etc/
and /var/
is actually writable and can be
modified like on any other typical Linux distribution. The only
exception is /usr/
where the immutable OS is hermetic. That
means configuration management tools should work just fine in this
model – up to the point where they are used to install additional
RPM/dpkg packages, because that's something not allowed in the
model above: packages need to be installed at image build time and
thus on the image build host, not the runtime host.
What about non-UEFI and non-TPM2 systems?
The above is designed around the feature set of contemporary PCs, and this means UEFI and TPM2 being available (simply because the PC is pretty much defined by the Windows platform, and current versions of Windows require both).
I think it's important to make the best of the features of today's PC hardware, and then find suitable fallbacks on more limited hardware. Specifically this means: if there's desire to implement something like the this on non-UEFI or non-TPM2 hardware we should look for suitable fallbacks for the individual functionality, but generally try to add glue to the old systems so that conceptually they behave more like the new systems instead of the other way round. Or in other words: most of the above is not strictly tied to UEFI or TPM2, and for many cases already there are reasonably fallbacks in place for more limited systems. Of course, without TPM2 many of the security guarantees will be weakened.
How would you name an OS built that way?
I think a desktop OS built this way if it has the GNOME desktop should of course be called GnomeBook, to mimic the ChromeBook name. ;-)
But in general, I'd call hermetic, adaptive, immutable OSes like this "particles".
Help making Distributions Hermetic in /usr/
!
One of the core ideas of the approach described above is to make
the OS hermetic in /usr/
, i.e. make it carry a comprehensive
description of what needs to be set up outside of it when
instantiated. Specifically, this means that system users that are
needed are declared in systemd-sysusers
snippets, and skeleton
files and directories are created via systemd-tmpfiles
. Moreover
additional partitions should be declared via systemd-repart
drop-ins.
At this point some distributions (such as Fedora) are (probably
more by accident than on purpose) already mostly hermetic in
/usr/
, at least for the most basic parts of the OS. However,
this is not complete: many daemons require to have specific
resources set up in /var/
or /etc/
before they can work, and
the relevant packages do not carry systemd-tmpfiles
descriptions
that add them if missing. So there are two ways you could help
here: politically, it would be highly relevant to convince
distributions that an OS that is hermetic in /usr/
is highly
desirable and it's a worthy goal for packagers to get there. More
specifically, it would be desirable if RPM/dpkg packages would
ship with enough systemd-tmpfiles
information so that
configuration files the packages strictly need for operation are
symlinked (or copied) from /usr/share/factory/
if they are
missing (even better of course would be if packages from their
upstream sources on would just work with an empty /etc/
and
/var/
, and create themselves what they need and default to good
defaults in absence of configuration files).
Note that distributions that adopted systemd-sysusers
,
systemd-tmpfiles
and the /usr/
merge are already quite close
to providing an OS that is hermetic in /usr/
. These were the
big, the major advancements: making the image fully hermetic
should be less controversial – at least that's my guess.
Also note that making the OS hermetic in /usr/
is not just useful in
scenarios like the above. It also means that stuff like
this
and like
this
can work well.
Fill in the gaps!
I already mentioned a couple of missing bits and pieces in the
implementation of the overall vision. In the systemd
project
we'd be delighted to review/merge any PRs that fill in the voids.
Build your own OS like this!
Of course, while we built all these building blocks and they have been adopted to various levels and various purposes in the various distributions, no one so far built an OS that puts things together just like that. It would be excellent if we had communities that work on building images like what I propose above. i.e. if you want to work on making a secure GnomeBook as I suggest above a reality that would be more than welcome.
How could this look like specifically? Pick an existing
distribution, write a set of mkosi
descriptions plus some
additional drop-in files, and then build this on some build
infrastructure. While doing so, report the gaps, and help us
address them.
systemd-tmpfiles
systemd-sysusers
systemd-boot
systemd-stub
systemd-sysext
systemd-portabled
, Portable Services Introductionsystemd-repart
systemd-nspawn
systemd-sysupdate
systemd-creds
, System and Service Credentialssystemd-homed
And that's all for now.
![]() |
|
April 29, 2022 | |
![]() |
I've been working on kopper recently, which is a complementary project to zink. Just as zink implements OpenGL in terms of Vulkan, kopper seeks to implement the GL window system bindings - like EGL and GLX - in terms of the Vulkan WSI extensions. There are several benefits to doing this, which I'll get into in a future post, but today's story is really about libX11 and libxcb.
Yes, again.
One important GLX feature is the ability to set the swap interval, which is how you get tear-free rendering by syncing buffer swaps to the vertical retrace. A swap interval of 1 is the typical case, where an image update happens once per frame. The Vulkan way to do this is to set the swapchain present mode to FIFO, since FIFO updates are implicitly synced to vblank. Mesa's WSI code for X11 uses a swapchain management thread for FIFO present modes. This thread is started from inside the vulkan driver, and it only uses libxcb to talk to the X server. But libGL is a libX11 client library, so in this scenario there is always an "xlib thread" as well.
libX11 uses libxcb internally these days, because otherwise there would be no way to intermix xlib and xcb calls in the same process. But it does not use libxcb's reflection of the protocol, XGetGeometry does not call xcb_get_geometry for example. Instead, libxcb has an API to allow other code to take over the write side of the display socket, with a callback mechanism to get it back when another xcb client issues a request. The callback function libX11 uses here is straightforward: lock the Display, flush out any internally buffered requests, and return the sequence number of the last request written. Both libraries need this sequence number for various reasons internally, xcb for example uses it to make sure replies go back to the thread that issued the request.
But "lock the Display" here really means call into a vtable in the Display struct. That vtable is filled in during XOpenDisplay, but the individual function pointers are only non-NULL if you called XInitThreads beforehand. And if you're libGL, you have no way to enforce that, your public-facing API operates on a Display that was already created.
So now we see the race. The queue management thread calls into libxcb while the main thread is somewhere inside libX11. Since libX11 has taken the socket, the xcb thread runs the release callback. Since the Display was not made thread-safe at XOpenDisplay time, the release callback does not block, so the xlib thread's work won't be correctly accounted. If you're lucky the two sides will at least write to the socket atomically with respect to each other, but at this point they have diverging opinions about the request sequence numbering, and it's a matter of time until you crash.
It turns out kopper makes this really easy to hit. Like "resize a glxgears window" easy. However, this isn't just a kopper issue, this race exists for every program that uses xcb on a not-necessarily-thread-safe Display. The only reasonable fix is to for libX11 to just always be thread-safe.
![]() |
|
April 26, 2022 | |
![]() |
I recently
blogged
about how to run a volatile systemd-nspawn
container from your
host's /usr/
tree, for quickly testing stuff in your host
environment, sharing your home drectory, but all that without making a
single modification to your host, and on an isolated node.
The one-liner discussed in that blog story is great for testing during
system software development. Let's have a look at another systemd
tool that I regularly use to test things during systemd
development,
in a relatively safe environment, but still taking full benefit of my
host's setup.
Since a while now, systemd has been shipping with a simple component
called
systemd-sysext
. It's
primary usecase goes something like this: on one hand OS systems with
immutable /usr/
hierarchies are fantastic for security, robustness,
updating and simplicity, but on the other hand not being able to
quickly add stuff to /usr/
is just annoying.
systemd-sysext
is supposed to bridge this contradiction: when
invoked it will merge a bunch of "system extension" images into
/usr/
(and /opt/
as a matter of fact) through the use of read-only
overlayfs
, making all files shipped in the image instantly and
atomically appear in /usr/
during runtime — as if they always had
been there. Now, let's say you are building your locked down OS, with
an immutable /usr/
tree, and it comes without ability to log into,
without debugging tools, without anything you want and need when
trying to debug and fix something in the system. With systemd-sysext
you could use a system extension image that contains all this, drop it
into the system, and activate it with systemd-sysext
so that it
genuinely extends the host system.
(There are many other usecases for this tool, for example, you could build systems that way that at their base use a generic image, but by installing one or more system extensions get extended to with additional more specific functionality, or drivers, or similar. The tool is generic, use it for whatever you want, but for now let's not get lost in listing all the possibilites.)
What's particularly nice about the tool is that it supports
automatically discovered dm-verity
images, with signatures and
everything. So you can even do this in a fully authenticated,
measured, safe way. But I am digressing…
Now that we (hopefully) have a rough understanding what
systemd-sysext
is and does, let's discuss how specficially we can
use this in the context of system software development, to safely use
and test bleeding edge development code — built freshly from your
project's build tree – in your host OS without having to risk that the
host OS is corrupted or becomes unbootable by stuff that didn't quite
yet work the way it was envisioned:
The images systemd-sysext
merges into /usr/
can be of two kinds:
disk images with a file system/verity/signature, or simple, plain
directory trees. To make these images available to the tool, they can
be placed or symlinked into /usr/lib/extensions/
,
/var/lib/extensions/
, /run/extensions/
(and a bunch of
others). So if we now install our freshly built development software
into a subdirectory of those paths, then that's entirely sufficient to
make them valid system extension images in the sense of
systemd-sysext
, and thus can be merged into /usr/
to try them out.
To be more specific: when I develop systemd
itself, here's what I do
regularly, to see how my new development version would behave on my
host system. As preparation I checked out the systemd development git
tree first of course, hacked around in it a bit, then built it with
meson/ninja. And now I want to test what I just built:
sudo DESTDIR=/run/extensions/systemd-test meson install -C build --quiet --no-rebuild &&
sudo systemd-sysext refresh --force
Explanation: first, we'll install my current build tree as a system
extension into /run/extensions/systemd-test/
. And then we apply it
to the host via the systemd-sysext refresh
command. This command
will search for all installed system extension images in the
aforementioned directories, then unmount (i.e. "unmerge") any
previously merged dirs from /usr/
and then freshly mount
(i.e. "merge") the new set of system extensions on top of /usr/
. And
just like that, I have installed my development tree of systemd
into
the host OS, and all that without actually modifying/replacing even a
single file on the host at all. Nothing here actually hit the disk!
Note that all this works on any system really, it is not necessary
that the underlying OS even is designed with immutability in
mind. Just because the tool was developed with immutable systems in
mind it doesn't mean you couldn't use it on traditional systems where
/usr/
is mutable as well. In fact, my development box actually runs
regular Fedora, i.e. is RPM-based and thus has a mutable /usr/
tree. As long as system extensions are applied the whole of /usr/
becomes read-only though.
Once I am done testing, when I want to revert to how things were without the image installed, it is sufficient to call:
sudo systemd-sysext unmerge
And there you go, all files my development tree generated are gone
again, and the host system is as it was before (and /usr/
mutable
again, in case one is on a traditional Linux distribution).
Also note that a reboot (regardless if a clean one or an abnormal
shutdown) will undo the whole thing automatically, since we installed
our build tree into /run/
after all, i.e. a tmpfs
instance that is
flushed on boot. And given that the overlayfs
merge is a runtime
thing, too, the whole operation was executed without any
persistence. Isn't that great?
(You might wonder why I specified --force
on the systemd-sysext
refresh
line earlier. That's because systemd-sysext
actually does
some minimal version compatibility checks when applying system
extension images. For that it will look at the host's
/etc/os-release
file with
/usr/lib/extension-release.d/extension-release.<name>
, and refuse
operaton if the image is not actually built for the host OS
version. Here we don't want to bother with dropping that file in
there, we know already that the extension image is compatible with
the host, as we just built it on it. --force
allows us to skip the
version check.)
You might wonder: what about the combination of the idea from the
previous blog story (regarding running container's off the host
/usr/
tree) with system extensions? Glad you asked. Right now we
have no support for this, but it's high on our TODO list (patches
welcome, of course!). i.e. a new switch for systemd-nspawn
called
--system-extension=
that would allow merging one or more such
extensions into the container tree booted would be stellar. With that,
with a single command I could run a container off my host OS but with
a development version of systemd dropped in, all without any
persistence. How awesome would that be?
(Oh, and in case you wonder, all of this only works with distributions
that have completed the /usr/
merge. On legacy distributions that
didn't do that and still place parts of /usr/
all over the hierarchy
the above won't work, since merging /usr/
trees via overlayfs
is
pretty pointess if the OS is not hermetic in /usr/
.)
And that's all for now. Happy hacking!
![]() |
|
April 24, 2022 | |
![]() |
The title might be a bit hyperbolic here, but we’re indeed exploring a first step in that direction with radv. The impetus here is the ExecuteIndirect
command in Direct3D 12 and some games that are using it in non-trivial ways. (e.g. Halo Infinite)
ExecuteIndirect
can be seen as an extension of what we have in Vulkan with vkCmdDrawIndirectCount
. It adds extra capabilities. To support that with vkd3d-proton we need the following indirect Vulkan capabilities:
This functionality happens to be a subset of VK_NV_device_generated_commands
and hence I’ve been working on implementing a subset of that extension on radv. Unfortunately, we can’t really give the firmware a “extended indirect draw call” and execute stuff, so we’re stuck generating command buffers on the GPU.
The way the extension works, the application specifies a command “signature” on the CPU, which specifies that for each draw call the application is going to update A, B and C. Then, at runtime, the application provides a buffer providing the data for A, B and C for each draw call. The driver then processes that into a command buffer and then executes that into a secondary command buffer.
The workflow is then as follows:
vkCmdPreprocessGeneratedCommandsNV
which converts the application buffer into a command buffer (in the preprocess buffer)vkCmdExecuteGeneratedCommandsNV
to execute the generated command buffer.When the application triggers a draw command in Vulkan, the driver generates GPU commands to do the following:
Of course we skip any of these steps (or parts of them) when they’re redundant. The majority of the complexity is in the register state we have to set. There are multiple parts here
Fixed function state:
Overall, most of the pipeline state is fairly easy to emit: we just precompute it on pipeline creation and memcpy
it over if we switch shaders. The most difficult is probably the user SGPRs, and the reason for that is that it is derived from a lot of the remaining API state . Note that the list above doesn’t include push constants, descriptor sets or vertex buffers. The driver computes all of these, and generates the user SGPR data from that.
Descriptor sets in radv are just a piece of GPU memory, and radv binds a descriptor set by providing the shader with a pointer to that GPU memory in a user SGPR. Similarly, we have no hardware support for vertex buffers, so radv generates a push descriptor set containing internal texel buffers and then provides a user SGPR with a pointer to that descriptor set.
For push constants, radv has two modes: a portion of the data can be passed in user SGPRs directly, but sometimes a chunk of memory gets allocated and then a pointer to that memory is provided in a user SGPR. This fallback exists because the hardware doesn’t always have enough user SGPRs to fit all the data.
On Vega and later there are 32 user SGPRs, and on earlier GCN GPUs there are 16. This needs to fit pointers to all the referenced descriptor sets (including internal ones like the one for vertex buffers), push constants, builtins like the start vertex and start instance etc. To get the best performance here, radv determines a mapping of API object to user SGPR at shader compile time and then at draw time radv uses that mapping to write user SGPRs.
This results in some interesting behavior, like switching pipelines does cause the driver to update all the user SGPRs because the mapping might have changed.
Furthermore, as an interesting performance hack radv allocates all upload buffers (for the push constant and push descriptor sets), shaders and descriptor pools in a single 4 GiB region of of memory so that we can pass only the bottom 32-bits of all the pointers in a user SGPR, getting us farther with the limited number of user SGPRs. We will see later how that makes things difficult for us.
As shown above radv has a bunch of complexity around state for draw calls and if we start generating command buffers on the GPU that risks copying a significant part of that complexity to a shader. Luckily ExecuteIndirect
and VK_NV_device_generated_commands
have some limitations that make this easier. The app can only change
VK_NV_device_generated_commands
also allows changing shaders and the rotation winding of what is considered a primitive backface but we’ve chosen to ignore that for now since it isn’t needed for ExecuteIndirect
(though especially the shader switching could be useful for an application).
The second curveball is that the buffer the application provides needs to provide the same set of data for every draw call. This avoids having to do a lot of serial processing to figure out what the previous state was, which allows processing every draw command in a separate shader invocation. Unfortunately we’re still a bit dependent on the old state that is bound before the indirect command buffer execution:
Remember that for vertex buffers and push constants we may put them in a piece of memory. That piece of memory needs to contains all the vertex buffers/push constants for that draw call, so even if we modify only one of them, we have to copy the rest over. The index buffer is different: in the draw packets for the GPU there is a field that is derived from the index buffer size.
So in vkCmdPreprocessGeneratedCommandsNV
radv partitions the preprocess buffer into a command buffer and an upload buffer (for the vertex buffers & push constants), both with a fixed stride based on the command signature. Then it launches a shader which processes a draw call in each invocation:
if (shader used vertex buffers && we change a vertex buffer) {
copy all vertex buffers
update the changed vertex buffers
emit a new vertex descriptor set pointer
}
if (we change a push constant) {
if (we change a push constant in memory) {
copy all push constant
update changed push constants
emit a new push constant pointer
}
emit all changed inline push constants into user SGPRs
}
if (we change the index buffer) {
emit new index buffers
}
emit a draw command
insert NOPs up to the stride
In vkCmdExecuteGeneratedCommandsNV
radv uses the internal equivalent of vkCmdExecuteCommands
to execute as if the generated command buffer is a secondary command buffer.
Of course one does not simply move part of the driver to GPU shaders without any challenges. In fact we have a whole bunch of them. Some of them just need a bunch of work to solve, some need some extension specification tweaking and some are hard to solve without significant tradeoffs.
A big problem is that the code needed for the limited subset of state that is supported is now in 3 places:
vkCmdPreprocessGeneratedCommandsNV
to build the preprocess buffer.Having the same functionality in multiple places is a recipe for things going out of sync. This makes it harder to change this code and much easier for bugs to sneak in. This can be mitigated with a lot of testing, but a bunch of GPU work gets complicated quickly. (e.g. the preprocess buffer being larger than needed still results in correct results, getting a second opinion from the shader to check adds significant complexity).
nir_builder
gets old quicklyIn the driver at the moment we have no good high level shader compiler. As a result a lot of the internal helper shaders are written using the nir_builder
helper to generate nir
, the intermediate IR of the shader compiler. Example fragment:
nir_push_loop(b);
{
nir_ssa_def *curr_offset = nir_load_var(b, offset);
nir_push_if(b, nir_ieq(b, curr_offset, cmd_buf_size));
{
nir_jump(b, nir_jump_break);
}
nir_pop_if(b, NULL);
nir_ssa_def *packet_size = nir_isub(b, cmd_buf_size, curr_offset);
packet_size = nir_umin(b, packet_size, nir_imm_int(b, 0x3ffc * 4));
nir_ssa_def *len = nir_ushr_imm(b, packet_size, 2);
len = nir_iadd_imm(b, len, -2);
nir_ssa_def *packet = nir_pkt3(b, PKT3_NOP, len);
nir_store_ssbo(b, packet, dst_buf, curr_offset, .write_mask = 0x1,
.access = ACCESS_NON_READABLE, .align_mul = 4);
nir_store_var(b, offset, nir_iadd(b, curr_offset, packet_size), 0x1);
}
nir_pop_loop(b, NULL);
It is clear that this all gets very verbose very quickly. This is somewhat fine as long as all the internal shaders are tiny. However, between this and raytracing our internal shaders are getting significantly bigger and the verbosity really becomes a problem.
Interesting things to explore here are to use glslang, or even to try writing our shaders in OpenCL C and then compiling it to SPIR-V at build time. The challenge there is that radv is built on a diverse set of platforms (including Windows, Android and desktop Linux) which can make significant dependencies a struggle.
Ideally your GPU work is very suitable for pipelining to avoid synchronization cost on the GPU. If we generate the command buffer and then execute it we need to have a full GPU sync point in between, which can get very expensive as it waits until the GPU is idle. To avoid this VK_NV_device_generated_commands
has added the separate vkCmdPreprocessGeneratedCommandsNV
command, so that the application can batch up a bunch of work before incurring the cost a sync point.
However, in radv we have to do the command buffer generation in vkCmdExecuteGeneratedCommandsNV
as our command buffer generation depends on some of the other state that is bound, but might not be bound yet when the application calls vkCmdPreprocessGeneratedCommandsNV
.
Which brings up a slight spec problem: The extension specification doesn’t specify whether the application is allowed to execute vkCmdExecuteGeneratedCommandsNV
on multiple queues concurrently with the same preprocess buffer. If all the writing of that happens in vkCmdPreprocessGeneratedCommandsNV
that would result in correct behavior, but if the writing happens in vkCmdExecuteGeneratedCommandsNV
this results in a race condition.
Remember that radv only passes the bottom 32-bits of some pointers around. As a result the application needs to allocate the preprocess buffer in that 4-GiB range. This in itself is easy: just add a new memory type and require it for this usage. However, the devil is in the details.
For example, what should we do for memory budget queries? That is per memory heap, not memory type. However, a new memory heap does not make sense, as the memory is also still subject to physical availability of VRAM, not only address space.
Furthermore, this 4-GiB region is more constrained than other memory, so it would be a shame if applications start allocating random stuff in it. If we look at the existing usage for a pretty heavy game (HZD) we get about
So typically we have a lot of room available. Ideally the ordering of memory types would get an application to prefer another memory type when we do not need this special region. However, memory object caching poses a big risk here: Would you choose a memory object in the cache that you can reuse/suballocate (potentially in that limited region), or allocate new for a “better” memory type?
Luckily we have not seen that risk play out, but the only real tested user at this point has been vkd3d-proton.
When executing the generated command buffer radv does that the same way as calling a secondary command buffer. This has a significant limitation: A secondary command buffer cannot call a secondary command buffer on the hardware. As a result the current implementation has a problem if vkCmdExecuteGeneratedCommandsNV
gets called on a secondary command buffer.
It is possible to work around this. An example would be to split the secondary command buffer into 3 parts: pre, generated, post. However, that needs a bunch of refactoring to allow multiple internal command buffers per API command buffers.
Don’t expect this upstream very quickly. The main reason for exploring this in radv is ExecuteIndirect
support for Halo Infinite, and after some recent updates we’re back into GPU hang limbo with radv/vkd3d-proton there. So while we’re solving that I’m holding off on upstreaming in case the hangs are caused by the implementation of this extension.
Furthermore, this is only a partial implementation of the extension anyways, with a fair number of limitations that we’d ideally eliminate before fully exposing this extension.
![]() |
|
April 20, 2022 | |
![]() |
With Mesa 22.1 RC1 firmly out the door, most eyes have turned towards Mesa 22.2.
But not all eyes.
No, while most expected me to be rocketing off towards the next shiny feature, one ticket caught my eye:
Mesa 22.1rc1: Zink on Windows doesn’t work even simple wglgears app fails..
Sadly, I don’t support Windows. I don’t have a test machine to run it, and I don’t even have a VM I could spin up to run Lavapipe. I knew that Kopper was going to cause problems with other frontends, but I didn’t know how many other frontends were actually being used.
The answer was not zero, unfortunately. Plenty of users were enjoying the slow, software driver speed of Zink on Windows to spin those gears, and I had just crushed their dreams.
As I had no plans to change anything here, it would take a new hero to set things right.
Who here loves X-Plane?
I love X-Plane. It’s my favorite flight simulator. If I could, I’d play it all day every day. And do you know who my favorite X-Plane developer is?
Friend of the blog and part-time Zink developer, Sidney Just.
Some of you might know him from his extensive collection of artisanal blog posts. Some might have seen his work enabling Vulkan<->OpenGL interop in Mesa on Windows.
But did you know that Sid’s latest project is much more groundbreaking than just bumping Zink’s supported extension count far beyond the reach of every other driver?
What if I told you that this image
is Zink running wglgears on a NVIDIA 2070 GPU on Windows at full speed? No software-copy scanout. Just Kopper.
Over the past couple days, Sid’s done the esoteric work of hammering out WSI support for Zink on Windows, making us the first hardware-accelerated, GL 4.6-capable Mesa driver to run natively on Windows.
Don’t believe me?
Recognize a little Aztec Ruins action from GFXBench?
The results are about what we’d expect of an app I’ve literally never run myself:
Zink
NVIDIA
Not too bad at all!
I think we can safely say that Sid has managed to fix the original bug. Thanks, Sid!
But why is an X-Plane developer working on Zink?
The man himself has this to say on the topic:
X-Plane has traditionally been using OpenGL directly for all of its rendering needs. As a result, for years our plugin SDK has directly exposed the games OpenGL context directly to third party plugins, which have used it to render custom avionic screens and GUI elements. When we finally did the switch to Vulkan and Metal in 2020, one of the big issues we faced was how to deal with plugins. Our solution so far has been to rely on native Vulkan/OpenGL driver interop via extensions, which has mostly worked and allowed us to ship with modern backends.
Unfortunately this puts us at the mercy of the driver to provide good interop. Sadly on some platforms, this just isn’t available at all. On others, the drivers are broken leading to artifacts when mixing Vulkan and GL rendering. To date, our solution has been to just shrug it off and hope for better drivers. X-Plane plugins make use of compatibly profile GL features, as well as core profile features, depending on the authors skill, so libraries like ANGLE were not an option for us.
This is where Zink comes in for us: Being a real GL driver, it has support for all of the features that we need. Being open source also means that any issues that we do discover are much easier to fix ourselves. We’ve made some progress including Zink into the next major version of X-Plane, X-Plane 12, and it’s looking very promising so far. Our hope is to ship X-Plane 12 with Zink as the GL backend for plugins and leave driver interop issues in the past.
The roots of this interest can also be seen in his blog post from last year where he touches on the future of GL plugin support.
Awesome!
Big Triangle’s definitely paying attention now.
And if any of my readers think this work is cool, go buy yourself a copy of X-Plane to say thanks for contributing back to open source.
![]() |
|
April 15, 2022 | |
![]() |
This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:
In this article, we will finally focus on generating the rootfs/container image of the CI Gateway in a way that enables live patching the system without always needing to reboot.
This work is sponsored by the Valve Corporation.
System updates are a necessary evil for any internet-facing server, unless you want your system to become part of a botnet. This is especially true for CI systems since they let people on the internet run code on machines, often leading to unfair use such as cryptomining (this one is hard to avoid though)!
The problem with system updates is not the 2 or 3 minutes of downtime that it takes to reboot, it is that we cannot reboot while any CI job is running. Scheduling a reboot thus first requires to stop accepting new jobs, wait for the current ones to finish, then finally reboot. This solution may be acceptable if your jobs take ~30 minutes, but what if they last 6h? A reboot suddenly gets close to a typical 8h work day, and we definitely want to have someone looking over the reboot sequence so they can revert to a previous boot configuration if the new one failed.
This problem may be addressed in a cloud environment by live-migrating services/containers/VMs from a non-updated host to an updated one. This is unfortunately a lot more complex to pull off for a bare-metal CI without having a second CI gateway and designing synchronization systems/hardware to arbiter access to the test machines's power/serial consoles/boot configuration.
So, while we cannot always avoid the need to drain the CI jobs before rebooting, what we can do is reduce the cases in which we need to perform this action. Unfortunately, containers have been designed with atomic updates in mind (this is why we want to use them), but that means that trivial operations such as adding an ssh key, a Wireguard peer, or updating a firewall rule will require a reboot. A hacky solution may be for the admins to update the infra container then log in the different CI gateways and manually reproduce the changes they have done in the new container. These changes would be lost at the next reboot, but this is not a problem since the CI gateway would use the latest container when rebooting which already contains the updates. While possible, this solution is error-prone and not testable ahead of time, which is against the requirements for the gateway we laid out in Part 3.
An improvement to live-updating containers by hand would be to use tools such as Ansible, Salt, or even Puppet to manage and deploy non-critical services and configuration. This would enable live-updating the currently-running container but would need to be run after every reboot. An Ansible playbook may be run locally, so it is not inconceivable for a service to be run at boot that would download the latest playbook and run it. This solution is however forcing developers/admins to decide which services need to have their configuration baked in the container and which services should be deployed using a tool like Ansible... unless...
We could use a tool like Ansible to describe all the packages and services to install, along with their configuration. Creating a container would then be achieved by running the Ansible playbook on a base container image. Assuming that the playbook would truly be idem-potent (running the playbook multiple times will lead to the same final state), this would mean that there would be no differences between the live-patched container and the new container we created. In other words, we simply morph the currently-running container to the wanted configuration by running the same Ansible playbook we used to create the container, but against the live CI gateway! This will not always remove the need to reboot the CI gateways from time to time (updating the kernel, or services which don't support live-updates without affecting CI jobs), but all the smaller changes can get applied in-situ!
The base container image has to contain the basic dependencies of the tool like Ansible, but if it were made to contain all the OS packages, it would split the final image into three container layers: the base OS container, the packages needed, and the configuration. Updating the configuration would thus result in only a few megabytes of update to download at the next reboot rather than the full OS image, thus reducing the reboot time.
Ansible is perfectly-suited to morph a container into its newest version, provided that all the resources used remain static between when the new container was created and when the currently-running container gets live-patched. This is because of Ansible's core principle of idempotency of operations: Rather than running commands blindly like in a shell-script, it first checks what is the current state then, if needed, update the state to match the desired target. This makes it safe to run the playbook multiple times, but will also allow us to only reboot services if its configuration or one of its dependencies' changed.
When version pinning of packages is possible (Python, Ruby, Rust, Golang, ...), Ansible can guarantee the idempotency that make live-patching safe. Unfortunately, package managers of Linux distributions are usually not idempotent: They were designed to ship updates, not pin software versions! In practice, this means that there are no guarantees that the package installed during live-patching will be the same as the one installed in the new base container, thus exposing oneself to potential differences in behaviour between the two deployment methods... The only way out of this issue is to create your own package repository and make sure its content will not change between the creation of the new container and the live-patching of all the CI Gateways. Failing that, all I can advise you to do is pick a stable distribution which will try its best to limit functional changes between updates within the same distribution version (Alpine Linux, CentOS, Debian, ...).
In the end, Ansible won't always be able to make live-updating your container strictly equivalent to rebooting into its latest version, but as long as you are aware of its limitations (or work around them), it will make updating your CI gateways way less of a trouble than it would be otherwise! You will need to find the right balance between live-updatability, and ease of maintenance of the code-base of your gateway.
At this point, you may be wondering how all of this looks in practice! Here is the example of the CI gateways we have been developping for Valve:
And if you are wondering how we can go from these scripts to working containers, here is how:
$ podman run --rm -d -p 8088:5000 --name registry docker.io/library/registry:2
$ env \
IMAGE_NAME=localhost:8088/valve-infra-base-container \
BASE_IMAGE=archlinux \
buildah unshare -- .gitlab-ci/valve-infra-base-container-build.sh
$ env \
IMAGE_NAME=localhost:8088/valve-infra-container \
BASE_IMAGE=valve-infra-base-container \
ANSIBLE_EXTRA_ARGS='--extra-vars service_mgr_override=inside_container -e development=true' \
buildah unshare -- .gitlab-ci/valve-infra-container-build.sh
And if you were willing to use our Makefile, it gets even easier:
$ make valve-infra-base-container BASE_IMAGE=archlinux IMAGE_NAME=localhost:8088/valve-infra-base-container
$ make valve-infra-container BASE_IMAGE=localhost:8088/valve-infra-base-container IMAGE_NAME=localhost:8088/valve-infra-container
Not too bad, right?
PS: These scripts are constantly being updated, so make sure to check out their current version!
In this post, we highlighted the difficulty of keeping the CI Gateways up to date when CI jobs can take multiple hours to complete, preventing new jobs from starting until the current queue is emptied and the gateway has rebooted.
We have then shown that despite looking like competing solutions to deploy services in production, containers and tools like Ansible can actually work well together to reduce the need for reboots by morphing the currently-running container into the updated one. There are however some limits to this solution which are important to keep in mind when designing the system.
In the next post, we will be designing the executor service which is responsible for time-sharing the test machines between different CI/manual jobs. We will thus be talking about deploying test environments, BOOTP, and serial consoles!
That's all for now, thanks for making it to the end!
![]() |
|
April 12, 2022 | |
![]() |
As everyone who’s anyone knows, the next Mesa release branchpoint is coming up tomorrow. Like usual, here’s the rundown on what to expect from zink in this release:
So if you find a zink problem in the 22.1 release of Mesa, it’s definitely because of Kopper and not actually anything zink-related.
But also this is sort-of-almost-maybe a lavapipe blog, and that driver has had a much more exciting quarter. Here’s a rundown.
New Extensions:
Vulkan 1.3 is now supported. We’ve landed a number of big optimizations as well, leading to massively improved CI performance.
Lavapipe: the cutting-edge software implementation of Vulkan.
…as long as you don’t need descriptor indexing.
![]() |
|
April 07, 2022 | |
![]() |
Since Kopper got merged today upstream I wanted to write a little about it as I think the value it brings can be unclear for the uninitiated.
Adam Jackson in our graphics team has been working for the last Months together with other community members like Mike Blumenkrantz implementing Kopper. For those unaware Zink is an OpenGL implementation running on top of Vulkan and Kopper is the layer that allows you to translate OpenGL and GLX window handling to Vulkan WSI handling. This means that you can get full OpenGL support even if your GPU only has a Vulkan driver available and it also means you can for instance run GNOME on top of this stack thanks to the addition of Kopper to Zink.
During the lifecycle of the soon to be released Fedora Workstation 36 we expect to allow you to turn on the doing OpenGL using Kopper and Zink as an experimental feature once we update Fedora 36 to Mesa 22.1.
So you might ask why would I care about this as an end user? Well initially you probably will not care much, but over time it is likely that GPU makers will eventually stop developing native OpenGL drivers and just focus on their Vulkan drivers. At that point Zink and Kopper provides you with a backwards compatibility solution for your OpenGL applications. And for Linux distributions it will also at some point help reduce the amount of code we need to ship and maintain significantly as we can just rely on Zink and Kopper everywhere which of course reduces the workload for maintainers.
This is not going to be an overnight transition though, Zink and Kopper will need some time to stabilize and further improve performance. At the moment performance is generally a bit slower than the native drivers, but we have seen some examples of games which actually got better performance with specific driver combinations, but over time we expect to see the negative performance delta shrink. The delta is unlikely to ever fully go away due to the cost of translating between the two APIs, but on the other side we are going to be in a situation in a few years where all current/new applications use Vulkan natively (or through Proton) and thus the stuff that relies on OpenGL will be older software, so combined with faster GPUs you should still get more than good enough performance. And at that point Zink will be a lifesaver for your old OpenGL based applications and games.
![]() |
|
April 06, 2022 | |
![]() |
By the time you read this, Kopper will have landed. This means a number of things have changed:
MESA_LOADER_DRIVER_OVERRIDE=zink
will work for all driversIn particular, lots of cases of garbled/flickering rendering (I’m looking at you, Supertuxkart on ANV) will now be perfectly smooth and without issue.
Also there’s no swapinterval control yet, so X11 clients will have no choice but to churn out the maximum amount of FPS possible at all times.
You (probably?) aren’t going to be able to run a compositor on zink just yet, but it’s on the 22.1 TODO list.
Big thanks to Adam Jackson for carrying this project on his back.
![]() |
|
April 05, 2022 | |
![]() |
Apparently, in some parts of this
world, the /usr/
-merge
transition is still ongoing. Let's take the opportunity to have a look
at one specific way to take benefit of the /usr/
-merge (and
associated work) IRL.
I develop system-level software as you might know. Oftentimes I want
to run my development code on my PC but be reasonably sure it cannot
destroy or otherwise negatively affect my host system. Now I could set
up a container tree for that, and boot into that. But often I am too
lazy for that, I don't want to bother with a slow package manager
setting up a new OS tree for me. So here's what I often do instead —
and this only works because of the /usr/
-merge.
I run a command like the following (without any preparatory work):
systemd-nspawn \
--directory=/ \
--volatile=yes \
-U \
--set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \
--set-credential=firstboot.locale:C.UTF-8 \
--bind-user=lennart \
-b
And then I very quickly get a login prompt on a container that runs
the exact same software as my host — but is also isolated from the
host. I do not need to prepare any separate OS tree or anything
else. It just works. And my host user lennart
is just there,
ready for me to log into.
So here's what these
systemd-nspawn
options specifically do:
--directory=/
tells systemd-nspawn
to run off the host OS'
file hierarchy. That smells like danger of course, running two
OS instances off the same directory hierarchy. But don't be
scared, because:
--volatile=yes
enables volatile mode. Specifically this means
what we configured with --directory=/
as root file system is
slightly rearranged. Instead of mounting that tree as it is, we'll
mount a tmpfs
instance as actual root file system, and then
mount the /usr/
subdirectory of the specified hierarchy into the
/usr/
subdirectory of the container file hierarchy in read-only
fashion – and only that directory. So now we have a container
directory tree that is basically empty, but imports all host OS
binaries and libraries into its /usr/
tree. All software
installed on the host is also available in the container with no
manual work. This mechanism only works because on /usr/
-merged
OSes vendor resources are monopolized at a single place:
/usr/
. It's sufficient to share that one directory with the
container to get a second instance of the host OS running. Note
that this means /etc/
and /var/
will be entirely empty
initially when this second system boots up. Thankfully, forward
looking distributions (such as Fedora) have adopted
systemd-tmpfiles
and
systemd-sysusers
quite pervasively, so that system users and files/directories
required for operation are created automatically should they be
missing. Thus, even though at boot the mentioned directories are
initially empty, once the system is booted up they are
sufficiently populated for things to just work.
-U
means we'll enable user namespacing, in fully automatic
mode. This does three things: it picks a free host UID range
dynamically for the container, then sets up user namespacing for
the container processes mapping host UID range to UIDs 0…65534 in
the container. It then sets up a similar UID mapped mount on the
/usr/
tree of the container. Net effect: file ownerships as set
on the host OS tree appear as they belong to the very same users
inside of the container environment, except that we use user
namespacing for everything, and thus the users are actually
neatly isolated from the host.
--set-credential=passwd.hashed-password.root:$(mkpasswd
mysecret)
passes a credential to the container. Credentials are
bits of data that you can pass to systemd services and whole
systems. They are actually awesome concepts (e.g. they support
TPM2 authentication/encryption that just works!) but I am not going
to go into details around that, given it's off-topic in this
specific scenario. Here we just take benefit of the fact that
systemd-sysusers
looks for a credential called
passwd.hashed-password.root
to initialize the root password of
the system from. We set it to mysecret
. This means once the
system is booted up we can log in as root
and the supplied
password. Yay. (Remember, /etc/
is initially empty on this
container, and thus also carries no /etc/passwd
or
/etc/shadow
, and thus has no root user record, and thus no root
password.)
mkpasswd
is a tool then
converts a plain text password into a UNIX hashed password, which
is what this specific credential expects.
Similar, --set-credential=firstboot.locale:C.UTF-8
tells the
systemd-firstboot
service in the container to initialize /etc/locale.conf
with
this locale.
--bind-user=lennart
binds the host user lennart
into the
container, also as user lennart
. This does two things: it mounts
the host user's home directory into the container. It also copies
a minimal user record of the specified user into the container
that
nss-systemd
then picks up and includes in the regular user database. This
means, once the container is booted up I can log in as lennart
with my regular password, and once I logged in I will see my
regular host home directory, and can make changes to it. Yippieh!
(This does a couple of more things, such as UID mapping and
things, but let's not get lost in too much details.)
So, if I run this, I will very quickly get a login prompt, where I can
log into as my regular user. I have full access to my host home
directory, but otherwise everything is nicely isolated from the host,
and changes outside of the home directory are either prohibited or are
volatile, i.e. go to a tmpfs
instance whose lifetime is bound to the
container's lifetime: when I shut down the container I just started,
then any changes outside of my user's home directory are lost.
Note that while here I use --volatile=yes
in combination with
--directory=/
you can actually use it on any OS hierarchy, i.e. just
about any directory that contains OS binaries.
Similar, the --bind-user=
stuff works with any OS hierarchy too (but
do note that only systemd 249 and newer will pick up the user records
passed to the container that way, i.e. this requires at least v249
both on the host and in the container to work).
Or in short: the possibilities are endless!
For this all to work, you need:
A recent kernel (5.15 should suffice, as it brings UID mapped
mounts for the most common file systems, so that -U
and
--bind-user=
can work well.)
A recent systemd (249 should suffice, which brings --bind-user=
,
and a -U
switch backed by UID mapped mounts).
A distribution that adopted the /usr/
-merge, systemd-tmpfiles
and systemd-sysusers
so that the directory hierarchy and user
databases are automatically populated when empty at boot. (Fedora
35 should suffice.)
While a lot of today's software actually out of the box works well on
systems that come up with an unpopulated /etc/
and /var/
, and
either fall back to reasonable built-in defaults, or deploy
systemd-tmpfiles
to create what is missing, things aren't perfect:
some software typically installed an desktop OSes will fail to start
when invoked in such a container, and be visible as ugly failed
services, but it won't stop me from logging in and using the system
for what I want to use it. It would be excellent to get that fixed,
though. This can either be fixed in the relevant software upstream
(i.e. if opening your configuration file fails with ENOENT
, then
just default to reasonable defaults), or in the distribution packaging
(i.e. add a
tmpfiles.d/
file that copies or symlinks in skeleton configuration from
/usr/share/factory/etc/
via the C
or L
line types).
And then there's certain software dealing with hardware management and
similar that simply cannot work in a container (as device APIs on
Linux are generally not virtualized for containers) reasonably. It
would be excellent if software like that would be updated to carry
ConditionVirtualization=!container
or
ConditionPathIsReadWrite=/sys
conditionalization in their unit
files, so that it is automatically – cleanly – skipped when executed
in such a container environment.
And that's all for now.
![]() |
|
March 30, 2022 | |
![]() |
Do you want to start a career in open-source? Do you want to learn amazing skills while getting paid? Keep reading!
Igalia has a grant program that gives students with a background in Computer Science, Information Technology and Free Software their first exposure to the professional world, working hand in hand with Igalia programmers and learning with them. It is called Igalia Coding Experience.
While this experience is open for everyone, Igalia expressly invites women (both cis and trans), trans men, and genderqueer people to apply. The Coding Experience program gives preference to applications coming from underrepresented groups in our industry.
You can apply to any of the offered grants this year: Web Standards, WebKit, Chromium, Compilers and Graphics.
In the case of Graphics, the student will have the opportunity to deal with the Linux DRM subsystem. Specifically, the student will improve the test coverage of DRM drivers through IGT, a testing framework designed for this purpose. These includes learning how to contribute to Linux kernel/DRM, interact with the DRI-devel community, understand DRM core functionality, and increase test coverage of IGT tool.
The conditions of our Coding Experience program are:
The submission period goes from March 16th until April 30th. Students will be selected in May. We will work with the student to arrange a suitable starting date during 2022, from June onwards, and finishing on a date to be agreed that suits their schedule.
The popular Google Summer of Code is another option for students. This year, X.Org Foundation participates as Open Source organization. We have some proposed ideas but you can propose any project idea as well.
Timeline for proposals is from April 4th to April 19th. However, you should contact us before in order to discuss your ideas with potential mentors.
GSoC gives some stipend to students too (from 1,500 to 6,000 USD depending on the size of the project and your location). The hours to complete the project varies from 175 to 350 hours depending on the size of the project as well.
Of course, this is a remote-friendly program, so any student in the world can participate in it.
Outreachy is another internship program for applicants from around the world who face under-representation, systemic bias or discrimination in the technology industry of their country. Outreachy supports diversity in free and open source software!
Outreachy internships are remote, paid ($7,000), and last three months. Outreachy internships run from May to August and December to March. Applications open in January and August.
The projects listed cover many areas of the open-source software stack: from kernel to distributions work. Please check current proposals to find anything that is interesting for you!
X.Org Foundation voted in 2008 to initiate a program known as the X.Org Endless Vacation of Code (EVoC) program, in order to give more flexibility to students: an EVoC mentorship can be initiated at any time during the calendar year, the Board can fund as many of these mentorships as it sees fit.
Like the other programs, EVoC is remote-friendly as well. The stipend goes as follows: an initial payment of 500 USD and two further payments of 2,250 USD upon completion of project milestones. EVoC does not set limits in hours, but there are some requirements and steps to do before applying. Please read X.Org Endless Vacation of Code website to learn more.
As you see, there are many ways to enter into the Open Source community. Although I focused in the open source graphics stack related programs, there are many of them.
With all of these possibilities (and many more, including internships at companies), I hope that you can apply and that the experience will encourage you to start a career in the open-source community.
Happy hacking!
![]() |
|
March 29, 2022 | |
![]() |
Today marks (at last) the release of some cool extensions I’ve had the pleasure of working on:
This extension revolutionizes how PSOs can be managed by the application, and it’s the first step towards solving the dreaded stuttering that zink suffers from when attempting to play any sort of game. There’s definitely going to be more posts from me on this in the future.
Currently, zink has to do some awfulness internally to replicate the awfulness of GL_PRIMITIVES_GENERATED
. With this extension, at least some of that awfulness can be pushed down to the driver. And the spec, of course. You can’t scrub this filth out of your soul.
The mesa community being awesome as it is, support for these extensions is already underway:
But obviously Lavapipe, being the greatest of all drivers, will already have support landed by the time you read this post.
Let the bug reports flow!
![]() |
|
March 22, 2022 | |
![]() |
OpenSSH has this very nice setting, VerifyHostKeyDNS
, which when
enabled, will pull SSH host keys from DNS, and you no longer need to
either trust on first use, or copy host keys around out of band.
Naturally, trusting unsecured DNS is a bit scary, so this requires the
record to be signed using DNSSEC. This has worked for a long time,
but then broke, seemingly out of the blue. Running ssh -vvv
gave
output similar to
debug1: found 4 insecure fingerprints in DNS
debug3: verify_host_key_dns: checking SSHFP type 1 fptype 2
debug3: verify_host_key_dns: checking SSHFP type 4 fptype 2
debug1: verify_host_key_dns: matched SSHFP type 4 fptype 2
debug3: verify_host_key_dns: checking SSHFP type 4 fptype 1
debug1: verify_host_key_dns: matched SSHFP type 4 fptype 1
debug3: verify_host_key_dns: checking SSHFP type 1 fptype 1
debug1: matching host key fingerprint found in DNS
even though the zone was signed, the resolver was checking the
signature and I even checked that the DNS response had the AD
bit
set.
The fix was to add options trust-ad
to /etc/resolv.conf
. Without
this, glibc will discard the AD
bit from any upstream DNS
servers. Note that you should only add this if you actually have a
trusted DNS resolver. I run unbound on localhost, so if somebody can
do a man-in-the-middle attack on that traffic, I have other problems.
![]() |
|
March 18, 2022 | |
![]() |
Anyone who knows me knows that I hate cardio.
Full stop.
I’m not picking up and putting down all these heavy weights just so I can go for a jog afterwards and lose all my gains.
Similarly, I’m not trying to set a world record for speed-writing code. This stuff takes time, and it can’t be rushed.
Unless…
Today we’re a Lavapipe blog.
Lavapipe is, of course, the software implementation of Vulkan that ships with Mesa, originally braindumped directly into the repo by graphics god and part-time Twitter executive, Dave Airlie. For a long time, the Lavapipe meme has been “Try it on Lavapipe—it’s not conformant, but it still works pretty good haha” and we’ve all had a good chuckle at the idea that anything not officially signed and stamped by Khronos could ever draw a single triangle properly.
But, pending a single MR that fixes the four outstanding failures for Vulkan 1.2 conformance, as of last week, Lavapipe passes 100% of conformance tests. Thus, pending a merge and a Mesa bugfix release, Lavapipe will achieve official conformance.
And then we’ll have a new meme: Vulkan 1.3 When?
As some have noticed, I’ve been quietly (very, very, very, very, very, very, very, very, very, very, very, very quietly) implementing a number of features for Lavapipe over the past week.
But why?
Khronos-fanatics will immediately recognize that these features are all part of Vulkan 1.3.
Which Lavapipe also now supports, pending more merges which I expect to happen early next week.
This is what a sprint looks like.
![]() |
|
March 17, 2022 | |
![]() |
We had a busy 2021 within GNU/Linux graphics stack at Igalia.
Would you like to know what we have done last year? Keep reading!
Last year both the OpenGL and the Vulkan drivers received a lot of love. For example, we implemented several optimizations such improvements in the v3dv pipeline cache. In this blog post, Alejandro Piñeiro presents how we improved the v3dv pipeline cache times by reducing the two-cache-lookup done previously by only one, and shows some numbers on both a synthetic test (modified CTS test), and some games.
We also did performance improvements of the v3d compilers for OpenGL and Vulkan. Iago Toral explains our work on optimizating the backend compiler with techniques such as improving memory lookup efficiency, reducing instruction counts, instruction packing, uniform handling, among others. There are some numbers that show framerate improvements from ~6 to ~62% on different games / demos.
Framerate improvement after optimization (in %). Taken from Iago’s blogpost
Of course, there was work related to feature implementation. This blog post from Iago lists some Vulkan extensions implemented in the v3dv driver in 2021… Although not all the implemented extensions are listed there, you can see the driver is quickly catching up in its Vulkan extension support.
My colleague Juan A. Suárez implemented performance counters in the v3d driver (an OpenGL driver) which required modifications in the kernel and in the Mesa driver. More info in his blog post.
There was more work in other areas done in 2021 too, like the improved support for RenderDoc and GFXReconstruct. And not to forget the kernel contributions to the DRM driver done by Melissa Wen, who not only worked on developing features for it, but also reviewed all the patches that came from the community.
However, the biggest milestone for the v3Dv driver was to be Vulkan 1.1 conformant in the last quarter of 2021. That was just one year after becoming Vulkan 1.0 conformant. As you can imagine, that implied a lot of work implementing features, fixing bugs and, of course, improving the driver in many different ways. Great job folks!
If you want to know more about all the work done on these drivers during 2021, there is an awesome talk from my colleague Alejando Piñeiro at FOSDEM 2022: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4”, and another one from my colleague Iago Toral in XDC 2021: “Raspberry Pi Vulkan driver update”. Below you can find the video recordings of both talks.
FOSDEM 2022 talk: “v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi 4”
XDC 2021 talk: “Raspberry Pi Vulkan driver update”
Photo of the Qualcomm® Robotics RB3 Platform embedded board that I use for Turnip development.
There were also several achievements done by igalians on both Freedreno and Turnip drivers. These are reverse engineered open-source drivers for Qualcomm Adreno GPUs: Freedreno for OpenGL and Turnip for Vulkan.
Starting 2021, my colleague Danylo Piliaiev helped with implementing the missing bits in Freedreno for supporting OpenGL 3.3 on Adreno 6xx GPUs. His blog post explained his work, such as implementing ARB_blend_func_extended, ARB_shader_stencil_export and fixing a variety of CTS test failures.
Related to this, my colleague Guilherme G. Piccoli worked on porting a recent kernel to one of the boards we use for Freedreno development: the Inforce 6640. He did an awesome job getting a 5.14 kernel booting on that embedded board. If you want to know more, please read the blog post he wrote explaining all the issues he found and how he fixed them!
Picture of the Inforce 6640 board that Guilherme used for his development. Image from his blog post.
However the biggest chunk of work was done in Turnip driver. We have implemented a long list of Vulkan extensions: VK_KHR_buffer_device_address, VK_KHR_depth_stencil_resolve, VK_EXT_image_view_min_lod, VK_KHR_spirv_1_4, VK_EXT_descriptor_indexing, VK_KHR_timeline_semaphore, VK_KHR_16bit_storage, VK_KHR_shader_float16, VK_KHR_uniform_buffer_standard_layout, VK_EXT_extended_dynamic_state, VK_KHR_pipeline_executable_properties, VK_VALVE_mutable_descriptor_type, VK_KHR_vulkan_memory_model and many others. Danylo Piliaiev and Hyunjun Ko are terrific developers!
But not all our work was related to feature development, for example I implemented Low-Resolution Z-buffer (LRZ) HW optimization, Danylo fixed a long list of rendering bugs that happened in real-world applications (blog post 1, blog post 2) like D3D games run on Vulkan (thanks to DXVK and VKD3D), instrumented the backend compiler to dump register values, among many other fixes and optimizations.
However, the biggest achievement was getting Vulkan 1.1 conformance for Turnip. Danylo wrote a blog post mentioning all the work we did to achieve that this year.
If you want to know more, don’t miss this FOSDEM 2022 talk given by my colleague Hyunjun Ko called “The status of turnip driver development. What happened in 2021 and will happen in 2022 for turnip.”. Video below.
FOSDEM 2022 talk: “The status of turnip driver development. What happened in 2021 and will happen in 2022 for turnip.”
Our graphics work doesn’t cover only driver development, we also participate in Khronos Group as Vulkan Conformance Test Suite developers and even as spec contributors.
My colleague Ricardo Garcia is a very productive developer. He worked on implementing tests for Vulkan Ray Tracing extensions (read his blog post about ray tracing for more info about this big Vulkan feature), implemented tests for a long list of Vulkan extensions like VK_KHR_present_id and VK_KHR_present_wait, VK_EXT_multi_draw (watch his talk at XDC 2021), VK_EXT_border_color_swizzle (watch his talk at FOSDEM 2022) among many others. In many of these extensions, he contributed to their respective specifications in a significant way (just search for his name in the Vulkan spec!).
XDC 2021 talk: “Quick Overview of VK_EXT_multi_draw”
FOSDEM 2022 talk: “Fun with border colors in Vulkan. An overview of the story behind VK_EXT_border_color_swizzle”
Similarly, I participated modestly in this effort by developing tests for some extensions like VK_EXT_image_view_min_lod (blog post). Of course, both Ricardo and I implemented many new CTS tests by adding coverage to existing ones, we fixed lots of bugs in existing ones and reported dozens of driver issues to the respective Mesa developers.
Not only that, both Ricardo and I appeared as Vulkan 1.3 spec contributors.
Another interesting work we started in 2021 is Vulkan Video support on Gstreamer. My colleague Víctor Jaquez presented the Vulkan Video extension at XDC 2021 and soon after he started working on Vulkan Video’s h264 decoder support. You can find more information in his blog post, or watching his XDC 2021 talk below:
FOSDEM 2022 talk: “Video decoding in Vulkan: VK_KHR_video_queue/decode APIs”
Before I leave this section, don’t forget to take a look at Ricardo’s blogpost on debugPrintfEXT feature. If you are a Graphics developer, you will find this feature very interesting for debugging issues in your applications!
Along those lines, Danylo presented at XDC 2021 a talk about dissecting and fixing Vulkan rendering issues in drivers with RenderDoc. Very useful for driver developers! Watch the talk below:
XDC 2021 talk: “Dissecting Vulkan rendering issues in drivers with RenderDoc”
To finalize this blog post, remember that you now have vkrunner (the Vulkan shader tester created by Igalia) available for RPM-based GNU/Linux distributions. In case you are working with embedded systems, maybe my blog post about cross-compiling with icecream will help to speed up your builds.
This is just a summary of the highlights we did last year. I’m sorry if I am missing more work from my colleagues.
![]() |
|
March 16, 2022 | |
![]() |
Those of you in-the-know are well aware that Zink has always had a crippling addiction to seamless cubemaps. Specifically, Vulkan doesn’t support non-seamless cubemaps since nobody wants those anymore, but this is the default mode of sampling for OpenGL.
Thus, it is impossible for Zink to pass GL 4.6 conformance until this issue is resolved.
But what does this even mean?
As veterans of intro to geometry courses all know*, a cube is a 3D shape that has six identically-sized sides called “faces”. In computer graphics, each of these faces has its own content that can be read and written discretely.
When interpolating data from a cube-type texture, there are two methods:
This effectively results in Zink interpolating across the boundaries of cube faces when it should instead be clamping/wrapping pixel data to a single face.
But who cares about all that math nonsense when the result is that Zink is still failing CTS cases?
*Disclosure: I have been advised by my lawyer to state on the record that I have never taken an intro to geometry course and have instead copied this entire blog post off StackOverflow.
In order to replicate this basic OpenGL behavior, a substantial amount of code is required—most of it terrible.
The first step is to determine when a cube should be sampled as non-seamless. OpenGL helpfully has only one extension (plus this other extension) which control seamless access of cubemaps, so as long as that one state (plus the other state) isn’t enabled, a cube shouldn’t be interpreted seamlessly.
With this done, what happens at coordinates that lie at the edge of a face? The OpenGL wrap enum covers this. For the purposes of this blog post, only two wrap modes exist:
coord = extent
or coord = 0
)coord %= extent
)So now non-seamless cubes are detected, and the behavior for handling their non-seamlessness is known, but how can this actually be done?
In short, this requires shader rewrites to handle coordinate clamping, then wrapping. Since it’s not going to be desirable to have a different shader variant for every time the wrap mode changes, this means loading the parameters from a UBO. Since it’s further not going to be desirable to have shader variants for each per-texture seamless/non-seamless cube combination, this means also making the rewrite handle the no-op case of continuing to use the original, seamless coordinates after doing all the calculations for the non-seamless case.
Worst of all, this has to be plumbed through the Rube Goldberg machine that is Mesa.
It was terrible, and it continues to be terrible.
If I were another blogger, I would probably take this opportunity to flex my Calculus credentials by putting all kinds of math here, but nobody really wants to read that, and the hell if I know how to make markdown do that whiteboard thing so I can doodle in fancy formulas or whatever from the spec.
Instead, you can read the merge request if you’re that deeply invested in cube mechanics.
![]() |
|
March 09, 2022 | |
![]() |
Subgroup operations or wave intrinsics, such as reducing a value across the threads of a shader subgroup or wave, were introduced in GPU programming languages a while ago. They communicate with other threads of the same wave, for example to exchange the input values of a reduction, but not necessarily with all of them if there is divergent control flow.
In LLVM, we call such operations convergent. Unfortunately, LLVM does not define how the set of communicating threads in convergent operations -- the set of converged threads -- is affected by control flow.
If you're used to thinking in terms of structured control flow, this may seem trivial. Obviously, there is a tree of control flow constructs: loops, if-statements, and perhaps a few others depending on the language. Two threads are converged in the body of a child construct if and only if both execute that body and they are converged in the parent. Throw in some simple and intuitive rules about loop counters and early exits (nested return, break and continue, that sort of thing) and you're done.
In an unstructured control flow graph, the answer is not obvious at all. I gave a presentation at the 2020 LLVM Developers' Meeting that explains some of the challenges as well as a solution proposal that involves adding convergence control tokens to the IR.
Very briefly, convergent operations in the proposal use a token variable that is defined by a convergence control intrinsic. Two dynamic instances of the same static convergent operation from two different threads are converged if and only if the dynamic instances of the control intrinsic producing the used token values were converged.
(The published draft of the proposal talks of multiple threads executing the same dynamic instance. I have since been convinced that it's easier to teach this matter if we instead always give every thread its own dynamic instances and talk about a convergence equivalence relation between dynamic instances. This doesn't change the resulting semantics.)
The draft has three such control intrinsics: anchor, entry, and (loop) heart. Of particular interest here is the heart. For the most common and intuitive use cases, a heart intrinsic is placed in the header of natural loops. The token it defines is used by convergent operations in the loop. The heart intrinsic itself also uses a token that is defined outside the loop: either by another heart in the case of nested loops, or by an anchor or entry. The heart combines two intuitive behaviors:
Viewed from this angle, how about we define a weaker version of these rules that lies somewhere between an anchor and a loop heart? We could call it a "light heart", though I will stick with "iterating anchor". The iterating anchor defines a token but has no arguments. Like for the anchor, the set of converged threads is implementation-defined -- when the iterating anchor is first encountered. When threads encounter the iterating anchor again without leaving the dominance region of its containing basic block, they are converged if and only if they were converged during their previous encounter of the iterating anchor.
The notion of an iterating anchor came up when discussing the convergence behaviors that can be guaranteed for natural loops. Is it possible to guarantee that natural loops always behave in the natural way -- according to their loop counter -- when it comes to convergence?
Naively, this should be possible: just put hearts into loop headers! Unfortunately, that's not so straightforward when multiple natural loops are contained in an irreducible loop:
Hearts in A and C must refer to a token defined outside the loops; that is, a token defined in E. The resulting program is ill-formed because it has a closed path that goes through two hearts that use the same token, but the path does not go through the definition of that token. This well-formedness rule exists because the rules about heart semantics are unsatisfiable if the rule is broken.
The underlying intuitive issue is that if the branch at E is divergent in a typical implementation, the wave (or subgroup) must choose whether A or C is executed first. Neither choice works. The heart in A indicates that (among the threads that are converged in E) all threads that visit A (whether immediately or via C) must be converged during their first visit of A. But if the wave executes A first, then threads which branch directly from E to A cannot be converged with those that first branch to C. The opposite conflict exists if the wave executes C first.
If we replace the hearts in A and C by iterating anchors, this problem goes away because the convergence during the initial visit of each block is implementation-defined. In practice, it should fall out of which of the blocks the implementation decides to execute first.
So it seems that iterating anchors can fill a gap in the expressiveness of the convergence control design. But are they really a sound addition? There are two main questions:
Consider the following simple CFG with an iterating anchor in A and a heart in B that refers back to a token defined in E:
Now consider two threads that are initially converged with execution traces:
One could try to resolve the paradox by saying that the threads cannot be converged in A at all, but this would mean that the threads mustdiverge before a divergent branch occurs. That seems unreasonable, since typical implementations want to avoid divergence as long as control flow is uniform.
The example arguably breaks the spirit of the rule about convergence regions from the draft proposal linked above, and so a minor change to the definition of convergence region may be used to exclude it.
What if the CFG instead looks as follows, which does not break any rules about convergence regions:
For the same execution traces, the heart rule again implies that the threads must be converged in B. The convergence of the first dynamic instances of A are technically implementation-defined, but we'd expect most implementations to be converged there.
The second dynamic instances of A cannot be converged due to the convergence of the dynamic instances of B. That's okay: the second dynamic instance of A in thread 2 is a re-entry into the dominance region of A, and so its convergence is unrelated to any convergence of earlier dynamic instances of A.
Unfortunately, we still cannot allow this second example. A program transform may find that the conditional branch in E is constant and the edge from E to B is dead. Removing that edge brings us back to the previous example which is ill-formed. However, a transform which removes the dead edge would not normally inspect the blocks A and B or their dominance relation in detail. The program becomes ill-formed by spooky action at a distance.
The following static rule forbids both example CFGs: if there is a closed path through a heart and an iterating anchor, but not through the definition of the token that the heart uses, then the heart must dominate the iterating anchor.
There is at least one other issue of spooky action at a distance. If the iterating anchor is not the first (non-phi) instruction of its basic block, then it may be preceded by a function call in the same block. The callee may contain control flow that ends up being inlined. Back edges that previously pointed at the block containing the iterating anchor will then point to a different block, which changes the behavior quite drastically. Essentially, the iterating anchor is reduced to a plain anchor.
What can we do about that? It's tempting to decree that an iterating anchor must always be the first (non-phi) instruction of a basic block. Unfortunately, this is not easily done in LLVM in the face of general transforms that might sink instructions or merge basic blocks.
We could chew through some other ideas for making iterating anchors work, but that turns out to be unnecessary. The desired behavior of iterating anchors can be obtained by inserting preheader blocks. The initial example of two natural loops contained in an irreducible loop becomes:
Place anchors in Ap and Cp and hearts in A and C that use the token defined by their respective dominating anchor. Convergence at the anchors is implementation-defined, but relative to this initial convergence at the anchor, convergence inside the natural loops headed by A and C behaves in the natural way, based on a virtual loop counter. The transform of inserting an anchor in the preheader is easily generalized.
To sum it up: We've concluded that defining an "iterating anchor" convergence control intrinsic is problematic, but luckily also unnecessary. The control intrinsics defined in the original proposal are sufficient. I hope that the discussion that led to those conclusions helps illustrate some aspects of the convergence control proposal for LLVM as well as the goals and principles that drove it.
![]() |
|
March 07, 2022 | |
![]() |
![]() |
|
March 04, 2022 | |
![]() |
A quick reminder: libei is the library for emulated input. It comes as a pair of C libraries, libei for the client side and libeis for the server side.
libei has been sitting mostly untouched since the last status update. There are two use-cases we need to solve for input emulation in Wayland - the ability to emulate input (think xdotool, or Synergy/Barrier/InputLeap client) and the ability to capture input (think Synergy/Barrier/InputLeap server). The latter effectively blocked development in libei [1], until that use-case was sorted there wasn't much point investing too much into libei - after all it may get thrown out as a bad idea. And epiphanies were as elusive like toilet paper and RATs, so nothing much get done. This changed about a week or two ago when the required lightbulb finally arrived, pre-lit from the factory.
So, the solution to the input capturing use-case is going to be a so-called "passive context" for libei. In the traditional [2] "active context" approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat that typically represent the available screens. libei then sends events through these devices, causing input to be appear in the compositor which moves the cursor around. In a typical and simple use-case you'd get a 1920x1080 absolute pointer device and a keyboard with a $layout keymap, libei then sends events to position the cursor and or happily type away on-screen.
In the "passive context" <deja-vu> approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat </deja-vu> that typically represent the physical devices connected to the host computer. libei then receives events from these devices, causing input to be generated in the libei client. In a typical and simple use-case you'd get a relative pointer device and a keyboard device with a $layout keymap, the compositor then sends events matching the relative input of the connected mouse or touchpad.
The two notable differences are thus: events flow from EIS to libei and the devices don't represent the screen but rather the physical [3] input devices.
This changes libei from a library for emulated input to an input event transport layer between two processes. On a much higher level than e.g. evdev or HID and with more contextual information (seats, devices are logically abstracted, etc.). And of course, the EIS implementation is always in control of the events, regardless which direction they flow. A compositor can implement an event filter or designate key to break the connection to the libei client. In pseudocode, the compositor's input event processing function will look like this:
Not shown here are the various appropriate filters and conversions in between (e.g. all relative events from libinput devices would likely be sent through the single relative device exposed on the EIS context). Again, the compositor is in control so it would be trivial to implement e.g. capturing of the touchpad only but not the mouse.
function handle_input_events():
real_events = libinput.get_events()
for e in real_events:
if input_capture_active:
send_event_to_passive_libei_client(e)
else:
process_event(e)
emulated_events = eis.get_events_from_active_clients()
for e in emulated_events:
process_event(e)
In the current design, a libei context can only be active or passive, not both. The EIS context is both, it's up to the implementation to disconnect active or passive clients if it doesn't support those.
Notably, the above only caters for the transport of input events, it doesn't actually make any decision on when to capture events. This handled by the CaptureInput XDG Desktop Portal [4]. The idea here is that an application like Synergy/Barrier/InputLeap server connects to the CaptureInput portal and requests a CaptureInput session. In that session it can define pointer barriers (left edge, right edge, etc.) and, in the future, maybe other triggers. In return it gets a libei socket that it can initialize a libei context from. When the compositor decides that the pointer barrier has been crossed, it re-routes the input events through the EIS context so they pop out in the application. Synergy/Barrier/InputLeap then converts that to the global position, passes it to the right remote Synergy/Barrier/InputLeap client and replays it there through an active libei context where it feeds into the local compositor.
Because the management of when to capture input is handled by the portal and the respective backends, it can be natively integrated into the UI. Because the actual input events are a direct flow between compositor and application, the latency should be minimal. Because it's a high-level event library, you don't need to care about hardware-specific details (unlike, say, the inputfd proposal from 2017). Because the negotiation of when to capture input is through the portal, the application itself can run inside a sandbox. And because libei only handles the transport layer, compositors that don't want to support sandboxes can set up their own negotiation protocol.
So overall, right now this seems like a workable solution.
[1] "blocked" is probably overstating it a bit but no-one else tried to push it forward, so..
[2] "traditional" is probably overstating it for a project that's barely out of alpha development
[3] "physical" is probably overstating it since it's likely to be a logical representation of the types of inputs, e.g. one relative device for all mice/touchpads/trackpoints
[4] "handled by" is probably overstating it since at the time of writing the portal is merely a draft of an XML file
Yeah, my b, I forgot this was a thing.
Fuck it though, I’m a professional, so I’m gonna pretend I didn’t just skip a month of blogs and get right back into it.
Gallivm is the nir/tgsi-to-llvm translation layer in Gallium that LLVMpipe (and thus Lavapipe) uses to generate the JIT functions which make triangles. It’s very old code in that it predates me knowing how triangles work, but that doesn’t mean it doesn’t have bugs.
And Gallivm bugs are the worst bugs.
For a long time, I’ve had SIGILL crashes on exactly one machine locally for the CTS glob dEQP-GLES31.functional.program_uniform.by*sampler2D_samplerCube*
. These tests pass on everyone else’s machines including CI.
Like I said, Gallivm bugs are the worst bugs.
How does one debug JIT code? GDB can’t be used, valgrind doesn’t work, and, despite what LLVM developers would tell you, building an assert-enabled LLVM doesn’t help at all in most cases here since that will only catch invalid behavior, not questionably valid behavior that very obviously produces invalid results.
So we enter the world of lp_build_print
debugging. Much like standard printf
debugging, the strategy here is to just lp_build_print_value
or lp_build_printf("I hate this part of the shader too")
our way to figuring out where in the shader the crash occurs.
Here’s an example shader from dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex
that crashes:
#version 310 es
in highp vec4 a_position;
out mediump float v_vtxOut;
struct structType
{
mediump sampler2D m0;
mediump samplerCube m1;
};
uniform structType u_var;
mediump float compare_float (mediump float a, mediump float b) { return abs(a - b) < 0.05 ? 1.0 : 0.0; }
mediump float compare_vec4 (mediump vec4 a, mediump vec4 b) { return compare_float(a.x, b.x)*compare_float(a.y, b.y)*compare_float(a.z, b.z)*compare_float(a.w, b.w); }
void main (void)
{
gl_Position = a_position;
v_vtxOut = 1.0;
v_vtxOut *= compare_vec4(texture(u_var.m0, vec2(0.0)), vec4(0.15, 0.52, 0.26, 0.35));
v_vtxOut *= compare_vec4(texture(u_var.m1, vec3(0.0)), vec4(0.88, 0.09, 0.30, 0.61));
}
Which, in llvmpipe NIR, is:
shader: MESA_SHADER_VERTEX
source_sha1: {0xcb00c93e, 0x64db3b0f, 0xf4764ad3, 0x12b69222, 0x7fb42437}
inputs: 1
outputs: 2
uniforms: 0
shared: 0
ray queries: 0
decl_var uniform INTERP_MODE_NONE sampler2D lower@u_var.m0 (0, 0, 0)
decl_var uniform INTERP_MODE_NONE samplerCube lower@u_var.m1 (0, 0, 1)
decl_function main (0 params)
impl main {
block block_0:
/* preds: */
vec1 32 ssa_0 = deref_var &a_position (shader_in vec4)
vec4 32 ssa_1 = intrinsic load_deref (ssa_0) (access=0)
vec1 16 ssa_2 = load_const (0xb0cd = -0.150024)
vec1 16 ssa_3 = load_const (0x2a66 = 0.049988)
vec1 16 ssa_4 = load_const (0xb829 = -0.520020)
vec1 16 ssa_5 = load_const (0xb429 = -0.260010)
vec1 16 ssa_6 = load_const (0xb59a = -0.350098)
vec1 16 ssa_7 = load_const (0xbb0a = -0.879883)
vec1 16 ssa_8 = load_const (0xadc3 = -0.090027)
vec1 16 ssa_9 = load_const (0xb4cd = -0.300049)
vec1 16 ssa_10 = load_const (0xb8e1 = -0.609863)
vec2 32 ssa_13 = load_const (0x00000000, 0x00000000) = (0.000000, 0.000000)
vec1 32 ssa_49 = load_const (0x00000000 = 0.000000)
vec4 16 ssa_14 = (float16)txl ssa_13 (coord), ssa_49 (lod), 0 (texture), 0 (sampler)
vec1 16 ssa_15 = fadd ssa_14.x, ssa_2
vec1 16 ssa_16 = fabs ssa_15
vec1 16 ssa_17 = fadd ssa_14.y, ssa_4
vec1 16 ssa_18 = fabs ssa_17
vec1 16 ssa_19 = fadd ssa_14.z, ssa_5
vec1 16 ssa_20 = fabs ssa_19
vec1 16 ssa_21 = fadd ssa_14.w, ssa_6
vec1 16 ssa_22 = fabs ssa_21
vec1 16 ssa_23 = fmax ssa_16, ssa_18
vec1 16 ssa_24 = fmax ssa_23, ssa_20
vec1 16 ssa_25 = fmax ssa_24, ssa_22
vec3 32 ssa_27 = load_const (0x00000000, 0x00000000, 0x00000000) = (0.000000, 0.000000, 0.000000)
vec1 32 ssa_50 = load_const (0x00000000 = 0.000000)
vec4 16 ssa_28 = (float16)txl ssa_27 (coord), ssa_50 (lod), 1 (texture), 1 (sampler)
vec1 16 ssa_29 = fadd ssa_28.x, ssa_7
vec1 16 ssa_30 = fabs ssa_29
vec1 16 ssa_31 = fadd ssa_28.y, ssa_8
vec1 16 ssa_32 = fabs ssa_31
vec1 16 ssa_33 = fadd ssa_28.z, ssa_9
vec1 16 ssa_34 = fabs ssa_33
vec1 16 ssa_35 = fadd ssa_28.w, ssa_10
vec1 16 ssa_36 = fabs ssa_35
vec1 16 ssa_37 = fmax ssa_30, ssa_32
vec1 16 ssa_38 = fmax ssa_37, ssa_34
vec1 16 ssa_39 = fmax ssa_38, ssa_36
vec1 16 ssa_40 = fmax ssa_25, ssa_39
vec1 32 ssa_41 = flt32 ssa_40, ssa_3
vec1 32 ssa_42 = b2f32 ssa_41
vec1 32 ssa_43 = deref_var &gl_Position (shader_out vec4)
intrinsic store_deref (ssa_43, ssa_1) (wrmask=xyzw /*15*/, access=0)
vec1 32 ssa_44 = deref_var &v_vtxOut (shader_out float)
intrinsic store_deref (ssa_44, ssa_42) (wrmask=x /*1*/, access=0)
/* succs: block_1 */
block block_1:
}
There’s two sample ops (txl
), and since these tests only do simple texture()
calls, it seems reasonable to assume that one of them is causing the crash. Sticking a lp_build_print_value
on the texel values fetched by the sample operations will reveal whether the crash occurs before or after them.
What output does this yield?
Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'..
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0
[1] 3500332 illegal hardware instruction (core dumped)
Each txl
op fetches four values, which means this is the result from the first instruction, but the second one isn’t reached before the crash. Unsurprisingly, this is also the cube sampling instruction, which makes sense given that all the crashes of this type I get are from cube sampling tests.
Now that it’s been determined the second txl
is causing the crash, it’s reasonable to assume that the construction of that sampling op is the cause rather than the op itself, as proven by sticking some simple lp_build_printf("What am I doing with my life")
calls in just before that op. Indeed, as the printfs confirm, I’m still questioning the life choices that led me to this point, so it’s now proven that the txl
instruction itself is the problem.
Cube sampling has a lot of complex math involved for face selection, and I’ve spent a lot of time in there recently. My first guess was that the cube coordinates were bogus. Printing them yielded results:
Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'..
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617
cubecoords nan nan nan nan nan nan nan nan
cubecoords nan nan nan nan nan nan nan nan
These cube coords have more NaNs than a 1960s Batman TV series, so it looks like I was right in my hunch. Printing the cube S-face value next yields more NaNs. My printf search continued a couple more iterations until I wound up at this function:
static LLVMValueRef
lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord)
{
/* ima = +0.5 / abs(coord); */
LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5);
LLVMValueRef absCoord = lp_build_abs(coord_bld, coord);
LLVMValueRef ima = lp_build_div(coord_bld, posHalf, absCoord);
return ima;
}
Immediately, all of us multiverse-brain engineers spot something suspicious: this has a division operation with a user-provided divisor. Printing absCoord
here yielded all zeroes, which was about where my remaining energy was at this Friday morning, so I mangled the code slightly:
static LLVMValueRef
lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord)
{
/* ima = +0.5 / abs(coord); */
LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5);
LLVMValueRef absCoord = lp_build_abs(coord_bld, coord);
/* avoid div by zero */
LLVMValueRef sel = lp_build_cmp(coord_bld, PIPE_FUNC_GREATER, absCoord, coord_bld->zero);
LLVMValueRef div = lp_build_div(coord_bld, posHalf, absCoord);
LLVMValueRef ima = lp_build_select(coord_bld, sel, div, coord_bld->zero);
return ima;
}
And blammo, now that Gallivm could no longer divide by zero, the test was now passing. And so were a lot of others.
There’s been some speculation about how close Zink really is to being “useful”, where “useful” is determined by the majesty of passing GL4.6 CTS.
So how close is it? The answer might shock you.
Remaining Lavapipe Fails: 17
Remaining ANV Fails (Icelake): 9
Big Triangle better keep a careful eye on us now.
![]() |
|
February 17, 2022 | |
![]() |
Around 2 years ago while I was working on tessellation support for llvmpipe, and running the heaven benchmark on my Ryzen, I noticed that heaven despite running slowly wasn't saturating all the cores. I dug in a bit, and found that llvmpipe despite threading rasterization, fragment shading and blending stages, never did anything else while those were happening.
I dug into the code as I clearly remembered seeing a concept of a "scene" where all the primitives were binned into and then dispatched. It turned out the "scene" was always executed synchronously.
At the time I wrote support to allow multiple scenes to exist, so while one scene was executing the vertex shading and binning for the next scene could execute, and it would be queued up. For heaven at the time I saw some places where it would build 36 scenes. However heaven was still 1fps with tess, and regressions in other areas were rampant, and I mostly left them in a branch.
The reasons so many things were broken by the patches was that large parts of llvmpipe and also lavapipe, weren't ready for the async pipeline processing. The concept of a fence after the pipeline finished was there, but wasn't used properly everywhere. A lot of operations assumed there was nothing going on behind the scenes so never fenced. Lots of things like queries broke due to fact that a query would always be ready in the old model, but now query availability could return unavailable like a real hw driver. Resource tracking existed but was incomplete, so knowing when to flush wasn't always accurate. Presentation was broken due to incorrect waiting both for GL and Lavapipe. Lavapipe needed semaphore support that actually did things as apps used it between the render and present pipeline pieces.
Mesa CI recently got some paraview traces added to it, and I was doing some perf traces with them. Paraview is a data visualization tool, and it generates vertex heavy workloads, as opposed to compositors and even games. It turned out binning was most of the overhead, and I realized the overlapping series could help this sort of workload. I dusted off the patch series and nailed down all the issues.
Emma Anholt ran some benchmarks on the results with the paraview traces and got
I've got it all lined up in a merge request and it doesn't break CI anymore, so hopefully get it landed in the next while, once I cleanup any misc bits.
![]() |
|
February 16, 2022 | |
![]() |
Earlier this week, Neil McGovern announced that he is due to be stepping down as the Executive Director as the GNOME Foundation later this year. As the President of the board and Neil’s effective manager together with the Executive Committee, I wanted to take a moment to reflect on his achievements in the past 5 years and explain a little about what the next steps would be.
Since joining in 2017, Neil has overseen a productive period of growth and maturity for the Foundation, increasing our influence both within the GNOME project and the wider Free and Open Source Software community. Here’s a few highlights of what he’s achieved together with the Foundation team and the community:
Recognizing and appreciating the amazing progress that GNOME has made with Neil’s support, the search for a new Executive Director provides the opportunity for the Foundation board to set the agenda and next high-level goals we’d like to achieve together with our new Executive Director.
In terms of the desktop, applications, technology, design and development processes, whilst there are always improvements to be made, the board’s general feeling is that thanks to the work of our amazing community of contributors, GNOME is doing very well in terms of what we produce and publish. Recent desktop releases have looked great, highly polished and well-received, and the application ecosystem is growing and improving through new developers and applications bringing great energy at the moment. From here, our largest opportunity in terms of growing the community and our user base is being able to articulate the benefits of what we’ve produced to a wider public audience, and deliver impact which allows us to secure and grow new and sustainable sources of funding.
For individuals, we are able to offer an exceedingly high quality desktop experience and a broad range of powerful applications which are affordable to all, backed by a nonprofit which can be trusted to look after your data, digital security and your best interests as an individual. From the perspective of being a public charity in the US, we also have the opportunity to establish programs that draw upon our community, technology and products to deliver impact such as developing employable skills, incubating new Open Source contributors, learning to program and more.
For our next Executive Director, we will be looking for an individual with existing experience in that nonprofit landscape, ideally with prior experience establishing and raising funds for programs that deliver impact through technology, and appreciation for the values that bring people to Free, Open Source and other Open Culture organizations. Working closely with the existing members, contributors, volunteers and whole GNOME community, and managing our relationships with the Advisory Board and other key partners, we hope to find a candidate that can build public awareness and help people learn about, use and benefit from what GNOME has built over the past two decades.
Neil has agreed to stay in his position for a 6 month transition period, during which he will support the board in our search for a new Executive Director and support a smooth hand-over. Over the coming weeks we will publish the job description for the new ED, and establish a search committee who will be responsible for sourcing and interviewing candidates to make a recommendation to the board for Neil’s successor – a hard act to follow!
I’m confident the community will join me and the board in personally thanking Neil for his 5 years of dedicated service in support of GNOME and the Foundation. Should you have any queries regarding the process, or offers of assistance in the coming hiring process, please don’t hesitate to join the discussion or reach out directly to the board.
![]() |
|
February 15, 2022 | |
![]() |
After roughly 20 years and counting up to 0.40 in release numbers, I've decided to call the next version of the xf86-input-wacom driver the 1.0 release. [1] This cycle has seen a bulk of development (>180 patches) which is roughly as much as the last 12 releases together. None of these patches actually added user-visible features, so let's talk about technical dept and what turned out to be an interesting way of reducing it.
The wacom driver's git history goes back to 2002 and the current batch of maintainers (Ping, Jason and I) have all been working on it for one to two decades. It used to be a Wacom-only driver but with the improvements made to the kernel over the years the driver should work with most tablets that have a kernel driver, albeit some of the more quirky niche features will be more limited (but your non-Wacom devices probably don't have those features anyway).
The one constant was always: the driver was extremely difficult to test, something common to all X input drivers. Development is a cycle of restarting the X server a billion times, testing is mostly plugging hardware in and moving things around in the hope that you can spot the bugs. On a driver that doesn't move much, this isn't necessarily a problem. Until a bug comes along, that requires some core rework of the event handling - in the kernel, libinput and, yes, the wacom driver.
After years of libinput development, I wasn't really in the mood for the whole "plug every tablet in and test it, for every commit". In a rather caffeine-driven development cycle [2], the driver was separated into two logical entities: the core driver and the "frontend". The default frontend is the X11 one which is now a relatively thin layer around the core driver parts, primarily to translate events into the X Server's API. So, not unlike libinput + xf86-input-libinput in terms of architecture. In ascii-art:
|
+--------------------+ | big giant
/dev/input/event0->| core driver | x11 |->| X server
+--------------------+ | process
|
Now, that logical separation means we can have another frontend which I implemented as a relatively light GObject wrapper and is now a library creatively called libgwacom:
This isn't a public library or API and it's very much focused on the needs of the X driver so there are some peculiarities in there. What it allows us though is a new wacom-record tool that can hook onto event nodes and print the events as they come out of the driver. So instead of having to restart X and move and click things, you get this:
+-----------------------+ |
/dev/input/event0->| core driver | gwacom |--| tools or test suites
+-----------------------+ |
This is YAML which means we can process the output for comparison or just to search for things.
$ ./builddir/wacom-record
wacom-record:
version: 0.99.2
git: xf86-input-wacom-0.99.2-17-g404dfd5a
device:
path: /dev/input/event6
name: "Wacom Intuos Pro M Pen"
events:
- source: 0
event: new-device
name: "Wacom Intuos Pro M Pen"
type: stylus
capabilities:
keys: true
is-absolute: true
is-direct-touch: false
ntouches: 0
naxes: 6
axes:
- {type: x , range: [ 0, 44800], resolution: 200000}
- {type: y , range: [ 0, 29600], resolution: 200000}
- {type: pressure , range: [ 0, 65536], resolution: 0}
- {type: tilt_x , range: [ -64, 63], resolution: 57}
- {type: tilt_y , range: [ -64, 63], resolution: 57}
- {type: wheel , range: [ -900, 899], resolution: 0}
...
- source: 0
mode: absolute
event: motion
mask: [ "x", "y", "pressure", "tilt-x", "tilt-y", "wheel" ]
axes: { x: 28066, y: 17643, pressure: 0, tilt: [ -4, 56], rotation: 0, throttle: 0, wheel: -108, rings: [ 0, 0]
A tool to quickly analyse data makes for faster development iterations but it's still a far cry from reliable regression testing (and writing a test suite is a daunting task at best). But one nice thing about GObject is that it's accessible from other languages, including Python. So our test suite can be in Python, using pytest and all its capabilities, plus all the advantages Python has over C. Most of driver testing comes down to: create a uinput device, set up the driver with some options, push events through that device and verify they come out of the driver in the right sequence and format. I don't need C for that. So there's pull request sitting out there doing exactly that - adding a pytest test suite for a 20-year old X driver written in C. That this is a) possible and b) a lot less work than expected got me quite unreasonably excited. If you do have to maintain an old C library, maybe consider whether's possible doing the same because there's nothing like the warm fuzzy feeling a green tick on a CI pipeline gives you.
[1] As scholars of version numbers know, they make as much sense as your stereotypical uncle's facebook opinion, so why not.
[2] The Colombian GDP probably went up a bit
![]() |
|
February 05, 2022 | |
![]() |
(I nearly went with clutterectomy, but that would be doing our old servant project a disservice.)
Yesterday, I finally merged the work-in-progress branch porting totem to GStreamer's GTK GL sink widget, undoing a lot of the work done in 2011 and 2014 to port the video widget and then to finally make use of its features.
But GTK has been modernised (in GTK3 but in GTK4 even more so), GStreamer grew a collection of GL plugins, Wayland and VA-API matured and clutter (and its siblings clutter-gtk, and clutter-gst) didn't get the resources they needed to follow.
A screenshot with practically no changes, as expected
The list of bug fixes and enhancements is substantial:
Until the port to GTK4, we expect a overall drop in performance on systems where there's no VA-API support, and the GTK4 port should bring it to par with the fastest of players available for GNOME.
You can install a Preview version right now by running:
$ flatpak install --user https://flathub.org/beta-repo/appstream/org.gnome.Totem.Devel.flatpakref
and filing bug in the GNOME GitLab.
Next stop, a GTK4 port!
![]() |
|
February 03, 2022 | |
![]() |
I always do one of these big roundups for each Mesa release, so here’s what you can expect to see from zink in the upcoming release:
--i-know-this-is-not-a-benchmark
to see the real speed)All around looking like another great release.
Yes, we’re here.
After literally years of awfulness, I’ve finally solved (for good) the debacle that is point size conversion from GL to Vulkan.
What’s so awful about it, you might be asking. How hard can it be to just add gl_PointSize
to a shader, you follow up with as you push your glasses higher up your nose.
Allow me to explain.
In Vulkan, there is exactly one method for setting the size of points: the gl_PointSize
shader output controls it, and that’s it.
14.4 Points If program point size mode is enabled, the derived point size is taken from the (potentially clipped) shader built-in gl_PointSize written by the last vertex processing stage and clamped to the implementation-dependent point size range. If the value written to gl_PointSize is less than or equal to zero, or if no value was written to gl_PointSize, results are undefined. If program point size mode is disabled, the derived point size is specified with the command
void PointSize( float size );
Tessellation evaluation shaders have a number of built-in output variables used to pass values to equivalent built-in input variables read by subsequent shader stages or to subsequent fixed functionality vertex processing pipeline stages. These variables are gl_Position, gl_PointSize, gl_ClipDistance, and gl_CullDistance, and all behave identically to equivalently named vertex shader outputs.
The built-in output gl_PointSize, if written, holds the size of the point to be rasterized, measured in pixels
In short, if PROGRAM_POINT_SIZE
is enabled, then points are sized based on the gl_PointSize
shader output of the last vertex stage.
In OpenGL ES (versions 2.0, 3.0, 3.1):
The point size is taken from the shader built-in gl_PointSize written by the
vertex shader, and clamped to the implementation-dependent point size range.
In OpenGL ES (version 3.2):
The point size is determined by the last vertex processing stage. If the last vertex processing stage is not a vertex shader, the point size is 1.0. If the last vertex processing stage is a vertex shader, the point size is taken from the shader built-in gl_PointSize written by the vertex shader, and is clamped to the implementation-dependent point size range.
Thus for an ES context, the point size always comes from the last vertex stage, which means it can be anything it wants to be if that stage is a vertex shader and cannot be written to for all other stages because it is not a valid output (this last, bolded part is going to be really funny in a minute or two).
What do the specs agree on?
gl_PointSize
Literally that’s it.
Awesome.
As we know, Vulkan has a very simple and clearly defined model for point size:
The point size is taken from the (potentially clipped) shader built-in PointSize written by:
• the geometry shader, if active;
• the tessellation evaluation shader, if active and no geometry shader is active;
• the vertex shader, otherwise
- 27.10. Points
It really can be that simple.
So one would think that we can just hook up some conditionals based on the GL rules and then export the correct value.
That would be easy.
Simple.
It would make sense.
hahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahaha
It gets worse (obviously).
gl_PointSize
is a valid XFB varying, which means it must be exported correctly to the transform feedback buffer. For the ES case, it’s simple, but for desktop GL, there’s a little something called PROGRAM_POINT_SIZE
state which totally fucks that up. Because, as we know, Vulkan has exactly one way of setting point size, and it’s the shader variable.
Thus, if there is a desktop GL context using a vertex shader as its last vertex stage for a draw, and if that shader has its own gl_PointSize
value, this value must be exported for XFB.
But not used for point rasterization.
…Because in order to pass CTS for ES 3.2, your implementation also has to be able to violate spec.
Remember above when I said it was going to be funny that gl_PointSize
is not a legal output for non-vertex stages in ES contexts?
CTS explicitly has “wide points” tests which verify illegal point sizes that are exported by the tessellation and geometry shader stages. Isn’t that cool?
Also, let’s be reasonable people for a moment, who actually wants a point that’s just one pixel? Nobody can see that on their 8k display.
I hate GL point size, and so should you.
![]() |
|
February 02, 2022 | |
![]() |
I keep meaning to blog, but then I get sidetracked by not blogging.
Truly a tough life.
So what’s new in zink-land?
Nothing too exciting. Mostly bug fixes. I managed to sneak ARB_sparse_texture_clamp in for zink just before the branchpoint, so all the sparse texturing features supported by Mesa will be supported by zink. But only on NVIDIA since they’re the only driver that fully supports Vulkan sparse texturing.
The past couple days I’ve been doing some truly awful things with gl_PointSize
to try and make this conformant for all possible cases. It’s a real debacle, and I’ll probably post more in-depth about it so everyone can get a good chuckle.
The one unusual part of my daily routine is that I haven’t rebased my testing branch in at least a couple weeks now since I’ve been trying to iron out regressions. Will I find that everything crashes and fails as soon as I do?
Probably.
More posts to come.
![]() |
|
February 01, 2022 | |
![]() |
There was an article on Open for Everyone today about Nobara, a Fedora-based distribution optimized for gaming. So I have no beef with Tomas Crider or any other creator/maintainer of a distribution targeting a specific use case. In fact they are usually trying to solve or work around real problems and make things easier for people. That said I have for years felt that the need for these things is a failing in itself and it has been a goal for me in the context of Fedora Workstation to figure out what we can do to remove the need for ‘usecase distros’. So I thought it would be of interest if I talk a bit about how I been viewing these things and the concrete efforts we taken to reduce the need for usecase oriented distributions. It is worth noting that the usecase distributions have of course proven useful for this too, in the sense that they to some degree also function as a very detailed ‘bug report’ for why the general case OS is not enough.
Before I start, you might say, but isn’t Fedora Workstation as usecase OS too? You often talk about having a developer focus? Yes, developers are something we care deeply about, but for instance that doesn’t mean we pre-install 50 IDEs in Fedora Workstation. Fedora Workstation should be a great general purpose OS out of the box and then we should have tools like GNOME Software and Toolbx available to let you quickly and easily tweak it into your ideal development system. But at the same time by being a general purpose OS at heart, it should be equally easy to install Steam and Lutris to start gaming or install Carla and Ardour to start doing audio production. Or install OBS Studio to do video streaming.
Looking back over the years one of the first conclusions I drew from looking at all the usecase distributions out there was that they often where mostly the standard distro, but with a carefully procured list of pre-installed software, for instance the old Fedora game spin was exactly that, a copy of Fedora with a lot of games pre-installed. So why was this valuable to people? For those of us who have been around for a while we remember that the average linux ‘app store’ was a very basic GUI which listed available software by name (usually quite cryptic names) and at best with a small icon. There was almost no other metadata available and search functionality was limited at best. So finding software was not simple, at it was usually more of a ‘search the internet and if you find something interesting see if its packaged for your distro’. So the usecase distros who focused on having procured pre-installed software, be that games, or pro-audio software or graphics tools ot whatever was their focus was basically responding to the fact that finding software was non-trivial and a lot of people maybe missed out on software that could be useful to them since it they simply never learned about its existence.
So when we kicked of the creation of GNOME Software one of the big focuses early on was to create a system for providing good metadata and displaying that metadata in a useful manner. So as an end user the most obvious change was of course the more rich UI of GNOME Software, but maybe just as important was the creation of AppStream, which was a specification for how applications to ship with metadata to allow GNOME Software and others to display much more in-depth information about the application and provide screenshots and so on.
So I do believe that between working on a better ‘App Store’ story for linux between the work on GNOME Software as the actual UI, but also by working with many stakeholders in the Linux ecosystem to define metadata standards like AppStream we made software a lot more discoverable on Linux and thus reduced the need for pre-loading significantly. This work also provided an important baseline for things like Flathub to thrive, as it then had a clear way to provide metadata about the applications it hosts.
We do continue to polish that user experience on an ongoing basis, but I do feel we reduced the need to pre-load a ton of software very significantly already with this.
Of course another aspect of this is application availability, which is why we worked to ensure things like Steam is available in GNOME Software on Fedora Workstation, and which we have now expanded on by starting to include more and more software listings from Flathub. These things makes it easy for our users to find the software they want, but at the same time we are still staying true to our mission of only shipping free software by default in Fedora.
The second major reason for usecase distributions have been that the generic version of the OS didn’t really have the right settings or setup to handle an important usecase. I think pro-audio is the best example of this where usecase distros like Fedora Jam or Ubuntu Studio popped up. The pre-install a lot of relevant software was definitely part of their DNA too, but there was also other issues involved, like the need for a special audio setup with JACK and often also kernel real-time patches applied. When we decided to include Pro-audio support in PipeWire resolving these issues was a big part of it. I strongly believe that we should be able to provide a simple and good out-of-the box experience for musicians and audio engineers on Linux without needing the OS to be specifically configured for the task. The strong and positive response we gotten from the Pro-audio community for PipeWire I believe points to that we are moving in the right direction there. Not claiming things are 100% yet, but we feel very confident that we will get there with PipeWire and make the Pro-Audio folks full fledged members of the Fedora WS community. Interestingly we also spent quite a bit of time trying to ensure the pro-audio tools in Fedora has proper AppStream metadata so that they would appear in GNOME Software as part of this. One area there where we are still looking at is the real time kernel stuff, our current take is that we do believe the remaining unmerged patches are not strictly needed anymore, as most of the important stuff has already been merged, but we are monitoring it as we keep developing and benchmarking PipeWire for the Pro-Audio usecase.
Another reason that I often saw that drove the creation of a usecase distribution is special hardware support, and not necessarily that special hardware, the NVidia driver for instance has triggered a lot of these attempts. The NVidia driver is challenging on a lot of levels and has been something we have been constantly working on. There was technical issues for instance, like the NVidia driver and Mesa fighting over who owned the OpenGL.so implementation, which we fixed by the introduction glvnd a few years ago. But for a distro like Fedora that also cares deeply about free and open source software it also provided us with a lot of philosophical challenges. We had to answer the question of how could we on one side make sure our users had easy access to the driver without abandoning our principle around Fedora only shipping free software of out the box? I think we found a good compromise today where the NVidia driver is available in Fedora Workstation for easy install through GNOME Software, but at the same time default to Nouveau of the box. That said this is a part of the story where we are still hard at work to improve things further and while I am not at liberty to mention any details I think I can at least mention that we are meeting with our engineering counterparts at NVidia on almost a weekly basis to discuss how to improve things, not just for graphics, but around compute and other shared areas of interest. The most recent public result of that collaboration was of course the XWayland support in recent NVidia drivers, but I promise you that this is something we keep focusing on and I expect that we will be able to share more cool news and important progress over the course of the year, both for users of the NVidia binary driver and for users of Nouveau.
What are we still looking at in terms of addressing issues like this? Well one thing we are talking about is if there is value/need for a facility to install specific software based on hardware or software. For instance if we detect a high end gaming mouse connected to your system should we install Piper/ratbag or at least make GNOME Software suggest it? And if we detect that you installed Lutris and Steam are there other tools we should recommend you install, like the gamemode GNOME Shell extenion? It is a somewhat hard question to answer, which is why we are still pondering it, on one side it seems like a nice addition, but such connections would mean that we need to have a big database we constantly maintain which isn’t trivial and also having something running on your system to lets say check for those high end mice do add a little overhead that might be a waste for many users.
Another area that we are looking at is the issue of codecs. We did a big effort a couple of years ago and got AC3, mp3, AAC and mpeg2 video cleared for inclusion, and also got the OpenH264 implementation from Cisco made available. That solved a lot of issues, but today with so many more getting into media creation I believe we need to take another stab at it and for instance try to get reliable hardware accelerated encoding and decoding on video. I am not ready to announce anything, but we got a few ideas and leads we are looking at for how to move the needle there in a significant way.
So to summarize, I am not criticizing anyone for putting together what I call usecase distros, but at the same time I really want to get to a point where they are rarely needed, because we should be able to cater to most needs within the context of a general purpose Linux operating system. That said I do appreciate the effort of these distro makers both in terms of trying to help users have a better experience on linux and in indirectly helping us showcase both potential solutions or highlight the major pain points that still needs addressing in a general purpose Linux desktop operating system.
![]() |
|
January 26, 2022 | |
![]() |
(This post was first published with Collabora on Jan 25, 2022.)
My work on Wayland and Weston color management and HDR support has been full of learning new concepts and terms. Many of them are crucial for understanding how color works. I started out so ignorant that I did not know how to blend two pixels together correctly. I did not even know that I did not know - I was just doing the obvious blend, and that was wrong. Now I think I know what I know and do not know, and I also feel that most developers around window systems and graphical applications are as uneducated as I was.
Color knowledge is surprisingly scarce in my field it seems. It is not enough that I educate myself. I need other people to talk to, to review my work, and to write patches that I will be reviewing. With the hope of making it even a little bit easier to understand what is going on with color I wrote the article: A Pixel's Color.
The article goes through most of the important concepts, trying to give you, a programmer, a vague idea of what they are. It does not explain everything too well, because I want you to be able to read through it, but it still got longer than I expected. My intention is to tell you about things you might not know about, so that you would at least know what you do not know.
A warm thank you to everyone who reviewed and commented on the article.
Originally Wayland CM&HDR extension merge request included documentation about how color management would work on Wayland. The actual protocol extension specification cannot even begin to explain all that.
To make that documentation easier to revise and contribute to, I proposed to move it into a new repository: color-and-hdr. That also allowed us to widen the scope of the documentation, so we can easily include things outside of Wayland: EGL, Vulkan WSI, DRM KMS, and more.
I hope that color-and-hdr documentation repository gains traction and becomes a community maintained effort in gathering information about color and HDR on Linux, and that we can eventually move it out of my personal namespace to become truly community owned.
![]() |
|
January 20, 2022 | |
![]() |
After weeks of hunting for the latest rumors of jekstrand’s future job prospects, I’ve finally done it: zink now supports more extensions than any other OpenGL driver in Mesa.
That’s right.
Check it on mesamatrix if you don’t believe me.
A couple days ago I merged support for the external memory extensions that I’d been putting off, and today we got sparse textures thanks to Qiang Yu at AMD doing 99% of the work to plumb the extensions through the rest of Mesa.
There’s even another sparse texture extension, which I’ve already landed all the support for in zink, that should be enabled for the upcoming release.
Zink (sometimes) has the performance, now it has the features, so naturally the focus now is going to shift to compatibility and correctness. Kopper is going to mostly take care of the former, which leaves the latter. There aren’t a ton of CTS cases failing.
Ideally, by the end of the year, there won’t be any.
![]() |
|
January 18, 2022 | |
![]() |
The last thing I remember Thursday was trying to get the truth out about Jason Ekstrand’s new role. Days have now passed, and I can’t remember what I was about to say or what I did over the extended weekend.
But Big Triangle sure has been busy. It’s clear I was on to something, because otherwise they wouldn’t have taken such drastic measures. Look at this: jekstrand is claiming Collabora has hired him. This is clearly part of a larger coverup, and the graphics news media are eating it up.
Congratulations to him, sure, but it’s obvious this is just another attempt to throw us off the trail. We may never find out what Jason’s real new job is, but that doesn’t mean we’re going to stop following the hints and clues as they accumulate. Sooner or later, Big Triangle is going to slip up, and then we’ll all know the truth.
In the meantime, zink goes on. I’ve spent quite a long while tinkering with NVIDIA and getting a solid baseline of CTS results. At present, I’m down to about 800 combined fails for GL 4.6 and ES 3.2. Given that lavapipe is at around 80 and RADV is just over 600, both excluding the confidential test suites, this is a pretty decent start.
This is probably going to be the last time I’m on nvidia for a while, and it hasn’t been too bad overall.
The (second) biggest news story for today is a rebrand.
Copper is being renamed.
It will, in fact, be named Kopper to match the zink/vulkan naming scheme.
I can’t overstate how significant this change is and how massive the ecosystem changes around it will be.
Just huge. Like the number of words in this blog post.
![]() |
|
January 13, 2022 | |
![]() |
It’s come to my attention that there’s a lot of rumors flying around about what exactly I’m doing aside from posting the latest info about where Jason Ekstrand, who coined the phrase, “If it compiles, we should ship it.” is going to end up.
Everyone knows that jekstrand’s next career move is big news—the kind of industry-shaking maneuvering that has every BigCo from Alphabet to Meta on tenterhooks. This post is going to debunk a number of the most common nonsense I’ve been hearing as well as give some updates about what else I’ve been doing besides scouring the internet for even the tiniest clue about what’s coming for this man’s career in 2022.
My sources were very keen on this rumor up until Tuesday, when, in an undisclosed IRC channel, Jason himself had the following to say:
<jekstrand> Sachiel: Contrary to popular belief, I can't work on every idea in the multiverse simultaneously. I'm limited to the same N dimensions as the rest of you.
This absolutely blew all the existing chatter out of the water. Until now, in the course of working on more sparse texturing extensions, I had the firm impression that we’d be seeing a return to form, likely with a Khronos member company, continuing to work on graphics. But now? With this? Clearly everyone was thinking too small.
Everyone except jekstrand himself, who will be taking up a position at CERN devising new display technology for particle accelerators.
Or at least, that’s what I thought until yesterday.
Unfortunately, this turned out to be bogus, no more than chaff deployed to stop us from getting to the truth because we were too close. Later, while I was pondering how buggy NVIDIA’s sparse image functionality was in the latest beta drivers and attempting to pass what few equally buggy CTS cases there were for ARB_sparse_texture2, I stumbled upon the obvious.
It’s so obvious, in fact, that everyone overlooked it because of how obvious it is.
Jason has left Intel and turned in his badge because he’s on vacation.
As everyone knows, he’s the kind of person who literally does not comprehend time in the same way that the rest of us do. It was his assessment of the HR policy that in order to take time off and leave the office, he had to quit. My latest intel (no pun intended) revealed that managers and executives alike were still scrambling, trying to figure out how to explain the company’s vacation policy using SSA-based compiler terminology, but optimizer passes left their attempts to engage him as no-ops.
Tragic.
I’ll be completely honest with you since you’ve read this far: I’ve just heard breaking news today. This is so fresh, so hot-off-the-presses that it’s almost as difficult to reveal as it is that I’ve implemented another 4 GL extensions. When the totality of all my MRs are landed, zink will become the GL driver in Mesa supporting the most extensions, and this is likely to be the case for the next release. Shocking, I know.
But not nearly as shocking as the fact that Jason is actually starting at Texas Instruments working on Vulkan for graphing calculators.
Think about it.
Anyone who knows jekstrand even the smallest amount knows how much sense this makes on both sides. He gets unlimited graphing calculators, and that’s all he had to hear before signing the contract. It’s that simple.
I know at least one person who does, and it’s not Jason Ekstrand. Because in the time that I was writing out the last (and now deprecated) information I had available, there’s been more, even later breaking news.
Copper now has a real MR open for it.
I realize it’s entirely off-topic now to be talking about some measly merge request, but it has the WSI tag on it, which means Jason has no choice but to read through the entire thing.
That’s because he’ll be working for Khronos as the Assistant Deputy Director of Presentation. If there’s presentations to be done by anyone in the graphics space, for any reason, they’ll have to go through jekstrand first. I don’t envy the responsibility and accountability that this sort of role demands; when it comes to shedsmanship, people in the presentation space are several levels above the rest.
We can only hope he’s up to the challenge.
Or at least, we would if that were actually where he was going, because I’ve just heard from
![]() |
|
January 10, 2022 | |
![]() |
This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Here are the different articles so far:
In this article, we will further discuss the role of the CI gateway, and which steps we can take to simplify its deployment, maintenance, and disaster recovery.
This work is sponsored by the Valve Corporation.
As seen in the part 1 of this CI series, the testing gateway is sitting between the test machines and the public network/internet:
Internet / ------------------------------+
Public network |
+---------+--------+ USB
| +-----------------------------------+
| Testing | Private network |
Main power (120/240 V) -----+ | Gateway +-----------------+ |
| +------+--+--------+ | |
| | | Serial / | |
| Main | | Ethernet | |
| Power| | | |
+-----------+-----------------|--+--------------+ +-------+--------+ +----+----+
| Switchable PDU | | | RJ45 switch | | USB Hub |
| Port 0 Port 1 ...| Port N | | | | |
+----+------------------------+-----------------+ +---+------------+ +-+-------+
| | |
Main | | |
Power| | |
+--------|--------+ Ethernet | |
| +-----------------------------------------+ +----+----+ |
| Test Machine 1 | Serial (RS-232 / TTL) | Serial | |
| +---------------------------------------------+ 2 USB +----+ USB
+-----------------+ +---------+
The testing gateway's role is to expose the test machines to the users, either directly or via GitLab/Github. As such, it will likely require the following components:
Since the gateway is connected to the internet, both the OS and the different services needs to be be kept updated relatively often to prevent your CI farm from becoming part of a botnet. This creates interesting issues:
These issues can thankfully be addressed by running all the services in a container (as systemd units), started using boot2container. Updating the operating system and the services would simply be done by generating a new container, running tests to validate it, pushing it to a container registry, rebooting the gateway, then waiting while the gateway downloads and execute the new services.
Using boot2container does not however fix the issue of how to update the kernel or boot configuration when the system fails to boot the current one. Indeed, if the kernel/boot2container/kernel command line are stored locally, they can only be modified via an SSH connection and thus require the machine to always be reachable, the gateway will be bricked until an operator boots an alternative operating system.
The easiest way not to brick your gateway after a broken update is to power it through a switchable PDU (so that we can power cycle the machine), and to download the kernel, initramfs (boot2container), and the kernel command line from a remote server at boot time. This is fortunately possible even through the internet by using fancy bootloaders, such as iPXE, and this will be the focus of this article!
Tune in for part 4 to learn more about how to create the container.
iPXE is a tiny bootloader that packs a punch! Not only can it boot kernels from local partitions, but it can also connect to the internet, and download kernels/initramfs using HTTP(S). Even more impressive is the little scripting engine which executes boot scripts instead of declarative boot configurations like grub. This enables creating loops, endlessly trying to boot until one method finally succeeds!
Let's start with a basic example, and build towards a production-ready solution!
In this example, we will focus on netbooting the gateway from a local HTTP
server. Let's start by reviewing a simple script that makes iPXE acquire an IP
from the local DHCP server, then download and execute another iPXE script from
http://<ip of your dev machine>:8000/boot/ipxe
.
If any step failed, the script will be restarted from the start until a successful
boot is achieved.
#!ipxe
echo Welcome to Valve infra's iPXE boot script
:retry
echo Acquiring an IP
dhcp || goto retry # Keep retrying getting an IP, until we get one
echo Got the IP: $${netX/ip} / $${netX/netmask}
echo
echo Chainloading from the iPXE server...
chain http://<ip of your dev machine>:8000/boot.ipxe
# The boot failed, let's restart!
goto retry
Neat, right? Now, we need to generate a bootable ISO image starting iPXE with the above script run as a default. We will then flash this ISO to a USB pendrive:
$ git clone git://git.ipxe.org/ipxe.git
$ make -C ipxe/src -j`nproc` bin/ipxe.iso EMBED=<boot script file>
$ sudo dd if=ipxe/src/bin/ipxe.iso of=/dev/sdX bs=1M conv=fsync status=progress
Once connected to the gateway, ensure that you boot from the pendrive, and you
should see iPXE bootloader trying to boot the kernel, but failing to download
the script from
http://<ip of your dev machine>:8000/boot.ipxe
. So, let's write one:
#!ipxe
kernel /files/kernel b2c.container="docker://hello-world"
initrd /files/initrd
boot
This script specifies the following elements:
http://<ip of your dev machine>:8000/files/kernel
, and set the kernel command line to ask boot2container to start the hello-world
containerhttp://<ip of your dev machine>:8000/files/initrd
Assuming your gateway has an architecture supported by boot2container, you may now download the kernel and initrd from boot2container's releases page. In case it is unsupported, create an issue, or a merge request to add support for it!
Now that you have created all the necessary files for the boot, start the web server on your development machine:
$ ls
boot.ipxe initrd kernel
$ python -m http.server 8080
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
<ip of your gateway> - - [09/Jan/2022 15:32:52] "GET /boot.ipxe HTTP/1.1" 200 -
<ip of your gateway> - - [09/Jan/2022 15:32:56] "GET /kernel HTTP/1.1" 200 -
<ip of your gateway> - - [09/Jan/2022 15:32:54] "GET /initrd HTTP/1.1" 200 -
If everything went well, the gateway should, after a couple of seconds, start downloading the boot script, then the kernel, and finally the initramfs. Once done, your gateway should boot Linux, run docker's hello-world container, then shut down.
Congratulations for netbooting your gateway! However, the current solution has one annoying constraint: it requires a trusted local network and server because we are using HTTP rather than HTTPS... On an untrusted network, a man in the middle could override your boot configuration and take over your CI...
If we were using HTTPS, we could download our boot script/kernel/initramfs directly from any public server, even GIT forges, without fear of any man in the middle! Let's try to achieve this!
In the previous section, we managed to netboot our gateway from the local network. In this section, we try to improve on it by netbooting using HTTPS. This enables booting from a public server hosted at places such as Linode for $5/month.
As I said earlier, iPXE supports HTTPS. However, if you are anyone like me, you may be wondering how such a small bootloader could know which root certificates to trust. The answer is that iPXE generates an SSL certificate at compilation time which is then used to sign all of the root certificates trusted by Mozilla (default), or any amount of certificate you may want. See iPXE's crypto page for more information.
WARNING: iPXE currently does not like certificates exceeding 4096 bits. This can be a limiting factor when trying to connect to existing servers. We hope to one day fix this bug, but in the mean time, you may be forced to use a 2048 bits Let's Encrypt certificate on a self-hosted web server. See our issue for more information.
WARNING 2: iPXE only supports a limited amount of ciphers. You'll need to make
sure they are listed in nginx's ssl_ciphers
configuration:
AES-128-CBC:AES-256-CBC:AES256-SHA256 and AES128-SHA256:AES256-SHA:AES128-SHA
To get started, install NGINX + Let's encrypt on your server, following your
favourite tutorial, copy the boot.ipxe
, kernel
, and initrd
files to the root
of the web server, then make sure you can download them using your browser.
With this done, we just need to edit iPXE's general config C header to enable HTTPS support:
$ sed -i 's/#undef\tDOWNLOAD_PROTO_HTTPS/#define\tDOWNLOAD_PROTO_HTTPS/' ipxe/src/config/general.h
Then, let's update our boot script to point to the new server:
#!ipxe
echo Welcome to Valve infra's iPXE boot script
:retry
echo Acquiring an IP
dhcp || goto retry # Keep retrying getting an IP, until we get one
echo Got the IP: $${netX/ip} / $${netX/netmask}
echo
echo Chainloading from the iPXE server...
chain https://<your server>/boot.ipxe
# The boot failed, let's restart!
goto retry
And finally, let's re-compile iPXE, reflash the gateway pendrive, and boot the gateway!
$ make -C ipxe/src -j`nproc` bin/ipxe.iso EMBED=<boot script file>
$ sudo dd if=ipxe/src/bin/ipxe.iso of=/dev/sdX bs=1M conv=fsync status=progress
If all went well, the gateway should boot and run the hello world container once again! Let's continue our journey by provisioning and backup'ing the local storage of the gateway!
In the previous section, we managed to control the boot configuration of our gateway via a public HTTPS server. In this section, we will improve on that by provisioning and backuping any local file the gateway container may need.
Boot2container has a nice feature that enables you to create a volume, and provision it from a bucket in a S3-compatible cloud storage, and sync back any local change. This is done by adding the following arguments to the kernel command line:
b2c.minio="s3,${s3_endpoint},${s3_access_key_id},${s3_access_key}"
: URL and credentials to the S3 serviceb2c.volume="perm,mirror=s3/${s3_bucket_name},pull_on=pipeline_start,push_on=changes,overwrite,delete"
: Create a perm
podman volume, mirror it from the bucket ${s3_bucket_name}
when booting the gateway, then push any local change back to the bucket. Delete or overwrite any existing file when mirroring.b2c.container="-ti -v perm:/mnt/perm docker://alpine"
: Start an alpine container, and mount the perm
container volume to /mnt/perm
Pretty, isn't it? Provided that your bucket is configured to save all the revisions of every file, this trick will kill three birds with one stone: initial provisioning, backup, and automatic recovery of the files in case the local disk fails and gets replaced with a new one!
The issue is that the boot configuration is currently open for everyone to see, if they know where to look for. This means that anyone could tamper with your local storage or even use your bucket to store their files...
To prevent attackers from stealing our S3 credentials by simply pointing their web browser to the right URL, we can authenticate incoming HTTPS requests by using an SSL client certificate. A different certificate would be embedded in every gateway's iPXE bootloader and checked by NGINX before serving the boot configuration for this precise gateway. By limiting access to a machine's boot configuration to its associated client certificate fingerprint, we even prevent compromised machines from accessing the data of other machines.
Additionally, secrets should not be kept in the kernel command line, as any
process executed on the gateway could easily gain access to it by reading
/proc/cmdline
. To address this issue, boot2container has a
b2c.extra_args_url
argument to source additional parameters from this URL.
If this URL is generated every time the gateway is downloading its boot
configuration, can be accessed only once, and expires soon after being created,
then secrets can be kept private inside boot2container and not be exposed to
the containers it starts.
Implementing these suggestions in a blog post is a little tricky, so I suggest you check out valve-infra's ipxe-boot-server component for more details. It provides a Makefile that makes it super easy to generate working certificates and create bootable gateway ISOs, a small python-based web service that will serve the right configuration to every gateway (including one-time secrets), and step-by-step instructions to deploy everything!
Assuming you decided to use this component and followed the README, you should then configure the gateway in this way:
$ pwd
/home/ipxe/valve-infra/ipxe-boot-server/files/<fingerprint of your gateway>/
$ ls
boot.ipxe initrd kernel secrets
$ cat boot.ipxe
#!ipxe
kernel /files/kernel b2c.extra_args_url="${secrets_url}" b2c.container="-v perm:/mnt/perm docker://alpine" b2c.ntp_peer=auto b2c.cache_device=auto
initrd /files/initrd
boot
$ cat secrets
b2c.minio="bbz,${s3_endpoint},${s3_access_key_id},${s3_access_key}" b2c.volume="perm,mirror=bbz/${s3_bucket_name},pull_on=pipeline_start,push_on=changes,overwrite,delete"
And that's it! We finally made it to the end, and created a secure way to provision our CI gateways with the wanted kernel, Operating System, and even local files!
When Charlie Turner and I started designing this system, we felt it would be a clean and simple way to solve our problems with our CI gateways, but the implementation ended up being quite a little trickier than the high-level view... especially the SSL certificates! However, the certainty that we can now deploy updates and fix our CI gateways even when they are physically inaccessible from us (provided the hardware and PDU are fine) definitely made it all worth it and made the prospect of having users depending on our systems less scary!
Let us know how you feel about it!
In this post, we focused on provisioning the CI gateway with its boot configuration, and local files via the internet. This drastically reduces the risks that updating the gateway's kernel would result in an extended loss of service, as the kernel configuration can quickly be reverted by changing the boot config files which is served from a cloud service provider.
The local file provisioning system also doubles as a backup, and disaster recovery system which will automatically kick in in case of hardware failure thanks to the constant mirroring of the local files with an S3-compatible cloud storage bucket.
In the next post, we will be talking about how to create the infra container, and how we can minimize down time during updates by not needing to reboot the gateway.
That's all for now, thanks for making it to the end!
I posted some fun fluff pieces last week to kick off the new year, but now it’s time to get down to brass tacks.
Everyone knows adding features is just flipping on the enable button. Now it’s time to see some real work.
If you don’t like real work, stop reading. Stop right now. Now.
Alright, now that all the haters are gone, let’s put on our bisecting snorkels and dive in.
The dream of 2022 was that I’d come back and everything would work exactly how I left it. All the same tests would pass, all the perf would be there, and my driver would compile.
I got two of those things, which isn’t too bad.
After spending a while bisecting and debugging last week, I categorized a number of regressions to RADV problems which probably only affect me since there’s no Vulkan CTS cases for them (yet). But today I came to the last of the problem cases: dEQP-GLES31.functional.tessellation_geometry_interaction.feedback.tessellation_output_quads_geometry_output_points
.
There’s nothing too remarkable about the test. It’s XFB, so, according to Jason Ekstrand, future Head of Graphic Wows at Pixar, it’s terrible.
What is remarkable, however is that the test passes fine when run in isolation.
Anyone who’s anyone knows what comes next.
Then it’s another X minutes (where X is usually between 5 and 180 depending on test runtimes) to slowly pare down the caselist to the sequence which actually triggers the failure. For those not in the know, this type of failure indicates a pathological driver bug where a sequence of commands triggers different results if tests are run in a different order.
There is, to my knowledge, no ‘automatic’ way to determine exactly which tests are required to trigger this type of failure from a caselist. It would be great if there was, and it would save me (and probably others who are similarly unaware) considerable time doing this type of caselist fuzzing.
Finally, I was left with this shortened caselist:
dEQP-GLES31.functional.shaders.builtin_constants.tessellation_shader.max_tess_evaluation_texture_image_units
dEQP-GLES31.functional.tessellation_geometry_interaction.feedback.tessellation_output_quads_geometry_output_points
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.25
Ideally, it would be great to be able to use something like gfxreconstruct for this. I could record two captures—one of the test failing in the caselist and one where it passes in isolation—and then compare them.
Here’s an excerpt from that attempt:
"[790]vkCreateShaderModule": {
"return": "VK_SUCCESS",
"device": "0x0x4",
"pCreateInfo": {
"sType": "VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO",
"pNext": null,
"flags": 0,
"codeSize": Unhandled VkFormatFeatureFlagBits2KHR,
"pCode": "0x0x285c8e0"
},
"pAllocator": null,
"[out]pShaderModule": "0x0xe0"
},
Why is it trying to print an enum value for codeSize
you might ask?
I’m not the only one to ask, and it’s still an unresolved mystery.
I was successful in doing the comparison with gfxreconstruct, but it yielded nothing of interest.
Puzzled, I decided to try the test out on lavapipe. Would it pass?
No.
It similarly fails on llvmpipe and IRIS.
But my lavapipe testing revealed an important clue. Given that there are no synchronization issues with lavapipe, this meant I could be certain this was a zink bug. Furthermore, the test failed both when the bug was exhibiting and when it wasn’t, meaning that I could actually see the “passing” values in addition to the failing ones for comparison.
Here’s the failing error output:
Verifying feedback results.
Element at index 0 (tessellation invocation 0) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 1 (tessellation invocation 1) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 2 (tessellation invocation 2) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 3 (tessellation invocation 3) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 4 (tessellation invocation 4) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 5 (tessellation invocation 5) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 6 (tessellation invocation 6) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0.5, 1)
Element at index 7 (tessellation invocation 7) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0.5, 1)
Omitted 24 error(s).
And here’s the passing error output:
Verifying feedback results.
Element at index 3 (tessellation invocation 1) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 4 (tessellation invocation 2) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 5 (tessellation invocation 3) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 6 (tessellation invocation 4) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 7 (tessellation invocation 5) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 8 (tessellation invocation 6) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 9 (tessellation invocation 7) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 10 (tessellation invocation 8) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Omitted 18 error(s).
This might not look like much, but to any zinkologists, there’s an immediate red flag: the Z component of the vertex is 0.5 in the failing case.
What does this remind us of?
Naturally it reminds us of nir_lower_clip_halfz
, the compiler pass which converts OpenGL Z coordinate ranges ([-1, 1]
) to Vulkan ([0, 1]
). This pass is run on the last vertex stage, but if it gets run more than once, a value of -1
becomes 0.5
.
Thus, it looks like the pass is being run twice in this test. How can this be verified?
ZINK_DEBUG=spirv
will export all spirv shaders used by an app. Therefore, dumping all the shaders for passing and failing runs should confirm that the conversion pass is being run an extra time when they’re compared. The verdict?
@@ -1,7 +1,7 @@
; SPIR-V
; Version: 1.5
; Generator: Khronos; 0
-; Bound: 23
+; Bound: 38
; Schema: 0
OpCapability TransformFeedback
OpCapability Shader
@@ -36,13 +36,28 @@
%_ptr_Output_v4float = OpTypePointer Output %v4float
%gl_Position = OpVariable %_ptr_Output_v4float Output
%v4uint = OpTypeVector %uint 4
+%uint_1056964608 = OpConstant %uint 1056964608
%main = OpFunction %void None %3
%18 = OpLabel
OpBranch %17
%17 = OpLabel
%19 = OpLoad %v4float %a_position
%21 = OpBitcast %v4uint %19
- %22 = OpBitcast %v4float %21
- OpStore %gl_Position %22
+ %22 = OpCompositeExtract %uint %21 3
+ %23 = OpCompositeExtract %uint %21 3
+ %24 = OpCompositeExtract %uint %21 2
+ %25 = OpBitcast %float %24
+ %26 = OpBitcast %float %23
+ %27 = OpFAdd %float %25 %26
+ %28 = OpBitcast %uint %27
+ %30 = OpBitcast %float %28
+ %31 = OpBitcast %float %uint_1056964608
+ %32 = OpFMul %float %30 %31
+ %33 = OpBitcast %uint %32
+ %34 = OpCompositeExtract %uint %21 1
+ %35 = OpCompositeExtract %uint %21 0
+ %36 = OpCompositeConstruct %v4uint %35 %34 %33 %22
+ %37 = OpBitcast %v4float %36
+ OpStore %gl_Position %37
OpReturn
OpFunctionEnd
And, as is the rule for such things, the fix was a simple one-liner to unset values in the vertex shader key.
It wasn’t technically a regression, but it manifested as such, and fixing it yielded another dozen or so fixes for cases which were affected by the same issue.
Blammo.
![]() |
|
January 04, 2022 | |
![]() |
It’s a busy week here at SGC. There’s emails to read, tickets to catch up on, rumors to spread about jekstrand’s impending move to Principal Engineer of Bose’s headphone compiler team, code to unwrite. The usual. Except now I’m actually around to manage everything instead of ignoring it.
Let’s do a brief catchup of today’s work items.
I said this was done yesterday, but the main CTS case for the extension is broken, so I didn’t adequately test it. Fortunately, Qiang Yu from AMD is on the case in addition to doing the original Gallium implementations for these extensions, and I was able to use a WIP patch to fix the test. And run it. And then run it again. And then run it in gdb. And then… And then…
Anyway, it all passes now, and sparse texture support is good to go once Australia comes back from vacation to review patches.
Also I fixed sparse buffer support, which I accidentally broke 6+ months ago but never noticed since only RADV implements these features and I have no games in my test list that use them.
I hate queries. Everyone knows I hate queries. The query code is the worst code in the entire driver. If I never have to open zink_query.c
again, I will still have opened it too many times for a single lifetime.
But today I hucked myself back in yet again to try and stop a very legitimate and legal replay of a Switch game from crashing. Everyone knows that anime is the real primary driver of all technology, so as soon as anyone files an anime-related ticket, all driver developers drop everything they’re doing to solve it. Unless they’re on vacation.
In this case, the problem amounted to:
Rejoice, for you can now play all your weeb games on zink if for some reason that’s where you’re at in your life.
But I’m not judging.
Yes.
I came back to the gift of a new CS:GO version which adds DXVK support, so now there’s also Gallium Nine support. It works fine.
Does it work better than other engines?
I don’t know, and I have real work to do so I’m not going to test it, but surely someone will take an interest in benchmarking such things now that I’ve heroically git add
ed a 64bit wrapper to my repo that can be used for testing.
A quick reminder that all Gallium Nine blog post references and tests happen with RadeonSI.
![]() |
|
January 03, 2022 | |
![]() |
It appears that Google created a handy tool that helps finding the command which causes a GPU hang/crash. It is called Graphics Flight Recorder (GFR) and was open-sourced a year ago but didn’t receive any attention. From the readme:
The Graphics Flight Recorder (GFR) is a Vulkan layer to help trackdown and identify the cause of GPU hangs and crashes. It works by instrumenting command buffers with completion tags. When an error is detected a log file containing incomplete command buffers is written. Often the last complete or incomplete commands are responsible for the crash.
It requires VK_AMD_buffer_marker
support; however, this extension is rather trivial to implement - I had only to copy-paste the code from our vkCmdSetEvent
implementation and that was it. Note, at the moment of writing, GFR unconditionally usesVK_AMD_device_coherent_memory
, which could be manually patched out for it to run on other GPUs.
GFR already helped me to fix hangs in “Alien: Isolation” and “Digital Combat Simulator”. In both cases the hang was in a compute shader and the output from GFR looked like:
...
- # Command:
id: 6/9
markerValue: 0x000A0006
name: vkCmdBindPipeline
state: [SUBMITTED_EXECUTION_COMPLETE]
parameters:
- # parameter:
name: commandBuffer
value: 0x000000558CFD2A10
- # parameter:
name: pipelineBindPoint
value: 1
- # parameter:
name: pipeline
value: 0x000000558D3D6750
- # Command:
id: 6/9
message: '>>>>>>>>>>>>>> LAST COMPLETE COMMAND <<<<<<<<<<<<<<'
- # Command:
id: 7/9
markerValue: 0x000A0007
name: vkCmdDispatch
state: [SUBMITTED_EXECUTION_INCOMPLETE]
parameters:
- # parameter:
name: commandBuffer
value: 0x000000558CFD2A10
- # parameter:
name: groupCountX
value: 5
- # parameter:
name: groupCountY
value: 1
- # parameter:
name: groupCountZ
value: 1
internalState:
pipeline:
vkHandle: 0x000000558D3D6750
bindPoint: compute
shaderInfos:
- # shaderInfo:
stage: cs
module: (0x000000558F82B2A0)
entry: "main"
descriptorSets:
- # descriptorSet:
index: 0
set: 0x000000558E498728
- # Command:
id: 8/9
markerValue: 0x000A0008
name: vkCmdPipelineBarrier
state: [SUBMITTED_EXECUTION_NOT_STARTED]
...
After confirming that corresponding vkCmdDispatch
is indeed the call which hangs, in both cases I made an Amber test which fully simulated the call. For a compute shader, this is relatively easy to do since all you need is to save the decompiled shader and buffers being used by it. Luckily in both cases these Amber tests reproduced the hangs.
With standalone reproducers, the problems were much easier to debug, and fixes were made shortly: MR#14044 for “Alien: Isolation” and MR#14110 for “Digital Combat Simulator”.
Unfortunately this tool is not a panacea:
Anyway, it’s easy to use so you should give it a try.
The blog is back. I know everyone’s been furiously spamming F5 to see if there were any secret new posts, but no. There were not.
Today’s the first day of the new year, so I had to dig deep to remember how to do basic stuff like shitpost on IRC. And then someone told me jekstrand was going to Broadcom to work on Windows network drivers?
I’m just gonna say it now:
2022 has gone too far.
I know it’s early, I know some people are seeing this as a hot take, but I’m throwing the statement down before things get worse.
Knock it off, 2022.
Somehow the driver is still in the tree, still builds, and still runs. It’s a miracle.
Thus, since there were obviously no other matters more pressing than not falling behind on MesaMatrix, I spent the morning figuring out how to implement ARB_sparse_texture.
Was this the best decision when I didn’t even remember how to make meson clear its dependency cache? No. No it wasn’t.
But I did it anyway because here at SGC, we take bad ideas and turn them into code.
Your move, 2022.
![]() |
|
December 09, 2021 | |
![]() |
One of the big issues I have when working on Turnip driver development is that when compiling either Mesa or VK-GL-CTS it takes a lot of time to complete, no matter how powerful the embedded board is. There are reasons for that: typically those board have limited amount of RAM (8 GB for the best case), a slow storage disk (typically UFS 2.1 on-board storage) and CPUs that are not so powerful compared with x86_64 desktop alternatives.
Photo of the Qualcomm® Robotics RB3 Platform embedded board that I use for Turnip development.
To fix this, it is recommended to do cross-compilation, however installing the development environment for cross-compilation could be cumbersome and prone to errors depending on the toolchain you use. One alternative is to use a distributed compilation system that allows cross-compilation like Icecream.
Icecream is a distributed compilation system that is very useful when you have to compile big projects and/or on low-spec machines, while having powerful machines in the local network that can do that job instead. However, it is not perfect: the linking stage is still done in the machine that submits the job, which depending on the available RAM, could be too much for it (however you can alleviate this a bit by using ZRAM for example).
One of the features that icecream has over its alternatives is that there is no need to install the same toolchain in all the machines as it is able to share the toolchain among all of them. This is very useful as we will see below in this post.
$ sudo apt install icecc
$ sudo dnf install icecream
You can compile it from sources.
You need to have an icecc scheduler in the local network that will balance the load among all the available nodes connected to it.
It does not matter which machine is the scheduler, you can use any of them as it is quite lightweight. To run the scheduler execute the following command:
$ sudo icecc-scheduler
Notice that the machine running this command is going to be the scheduler but it will not participate in the compilation process by default unless you ran iceccd
daemon as well (see next step).
First you need to run the iceccd
daemon as root. This is not needed on Debian-based systems, as its systemd unit is enabled by default.
You can do that using systemd in the following way:
$ sudo systemctl start iceccd
Or you can enable the daemon at startup time:
$ sudo systemctl enable iceccd
The daemon will connect automatically to the scheduler that is running in the local network. If that’s not the case, or there are more than one scheduler, you can run it standalone and give the scheduler’s IP as parameter:
sudo iceccd -s <ip_scheduler>
If you use ccache (recommended option), you just need to add the following in your .bashrc
:
export CCACHE_PREFIX=icecc
To use it without ccache, you need to add its path to $PATH
envvar so it is picked before the system compilers:
export PATH=/usr/lib/icecc/bin:$PATH
If you followed the previous steps, any time you compile anything on C/C++, it will distribute the work among the fastest nodes in the network. Notice that it will take into account system load, network connection, cores, among other variables, to decide which node will compile the object file.
Remember that the linking stage is always done in the machine that submits the job.
Icemon showing my x86_64 desktop (maxwell) cross-compiling a job for my aarch64 board (rb3).
In one x86_64 machine, you need to create a toolchain. This is not automatically done by icecc as you can have different toolchains for cross-compilation.
For example, you can install the cross-compiler from the distribution repositories:
For Debian-based systems:
sudo apt install crossbuild-essential-arm64
For Fedora:
$ sudo dnf install gcc-aarch64-linux-gnu gcc--c++-aarch64-linux-gnu
Finally, to create the toolchain to share in icecc:
$ icecc-create-env --gcc /usr/bin/aarch64-linux-gnu-gcc /usr/bin/aarch64-linux-gnu-g++
This will create a <hash>.tar.gz
file. The <hash>
is used to identify the toolchain to distribute among the nodes in case there is more than one. But don’t worry, once it is copied to a node, it won’t be copied again as it detects it is already present.
Note: it is important that the toolchain is compatible with the target machine. For example, if my aarch64 board is using Debian 11 Bullseye, it is better if the cross-compilation toolchain is created from a Debian Bullseye x86_64 machine (a VM also works), because you avoid incompatibilities like having different glibc versions.
If you have installed Debian 11 Bullseye in your aarch64, you can use my own cross-compilation toolchain for x86_64 and skip this step.
scp <hash>.tar.gz aarch64-machine-hostname:
Once the toolchain (<hash>.tar.gz
) is copied to the aarch64 machine, you just need to export this on .bashrc
:
# Icecc setup for crosscompilation
export CCACHE_PREFIX=icecc
export ICECC_VERSION=x86_64:~/<hash>.tar.gz
Just compile on aarch64 machine and the jobs be distributed among your x86_64 machines as well. Take into account the jobs will be shared among other aarch64 machines as well if icecc decides so, therefore no need to do any extra step.
It is important to remark that the cross-compilation toolchain creation is only needed once, as icecream will copy it on all the x86_64 machines that will execute any job launched by this aarch64 machine. However, you need to copy this toolchain to any aarch64 machines that will use icecream resources for cross-compiling.
This is an interesting graphical tool to see the status of the icecc nodes and the jobs under execution.
$ sudo apt install icecc-monitor
$ sudo dnf install icemon
You can compile it from sources.
Even though icecream has a good cross-compilation documentation, it was the post written 8 years ago by my Igalia colleague Víctor Jáquez the one that convinced me to setup icecream as explained in this post.
Hope you find this info as useful as I did :-)
![]() |
|
December 06, 2021 | |
![]() |
On the road to AppStream 1.0, a lot of items from the long todo list have been done so far – only one major feature is remaining, external release descriptions, which is a tricky one to implement and specify. For AppStream 1.0 it needs to be present or be rejected though, as it would be a major change in how release data is handled in AppStream.
Besides 1.0 preparation work, the recent 0.15 release and the releases before it come with their very own large set of changes, that are worth a look and may be interesting for your application to support. But first, for a change that affects the implementation and not the XML format:
Keeping all AppStream data in memory is expensive, especially if the data is huge (as on Debian and Ubuntu with their large repositories generated from desktop-entry files as well) and if processes using AppStream are long-running. The latter is more and more the case, not only does GNOME Software run in the background, KDE uses AppStream in KRunner and Phosh will use it too for reading form factor information. Therefore, AppStream via libappstream provides an on-disk cache that is memory-mapped, so data is only consuming RAM if we are actually doing anything with it.
Previously, AppStream used an LMDB-based cache in the background, with indices for fulltext search and other common search operations. This was a very fast solution, but also came with limitations, LMDB’s maximum key size of 511 bytes became a problem quite often, adjusting the maximum database size (since it has to be set at opening time) was annoyingly tricky, and building dedicated indices for each search operation was very inflexible. In addition to that, the caching code was changed multiple times in the past to allow system-wide metadata to be cached per-user, as some distributions didn’t (want to) build a system-wide cache and therefore ran into performance issues when XML was parsed repeatedly for generation of a temporary cache. In addition to all that, the cache was designed around the concept of “one cache for data from all sources”, which meant that we had to rebuild it entirely if just a small aspect changed, like a MetaInfo file being added to /usr/share/metainfo
, which was very inefficient.
To shorten a long story, the old caching code was rewritten with the new concepts of caches not necessarily being system-wide and caches existing for more fine-grained groups of files in mind. The new caching code uses Richard Hughes’ excellent libxmlb internally for memory-mapped data storage. Unlike LMDB, libxmlb knows about the XML document model, so queries can be much more powerful and we do not need to build indices manually. The library is also already used by GNOME Software and fwupd for parsing of (refined) AppStream metadata, so it works quite well for that usecase. As a result, search queries via libappstream are now a bit slower (very much depends on the query, roughly 20% on average), but can be mmuch more powerful. The caching code is a lot more robust, which should speed up startup time of applications. And in addition to all of that, the AsPool
class has gained a flag to allow it to monitor AppStream source data for changes and refresh the cache fully automatically and transparently in the background.
All software written against the previous version of the libappstream library should continue to work with the new caching code, but to make use of some of the new features, software using it may need adjustments. A lot of methods have been deprecated too now.
Compiling MetaInfo and other metadata into AppStream collection metadata, extracting icons, language information, refining data and caching media is an involved process. The appstream-generator tool does this very well for data from Linux distribution sources, but the tool is also pretty “heavyweight” with lots of knobs to adjust, an underlying database and a complex algorithm for icon extraction. Embedding it into other tools via anything else but its command-line API is also not easy (due to D’s GC initialization, and because it was never written with that feature in mind). Sometimes a simpler tool is all you need, so the libappstream-compose library as well as appstreamcli compose
are being developed at the moment. The library contains building blocks for developing a tool like appstream-generator while the cli tool allows to simply extract metadata from any directory tree, which can be used by e.g. Flatpak. For this to work well, a lot of appstream-generator‘s D code is translated into plain C, so the implementation stays identical but the language changes.
Ultimately, the generator tool will use libappstream-compose for any general data refinement, and only implement things necessary to extract data from the archive of distributions. New applications (e.g. for new bundling systems and other purposes) can then use the same building blocks to implement new data generators similar to appstream-generator with ease, sharing much of the code that would be identical between implementations anyway.
Want to advertise that your application supports touch input? Keyboard input? Has support for graphics tablets? Gamepads? Sure, nothing is easier than that with the new control
relation item and supports
relation kind (since 0.12.11 / 0.15.0, details):
<supports>
<control>pointing</control>
<control>keyboard</control>
<control>touch</control>
<control>tablet</control>
</supports>
Some applications are unusable below a certain window size, so you do not want to display them in a software center that is running on a device with a small screen, like a phone. In order to encode this information in a flexible way, AppStream now contains a display_length
relation item to require or recommend a minimum (or maximum) display size that the described GUI application can work with. For example:
<requires>
<display_length compare="ge">360</display_length>
</requires>
This will make the application require a display length greater or equal to 300 logical pixels. A logical pixel (also device independent pixel) is the amount of pixels that the application can draw in one direction. Since screens, especially phone screens but also screens on a desktop, can be rotated, the display_length
value will be checked against the longest edge of a display by default (by explicitly specifying the shorter edge, this can be changed).
This feature is available since 0.13.0, details. See also Tobias Bernard’s blog entry on this topic.
This is a feature that was originally requested for the LVFS/fwupd, but one of the great things about AppStream is that we can take very project-specific ideas and generalize them so something comes out of them that is useful for many. The new tags
tag allows people to tag components with an arbitrary namespaced string. This can be useful for project-internal organization of applications, as well as to convey certain additional properties to a software center, e.g. an application could mark itself as “featured” in a specific software center only. Metadata generators may also add their own tags to components to improve organization. AppStream gives no recommendations as to how these tags are to be interpreted except for them being a strictly optional feature. So any meaning is something clients and metadata authors need to negotiate. It therefore is a more specialized usecase of the already existing custom
tag, and I expect it to be primarily useful within larger organizations that produce a lot of software components that need sorting. For example:
<tags>
<tag namespace="lvfs">vendor-2021q1</tag>
<tag namespace="plasma">featured</tag>
</tags>
This feature is available since 0.15.0, details.
The MetaInfo Creator (source) tool is a very simple web application that provides you with a form to fill out and will then generate MetaInfo XML to add to your project after you have answered all of its questions. It is an easy way for developers to add the required metadata without having to read the specification or any guides at all.
Recently, I added support for the new control
and display_length
tags, resolved a few minor issues and also added a button to instantly copy the generated output to clipboard so people can paste it into their project. If you want to create a new MetaInfo file, this tool is the best way to do it!
The creator tool will also not transfer any data out of your webbrowser, it is strictly a client-side application.
And that is about it for the most notable changes in AppStream land! Of course there is a lot more, additional tags for the LVFS and content rating have been added, lots of bugs have been squashed, the documentation has been refined a lot and the library has gained a lot of new API to make building software centers easier. Still, there is a lot to do and quite a few open feature requests too. Onwards to 1.0!
![]() |
|
December 02, 2021 | |
![]() |
Khronos submission indicating Vulkan 1.1 conformance for Turnip on Adreno 618 GPU.
It is a great feat, especially for a driver which is created without hardware documentation. And we support features far from the bare minimum required for conformance.
But first of all, I want to thank and congratulate everyone working on the driver: Connor Abbott, Rob Clark, Emma Anholt, Jonathan Marek, Hyunjun Ko, Samuel Iglesias. And special thanks to Samuel Iglesias and Ricardo Garcia for tirelessly improving Khronos Vulkan Conformance Tests.
At the start of the year, when I started working on Turnip, I looked at the list of failing tests and thought “It wouldn’t take a lot to fix them!”, right, sure… And so I started fixing issues alongside of looking for missing features.
In June there were even more failures than there were in January, how could it be? Of course we were adding new features and it accounted for some of them. However even this list was likely not exhaustive because for gitlab CI instead of running the whole Vulkan CTS suite - we ran 1/3 of it. We didn’t have enough devices to run the whole suite fast enough to make it usable in CI. So I just ran it locally from time to time.
1/3 of the tests doesn’t sound bad and for the most part it’s good enough since we have a huge amount of tests looking like this:
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture_format_list
...
Every format, every operation, etc. Tens of thousands of them.
Unfortunately the selection of tests for a fractional run is as straightforward as possible - just every third test. Which bites us when there a single unique tests, like:
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil_no_attachment
...
Most of them test something unique that has much higher probability of triggering a special path in a driver compared to uncountable image tests. And they fell through the cracks. I even had to fix one test twice because the CI didn’t run it.
A possible solution is to skip tests only when there is a large swath of them and run smaller groups as-is. But it’s likely more productive to just throw more hardware at the issue =).
Another trouble is that we had only one 6xx sub-generation present in CI - Adreno 630. We distinguish four sub-generations. Not only they have some different capabilities, there are also differences in the existing ones, causing the same test to pass on CI and being broken on another newer GPU. Presently in CI we test only Adreno 618 and 630 which are “Gen 1” GPUs and we claimed conformance only for Adreno 618.
Yet another issue is that we could render in tiling and bypass (sysmem) modes. That’s because there are a few features we could support only when there is no tiling and we render directly into the sysmem, and sometimes rendering directly into sysmem is just faster. At the moment we use tiling rendering by default unless we meet an edge case, so by default CTS tests only tiling rendering.
We are forcing sysmem mode for a subset of tests on CI, however it’s not enough because the difference between modes is relevant for more than just a few tests. Thus ideally we should run twice as many tests, and even better would be thrice as many to account for tiling mode without binning vertex shader.
That issue became apparent when I implemented a magical eight-ball to choose between tiling and bypass modes depending on the run-time information in order to squeeze more performance (it’s still work-in-progress). The basic idea is that a single draw call or a few small draw calls is faster to render directly into system memory instead of loading framebuffer into the tile memory and storing it back. But almost every single CTS test does exactly this! Do a single or a few draw calls per render pass, which causes all tests to run in bypass mode. Fun!
Now we would be forced to deal with this issue since with the magic eight-ball games would partly run in the tiling mode and partly in the bypass, making them equally important for real-world workload.
Unfortunately no test suite could wholly reflect what game developers do in their games. However, the amount of tests grows and new tests are getting contributed based on issues found in games and other applications.
When I ran my stash of D3D11 game traces through DXVK on Turnip for the first time - I found a bunch of new crashes and hangs but it took fixing just a few of them for majority of games to render correctly. This shows that Khronos Vulkan Conformance Tests are doing their job and we at Igalia are striving to make them even better.
One of the extensions released as part of Vulkan 1.2.199 was VK_EXT_image_view_min_lod extension. I’m happy to see it published as I have participated in the release process of this extension: from reviewing the spec exhaustively (I even contributed a few things to improve it!) to developing CTS tests for it that will be eventually merged to the CTS repo.
This extension was proposed by Valve to mirror a feature present in Direct3D 12 (check ResourceMinLODClamp
here) and Direct3D 11 (check SetResourceMinLOD
here). In other words, this extension allows clamping the minimum LOD value accessed by an image view to a minLod value set at image view creation time.
That way, any library or API layer that translates Direct3D 11/12 calls to Vulkan can use the extension to mirror the behavior above on Vulkan directly without workarounds, facilitating the port of Direct3D applications such as games to Vulkan. For example, projects like Vkd3d, Vkd3d-proton and DXVK could benefit from it.
Going into more details, this extension changed how the image level selection is calculated and sets an additional minimum required in the image level for integer texel coordinate operations if it is enabled.
The way to use this feature in an application is very simple:
// Provided by VK_EXT_image_view_min_lod
typedef struct VkPhysicalDeviceImageViewMinLodFeaturesEXT {
VkStructureType sType;
void* pNext;
VkBool32 minLod;
} VkPhysicalDeviceImageViewMinLodFeaturesEXT;
Once you know everything is working, enable both the extension and the feature when creating the device.
When you want to create a VkImageView
that defines a minLod for image accesses, then add the following structure filled with the value you want in VkImageViewCreateInfo’s pNext
.
// Provided by VK_EXT_image_view_min_lod
typedef struct VkImageViewMinLodCreateInfoEXT {
VkStructureType sType;
const void* pNext;
float minLod;
} VkImageViewMinLodCreateInfoEXT;
And that’s all! As you see, it is a very simple extension.
Happy hacking!
![]() |
|
November 24, 2021 | |
![]() |
I was interested in how much work a vaapi on top of vulkan video proof of concept would be.
My main reason for being interested is actually video encoding, there is no good vulkan video encoding demo yet, and I'm not experienced enough in the area to write one, but I can hack stuff. I think it is probably easier to hack a vaapi encode to vulkan video encode than write a demo app myself.
With that in mind I decided to see what decode would look like first. I talked to Mike B (most famous zink author) before he left for holidays, then I ignored everything he told me and wrote a super hack.
This morning I convinced zink vaapi on top anv with iris GL doing the presents in mpv to show me some useful frames of video. However zink vaapi on anv with zink GL is failing miserably (well green jellyfish).
I'm not sure how much more I'll push on the decode side at this stage, I really wanted it to validate the driver side code, and I've found a few bugs in there already.
The WIP hacks are at [1]. I might push on to encode side and see if I can workout what it entails, though the encode spec work is a lot more changeable at the moment.
[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/zink-video-wip
![]() |
|
November 19, 2021 | |
![]() |
Yes, we’ve finally reached that time. It’s mid-November, and I’ve been storing up all this random stuff to unveil now because I’m taking the rest of the year off.
This will be the final SGC post for 2021. As such, it has to be a good one, doesn’t it?
It’s been a wild year for zink. Does anybody even remember how many times I finished the project? I don’t, but it’s been at least a couple. Somehow there’s still more to do though.
I’ll be updating zink-wip one final time later today with the latest Copper snapshot. This is going to be crashier than the usual zink-wip, but that’s because zink-wip doesn’t have nearly as much cool future-zink stuff as it used to these days. Nearly everything is already merged into mainline, or at least everything that’s likely to help with general use, so just use that if you aren’t specifically trying to test out Copper.
One of those things that’s been a thorn in zink’s side for a long time is PBO handling, specifically for unsupported formats like ARGB/ABGR, ALPHA, LUMINANCE, and InTeNsItY. Vulkan has no analogs for any of these, and any app/game which tries to do texture upload or download from them with zink is going to have a very, very bad time, as has been the case with CS:GO, which would take literal days to reach menus due to performing fullscreen GL_LUMINANCE texture downloads.
This is now fixed in the course of landing compute PBO download support, which I blogged about forever ago since it also yields a 2-10x performance improvement for a number of other cases in all Gallium drivers. Or at least the ones that enable it.
CS:GO should now run out of the box in Mesa 22.0, and things like RPCS3 which do a lot of PBO downloading should also see huge improvements.
That’s all I’ve got here for zink, so now it’s time once again…
That’s right, it’s happening. Change your hats, we’re a Gallium blog again for the first time in nearly five months.
Everyone remembers when I promised that you’d be able to run native Linux D3D9 games on the Nine state tracker. Well, I suppose that’s a fancy way of saying Source Engine games, aka the ones Valve ships with native Linux ports, since probably nobody else has shipped any kind of native Linux app that uses the D3D9 API, but still!
That time is now.
Right now.
No more waiting, no new Mesa release required, you can just plug it in and test it out this second for instantly improved performance.
As long as you first acknowledge that this is not a Valve-official project, and it’s only to be used for educational purposes.
But also, please benchmark it lots and tell me your findings. Again, just for educational purposes. Wink.
This has been a long time in the making. After the original post, I knew that the goal here was to eventually be able to run these games without needing any kind of specialized Mesa build, since that’s annoying and also breaks compatibility with running Nine for other purposes.
Thus I enlisted the further help of Nine expert and image enthusiast, Axel Davy, to help smooth out the rough edges once I was done fingerpainting my way to victory.
The result is a simple wrapper which can be preloaded to run any DXVK-compatible (i.e., any of them that support -vulkan
) Source Engine game on Nine—and obviously this won’t work on NVIDIA blob at all so don’t bother trying.
In short:
Properties
for e.g., Left 4 Dead 2LD_PRELOAD=/path/to/Xnine/nine_sdl.so %command% -vulkan
For Portal 2 (at present, though this won’t always be the case), you’ll also need to add NINE_VHACKS=1
to work around some frogs that were accidentally added to the latest version of the game as a developer-only easter egg.
Then just run the game normally, and if everything went right and you have Nine installed in one of the usual places, you should load up the game with Gallium Nine. More details on that in the repo’s README.
Yes. Very brrr.
Here’s your normal GL performance from a simple Portal 2 benchmark:
Around 400 FPS.
Here’s Gallium Nine:
Around 600 FPS.
A 50% improvement with the exact same backend GPU driver isn’t too bad for a simple preload shim.
You got it.
What about DXVK?
This isn’t an extensive benchmark, but here we go with that too:
Also around 600 FPS.
I say “around” here because the variation is quite extreme for both Nine and DXVK based on slight changes in variable clock speeds because I didn’t pin them: Nine ranges between 590-610 FPS, and DXVK is 590-620 FPS.
So now there’s two solid, open source methods for improving performance in these games over the normal GL version. But what if we go even deeper?
What if we check out some real performance numbers?
If you’ve never checked out PowerTOP, it’s a nice way to get an overview of what’s using up system resources and consuming power.
If you’ve never used it for benchmarking, don’t worry, I took care of that too.
Here’s some PowerTOP figures for the same Portal 2 timedemo:
What’s interesting here is that DXVK uses 90%+ CPU, while Nine is only using about 25%. This is a gap that’s consistent across runs, and it likely explains why a number of people find that DXVK doesn’t work on their systems: you still need some amount of CPU to run the actual game calculations, so if you’re on older hardware, you might end up using all of your available CPU just on DXVK internals.
Got you covered. Here’s a per-second poll (one row per second) from radeontop.
DXVK:
GPU Usage | VGT Usage | TA Usage | SX Usage | SH Usage | SPI Usage | SC Usage | PA Usage | DB Usage | CB Usage | VRAM Usage | GTT Usage | Memory Clock | Shader Clock |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35.83% | 17.50% | 23.33% | 28.33% | 17.50% | 29.17% | 28.33% | 5.00% | 27.50% | 26.67% | 12.75% 1038.15mb | 7.82% 638.53mb | 52.19% 0.457ghz | 33.52% 0.704ghz |
35.83% | 17.50% | 23.33% | 28.33% | 17.50% | 29.17% | 28.33% | 5.00% | 27.50% | 26.67% | 12.75% 1038.15mb | 7.82% 638.53mb | 52.19% 0.457ghz | 33.52% 0.704ghz |
36.67% | 30.00% | 33.33% | 35.00% | 30.00% | 35.00% | 32.50% | 18.33% | 30.83% | 28.33% | 12.76% 1038.57mb | 7.82% 638.53mb | 48.88% 0.428ghz | 36.95% 0.776ghz |
75.83% | 63.33% | 62.50% | 66.67% | 63.33% | 68.33% | 65.00% | 27.50% | 60.83% | 53.33% | 12.76% 1038.73mb | 7.82% 638.53mb | 100.00% 0.875ghz | 95.82% 2.012ghz |
71.67% | 60.00% | 60.00% | 64.17% | 60.00% | 66.67% | 60.83% | 23.33% | 56.67% | 51.67% | 12.76% 1038.73mb | 7.82% 638.53mb | 100.00% 0.875ghz | 96.31% 2.023ghz |
75.00% | 62.50% | 66.67% | 66.67% | 62.50% | 69.17% | 68.33% | 23.33% | 65.83% | 59.17% | 12.76% 1038.73mb | 7.82% 638.53mb | 100.00% 0.875ghz | 96.71% 2.031ghz |
63.33% | 55.00% | 56.67% | 58.33% | 55.00% | 59.17% | 59.17% | 17.50% | 52.50% | 50.00% | 12.76% 1038.73mb | 7.82% 638.53mb | 100.00% 0.875ghz | 89.77% 1.885ghz |
78.33% | 64.17% | 64.17% | 65.00% | 64.17% | 69.17% | 70.83% | 30.00% | 63.33% | 58.33% | 12.76% 1038.73mb | 7.82% 638.53mb | 100.00% 0.875ghz | 97.33% 2.044ghz |
73.33% | 60.83% | 64.17% | 65.00% | 60.83% | 67.50% | 64.17% | 29.17% | 59.17% | 51.67% | 12.76% 1038.73mb | 7.82% 638.53mb | 100.00% 0.875ghz | 97.39% 2.045ghz |
60.83% | 50.83% | 50.00% | 53.33% | 50.83% | 55.00% | 50.83% | 25.83% | 48.33% | 45.00% | 12.76% 1038.73mb | 7.82% 638.53mb | 100.00% 0.875ghz | 95.35% 2.002ghz |
67.50% | 50.00% | 55.00% | 59.17% | 50.00% | 60.00% | 58.33% | 28.33% | 52.50% | 45.00% | 12.76% 1038.73mb | 7.82% 638.53mb | 100.00% 0.875ghz | 87.91% 1.846ghz |
Nine:
GPU Usage | VGT Usage | TA Usage | SX Usage | SH Usage | SPI Usage | SC Usage | PA Usage | DB Usage | CB Usage | VRAM Usage | GTT Usage | Memory Clock | Shader Clock |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17.50% | 11.67% | 15.00% | 10.83% | 11.67% | 15.00% | 10.83% | 3.33% | 10.83% | 10.00% | 7.38% 600.56mb | 1.60% 130.48mb | 50.38% 0.441ghz | 15.76% 0.331ghz |
17.50% | 11.67% | 15.00% | 10.83% | 11.67% | 15.00% | 10.83% | 3.33% | 10.83% | 10.00% | 7.38% 600.56mb | 1.60% 130.48mb | 50.38% 0.441ghz | 15.76% 0.331ghz |
70.83% | 63.33% | 65.83% | 60.00% | 63.33% | 68.33% | 57.50% | 24.17% | 56.67% | 54.17% | 7.35% 598.43mb | 1.60% 130.48mb | 89.50% 0.783ghz | 77.09% 1.619ghz |
74.17% | 70.00% | 67.50% | 60.00% | 70.00% | 70.83% | 61.67% | 17.50% | 60.83% | 58.33% | 7.35% 598.42mb | 1.60% 130.47mb | 100.00% 0.875ghz | 91.03% 1.912ghz |
78.33% | 69.17% | 72.50% | 65.00% | 69.17% | 75.83% | 65.83% | 15.00% | 65.83% | 64.17% | 7.37% 599.80mb | 1.60% 130.47mb | 100.00% 0.875ghz | 93.92% 1.972ghz |
70.83% | 67.50% | 64.17% | 55.00% | 67.50% | 67.50% | 57.50% | 20.83% | 55.83% | 53.33% | 7.35% 598.42mb | 1.60% 130.47mb | 100.00% 0.875ghz | 91.93% 1.930ghz |
65.00% | 64.17% | 60.00% | 51.67% | 64.17% | 61.67% | 53.33% | 18.33% | 52.50% | 50.83% | 7.37% 599.80mb | 1.60% 130.47mb | 100.00% 0.875ghz | 89.95% 1.889ghz |
74.17% | 68.33% | 70.00% | 60.83% | 68.33% | 72.50% | 65.00% | 24.17% | 64.17% | 58.33% | 7.35% 598.42mb | 1.60% 130.47mb | 100.00% 0.875ghz | 92.53% 1.943ghz |
77.50% | 73.33% | 73.33% | 62.50% | 73.33% | 75.00% | 61.67% | 22.50% | 62.50% | 57.50% | 7.35% 598.42mb | 1.60% 130.47mb | 100.00% 0.875ghz | 91.21% 1.915ghz |
70.00% | 65.83% | 60.00% | 57.50% | 65.83% | 61.67% | 59.17% | 24.17% | 55.00% | 54.17% | 7.35% 598.42mb | 1.60% 130.47mb | 100.00% 0.875ghz | 92.69% 1.946ghz |
70.00% | 65.83% | 60.00% | 57.50% | 65.83% | 61.67% | 59.17% | 24.17% | 55.00% | 54.17% | 7.35% 598.42mb | 1.60% 130.47mb | 100.00% 0.875ghz | 92.69% 1.946ghz |
Again, here we see a number of interesting things. DXVK consistently provokes slightly higher clock speeds (because I didn’t pin them), which may explain why it skews slightly higher in the benchmark results. DXVK also uses nearly 2x more VRAM and nearly 5x more GTT. On more modern hardware it’s unlikely that this would matter since we all have more GPU memory than we can possibly use in an OpenGL game, but on older hardware—or in cases where memory usage might lead to power consumption that should be avoided because we’re running on battery—this could end up being significant.
Source Engine games run great on Linux. That’s what we all care about at the end of the day, isn’t it?
But also, if more Source Engine games get ported to DXVK, give them a try with Nine. Or just test the currently ported ones, Portal 2 and Left 4 Dead 2.
I want data.
Lots of data.
Post it here, email it to me, whatever.
Lots of cool projects still in the works, so stay tuned next year!
![]() |
|
November 18, 2021 | |
![]() |
If you own a laptop (Dell, HP, Lenovo) with a WWAN module, it is very likely that the modules are FCC-locked on every boot, and the special FCC unlock procedure needs to be run before they can be used.
Until ModemManager 1.18.2, the procedure was automatically run for the FCC unlock procedures we knew about, but this will no longer happen. Once 1.18.4 is out, the procedure will need to be explicitly enabled by each user, under their own responsibility, or otherwise implicitly enabled after installing an official FCC unlock tool provided by the manufacturer itself.
See a full description of the rationale behind this change in the ModemManager documentation site and the suggested code changes in the gitlab merge request.
If you want to enable the ModemManager provided unofficial FCC unlock tools once you have installed 1.18.4, run (assuming sysconfdir=/etc and datadir=/usr/share) this command (*):
sudo ln -sft /etc/ModemManager/fcc-unlock.d /usr/share/ModemManager/fcc-unlock.available.d/*
The user-enabled tools in /etc should not be removed during package upgrades, so this should be a one-time setup.
(*) Updated to have one single command instead of a for loop; thanks heftig!
![]() |
|
November 17, 2021 | |
![]() |
In an earlier post I talked about Copper and what it could do on the way to a zink future.
What I didn’t talk about was WSI, or the fact that I’ve already fully implemented it in the course of bashing Copper into a functional state.
…was the final step for zink to become truly usable.
At present, zink has a very hacky architecture where it loads through the regular driver path, but then for every image that is presented on the screen, it keeps a shadow copy which it blits to just before scanout, and this is the one that gets displayed.
Usually this works great other than the obvious (but minor) overhead that the blit incurs.
Where it doesn’t work great, however, is on non-Mesa drivers.
That’s right. I’m looking at you, NVIDIA.
As long-time blog enthusiasts will remember, I had NVIDIA running on zink some time ago, but there was a problem as it related to performance. Specifically, that single blit turned into a blit and then a full-frame CPU copy, which made getting any sort of game running with usable FPS a bit of a challenge.
WSI solves this by letting the Vulkan driver handle the scanout image entirely, removing all the copies to let zink render more like a normal driver (or game/app).
That’s what everyone’s probably wondering. I have zink. I have WSI. I have my RTX2070 with the NVIDIA blob driver.
How does NVIDIA’s Vulkan driver (with zink) stack up to NVIDIA’s GL driver?
Everything below is using the 495.44 beta driver, as that’s the latest one at the time of my testing, and the non-beta driver didn’t work at all.
But first, can NVIDIA’s GL driver even render the game I want to show?
Confusingly, the answer is no, this version of NVIDIA’s GL driver can’t correctly render Tomb Raider, which is my go-to for all things GL and benchmarking. I’m gonna let that slide though since it’s still pumping out those frames at a solid rate.
It’s frustrating, but sometimes just passing CTS isn’t enough to be able to run some games, or there’s certain extensions (ARB_bindless_texture) which are under-covered.
I’ll say as a prelude that it was a bit challenging to get a AAA game running in this environment. There’s some very strange issues happening with the NVIDIA Vulkan driver which prevented me from running quite a lot of things. Tomb Raider was the first one I got going after two full days of hacking at it, and that’s about what my time budget allowed for the excursion, so I stopped at that.
Up first: NVIDIA’s GL driver (495.44)
Second: NVIDIA’s Vulkan driver (495.44)
As we can see, zink with NVIDIA’s Vulkan driver is roughly 25-30% faster than NVIDIA’s GL driver for Tomb Raider.
I doubt that zink maintains this performance gap for all titles, but now we know that there are already at least some cases where it can pull ahead. Given that most vendors are shifting resources towards current-year graphics APIs like Vulkan and D3D12, it won’t be surprising if maintenance-mode GL drivers start to fall behind actively developed Vulkan drivers.
In short, there’s a real possibility that zink can provide tangible benefits to vendors who only want to ship Vulkan drivers, and those benefits might be more than (eventually) providing a conformant GL implementation.
Stay tuned for tomorrow when I close out the week strong with one final announcement for the year.
![]() |
|
November 15, 2021 | |
![]() |
Previously I mentioned having AMD VCN h264 support. Today I added initial support for the older UVD engine[1]. This is found on chips from Vega back to SI.
I've only tested it on my Vega so far.
I also worked out the "correct" answer to the how to I send the reset command correctly, however the nvidia player I'm using as a demo doesn't do things that way yet, so I've forked it for now[2].
The answer is to use vkCmdControlVideoCodingKHR to send a reset the first type a session is used. However I can't see how the app is meant to know this is necessary, but I've asked the appropriate people.
The initial anv branch I mentioned last week is now here[3].
[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-uvd-h264
[2] https://github.com/airlied/vk_video_samples/tree/radv-fixes
[3] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode
Over the past months, I’ve been working with Adam “X Does What I Say” Jackson to try and improve zink’s path through the arcane system of chutes and ladders that comprises Gallium’s loader architecture. The recent victory in getting a Wayland system compositor running is the culmination of those efforts.
I wanted to write at least a short blog post detailing some of the Gallium changes that were necessary to make this happen, if only so I have something to refer back to when I inevevitably break things later, so let’s dig in.
It’s questionable to me whether anyone really knows how all the Gallium loader and DRI frontend interfaces work without first taking a deep read of the code and then having a nice, refreshing run around the block screaming to let all the crazy out. From what I understand of it, there’s the DRI (userspace) interface, which is used by EGL/GBM/GLX/SMH to manage buffers for scanout. DRI itself is split between software and platform; each DRI interface is a composite made of all the “extensions” which provide additional functionality to enable various API-level extensions.
It’s a real disaster to have to work with, and ideally the eventual removal of classic drivers will allow it to be simplified so that mere humans like me can comprehend its majesty.
Beyond all this, however, there’s the notion that the DRI frontend is responsible for determining the size of the scanout buffer as well as various other attributes. The software path through this is nice and simple since there’s no hardware to negotiate with, and the platform path exists.
Currently, zink runs on the platform path, which means that the DRI frontend is what “runs” zink. It chooses the framebuffer size, manages resizes, and handles multisample resolve blits as needed for every frame that gets rendered.
The problem with this methodology is that there’s effecively two WSI systems active simultaneously: the Mesa DRI architecture, and the (eventual) Vulkan WSI infrastructure. Vulkan WSI isn’t going to work at all if it isn’t in charge of deciding things like window size, which means that the existing DRI architecture can’t work, neither in the platform mode nor the software mode.
As we know, there can be only one.
Thus Adam has been toiling away behind the scenes, taking neither vacation nor lunch break for the past ten years in order to iterate on a more optimal solution.
The result?
If you’re a Mesa developer or just a metallurgist, you know why the name Copper was chosen.
The premise of Copper is that it’s a DRI interface extension which can be used exclusively by zink to avoid any of the problem areas previously mentioned. The application will create a window, create a GL context for it, and (eventually) Vulkan WSI can figure things out by just having the window/surface passed through. This shifts all the “driving” WSI code out of DRI and into Vulkan WSI, which is much more natural.
In addition to Copper, zink can now be bound to a slight variation of the Gallium software loader to skip all the driver querying bits. There’s no longer anything to query, as DRI doesn’t have to make decisions anymore. It just calls through to zink normally, and zink can handle everything using the Vulkan API.
Simple and clean.
This all requires a ton of code. Looking at the two largest commits:
29 files changed, 1413 insertions(+), 540 deletions(-)
23 files changed, 834 insertions(+), 206 deletions(-)
Is a big yikes.
I can say with certainty that these improvements won’t be landing before 2022, but eventually they will in one form or another, and then zink will become significantly more flexible.
![]() |
|
November 12, 2021 | |
![]() |
Last week I mentioned I had the basics of h264 decode using the proposed vulkan video on radv. This week I attempted to do the same thing with Intel's Mesa vulkan driver "anv".
Now I'd previously unsuccessfully tried to get vaapi on crocus working but got sidetracked back into other projects. The Intel h264 decoder hasn't changed a lot between ivb/hsw/gen8/gen9 era. I ported what I had from crocus to anv and started trying to get something to decode on my WhiskeyLake.
I wrote the code pretty early on, figured out all the things I had to send the hardware.
The first anv side bridge to cross was Vulkan is doing H264 Picture level decode API, so it means you get handed the encoded slice data. However to program the Intel hw you need to decode the slice header. I wrote a slice header decoder in some common code. The other thing you need to give the intel hw is a number of bits of slice header, which in some encoding schemes is rounded to bytes and in some isn't. Slice headers also have a 3-byte header on them, which Intel hardware wants you to discard or skip before handing it to it.
Once I'd fixed up that sort of thing in anv + crocus, I started getting grey I-frames decoded with later B/P frames using the grey frames as references so you'd see this kinda wierd motion.
That was I think 3 days ago. I've have stared at this intently for those 3 days blaming everything from bitstream encoding to rechecking all my packets (not enough times though). I had someone else verify they could see grey frames.
Today after a long discussion about possibilities, I was randomly comparing a frame from the intel-vaapi-driver and from crocus, and I spotted a packet header, the docs say is 34 dwords long, but intel-vaapi was only encoding 16 dwords, I switched crocus to explicitly state a 16-dword length and I started seeing my I-frames.
Now the B/P frames still have issues. I don't think I'm getting the ref frames logic right yet, but it felt like a decent win after 3 days of staring at it.
The crocus code is [1]. The anv code isn't cleaned up enough to post a pointer to yet, enterprising people might find it. Next week I'll clean it all up, and then start to ponder upstream paths and shared code for radv + anv. Then h265 maybe.
[1]https://gitlab.freedesktop.org/airlied/mesa/-/tree/crocus-media-wip
Zink can now run all display platform flavors of Weston (and possibly other compositors?). Expect it in zink-wip later today once it passes another round of my local CI.
Here it is in DRM running weston-simple-egl
and weston-simple-dmabuf-egl
all on zink:
This has a lot of rough edges, mostly as it relates to X11. In particular:
I’d go into details on this, but honestly it’s going to be like a week of posts to detail the sheer amount of chainsawing that’s gone into the project.
Stay tuned for that and more next week.
![]() |
|
November 11, 2021 | |
![]() |
That the one true benchmark for graphics is glxgears
. It’s been the standard for 20+ years, and it’s going to remain the standard for a long time to come.
Zink has gone through a couple phases of glxgears
performance.
Everyone remembers weird glxgears
that was getting illegal amounts of frames due to its misrendering:
We salute you, old friend.
Now, however, some number of you have become aware of the new threat posed by heavy gears in the Mesa 21.3 release. Whereas glxgears
is usually a lightweight, performant benchmarking tool, heavy gears
is the opposite, chugging away at up to 20% of a single CPU core with none of the accompanying performance.
Terrifying.
The answer won’t surprise you: GL_QUADS
.
Indeed, because zink is a driver relying on the Vulkan API, only the primitive types supported by Vulkan can be directly drawn. This means any app using GL_QUADS
is going to have a very bad time.
glxgears
is exactly one of these apps, and (now that there’s a ticket open) I was forced to take action.
The root of the problem here is that gears passes its vertices into GL to be drawn as a rectangle, but zink can only draw triangles. This (currently) results in doing a very non-performant readback of the index buffer before every draw call to convert the draw to a triangle-based one.
A smart person might say “Why not just convert the vertices to triangles as you get them instead of waiting until they’re in the buffer?”
Thankfully, a smart person did say that and then do the accompanying work. The result is that finally, after all these years, zink can actually perform well in a real benchmark:
For more exciting zink updates. You won’t want to miss them.
![]() |
|
November 10, 2021 | |
![]() |
In the course of working on more CI-related things for zink, I came across a series of troublesome tests (KHR-GL46.geometry_shader.rendering.rendering.triangles_*
) that triggered a severe performance issue. Specifically, the LLVM optimizer spends absolute ages trying to optimize ubershaders like this one used in the tests:
#version 440
in vec4 position;
out vec4 vs_gs_color;
uniform bool is_lines_output;
uniform bool is_indexed_draw_call;
uniform bool is_instanced_draw_call;
uniform bool is_points_output;
uniform bool is_triangle_fan_input;
uniform bool is_triangle_strip_input;
uniform bool is_triangle_strip_adjacency_input;
uniform bool is_triangles_adjacency_input;
uniform bool is_triangles_input;
uniform bool is_triangles_output;
uniform ivec2 renderingTargetSize;
uniform ivec2 singleRenderingTargetSize;
void main()
{
gl_Position = position + vec4(float(gl_InstanceID) ) * vec4(0, float(singleRenderingTargetSize.y) / float(renderingTargetSize.y), 0, 0) * vec4(2.0);
vs_gs_color = vec4(1.0, 0.0, 0.0, 0.0);
if (is_lines_output)
{
if (!is_indexed_draw_call)
{
if (is_triangle_fan_input)
{
switch(gl_VertexID)
{
case 0: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 1:
case 5: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 2: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 3: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 4: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangle_strip_input)
{
switch(gl_VertexID)
{
case 1:
case 6: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 0:
case 4: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 2: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 3: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 5: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangle_strip_adjacency_input)
{
switch(gl_VertexID)
{
case 2:
case 12: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 0:
case 8: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 6: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 10: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangles_input)
{
switch(gl_VertexID)
{
case 0: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 1: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 2: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 3: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 5: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 6: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 7: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 8: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 9: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 10: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 11: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
}
}
else
if (is_triangles_adjacency_input)
{
vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);
switch(gl_VertexID)
{
case 0: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 2: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 6: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 8: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 10: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 12: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 14: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 16: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 18: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 20: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 22: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
}
}
}
else
{
if (is_triangles_input)
{
switch(gl_VertexID)
{
case 11: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 10: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 9: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 8: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 7: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 6: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 5: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 4: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 3: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 2: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 1: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 0: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
}
}
else
if (is_triangle_fan_input)
{
switch(gl_VertexID)
{
case 5: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 4:
case 0: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 3: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 2: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 1: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangle_strip_input)
{
switch (gl_VertexID)
{
case 5:
case 0: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 6:
case 2: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 3: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 1: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangle_strip_adjacency_input)
{
switch(gl_VertexID)
{
case 11:
case 1: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 13:
case 5: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 9: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 7: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 3: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangles_adjacency_input)
{
vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);
switch(gl_VertexID)
{
case 23: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 21: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 19: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 17: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 15: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
case 13: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 11: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 9: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 7: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 5: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 3: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 1: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
}
}
}
}
else
if (is_points_output)
{
if (!is_indexed_draw_call)
{
if (is_triangles_adjacency_input)
{
vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);
switch (gl_VertexID)
{
case 0:
case 6:
case 12:
case 18: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 2:
case 22: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 4:
case 8: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 10:
case 14: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 16:
case 20: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
}
}
else
if (is_triangle_fan_input)
{
switch(gl_VertexID)
{
case 0: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 1:
case 5: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 2: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 3: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangle_strip_input)
{
switch (gl_VertexID)
{
case 1:
case 4: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 0:
case 6: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 2: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 3: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 5: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangle_strip_adjacency_input)
{
switch (gl_VertexID)
{
case 2:
case 8: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 0:
case 12: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 4: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 6: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 10: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangles_input)
{
switch (gl_VertexID)
{
case 0:
case 3:
case 6:
case 9: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 1:
case 11: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 2:
case 4: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 5:
case 7: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 8:
case 10: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
}
}
}
else
{
if (is_triangle_fan_input)
{
switch (gl_VertexID)
{
case 5: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 4:
case 0: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 3: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 2: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 1: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangle_strip_input)
{
switch (gl_VertexID)
{
case 5:
case 2: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 6:
case 0: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 4: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 3: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 1: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangle_strip_adjacency_input)
{
switch (gl_VertexID)
{
case 11:
case 5: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 13:
case 1: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 9: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 7: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 3: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
}
}
else
if (is_triangles_adjacency_input)
{
vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);
switch (gl_VertexID)
{
case 23:
case 17:
case 11:
case 5: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 21:
case 1: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 19:
case 15: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 13:
case 9: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 7:
case 3: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
}
}
else
if (is_triangles_input)
{
switch (gl_VertexID)
{
case 11:
case 8:
case 5:
case 2: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
case 10:
case 0: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
case 9:
case 7: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
case 6:
case 4: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
case 3:
case 1: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
}
}
}
}
else
if (is_triangles_output)
{
int vertex_id = 0;
if (!is_indexed_draw_call && is_triangles_adjacency_input && (gl_VertexID % 2 == 0) )
{
vertex_id = gl_VertexID / 2 + 1;
}
else
{
vertex_id = gl_VertexID + 1;
}
vs_gs_color = vec4(float(vertex_id) / 48.0, float(vertex_id % 3) / 2.0, float(vertex_id % 4) / 3.0, float(vertex_id % 5) / 4.0);
}
}
By ages I mean upwards of 10 minutes per test.
Yikes.
Fortunately, zink already has tools to combat exactly this problem: ZINK_INLINE_UNIFORMS
.
This feature analyzes shaders to determine if inlining uniform values will be beneficial, and if so, it rewrites the shader with the uniform values as constants rather than loads. This brings the resulting NIR for the shader from 4000+ lines down to just under 300. The tests all become near-instant to run as well.
Uniform inlining has been in zink for a while, but it’s been disabled by default (except on zink-wip for testing) as this isn’t a feature that’s typically desirable when running apps/games; every time the uniforms are updated, a new shader must be compiled, and this causes (even more) stuttering, making games on zink (even more) unplayable.
On CPU-based drivers like lavapipe, however, the time to compile a shader is usually less than the time to actually run a shader, so the trade-off becomes worth doing.
Stay tuned for exciting announcements in the next few days.
![]() |
|
November 05, 2021 | |
![]() |
A few weeks ago I watched Victor's excellent talk on Vulkan Video. This made me question my skills in this area. I'm pretty vague on video processing hardware, I really have no understanding of H264 or any of the standards. I've been loosely following the Vulkan video group inside of Khronos, but I can't say I've understood it or been useful.
radeonsi has a gallium vaapi driver, that talks to firmware driver encoder on the hardware, surely copying what it is programming can't be that hard. I got an mpv/vaapi setup running and tested some videos on that setup just to get comfortable. I looked at what sort of data was being pushed about.
The thing is the firmware is doing all the work here, the driver is mostly just responsible for taking semi-parsed h264 bitstream data structures and giving them in memory buffers to the fw API. Then the resulting decoded image should be magically in a buffer.
I then got the demo nvidia video decoder application mentioned in Victor's talk.
I ported the code to radv in a couple of days, but then began a long journey into the unknown. The firmware is quite expectant on exactly what it wants and when it wants it. After fixing some interactions with the video player, I started to dig.
Now vaapi and DXVA (Windows) are context based APIs. This means they are like OpenGL, where you create a context, do a bunch of work, and tear it down, the driver does all the hw queuing of commands internally. All the state is held in the context. Vulkan is a command buffer based API. The application records command buffers and then enqueues those command buffers to the hardware itself.
So the vaapi driver works like this for a video
create hw ctx, flush, decode, flush, decode, flush, decode, flush, decode, flush, destroy hw ctx, flush
However Vulkan wants things to be more like
Create Session, record command buffer with (begin, decode, end) send to hw, (begin, decode, end), send to hw, End Sesssion
There is no way at the Create/End session time to submit things to the hardware.
After a week or two of hair removal and insightful irc chats I stumbled over a decent enough workaround to avoid the hw dying and managed to decode a H264 video of some jellyfish.
The work is based on bunch of other stuff, and is in no way suitable for upstreaming yet, not to mention the Vulkan specification is only beta/provisional so can't be used anywhere outside of development.
The preliminary code is in my gitlab repo here[1]. It has a start on h265 decode, but it's not working at all yet, and I think the h264 code is a bit hangy randomly.
I'm not sure where this is going yet, but it was definitely an interesting experiment.
[1]: https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-prelim-decode
![]() |
|
November 04, 2021 | |
![]() |
A basic example of the git alias function syntax looks like this.
[alias]
shortcut = "!f() \
{\
echo Hello world!; \
}; f"
This syntax defines a function f
and then calls it. These aliases are executed in a sh
shell,
which means there's no access to Bash / Zsh specific functionality.
Every command is ended with a ;
and each line ended with a \
. This is easy enough
to grok. But when we try to clean up the above snippet and add some quotes to
"Hello world!"
, we hit this obtuse error message.
}; f: 1: Syntax error: end of file unexpected (expecting "}")
This syntax error is caused by quotes needing to be escaped. The reason for this comes down to how git tokenizes and executes these functions. If you're curious …
![]() |
![]() |
|
planet.fd.o | ||
planet.freedesktop.org is powered by Venus,
and the freedesktop.org community.
![]() ![]() ![]() ![]() ![]() ![]() |
||
![]() |
![]() |