planet.freedesktop.org
November 24, 2021

I was interested in how much work a vaapi on top of vulkan video proof of concept would be.

My main reason for being interested is actually video encoding, there is no good vulkan video encoding demo yet, and I'm not experienced enough in the area to write one, but I can hack stuff. I think it is probably easier to hack a vaapi encode to vulkan video encode than write a demo app myself.

With that in mind I decided to see what decode would look like first. I talked to Mike B (most famous zink author) before he left for holidays, then I ignored everything he told me and wrote a super hack.

This morning I convinced zink vaapi on top anv with iris GL doing the presents in mpv to show me some useful frames of video. However zink vaapi on anv with zink GL is failing miserably (well green jellyfish).

I'm not sure how much more I'll push on the decode side at this stage, I really wanted it to validate the driver side code, and I've found a few bugs in there already.

The WIP hacks are at [1]. I might push on to encode side and see if I can workout what it entails, though the encode spec work is a lot more changeable at the moment.

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/zink-video-wip

November 19, 2021

Last Post Of The Year

Yes, we’ve finally reached that time. It’s mid-November, and I’ve been storing up all this random stuff to unveil now because I’m taking the rest of the year off.

This will be the final SGC post for 2021. As such, it has to be a good one, doesn’t it?

Zink Roundup

It’s been a wild year for zink. Does anybody even remember how many times I finished the project? I don’t, but it’s been at least a couple. Somehow there’s still more to do though.

I’ll be updating zink-wip one final time later today with the latest Copper snapshot. This is going to be crashier than the usual zink-wip, but that’s because zink-wip doesn’t have nearly as much cool future-zink stuff as it used to these days. Nearly everything is already merged into mainline, or at least everything that’s likely to help with general use, so just use that if you aren’t specifically trying to test out Copper.

One of those things that’s been a thorn in zink’s side for a long time is PBO handling, specifically for unsupported formats like ARGB/ABGR, ALPHA, LUMINANCE, and InTeNsItY. Vulkan has no analogs for any of these, and any app/game which tries to do texture upload or download from them with zink is going to have a very, very bad time, as has been the case with CS:GO, which would take literal days to reach menus due to performing fullscreen GL_LUMINANCE texture downloads.

This is now fixed in the course of landing compute PBO download support, which I blogged about forever ago since it also yields a 2-10x performance improvement for a number of other cases in all Gallium drivers. Or at least the ones that enable it.

CS:GO should now run out of the box in Mesa 22.0, and things like RPCS3 which do a lot of PBO downloading should also see huge improvements.

That’s all I’ve got here for zink, so now it’s time once again…

THIS IS NO LONGER A ZINK BLOG

That’s right, it’s happening. Change your hats, we’re a Gallium blog again for the first time in nearly five months.

Everyone remembers when I promised that you’d be able to run native Linux D3D9 games on the Nine state tracker. Well, I suppose that’s a fancy way of saying Source Engine games, aka the ones Valve ships with native Linux ports, since probably nobody else has shipped any kind of native Linux app that uses the D3D9 API, but still!

That time is now.

Right now.

No more waiting, no new Mesa release required, you can just plug it in and test it out this second for instantly improved performance.

As long as you first acknowledge that this is not a Valve-official project, and it’s only to be used for educational purposes.

But also, please benchmark it lots and tell me your findings. Again, just for educational purposes. Wink.

How?

This has been a long time in the making. After the original post, I knew that the goal here was to eventually be able to run these games without needing any kind of specialized Mesa build, since that’s annoying and also breaks compatibility with running Nine for other purposes.

Thus I enlisted the further help of Nine expert and image enthusiast, Axel Davy, to help smooth out the rough edges once I was done fingerpainting my way to victory.

The result is a simple wrapper which can be preloaded to run any DXVK-compatible (i.e., any of them that support -vulkan) Source Engine game on Nine—and obviously this won’t work on NVIDIA blob at all so don’t bother trying.

In short:

  • clone that repo
  • right click on Properties for e.g., Left 4 Dead 2
  • change the command line to LD_PRELOAD=/path/to/Xnine/nine_sdl.so %command% -vulkan

For Portal 2 (at present, though this won’t always be the case), you’ll also need to add NINE_VHACKS=1 to work around some frogs that were accidentally added to the latest version of the game as a developer-only easter egg.

Then just run the game normally, and if everything went right and you have Nine installed in one of the usual places, you should load up the game with Gallium Nine. More details on that in the repo’s README.

GPU Goes Brrr?

Yes. Very brrr.

Here’s your normal GL performance from a simple Portal 2 benchmark:

p2-togl.png

Around 400 FPS.

Here’s Gallium Nine:

p2-nine.png

Around 600 FPS.

A 50% improvement with the exact same backend GPU driver isn’t too bad for a simple preload shim.

Can I Get A Side Of SHOTS FIRED With That?

You got it.

What about DXVK?

This isn’t an extensive benchmark, but here we go with that too:

p2-dxvk.png

Also around 600 FPS.

I say “around” here because the variation is quite extreme for both Nine and DXVK based on slight changes in variable clock speeds because I didn’t pin them: Nine ranges between 590-610 FPS, and DXVK is 590-620 FPS.

So now there’s two solid, open source methods for improving performance in these games over the normal GL version. But what if we go even deeper?

What if we check out some real performance numbers?

Power Consumption

If you’ve never checked out PowerTOP, it’s a nice way to get an overview of what’s using up system resources and consuming power.

If you’ve never used it for benchmarking, don’t worry, I took care of that too.

Here’s some PowerTOP figures for the same Portal 2 timedemo:

What’s interesting here is that DXVK uses 90%+ CPU, while Nine is only using about 25%. This is a gap that’s consistent across runs, and it likely explains why a number of people find that DXVK doesn’t work on their systems: you still need some amount of CPU to run the actual game calculations, so if you’re on older hardware, you might end up using all of your available CPU just on DXVK internals.

GPU Usage?

Got you covered. Here’s a per-second poll (one row per second) from radeontop.

DXVK:

GPU Usage VGT Usage TA Usage SX Usage SH Usage SPI Usage SC Usage PA Usage DB Usage CB Usage VRAM Usage GTT Usage Memory Clock Shader Clock
35.83% 17.50% 23.33% 28.33% 17.50% 29.17% 28.33% 5.00% 27.50% 26.67% 12.75% 1038.15mb 7.82% 638.53mb 52.19% 0.457ghz 33.52% 0.704ghz
35.83% 17.50% 23.33% 28.33% 17.50% 29.17% 28.33% 5.00% 27.50% 26.67% 12.75% 1038.15mb 7.82% 638.53mb 52.19% 0.457ghz 33.52% 0.704ghz
36.67% 30.00% 33.33% 35.00% 30.00% 35.00% 32.50% 18.33% 30.83% 28.33% 12.76% 1038.57mb 7.82% 638.53mb 48.88% 0.428ghz 36.95% 0.776ghz
75.83% 63.33% 62.50% 66.67% 63.33% 68.33% 65.00% 27.50% 60.83% 53.33% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 95.82% 2.012ghz
71.67% 60.00% 60.00% 64.17% 60.00% 66.67% 60.83% 23.33% 56.67% 51.67% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 96.31% 2.023ghz
75.00% 62.50% 66.67% 66.67% 62.50% 69.17% 68.33% 23.33% 65.83% 59.17% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 96.71% 2.031ghz
63.33% 55.00% 56.67% 58.33% 55.00% 59.17% 59.17% 17.50% 52.50% 50.00% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 89.77% 1.885ghz
78.33% 64.17% 64.17% 65.00% 64.17% 69.17% 70.83% 30.00% 63.33% 58.33% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 97.33% 2.044ghz
73.33% 60.83% 64.17% 65.00% 60.83% 67.50% 64.17% 29.17% 59.17% 51.67% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 97.39% 2.045ghz
60.83% 50.83% 50.00% 53.33% 50.83% 55.00% 50.83% 25.83% 48.33% 45.00% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 95.35% 2.002ghz
67.50% 50.00% 55.00% 59.17% 50.00% 60.00% 58.33% 28.33% 52.50% 45.00% 12.76% 1038.73mb 7.82% 638.53mb 100.00% 0.875ghz 87.91% 1.846ghz

Nine:

GPU Usage VGT Usage TA Usage SX Usage SH Usage SPI Usage SC Usage PA Usage DB Usage CB Usage VRAM Usage GTT Usage Memory Clock Shader Clock
17.50% 11.67% 15.00% 10.83% 11.67% 15.00% 10.83% 3.33% 10.83% 10.00% 7.38% 600.56mb 1.60% 130.48mb 50.38% 0.441ghz 15.76% 0.331ghz
17.50% 11.67% 15.00% 10.83% 11.67% 15.00% 10.83% 3.33% 10.83% 10.00% 7.38% 600.56mb 1.60% 130.48mb 50.38% 0.441ghz 15.76% 0.331ghz
70.83% 63.33% 65.83% 60.00% 63.33% 68.33% 57.50% 24.17% 56.67% 54.17% 7.35% 598.43mb 1.60% 130.48mb 89.50% 0.783ghz 77.09% 1.619ghz
74.17% 70.00% 67.50% 60.00% 70.00% 70.83% 61.67% 17.50% 60.83% 58.33% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 91.03% 1.912ghz
78.33% 69.17% 72.50% 65.00% 69.17% 75.83% 65.83% 15.00% 65.83% 64.17% 7.37% 599.80mb 1.60% 130.47mb 100.00% 0.875ghz 93.92% 1.972ghz
70.83% 67.50% 64.17% 55.00% 67.50% 67.50% 57.50% 20.83% 55.83% 53.33% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 91.93% 1.930ghz
65.00% 64.17% 60.00% 51.67% 64.17% 61.67% 53.33% 18.33% 52.50% 50.83% 7.37% 599.80mb 1.60% 130.47mb 100.00% 0.875ghz 89.95% 1.889ghz
74.17% 68.33% 70.00% 60.83% 68.33% 72.50% 65.00% 24.17% 64.17% 58.33% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 92.53% 1.943ghz
77.50% 73.33% 73.33% 62.50% 73.33% 75.00% 61.67% 22.50% 62.50% 57.50% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 91.21% 1.915ghz
70.00% 65.83% 60.00% 57.50% 65.83% 61.67% 59.17% 24.17% 55.00% 54.17% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 92.69% 1.946ghz
70.00% 65.83% 60.00% 57.50% 65.83% 61.67% 59.17% 24.17% 55.00% 54.17% 7.35% 598.42mb 1.60% 130.47mb 100.00% 0.875ghz 92.69% 1.946ghz

Again, here we see a number of interesting things. DXVK consistently provokes slightly higher clock speeds (because I didn’t pin them), which may explain why it skews slightly higher in the benchmark results. DXVK also uses nearly 2x more VRAM and nearly 5x more GTT. On more modern hardware it’s unlikely that this would matter since we all have more GPU memory than we can possibly use in an OpenGL game, but on older hardware—or in cases where memory usage might lead to power consumption that should be avoided because we’re running on battery—this could end up being significant.

Conclusion

Source Engine games run great on Linux. That’s what we all care about at the end of the day, isn’t it?

But also, if more Source Engine games get ported to DXVK, give them a try with Nine. Or just test the currently ported ones, Portal 2 and Left 4 Dead 2.

I want data.

Lots of data.

Post it here, email it to me, whatever.

Until 2022

Lots of cool projects still in the works, so stay tuned next year!

November 18, 2021

If you own a laptop (Dell, HP, Lenovo) with a WWAN module, it is very likely that the modules are FCC-locked on every boot, and the special FCC unlock procedure needs to be run before they can be used.

Until ModemManager 1.18.2, the procedure was automatically run for the FCC unlock procedures we knew about, but this will no longer happen. Once 1.18.4 is out, the procedure will need to be explicitly enabled by each user, under their own responsibility, or otherwise implicitly enabled after installing an official FCC unlock tool provided by the manufacturer itself.

See a full description of the rationale behind this change in the ModemManager documentation site and the suggested code changes in the gitlab merge request.

If you want to enable the ModemManager provided unofficial FCC unlock tools once you have installed 1.18.4, run (assuming sysconfdir=/etc and datadir=/usr/share) this command (*):

sudo ln -sft /etc/ModemManager/fcc-unlock.d /usr/share/ModemManager/fcc-unlock.available.d/*

The user-enabled tools in /etc should not be removed during package upgrades, so this should be a one-time setup.

(*) Updated to have one single command instead of a for loop; thanks heftig!

November 17, 2021

What If Zink Was Actually The Fastest GL Driver?

In an earlier post I talked about Copper and what it could do on the way to a zink future.

What I didn’t talk about was WSI, or the fact that I’ve already fully implemented it in the course of bashing Copper into a functional state.

Window System Integration

…was the final step for zink to become truly usable.

At present, zink has a very hacky architecture where it loads through the regular driver path, but then for every image that is presented on the screen, it keeps a shadow copy which it blits to just before scanout, and this is the one that gets displayed.

Usually this works great other than the obvious (but minor) overhead that the blit incurs.

Where it doesn’t work great, however, is on non-Mesa drivers.

That’s right. I’m looking at you, NVIDIA.

As long-time blog enthusiasts will remember, I had NVIDIA running on zink some time ago, but there was a problem as it related to performance. Specifically, that single blit turned into a blit and then a full-frame CPU copy, which made getting any sort of game running with usable FPS a bit of a challenge.

WSI solves this by letting the Vulkan driver handle the scanout image entirely, removing all the copies to let zink render more like a normal driver (or game/app).

So How Is it?

That’s what everyone’s probably wondering. I have zink. I have WSI. I have my RTX2070 with the NVIDIA blob driver.

How does NVIDIA’s Vulkan driver (with zink) stack up to NVIDIA’s GL driver?

Everything below is using the 495.44 beta driver, as that’s the latest one at the time of my testing, and the non-beta driver didn’t work at all.

But first, can NVIDIA’s GL driver even render the game I want to show?

nvtrbug.png

Confusingly, the answer is no, this version of NVIDIA’s GL driver can’t correctly render Tomb Raider, which is my go-to for all things GL and benchmarking. I’m gonna let that slide though since it’s still pumping out those frames at a solid rate.

It’s frustrating, but sometimes just passing CTS isn’t enough to be able to run some games, or there’s certain extensions (ARB_bindless_texture) which are under-covered.

The Numbers Don’t Lie

I’ll say as a prelude that it was a bit challenging to get a AAA game running in this environment. There’s some very strange issues happening with the NVIDIA Vulkan driver which prevented me from running quite a lot of things. Tomb Raider was the first one I got going after two full days of hacking at it, and that’s about what my time budget allowed for the excursion, so I stopped at that.

Up first: NVIDIA’s GL driver (495.44) nvtr-gl.png

Second: NVIDIA’s Vulkan driver (495.44) nvtr.png

As we can see, zink with NVIDIA’s Vulkan driver is roughly 25-30% faster than NVIDIA’s GL driver for Tomb Raider.

In Closing

I doubt that zink maintains this performance gap for all titles, but now we know that there are already at least some cases where it can pull ahead. Given that most vendors are shifting resources towards current-year graphics APIs like Vulkan and D3D12, it won’t be surprising if maintenance-mode GL drivers start to fall behind actively developed Vulkan drivers.

In short, there’s a real possibility that zink can provide tangible benefits to vendors who only want to ship Vulkan drivers, and those benefits might be more than (eventually) providing a conformant GL implementation.

Stay tuned for tomorrow when I close out the week strong with one final announcement for the year.

November 15, 2021

Previously I mentioned having AMD VCN h264 support. Today I added initial support for the older UVD engine[1]. This is found on chips from Vega back to SI.

I've only tested it on my Vega so far.

I also worked out the "correct" answer to the how to I send the reset command correctly, however the nvidia player I'm using as a demo doesn't do things that way yet, so I've forked it for now[2].

The answer is to use vkCmdControlVideoCodingKHR to send a reset the first type a session is used. However I can't see how the app is meant to know this is necessary, but I've asked the appropriate people.

The initial anv branch I mentioned last week is now here[3].


[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-uvd-h264

[2] https://github.com/airlied/vk_video_samples/tree/radv-fixes

[3] https://gitlab.freedesktop.org/airlied/mesa/-/tree/anv-vulkan-video-prelim-decode

Copper: It’s A Thing (Sort of)

Over the past months, I’ve been working with Adam “X Does What I Say” Jackson to try and improve zink’s path through the arcane system of chutes and ladders that comprises Gallium’s loader architecture. The recent victory in getting a Wayland system compositor running is the culmination of those efforts.

I wanted to write at least a short blog post detailing some of the Gallium changes that were necessary to make this happen, if only so I have something to refer back to when I inevevitably break things later, so let’s dig in.

Pipes: How Do They Work?

It’s questionable to me whether anyone really knows how all the Gallium loader and DRI frontend interfaces work without first taking a deep read of the code and then having a nice, refreshing run around the block screaming to let all the crazy out. From what I understand of it, there’s the DRI (userspace) interface, which is used by EGL/GBM/GLX/SMH to manage buffers for scanout. DRI itself is split between software and platform; each DRI interface is a composite made of all the “extensions” which provide additional functionality to enable various API-level extensions.

It’s a real disaster to have to work with, and ideally the eventual removal of classic drivers will allow it to be simplified so that mere humans like me can comprehend its majesty.

Beyond all this, however, there’s the notion that the DRI frontend is responsible for determining the size of the scanout buffer as well as various other attributes. The software path through this is nice and simple since there’s no hardware to negotiate with, and the platform path exists.

Currently, zink runs on the platform path, which means that the DRI frontend is what “runs” zink. It chooses the framebuffer size, manages resizes, and handles multisample resolve blits as needed for every frame that gets rendered.

Too Many CooksPipes

The problem with this methodology is that there’s effecively two WSI systems active simultaneously: the Mesa DRI architecture, and the (eventual) Vulkan WSI infrastructure. Vulkan WSI isn’t going to work at all if it isn’t in charge of deciding things like window size, which means that the existing DRI architecture can’t work, neither in the platform mode nor the software mode.

As we know, there can be only one.

Thus Adam has been toiling away behind the scenes, taking neither vacation nor lunch break for the past ten years in order to iterate on a more optimal solution.

The result?

Copper

If you’re a Mesa developer or just a metallurgist, you know why the name Copper was chosen.

The premise of Copper is that it’s a DRI interface extension which can be used exclusively by zink to avoid any of the problem areas previously mentioned. The application will create a window, create a GL context for it, and (eventually) Vulkan WSI can figure things out by just having the window/surface passed through. This shifts all the “driving” WSI code out of DRI and into Vulkan WSI, which is much more natural.

In addition to Copper, zink can now be bound to a slight variation of the Gallium software loader to skip all the driver querying bits. There’s no longer anything to query, as DRI doesn’t have to make decisions anymore. It just calls through to zink normally, and zink can handle everything using the Vulkan API.

Simple and clean.

Unfortunately

This all requires a ton of code. Looking at the two largest commits:

  • 29 files changed, 1413 insertions(+), 540 deletions(-)
  • 23 files changed, 834 insertions(+), 206 deletions(-)

Is a big yikes.

I can say with certainty that these improvements won’t be landing before 2022, but eventually they will in one form or another, and then zink will become significantly more flexible.

November 12, 2021

Last week I mentioned I had the basics of h264 decode using the proposed vulkan video on radv. This week I attempted to do the same thing with Intel's Mesa vulkan driver "anv".

Now I'd previously unsuccessfully tried to get vaapi on crocus working but got sidetracked back into other projects. The Intel h264 decoder hasn't changed a lot between ivb/hsw/gen8/gen9 era. I ported what I had from crocus to anv and started trying to get something to decode on my WhiskeyLake.

I wrote the code pretty early on, figured out all the things I had to send the hardware.

The first anv side bridge to cross was Vulkan is doing H264 Picture level decode API, so it means you get handed the encoded slice data. However to program the Intel hw you need to decode the slice header. I wrote a slice header decoder in some common code. The other thing you need to give the intel hw is a number of bits of slice header, which in some encoding schemes is rounded to bytes and in some isn't. Slice headers also have a 3-byte header on them, which Intel hardware wants you to discard or skip before handing it to it.

Once I'd fixed up that sort of thing in anv + crocus, I started getting grey I-frames decoded with later B/P frames using the grey frames as references so you'd see this kinda wierd motion.

That was I think 3 days ago. I've have stared at this intently for those 3 days blaming everything from bitstream encoding to rechecking all my packets (not enough times though). I had someone else verify they could see grey frames.

Today after a long discussion about possibilities, I was randomly comparing a frame from the intel-vaapi-driver and from crocus, and I spotted a packet header, the docs say is 34 dwords long, but intel-vaapi was only encoding 16 dwords, I switched crocus to explicitly state a 16-dword length and I started seeing my I-frames.

Now the B/P frames still have issues. I don't think I'm getting the ref frames logic right yet, but it felt like a decent win after 3 days of staring at it.

The crocus code is [1]. The anv code isn't cleaned up enough to post a pointer to yet, enterprising people might find it. Next week I'll clean it all up, and then start to ponder upstream paths and shared code for radv + anv. Then h265 maybe.

[1]https://gitlab.freedesktop.org/airlied/mesa/-/tree/crocus-media-wip

A Long Time Coming

Zink can now run all display platform flavors of Weston (and possibly other compositors?). Expect it in zink-wip later today once it passes another round of my local CI.

Here it is in DRM running weston-simple-egl and weston-simple-dmabuf-egl all on zink:

wayland-screenshot-2021-11-12_13-25-25.png

Under Construction

This has a lot of rough edges, mostly as it relates to X11. In particular:

  • xservers (including xwayland) can’t run because GLAMOR is hard
  • some apps (e.g., Unigine Heaven) randomly get killed by the xserver for unknown reasons
  • if you’re very lucky, you can hit a Vulkan WSI deadlock

How?

I’d go into details on this, but honestly it’s going to be like a week of posts to detail the sheer amount of chainsawing that’s gone into the project.

Stay tuned for that and more next week.

November 11, 2021

Everyone Knows…

That the one true benchmark for graphics is glxgears. It’s been the standard for 20+ years, and it’s going to remain the standard for a long time to come.

Gears Through Time

Zink has gone through a couple phases of glxgears performance.

Everyone remembers weird glxgears that was getting illegal amounts of frames due to its misrendering:

glxgears.png

We salute you, old friend.

Now, however, some number of you have become aware of the new threat posed by heavy gears in the Mesa 21.3 release. Whereas glxgears is usually a lightweight, performant benchmarking tool, heavy gears is the opposite, chugging away at up to 20% of a single CPU core with none of the accompanying performance.

sadgears.png

Terrifying.

What Creates Such A Monster?

The answer won’t surprise you: GL_QUADS.

Indeed, because zink is a driver relying on the Vulkan API, only the primitive types supported by Vulkan can be directly drawn. This means any app using GL_QUADS is going to have a very bad time.

glxgears is exactly one of these apps, and (now that there’s a ticket open) I was forced to take action.

Transquadmation

The root of the problem here is that gears passes its vertices into GL to be drawn as a rectangle, but zink can only draw triangles. This (currently) results in doing a very non-performant readback of the index buffer before every draw call to convert the draw to a triangle-based one.

A smart person might say “Why not just convert the vertices to triangles as you get them instead of waiting until they’re in the buffer?”

Thankfully, a smart person did say that and then do the accompanying work. The result is that finally, after all these years, zink can actually perform well in a real benchmark:

happygears.png

Stay Tuned

For more exciting zink updates. You won’t want to miss them.

November 10, 2021

UberCTS

In the course of working on more CI-related things for zink, I came across a series of troublesome tests (KHR-GL46.geometry_shader.rendering.rendering.triangles_*) that triggered a severe performance issue. Specifically, the LLVM optimizer spends absolute ages trying to optimize ubershaders like this one used in the tests:

#version 440

in  vec4 position;
out vec4 vs_gs_color;

uniform bool  is_lines_output;
uniform bool  is_indexed_draw_call;
uniform bool  is_instanced_draw_call;
uniform bool  is_points_output;
uniform bool  is_triangle_fan_input;
uniform bool  is_triangle_strip_input;
uniform bool  is_triangle_strip_adjacency_input;
uniform bool  is_triangles_adjacency_input;
uniform bool  is_triangles_input;
uniform bool  is_triangles_output;
uniform ivec2 renderingTargetSize;
uniform ivec2 singleRenderingTargetSize;

void main()
{
    gl_Position = position + vec4(float(gl_InstanceID) ) * vec4(0, float(singleRenderingTargetSize.y) / float(renderingTargetSize.y), 0, 0) * vec4(2.0);
    vs_gs_color = vec4(1.0, 0.0, 0.0, 0.0);

    if (is_lines_output)
    {
        if (!is_indexed_draw_call)
        {
            if (is_triangle_fan_input)
            {
                switch(gl_VertexID)
                {
                   case 0:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 1:
                   case 5:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 2:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 4:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_input)
            {
                switch(gl_VertexID)
                {
                   case 1:
                   case 6: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 0:
                   case 4:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 2:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 5:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_adjacency_input)
            {
                switch(gl_VertexID)
                {
                   case 2:
                   case 12: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 0:
                   case 8:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 6:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 10: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangles_input)
            {
                switch(gl_VertexID)
                {
                   case 0: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 1: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 2: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 5: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 6: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 7: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 8: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 9:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 10: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 11: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                }
            }
            else
            if (is_triangles_adjacency_input)
            {
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);

                switch(gl_VertexID)
                {
                    case 0: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 2: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 4: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 6: vs_gs_color  = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 8: vs_gs_color  = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 10: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 12: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 14: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 16: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 18: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 20: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 22: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                }
            }
        }
        else
        {
            if (is_triangles_input)
            {
                switch(gl_VertexID)
                {
                    case 11: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 10: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 9:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 8:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 7:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 6:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 5:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 4:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 3:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 0:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                }
            }
            else
            if (is_triangle_fan_input)
            {
                switch(gl_VertexID)
                {
                   case 5:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4:
                   case 0:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 3:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 2:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_input)
            {
                switch (gl_VertexID)
                {
                   case 5:
                   case 0: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 6:
                   case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 4:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 3:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_adjacency_input)
            {
                switch(gl_VertexID)
                {
                   case 11:
                   case 1:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 13:
                   case 5:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 9:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   case 7:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 3:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangles_adjacency_input)
            {
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);

                switch(gl_VertexID)
                {
                    case 23: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 21: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 19: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 17: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 15: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                    case 13: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 11: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 9:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 7:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 5: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 3: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 1: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                }
            }
        }
    }
    else
    if (is_points_output)
    {
        if (!is_indexed_draw_call)
        {
            if (is_triangles_adjacency_input)
            {
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);

                switch (gl_VertexID)
                {
                    case 0:
                    case 6:
                    case 12:
                    case 18: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 2:
                    case 22: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 4:
                    case 8: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 10:
                    case 14: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 16:
                    case 20: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                }
            }
            else
            if (is_triangle_fan_input)
            {
                switch(gl_VertexID)
                {
                   case 0:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 1:
                   case 5:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 3:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 4:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_input)
            {
                switch (gl_VertexID)
                {
                   case 1:
                   case 4:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 0:
                   case 6:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 2:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 3:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 5:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_adjacency_input)
            {
                switch (gl_VertexID)
                {
                   case 2:
                   case 8:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 0:
                   case 12: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 4:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 6:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 10: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangles_input)
            {
                switch (gl_VertexID)
                {
                    case 0:
                    case 3:
                    case 6:
                    case 9: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 1:
                    case 11: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 2:
                    case 4: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 5:
                    case 7: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 8:
                    case 10: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                }
            }
        }
        else
        {
            if (is_triangle_fan_input)
            {
                switch (gl_VertexID)
                {
                   case 5:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 4:
                   case 0:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 3:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 2:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 1:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_input)
            {
                switch (gl_VertexID)
                {
                   case 5:
                   case 2:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 6:
                   case 0:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 4:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 3:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 1:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangle_strip_adjacency_input)
            {
                switch (gl_VertexID)
                {
                   case 11:
                   case 5:  vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                   case 13:
                   case 1:  vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                   case 9:  vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                   case 7:  vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                   case 3:  vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                   default: vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0); break;
                }
            }
            else
            if (is_triangles_adjacency_input)
            {
                vs_gs_color = vec4(1.0, 1.0, 1.0, 1.0);
                switch (gl_VertexID)
                {
                    case 23:
                    case 17:
                    case 11:
                    case 5: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 21:
                    case 1: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 19:
                    case 15: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 13:
                    case 9: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 7:
                    case 3: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                }
            }
            else
            if (is_triangles_input)
            {
                switch (gl_VertexID)
                {
                    case 11:
                    case 8:
                    case 5:
                    case 2: vs_gs_color = vec4(0.4, 0.5, 0.6, 0.7); break;
                    case 10:
                    case 0: vs_gs_color = vec4(0.5, 0.6, 0.7, 0.8); break;
                    case 9:
                    case 7: vs_gs_color = vec4(0.1, 0.2, 0.3, 0.4); break;
                    case 6:
                    case 4: vs_gs_color = vec4(0.2, 0.3, 0.4, 0.5); break;
                    case 3:
                    case 1: vs_gs_color = vec4(0.3, 0.4, 0.5, 0.6); break;
                }
            }
        }
    }
    else
    if (is_triangles_output)
    {
        int vertex_id = 0;

        if (!is_indexed_draw_call && is_triangles_adjacency_input && (gl_VertexID % 2 == 0) )
        {
            vertex_id = gl_VertexID / 2 + 1;
        }
        else
        {
            vertex_id = gl_VertexID + 1;
        }

        vs_gs_color = vec4(float(vertex_id) / 48.0, float(vertex_id % 3) / 2.0, float(vertex_id % 4) / 3.0, float(vertex_id % 5) / 4.0);
    }
}

By ages I mean upwards of 10 minutes per test.

Yikes.

When In Doubt, Inline

Fortunately, zink already has tools to combat exactly this problem: ZINK_INLINE_UNIFORMS.

This feature analyzes shaders to determine if inlining uniform values will be beneficial, and if so, it rewrites the shader with the uniform values as constants rather than loads. This brings the resulting NIR for the shader from 4000+ lines down to just under 300. The tests all become near-instant to run as well.

Uniform inlining has been in zink for a while, but it’s been disabled by default (except on zink-wip for testing) as this isn’t a feature that’s typically desirable when running apps/games; every time the uniforms are updated, a new shader must be compiled, and this causes (even more) stuttering, making games on zink (even more) unplayable.

On CPU-based drivers like lavapipe, however, the time to compile a shader is usually less than the time to actually run a shader, so the trade-off becomes worth doing.

Stay tuned for exciting announcements in the next few days.

November 05, 2021

A few weeks ago I watched Victor's excellent talk on Vulkan Video. This made me question my skills in this area. I'm pretty vague on video processing hardware, I really have no understanding of H264 or any of the standards. I've been loosely following the Vulkan video group inside of Khronos, but I can't say I've understood it or been useful.

radeonsi has a gallium vaapi driver, that talks to firmware driver encoder on the hardware, surely copying what it is programming can't be that hard. I got an mpv/vaapi setup running and tested some videos on that setup just to get comfortable. I looked at what sort of data was being pushed about.

The thing is the firmware is doing all the work here, the driver is mostly just responsible for taking semi-parsed h264 bitstream data structures and giving them in memory buffers to the fw API. Then the resulting decoded image should be magically in a buffer.

I then got the demo nvidia video decoder application mentioned in Victor's talk.

I ported the code to radv in a couple of days, but then began a long journey into the unknown. The firmware is quite expectant on exactly what it wants and when it wants it. After fixing some interactions with the video player, I started to dig.

Now vaapi and DXVA (Windows) are context based APIs. This means they are like OpenGL, where you create a context, do a bunch of work, and tear it down, the driver does all the hw queuing of commands internally. All the state is held in the context. Vulkan is a command buffer based API. The application records command buffers and then enqueues those command buffers to the hardware itself.

So the vaapi driver works like this for a video

create hw ctx, flush, decode, flush, decode, flush, decode, flush, decode, flush, destroy hw ctx, flush

However Vulkan wants things to be more like

Create Session, record command buffer with (begin, decode, end) send to hw, (begin, decode, end), send to hw, End Sesssion

There is no way at the Create/End session time to submit things to the hardware.

After a week or two of hair removal and insightful irc chats I stumbled over a decent enough workaround to avoid the hw dying and managed to decode a H264 video of some jellyfish.

The work is based on bunch of other stuff, and is in no way suitable for upstreaming yet, not to mention the Vulkan specification is only beta/provisional so can't be used anywhere outside of development.

The preliminary code is in my gitlab repo here[1]. It has a start on h265 decode, but it's not working at all yet, and I think the h264 code is a bit hangy randomly.

I'm not sure where this is going yet, but it was definitely an interesting experiment.

[1]: https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-prelim-decode

November 04, 2021

A basic example of the git alias function syntax looks like this.

[alias]
    shortcut = "!f() \
    {\
        echo Hello world!; \
    }; f"

This syntax defines a function f and then calls it. These aliases are executed in a sh shell, which means there's no access to Bash / Zsh specific functionality.

Every command is ended with a ; and each line ended with a \. This is easy enough to grok. But when we try to clean up the above snippet and add some quotes to "Hello world!", we hit this obtuse error message.

}; f: 1: Syntax error: end of file unexpected (expecting "}")

This syntax error is caused by quotes needing to be escaped. The reason for this comes down to how git tokenizes and executes these functions. If you're curious …

What’s Even Happening

Where does the time go?

I’m not sure, but I’m making the most of it. As we all know, the Mesa 21.3 release cycle is well underway, primarily to enable me to jam an increasingly ludicrous number of bug fixes in before the final tarball ships.

But is it possible that I’m doing other things too?

Why yes. Yes it is.

CI

We all like CI. It helps prevent us from footgunning, even when we’re totally sure that our patches are exactly what the codebase needs.

That’s why I decided to add GL4.6 CI runs over the past day or so. No more will tests be randomly fixed or failed with my commits!

Unless they’re part of the Khronos Confidential GL Test Suite, of course, but we don’t talk about that one.

Intrepid readers will note that there’s now a file in the repo which lists exactly how many failures there are on lavapipe, so now everyone knows how many thousands of tests are doing the opposite.

Rendering

A new Vulkan spec was released this week and it included something I’ve been excited about for a while: KHR_dynamic_rendering. This extension is going to let me cut down on some CPU overhead, and so some time ago I wrote the lavapipe implementation to get a feel for it.

Surprisingly, this means that lavapipe is now the only mesa driver implementing it, though I don’t expect that to be the case for long.

I’m looking forward to seeing more projects switch to this, since let’s be honest, nobody ever liked renderpasses anyway.

October 28, 2021

…For 2021

Zink is done. It’s finished. I’ve said it before, I’ve said it again after that, and I even said I meant it that one time, but now I’m serious.

Super serious.

We’re done here. There’s no need to check this blog anymore, and you don’t have to update zink ever again if you’ve pulled in the last week.

Mesa 21.3

This is it. This is the release everyone’s been waiting for.

Why is that?

Because zink is Pretty Fast™ now. And it can run lots of stuff.

Blog enthusiasts will recall all that time ago when zink was over that I noted a number of Big Name™ games that zink could now run, including Metro: Last Light Redux and HITMAN (Agent 47’s Deluxe Psychedelic Trip Edition).

True blog connoisseurs will recall when zink started to pull ahead of real GL drivers in features in order to run Warhammer 40k: Dawn of War.

But how many die-hard blog fans are here from the future and can remember when I posted that Bioshock Infinite now actually runs on RADV instead of hanging?

It’s hard to overstate the amount of work that’s gone into zink for this release. Over 400 patches amounted to ES 3.2, a suballocator, and a slew of random extensions to improve compatibility and performance across the board.

If you find a game or app* which doesn’t run at all on zink 21.3, play the lottery. It’s your day.

  • Except using it for your Xorg session or Wayland compositor. Don’t try this without supervision.

Bioshock Infinite Now Runs On Zink

bioshock.png

Zink+RADV: New BFFs

As part of a cross-training exercise, I’ve been hanging around with the hangmaster himself, the Baron of Black Screens, the Liege of Lost Sessions, Samuel Pitoiset. Together we’ve (but mostly him while I watch) been stamping out a colossal number of pathological hangs and crashes with zink on RADV.

At present, zink on RADV has only around 200 failures in the GL 4.6 conformance suite. It’s not quite there yet, but considering the number was well over 1000 just a week ago, there’s a lot of easy cleanup work that can be done here in both drivers to knock that down further.

Will 2022 Be The Year Of Zink Conformance?

Maybe.

What’s Next?

Bug fixes.

Lots of bug fixes.

Seriously, there’s so, so many bug fixes coming.

I have some fun surprises to unveil in the near future too.

For example: any guesses which Vulkan driver I’ll be showing zink off on next? Hint: B I G F P S.

October 20, 2021

 In 2007, Jan Arne Petersen added a D-Bus API to what was still pretty much an import into gnome-control-center of the "acme" utility I wrote to have all the keys on my iBook working.

It switched the code away from remapping keyboard keys to "XF86Audio*", to expecting players to contact the D-Bus daemon and ask to be forwarded key events.

 

Multimedia keys circa 2003

In 2013, we added support for controlling media players using MPRIS, as another interface. Fast-forward to 2021, and MPRIS support is ubiquitous, whether in free software, proprietary applications or even browsers. So we'll be parting with the "org.gnome.SettingsDaemon.MediaKeys" D-Bus API. If your application still wants to work with older versions of GNOME, it is recommended to at least quiet the MediaKeys API's unavailability.

 

Multimedia keys in 2021
 

TL;DR: Remove code that relies on gnome-settings-daemon's MediaKeys API, make sure to add MPRIS support to your app.

October 04, 2021

I’m Bad At Blogging

I’m responsible enough to admit that I’m bad at blogging.

I’ve said repeatedly that I’m going to blog more often, and then I go and do the complete opposite.

I don’t know why I’m like this, but here we are, and now it’s time for another blog post.

What’s Been Happening

In short: not a lot.

All the features I’ve previously blogged about have landed, and zink is once again in “release mode” until the branchpoint next week to avoid having to rush patches in at the last second. This means probably there won’t be any interesting patches at all to zink until then.

We’re in a good spot though, and I’m pleased with the state of the driver for this release. You probably still won’t be using it to play any OpenGL games you pick up from the Winter Steam Sale, but potentially those days aren’t too far off.

With that said, I do have to actually blog about something technical for once, so let’s roll the dice and see what it’s going to be

ARB_bindless_texture

We did it. We got a good roll.

This is actually a cool extension for an implementation deep dive because of how (relatively) simple Vulkan makes it to handle.

First, an overview: What is ARB_bindless_texture?

This is an extension used by only the most elite GL apps to enable texture streaming, namely the ability to continually add more images into the rendering pipeline either for sampling or shader write operations. An image is bound to a “handle”, and from there, it can be made “resident” at any time to use it in shaders. This is different from the general GL methodology where an image must be explicitly bound to a specific slot (instead each image has its own slot), and it allows for both greater flexibility and more images to be in use at any given time.

At the implementation level, this actually amounts to three distinct features:

  • the ability to track and manage unique “handles” for each image that can be made resident
  • the ability to access these images from shaders
  • the ability to pass these images between shader stages as normal I/O

In zink, I tackled these in the order I’ve listed.

Handle Management

This wouldn’t have been (as) possible without one very special, very awful Vulkan extension.

You knew this was coming.

VK_EXT_descriptor_indexing.

That’s right, it’s a requirement for this, but not for the reason you might think. Zink has no need for the impossibly large descriptorsets enabled by this extension, but I did need the other features it provides:

  • VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT - enables binding a “bindless” descriptor set once and then performing updates on it without needing to have multiple sets
  • VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT - enables invalidating deleted members of an active set and leaving them as garbage values in the descriptor set as long as they won’t be accessed in shaders (don’t worry, this is totally safe)
  • VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT - enables updating members of an active set that aren’t currently in use by any shaders

With these, it becomes possible to implement bindless textures using the existing Gallium convention:

  • create u_idalloc instance to track and generate integer handle IDs
  • map these handle IDs to slots in a large-ish (1024) sized descriptor array
  • dynamically update the slots in the set as textures are made resident/not-resident
  • return handle IDs to the u_idalloc pool once they are destroyed and the image is no longer in use

This creates a cycle where a handle ID is allocated, an image is bound to that slot in the descriptor array, the image can be unbound, the handle ID is deleted, and then finally the ID is recycled, all while only binding and updating a single descriptor set as draws continue.

Shader Access

Now that the images are accessible to the GPU in the bindless descriptor array, shaders will have to be updated to read them.

In NIR, bindless instructions come in two variants:

  • nir_intrinsic_bindless_image_*
  • nir_instr_type_tex with nir_tex_src_texture_handle

These have their own unique semantics that I didn’t bother to look into; I only need to do completely normal array derefs, so what I actually needed was just to rewrite them back into normal-er instructions.

For the image intrinsics, that ended up being the following snippet:

nir_intrinsic_instr *instr = nir_instr_as_intrinsic(in);

nir_intrinsic_op op;
#define OP_SWAP(OP) \
case nir_intrinsic_bindless_image_##OP: \
   op = nir_intrinsic_image_deref_##OP; \
   break;


/* convert bindless intrinsics to deref intrinsics */
switch (instr->intrinsic) {
OP_SWAP(atomic_add)
OP_SWAP(atomic_and)
OP_SWAP(atomic_comp_swap)
OP_SWAP(atomic_dec_wrap)
OP_SWAP(atomic_exchange)
OP_SWAP(atomic_fadd)
OP_SWAP(atomic_fmax)
OP_SWAP(atomic_fmin)
OP_SWAP(atomic_imax)
OP_SWAP(atomic_imin)
OP_SWAP(atomic_inc_wrap)
OP_SWAP(atomic_or)
OP_SWAP(atomic_umax)
OP_SWAP(atomic_umin)
OP_SWAP(atomic_xor)
OP_SWAP(format)
OP_SWAP(load)
OP_SWAP(order)
OP_SWAP(samples)
OP_SWAP(size)
OP_SWAP(store)
default:
   return false;
}

enum glsl_sampler_dim dim = nir_intrinsic_image_dim(instr);
nir_variable *var = dim == GLSL_SAMPLER_DIM_BUF ? bindless_buffer_array : bindless_image_array;
if (!var)
   var = create_bindless_image(b->shader, dim);
instr->intrinsic = op;
b->cursor = nir_before_instr(in);
nir_deref_instr *deref = nir_build_deref_var(b, var);
if (glsl_type_is_array(var->type))
   deref = nir_build_deref_array(b, deref, nir_u2uN(b, instr->src[0].ssa, 32));
nir_instr_rewrite_src_ssa(in, &instr->src[0], &deref->dest.ssa);

In short, swap the intrinsic back to a regular image one, then rewrite the image src as a deref of a bindless image variable (which is just image[1024]). In long…it’s the same thing. It’s actually that simple.

The tex instruction is where things get trickier.

nir_variable *var = tex->sampler_dim == GLSL_SAMPLER_DIM_BUF ? bindless_buffer_array : bindless_texture_array;
if (!var)
   var = create_bindless_texture(b->shader, tex);
b->cursor = nir_before_instr(in);
nir_deref_instr *deref = nir_build_deref_var(b, var);
if (glsl_type_is_array(var->type))
   deref = nir_build_deref_array(b, deref, nir_u2uN(b, tex->src[idx].src.ssa, 32));
nir_instr_rewrite_src_ssa(in, &tex->src[idx].src, &deref->dest.ssa);

This part is the same as the image rewrite: just rewriting the instruction as a deref.

This part, however, is different:

unsigned needed_components = glsl_get_sampler_coordinate_components(glsl_without_array(var->type));
unsigned c = nir_tex_instr_src_index(tex, nir_tex_src_coord);
unsigned coord_components = nir_src_num_components(tex->src[c].src);
if (coord_components < needed_components) {
   nir_ssa_def *def = nir_pad_vector(b, tex->src[c].src.ssa, needed_components);
   nir_instr_rewrite_src_ssa(in, &tex->src[c].src, def);
   tex->coord_components = needed_components;
}

The thing about bindless textures is that by the time zink sees them, they have no dimensionality. They’re just textures in an array, regardless of whether they’re 1D, 2D, 3D, or arrayed. This means the variables used for derefs might not have the right number of coordinate components, or the instructions using them might not have the right number. To fix this, an extra cleanup is needed here to match up the number of components with the variable being used.

With all of that in place, basic bindless operations are working.

But wait…

Shader I/O

This was the tricky part. According to the spec, it now becomes legal to have an image or a sampler as an input or an output in a shader.

But is it really, truly necessary to pass images between the shaders?

No. No it isn’t.

nir_deref_instr *src_deref = nir_src_as_deref(instr->src[0]);
nir_variable *var = nir_deref_instr_get_variable(src_deref);
if (var->data.bindless)
   return false;
if (var->data.mode != nir_var_shader_in && var->data.mode != nir_var_shader_out)
   return false;
if (!glsl_type_is_image(var->type) && !glsl_type_is_sampler(var->type))
   return false;

var->type = glsl_int64_t_type();
var->data.bindless = 1;
b->cursor = nir_before_instr(in);
nir_deref_instr *deref = nir_build_deref_var(b, var);
if (instr->intrinsic == nir_intrinsic_load_deref) {
    nir_ssa_def *def = nir_load_deref(b, deref);
    nir_instr_rewrite_src_ssa(in, &instr->src[0], def);
    nir_ssa_def_rewrite_uses(&instr->dest.ssa, def);
} else {
   nir_store_deref(b, deref, instr->src[1].ssa, nir_intrinsic_write_mask(instr));
}
nir_instr_remove(in);
nir_instr_remove(&src_deref->instr);

Bindless shader i/o is really just passing array indices that masquerade as images. If they’re rewritten back to integer types, that all goes away, and they become regular i/o that needs no additional handling.

Just This Once

The translation to Vulkan made everything incredibly easy. I didn’t need any special hacks or corner case behavior, and I didn’t have to spend time reading code from other drivers to figure out what the hell I was doing wrong. Validation even works for it!

Truly miraculous.

October 01, 2021

Wim Taymans

Wim Taymans laying out the vision for the future of Linux multimedia


PipeWire has already made great strides forward in terms of improving the audio handling situation on Linux, but one of the original goals was to also bring along the video side of the house. In fact in the first few releases of Fedora Workstation where we shipped PipeWire we solely enabled it as a tool to handle screen sharing for Wayland and Flatpaks. So with PipeWire having stabilized a lot for audio now we feel the time has come to go back to the video side of PipeWire and work to improve the state-of-art for video capture handling under Linux. Wim Taymans did a presentation to our team inside Red Hat on the 30th of September talking about the current state of the world and where we need to go to move forward. I thought the information and ideas in his presentation deserved wider distribution so this blog post is building on that presentation to share it more widely and also hopefully rally the community to support us in this endeavour.

The current state of video capture, usually webcams, handling on Linux is basically the v4l2 kernel API. It has served us well for a lot of years, but we believe that just like you don’t write audio applications directly to the ALSA API anymore, you should neither write video applications directly to the v4l2 kernel API anymore. With PipeWire we can offer a lot more flexibility, security and power for video handling, just like it does for audio. The v4l2 API is an open/ioctl/mmap/read/write/close based API, meant for a single application to access at a time. There is a library called libv4l2, but nobody uses it because it causes more problems than it solves (no mmap, slow conversions, quirks). But there is no need to rely on the kernel API anymore as there are GStreamer and PipeWire plugins for v4l2 allowing you to access it using the GStreamer or PipeWire API instead. So our goal is not to replace v4l2, just as it is not our goal to replace ALSA, v4l2 and ALSA are still the kernel driver layer for video and audio.

It is also worth considering that new cameras are getting more and more complicated and thus configuring them are getting more complicated. Driving this change is a new set of cameras on the way often called MIPI cameras, as they adhere to the API standards set by the MiPI Alliance. Partly driven by this V4l2 is in active development with a Codec API addition, statefull/stateless, DMABUF, request API and also adding a Media Controller (MC) Graph with nodes, ports, links of processing blocks. This means that the threshold for an application developer to use these APIs directly is getting very high in addition to the aforementioned issues of single application access, the security issues of direct kernel access and so on.

libcamera logo


Libcamera is meant to be the userland library for v4l2.


Of course we are not the only ones seeing the growing complexity of cameras as a challenge for developers and thus libcamera has been developed to make interacting with these cameras easier. Libcamera provides unified API for setup and capture for cameras, it hides the complexity of modern camera devices, it is supported for ChromeOS, Android and Linux.
One way to describe libcamera is as the MESA of cameras. Libcamera provides hooks to run (out-of-process) vendor extensions like for image processing or enhancement. Using libcamera is considering pretty much a requirement for embedded systems these days, but also newer Intel chips will also have IPUs configurable with media controllers.

Libcamera is still under heavy development upstream and do not yet have a stable ABI, but they did add a .so version very recently which will make packaging in Fedora and elsewhere a lot simpler. In fact we have builds in Fedora ready now. Libcamera also ships with a set of GStreamer plugins which means you should be able to get for instance Cheese working through libcamera in theory (although as we will go into, we think this is the wrong approach).

Before I go further an important thing to be aware of here is that unlike on ALSA, where PipeWire can provide a virtual ALSA device to provide backwards compatibility with older applications using the ALSA API directly, there is no such option possible for v4l2. So application developers will need to port to something new here, be that libcamera or PipeWire. So what do we feel is the right way forward?

Ideal Linux Multimedia Stack

How we envision the Linux multimedia stack going forward


Above you see an illustration of what we believe should be how the stack looks going forward. If you made this drawing of what the current state is, then thanks to our backwards compatibility with ALSA, PulseAudio and Jack, all the applications would be pointing at PipeWire for their audio handling like they are in the illustration you see above, but all the video handling from most applications would be pointing directly at v4l2 in this diagram. At the same time we don’t want applications to port to libcamera either as it doesn’t offer a lot of the flexibility than using PipeWire will, but instead what we propose is that all applications target PipeWire in combination with the video camera portal API. Be aware that the video portal is not an alternative or a abstraction of the PipeWire API, it is just a way to set up the connection to PipeWire that has the added bonus of working if your application is shipping as a Flatpak or another type of desktop container. PipeWire would then be in charge of talking to libcamera or v42l for video, just like PipeWire is in charge of talking with ALSA on the audio side. Having PipeWire be the central hub means we get a lot of the same advantages for video that we get for audio. For instance as the application developer you interact with PipeWire regardless of if what you want is a screen capture, a camera feed or a video being played back. Multiple applications can share the same camera and at the same time there are security provided to avoid the camera being used without your knowledge to spy on you. And also we can have patchbay applications that supports video pipelines and not just audio, like Carla provides for Jack applications. To be clear this feature will not come for ‘free’ from Jack patchbays since Jack only does audio, but hopefully new PipeWire patchbays like Helvum can add video support.

So what about GStreamer you might ask. Well GStreamer is a great way to write multimedia applications and we strongly recommend it, but we do not recommend your GStreamer application using the v4l2 or libcamera plugins, instead we recommend that you use the PipeWire plugins, this is of course a little different from the audio side where PipeWire supports the PulseAudio and Jack APIs and thus you don’t need to port, but by targeting the PipeWire plugins in GStreamer your GStreamer application will get the full PipeWire featureset.

So what is our plan of action>
So we will start putting the pieces in place for this step by step in Fedora Workstation. We have already started on this by working on the libcamera support in PipeWire and packaging libcamera for Fedora. We will set it up so that PipeWire can have option to switch between v4l2 and libcamera, so that most users can keep using the v4l2 through PipeWire for the time being, while we work with upstream and the community to mature libcamera and its PipeWire backend. We will also enable device discoverer for PipeWire.

We are also working on maturing the GStreamer elements for PipeWire for the video capture usecase as we expect a lot of application developers will just be using GStreamer as opposed to targeting PipeWire directly. We will start with Cheese as our initial testbed for this work as it is a fairly simple application, using Cheese as a proof of concept to have it use PipeWire for camera access. We are still trying to decide if we will make Cheese speak directly with PipeWire, or have it talk to PipeWire through the pipewiresrc GStreamer plugin, but have their pro and cons in the context of testing and verifying this.

We will also start working with the Chromium and Firefox projects to have them use the Camera portal and PipeWire for camera support just like we did work with them through WebRTC for the screen sharing support using PipeWire.

There are a few major items we are still trying to decide upon in terms of the interaction between PipeWire and the Camera portal API. It would be tempting to see if we can hide the Camera portal API behind the PipeWire API, or failing that at least hide it for people using the GStreamer plugin. That way all applications get the portal support for free when porting to GStreamer instead of requiring using the Camera portal API as a second step. On the other side you need to set up the screen sharing portal yourself, so it would probably make things more consistent if we left it to application developers to do for camera access too.

What do we want from the community here?
First step is just help us with testing as we roll this out in Fedora Workstation and Cheese. While libcamera was written motivated by MIPI cameras, all webcams are meant to work through it, and thus all webcams are meant to work with PipeWire using the libcamera backend. At the moment that is not the case and thus community testing and feedback is critical for helping us and the libcamera community to mature libcamera. We hope that by allowing you to easily configure PipeWire to use the libcamera backend (and switch back after you are done testing) we can get a lot of you to test and let us what what cameras are not working well yet.

A little further down the road please start planning moving any application you maintain or contribute to away from v4l2 API and towards PipeWire. If your application is a GStreamer application the transition should be fairly simple going from the v4l2 plugins to the pipewire plugins, but beyond that you should familiarize yourself with the Camera portal API and the PipeWire API for accessing cameras.

For further news and information on PipeWire follow our @PipeWireP twitter account and for general news and information about what we are doing in Fedora Workstation make sure to follow me on twitter @cfkschaller.
PipeWire

September 27, 2021
For the hw-enablement for Bay- and Cherry-Trail devices which I do as a side project, sometimes it is useful to play with the Android which comes pre-installed on some of these devices.

Sometimes the Android-X86 boot-loader (kerneflinger) is locked and the standard "Developer-Options" -> "Enable OEM Unlock" -> "Run 'fastboot oem unlock'" sequence does not work (e.g. I got the unlock yes/no dialog, and could move between yes and no, but I could not actually confirm the choice).

Luckily there is an alternative, kernelflinger checks a "OEMLock" EFI variable to see if the device is locked or not. Like with some of my previous adventures changing hidden BIOS settings, this EFI variable is hidden from the OS as soon as the OS calls ExitBootServices, but we can use the same modified grub to change this EFI variable. After booting from an USB stick with the relevant grub binary installed as "EFI/BOOT/BOOTX64.EFI" or "BOOTIA32.EFI", entering the
following command on the grub cmdline will unlock the bootloader:

setup_var_cv OEMLock 0 1 1

Disabling dm-verity support is pretty easy on these devices because they can just boot a regular Linux distro from an USB drive. Note booting a regular Linux distro may cause the Android "system" partition to get auto-mounted after which dm-verity checks will fail! Once we have a regular Linux distro running step 1 is to find out which partition is the android_boot partition to do this as root run:

blkid /dev/mmcblk?p#

Replacing the ? for the mmcblk number for the internal eMMC and then for # is 1 to n, until one of the partitions is reported as having 'PARTLABEL="android_boot"', usually "mmcblk?p3" is the one you want, so you could try that first.

Now make an image of the partition by running e.g.:

dd if=/dev/mmcblk1p3" of=android_boot.img

And then copy the "android_boot.img" file to another computer. On this computer extract the file and then the initrd like this:

abootimg -x android_boot.img
mkdir initrd
cd initrd
zcat ../initrd.img | cpio -i


Now edit the fstab file and remove "verify" from the line for the system partition. after this update android_boot.img like this:

find . | cpio -o -H newc -R 0.0 | gzip -9 > ../initrd.img
cd ..
abootimg -u android_boot.img -r initrd.img


The easiest way to test the new image is using fastboot, boot the tablet into Android and connect it to the PC, then run:

adb reboot bootloader
fastboot boot android_boot.img


And then from an "adb shell" do "cat /fstab" verify that the "verify" option is gone now. After this you can (optionally) dd the new android_boot.img back to the android_boot partition to make the change permanent.

Note if Android is not booting you can force the bootloader to enter fastboot mode on the next boot by downloading this file and then under regular Linux running the following command as root:

cat LoaderEntryOneShot > /sys/firmware/efi/efivars/LoaderEntryOneShot-4a67b082-0a4c-41cf-b6c7-440b29bb8c4f
September 24, 2021

Fedora Workstation
So I have spoken about what is our vision for Fedora Workstation quite a few times before, but I feel it is often useful to get back to it as we progress with our overall effort.So if you read some of my blog posts about Fedora Workstation over the last 5 years, be aware that there is probably little new in here for you. If you haven’t read them however this is hopefully a useful primer on what we are trying to achieve with Fedora Workstation.

The first few years after we launched Fedora Workstation in 2014 we focused on lot on establishing a good culture around what we where doing with Fedora, making sure that it was a good day to day desktop driver for people, and not just a great place to develop the operating system itself. I think it was Fedora Project Lead Matthew Miller who phrased it very well when he said that we want to be Leading Edge, not Bleeding Edge. We also took a good look at the operating system from an overall stance and tried to map out where Linux tended to fall short as a desktop operating system and also tried to ask ourselves what our core audience would and should be. We refocused our efforts on being a great Operating System for all kinds of developers, but I think it is fair to say that we decided that was to narrow a wording as our efforts are truly to reach makers of all kinds like graphics artists and musicians, in addition to coders. So I thought I go through our key pillar efforts and talk about where they are at and where they are going.

Flatpak

Flatpak logo
One of the first things we concluded was that our story for people who wanted to deploy applications to our platform was really bad. The main challenge was that the platform was moving very fast and it was a big overhead for application developers to keep on top of the changes. In addition to that, since the Linux desktop is so fragmented, the application developers would have to deal with the fact that there was 20 different variants of this platform, all moving at a different pace. The way Linux applications was packaged, with each dependency being packaged independently of the application created pains on both sides, for the application developer it means the world kept moving underneath them with limited control and for the distributions it meant packaging pains as different applications who all depended on the same library might work or fail with different versions of a given library. So we concluded we needed a system which allowed us to decouple of application from the host OS to let application developers update their platform at a pace of their own choosing and at the same time unify the platform in the sense that the application should be able to run without problems on the latest Fedora releases, the latest RHEL releases or the latest versions of any other distribution out there. As we looked at it we realized there was some security downsides compared to the existing model, since the Os vendor would not be in charge of keeping all libraries up to date and secure, so sandboxing the applications ended up a critical requirement. At the time Alexander Larsson was working on bringing Docker to RHEL and Fedora so we tasked him with designing the new application model. The initial idea was to see if we could adjust Docker containers to the desktop usecase, but Docker containers as it stood at that time were very unsuited for the purpose of hosting desktop applications and our experience working with the docker upstream at the time was that they where not very welcoming to our contributions. So in light of how major the changes we would need to implement and the unlikelyhood of getting them accepted upstream, Alex started on what would become Flatpak. Another major technology that was coincidentally being developed at the same time was OSTree by Colin Walters. To this day I think the best description of OSTree is that it functions as a git for binaries, meaning it allows you a simple way to maintain and update your binary applications with minimally sized updates. It also provides some disk deduplication which we felt was important due to the duplication of libraries and so on that containers bring with them. Finally another major design decision Alex did was that the runtime/baseimage should be hosted outside the container, so make possible to update the runtime independently of the application with relevant security updates etc.

Today there is a thriving community around Flatpaks, with the center of activity being flathub, the Flatpak application repository. In Fedora Workstation 35 you should start seeing Flatpak from Flathub being offered as long as you have 3rd party repositories enabled. Also underway is Owen Taylor leading our efforts of integrating Flatpak building into the internal tools we use at Red Hat for putting RHEL together, with the goal of switching over to Flatpaks as our primary application delivery method for desktop applications in RHEL and to help us bridge the Fedora and RHEL application ecosystem.

You can follow the latest news from Flatpak through the official Flatpak twitter account.

Silverblue

So another major issue we decided needing improvements was that of OS upgrades (as opposed to application updates). The model pursued by Linux distros since their inception is one of shipping their OS as a large collection of independently packaged libraries. This setup is inherently fragile and requires a lot of quality engineering and testing to avoid problems, but even then sometimes things sometimes fail, especially in a fast moving OS like Fedora. A lot of configuration changes and updates has traditionally been done through scripts and similar, making rollback to an older version in cases where there is a problem also very challenging. Adventurous developers could also have done changes to their own copy of the OS that would break the upgrade later on. So thanks to all the great efforts to test and verify upgrades they usually go well for most users, but we wanted something even more sturdy. So the idea came up to move to a image based OS model, similar to what people had gotten used to on their phones. And OSTree once again became the technology we choose to do this, especially considering it was being used in Red Hat first foray into image based operating systems for servers (the server effort later got rolled into CoreOS as part of Red Hat acquiring CoreOS). The idea is that you ship the core operating system as a singular image and then to upgrade you just replace that image with a new image, and thus the risks of problems are greatly reduced. On top of that each of those images can be tested and verified as a whole by your QE and test teams. Of course we realized that a subset of people would still want to be able to tweak their OS, but once again OSTree came to our rescue as it allows developers to layer further RPMS on top of the OS image, including replacing current system libraries with for instance newer ones. The great thing about OSTree layering is that once you are done testing/using the layers RPMS you can with a very simple command just drop them again and go back to the upstream image. So combined with applications being shipped as Flatpaks this would create an OS that is a lot more sturdy, secure and simple to update and with a lot lower chance of an OS update breaking any of your applications. On top of that OSTree allows us to do easy OS rollbacks, so if the latest update somehow don’t work for you can you quickly rollback while waiting for the issue you are having to be fixed upstream. And hence Fedora Silverblue was born as the vehicle for us to develop and evolve an image based desktop operating system.

You can follow our efforts around Silverblue through the offical Silverblue twitter account.

Toolbx

Toolbox with RHEL

Toolbox pet container with RHEL UBI


So Flatpak helped us address a lot of the the gaps for making a better desktop OS on the application side and Silverblue was the vehicle for our vision on the OS side, but we realized that we also needed some way for all kinds of developers to be able to easily take advantage of the great resource that is the Fedora RPM package universe and the wider tools universe out there. We needed something that provided people with a great terminal experience. We had already been working on various smaller improvements to the terminal for a while, but we realized we needed something a lot more substantial. Accessing an immutable OS like Silverblue through a terminal window tends to be quite limiting. So that it is usually not want you want to do and also you don’t want to rely on the OSTree layering for running all your development tools and so on as that is going to be potentially painful when you upgrade your OS.
Luckily the container revolution happening in the Linux world pointed us to the solution here too, as while containers were rolled out the concept of ‘pet containers’ were also born. The idea of a pet container is that unlike general containers (sometimes refer to as cattle containers) pet container are containers that you care about on an individual level, like your personal development environment. In fact pet containers even improves on how we used to do things as they allow you to very easily maintain different environments for different projects. So for instance if you have two projects, hosted in two separate pet containers, where the two project depends on two different versions of python, then containers make that simple as it ensures that there is no risk of one of your projects ‘contaminating’ the others with its dependencies, yet at the same time allow you to grab RPMS or other kind of packages from upstream resources and install them in your container. In fact while inside your pet container the world feels a lot like it always has when on the linux command line. Thanks to the great effort of Dan Walsh and his team we had a growing number of easy to use container tools available to us, like podman. Podman is developed with the primary usecase being for running and deploying your containers at scale, managed by OpenShift and Kubernetes. But it also gave us the foundation we needed for Debarshi Ray to kicked of the Toolbx project to ensure that we had an easy to use tool for creating and managing pet containers. As a bonus Toolbx allows us to achieve another important goal, to allow Fedora Workstation users to develop applications against RHEL in a simple and straightforward manner, because Toolbx allows you to create RHEL containers just as easy as it allows you to create Fedora containers.

You can follow our efforts around Toolbox on the official Toolbox twitter account

Wayland

Ok, so between Flatpak, Silverblue and Toolbox we have the vision clear for how to create a robust OS, with a great story for application developers to maintain and deliver applications for it, to Toolbox providing a great developer story on top of this OS. But we also looked at the technical state of the Linux desktop and realized that there where some serious deficits we needed to address. One of the first one we saw was the state of graphics where X.org had served us well for many decades, but its age was showing and adding new features as they came in was becoming more and more painful. Kristian Høgsberg had started work on an alternative to X while still at Red Hat called Wayland, an effort he and a team of engineers where pushing forward at Intel. There was a general agreement in the wider community that Wayland was the way forward, but apart from Intel there was little serious development effort being put into moving it forward. On top of that, Canonical at the time had decided to go off on their own and develop their own alternative architecture in competition with X.org and Wayland. So as we were seeing a lot of things happening in the graphics space horizon, like HiDPI, and also we where getting requests to come up with a way to make Linux desktops more secure, we decided to team up with Intel and get Wayland into a truly usable state on the desktop. So we put many of our top developers, like Olivier Fourdan, Adam Jackson and Jonas Ådahl, on working on maturing Wayland as quickly as possible.
As things would have it we also ended up getting a lot of collaboration and development help coming in from the embedded sector, where companies such as Collabora was helping to deploy systems with Wayland onto various kinds of embedded devices and contributing fixes and improvements back up to Wayland (and Weston). To be honest I have to admit we did not fully appreciate what a herculean task it would end up being getting Wayland production ready for the desktop and it took us quite a few Fedora releases before we decided it was ready to go. As you might imagine dealing with 30 years of technical debt is no easy thing to pay down and while we kept moving forward at a steady pace there always seemed to be a new batch of issues to be resolved, but we managed to do so, not just by maturing Wayland, but also by porting major applications such as Martin Stransky porting Firefox, and Caolan McNamara porting LibreOffice over to Wayland. At the end of the day I think what saw us through to success was the incredible collaboration happening upstream between a large host of individual contributors, companies and having the support of the X.org community. And even when we had the whole thing put together there where still practical issues to overcome, like how we had to keep defaulting to X.org in Fedora when people installed the binary NVidia driver because that driver did not work with XWayland, the X backwards compatibility layer in Wayland. Luckily that is now in the process of becoming a thing of the past with the latest NVidia driver updates support XWayland and us working closely with NVidia to ensure driver and windowing stack works well.

PipeWire

Pipewire in action

Example of PipeWire running


So now we had a clear vision for the OS and a much improved and much more secure graphics stack in the form of Wayland, but we realized that all the new security features brought in by Flatpak and Wayland also made certain things like desktop capturing/remoting and web camera access a lot harder. Security is great and critical, but just like the old joke about the most secure computer being the one that is turned off, we realized that we needed to make sure these things kept working, but in a secure and better manner. Thankfully we have GStreamer co-creator Wim Taymans on the team and he thought he could come up with a pulseaudio equivalent for video that would allow us to offer screen capture and webcam access in a convenient and secure manner.
As Wim where prototyping what we called PulseVideo at the time we also started discussing the state of audio on Linux. Wim had contributed to PulseAudio to add a security layer to it, to make for instance it harder for a rogue application to eavesdrop on you using your microphone, but since it was not part of the original design it wasn’t a great solution. At the same time we talked about how our vision for Fedora Workstation was to make it the natural home for all kind of makers, which included musicians, but how the separateness of the pro-audio community getting in the way of that, especially due to the uneasy co-existence of PulseAudio on the consumer side and Jack for the pro-audio side. As part of his development effort Wim came to the conclusion that he code make the core logic of his new project so fast and versatile that it should be able to deal with the low latency requirements of the pro-audio community and also serve its purpose well on the consumer audio and video side. Having audio and video in one shared system would also be an improvement for us in terms of dealing with combined audio and video sources as guaranteeing audio video sync for instance had often been a challenge in the past. So Wims effort evolved into what we today call PipeWire and which I am going to be brave enough to say has been one of the most successful launches of a major new linux system component we ever done. Replacing two old sound servers while at the same time adding video support is no small feat, but Wim is working very hard on fixing bugs as quickly as they come in and ensure users have a great experience with PipeWire. And at the same time we are very happy that PipeWire now provides us with the ability of offering musicians and sound engineers a new home in Fedora Workstation.

You can follow our efforts on PipeWire on the PipeWire twitter account.

Hardware support and firmware

In parallel with everything mentioned above we where looking at the hardware landscape surrounding desktop linux. One of the first things we realized was horribly broken was firmware support under Linux. More and more of the hardware smarts was being found in the firmware, yet the firmware access under Linux and the firmware update story was basically non-existent. As we where discussing this problem internally, Peter Jones who is our representative on UEFI standards committee, pointed out that we probably where better poised to actually do something about this problem than ever, since UEFI was causing the firmware update process on most laptops and workstations to become standardized. So we teamed Peter up with Richard Hughes and out of that collaboration fwupd and LVFS was born. And in the years since we launched that we gone from having next to no firmware available on Linux (and the little we had only available through painful processes like burning bootable CDs etc.) to now having a lot of hardware getting firmware update support and more getting added almost on a weekly basis.
For the latest and greatest news around LVFS the best source of information is Richard Hughes twitter account.

In parallel to this Adam Jackson worked on glvnd, which provided us with a way to have multiple OpenGL implementations on the same system. For those who has been using Linux for a while I am sure you remembers the pain of the NVidia driver and Mesa fighting over who provided OpenGL on your system as it was all tied to a specific .so name. There was a lot of hacks being used out there to deal with that situation, of varying degree of fragility, but with the advent of glvnd nobody has to care about that problem anymore.

We also decided that we needed to have a part of the team dedicated to looking at what was happening in the market and work on covering important gaps. And with gaps I mean fixing the things that keeps the hardware vendors from being able to properly support Linux, not writing drivers for them. Instead we have been working closely with Dell and Lenovo to ensure that their suppliers provide drivers for their hardware and when needed we work to provide a framework for them to plug their hardware into. This has lead to a series of small, but important improvements, like getting the fingerprint reader stack on Linux to a state where hardware vendors can actually support it, bringing Thunderbolt support to Linux through Bolt, support for high definition and gaming mice through the libratbag project, support in the Linux kernel for the new laptop privacy screen feature, improved power management support through the power profiles daemon and now recently hiring a dedicated engineer to get HDR support fully in place in Linux.

Summary

So to summarize. We are of course not over the finish line with our vision yet. Silverblue is a fantastic project, but we are not yet ready to declare it the official version of Fedora Workstation, mostly because we want to give the community more time to embrace the Flatpak application model and for developers to embrace the pet container model. Especially applications like IDEs that cross the boundary between being in their own Flatpak sandbox while also interacting with things in your pet container and calling out to system tools like gdb need more work, but Christian Hergert has already done great work solving the problem in GNOME Builder while Owen Taylor has put together support for using Visual Studio Code with pet containers. So hopefully the wider universe of IDEs will follow suit, in the meantime one would need to call them from the command line from inside the pet container.

The good thing here is that Flatpaks and Toolbox also works great on traditional Fedora Workstation, you can get the full benefit of both technologies even on a traditional distribution, so we can allow for a soft and easy transition.

So for anyone who made it this far, appoligies for this become a little novel, that was not my intention when I started writing it :)

Feel free to follow my personal twitter account for more general news and updates on what we are doing around Fedora Workstation.
Christian F.K. Schaller photo

Last week we had our most loved annual conference: X.Org Developers Conference 2021. As a reminder, due to COVID-19 situation in Europe (and its respective restrictions on travel and events), we kept it virtual again this year… which is a pity as the former venue was Gdańsk, a very beautiful city (see picture below if you don’t believe me!) in Poland. Let’s see if we can finally have an XDC there!

XDC 2021

This year we had a very strong program. There were talks covering all aspects of the open-source graphics stack: from the kernel (including an Outreachy talk about VKMS) and Mesa drivers of all kind, inputs, libraries, X.org security and Wayland robustness… we had talks about testing drivers, debugging them, our infra at freedesktop.org, and even Vulkan specs (such Vulkan Video and VK_EXT_multi_draw) and their support in the open-source graphics stack. Definitely, a very complete program that is very interesting to all open-source developers working on this area. You can watch all the talks here or here and the slides were already uploaded in the program.

On behalf of the Call For Papers Committee, I would like to thank all speakers for their talks… this conference won’t make sense without you!

Big shout-out to the XDC 2021 organizers (Intel) represented by Radosław Szwichtenberg, Ryszard Knop and Maciej Ramotowski. They did an awesome job on having a very smooth conference. I can tell you that they promptly fixed any issue that happened, all of that behind the scenes so that the attendees not even noticed anything most of the times! That is what good conference organizers do!

XDC 2021 Organizers Can I invite you to a drink at least? You really deserve it!

If you want to know more details about what this virtual conference entailed, just watch Ryszard’s talk at XDC (info, video) or you can reuse their materials for future conferences. That’s very useful info for future conference organizers!

Talking about our streaming platforms, the big novelty this year was the use of media.ccc.de as a privacy-friendly alternative to our traditional Youtube setup (last year we got feedback about this). Media.ccc.de is an open-source platform that respects your privacy and we hope it worked fine for all attendees. Our stats indicate that ~50% of our audience connected to it during the three days of the conference. That’s awesome!

Last but not least, we couldn’t make this conference without our sponsors. We are very lucky to have on board Intel as our Platinum sponsor and organizer, our Gold sponsors (Google, NVIDIA, ARM, Microsoft and AMD, our Silver sponsors (Igalia, Collabora, The Linux Foundation), our Bronze sponsors (Gitlab and Khronos Group) and our Supporters (C3VOC). Big thank you from the X.Org community!

XDC 2021 Sponsors

Feedback

We would like to hear from you and learn about what worked and what needs to be improved for future editions of XDC! Share us your experience!

We have sent an email asking for feedback to different mailing lists (for example this). Don’t hesitate to send an email to X.Org Foundation board with all your feedback!

XDC 2022 announced!

X.Org Developers Conference 2022 has been announced! Jeremy White, from Codeweavers, gave a lightning talk presenting next year edition! Next year the XDC will not be alone… WineConf 2022 is going to be organized by Codeweavers as well and co-located with XDC!

Save the dates! October 4-5-6, 2022 in Minneapolis, Minnesota, USA.

XDC 2022: Minneapolis, Minnesota, USA Image from Wikipedia. License CC BY-SA 4.0.

XDC 2023 hosting proposals

Have you enjoyed XDC 2021? Do you think you can do it better? ;-) We are looking for organizers for XDC 2023 (most likely in Europe but we are open to other places).

We know this is a decision that takes time (trigger internal discussion, looking for volunteers, budget, a venue suitable for the event, etc). Therefore, we encourage potential interested parties to start the internal discussions now, so any question they have can be answered before we open the call for proposals for XDC 2023 at some point next year. Please read what it is required to organize this conference and feel free to contact me or the X.Org Foundation board for more info if needed.

Final acknowledgment

I would like to thank Igalia for all the support I got when I decided to run for re-election this year in the X.Org Foundation board and to allow me to participate in XDC organization during my work hours. It’s amazing that our Free Software and collaboration values are still present after 20 years rocking in the free world!

Igalia 20th anniversary Igalia

September 23, 2021

After a nine year hiatus, a new version of the X Input Protocol is out. Credit for the work goes to Povilas Kanapickas, who also implemented support for XI 2.4 in the various pieces of the stack [0]. So let's have a look.

X has had touch events since XI 2.2 (2012) but those were only really useful for direct touch devices (read: touchscreens). There were accommodations for indirect touch devices like touchpads but they were never used. The synaptics driver set the required bits for a while but it was dropped in 2015 because ... it was complicated to make use of and no-one seemed to actually use it anyway. Meanwhile, the rest of the world moved on and touchpad gestures are now prevalent. They've been standard in MacOS for ages, in Windows for almost ages and - with recent GNOME releases - now feature prominently on the Linux desktop as well. They have been part of libinput and the Wayland protocol for years (and even recently gained a new set of "hold" gestures). Meanwhile, X was left behind in the dust or mud, depending on your local climate.

XI 2.4 fixes this, it adds pinch and swipe gestures to the XI2 protocol and makes those available to supporting clients [2]. Notably here is that the interpretation of gestures is left to the driver [1]. The server takes the gestures and does the required state handling but otherwise has no decision into what constitutes a gesture. This is of course no different to e.g. 2-finger scrolling on a touchpad where the server just receives scroll events and passes them on accordingly.

XI 2.4 gesture events are quite similar to touch events in that they are processed as a sequence of begin/update/end with both types having their own event types. So the events you will receive are e.g. XIGesturePinchBegin or XIGestureSwipeUpdate. As with touch events, a client must select for all three (begin/update/end) on a window. Only one gesture can exist at any time, so if you are a multi-tasking octopus prepare to be disappointed.

Because gestures are tied to an indirect-touch device, the location they apply at is wherever the cursor is currently positioned. In that, they work similar to button presses, and passive grabs apply as expected too. So long-term the window manager will likely want a passive grab on the root window for swipe gestures while applications will implement pinch-to-zoom as you'd expect.

In terms of API there are no suprises. libXi 1.8 is the version to implement the new features and there we have a new XIGestureClassInfo returned by XIQueryDevice and of course the two events: XIGesturePinchEvent and XIGestureSwipeEvent. Grabbing is done via e.g. XIGrabSwipeGestureBegin, so for those of you with XI2 experience this will all look familiar. For those of you without - it's probably no longer worth investing time into becoming an XI2 expert.

Overall, it's a nice addition to the protocol and it will help getting the X server slightly closer to Wayland for a widely-used feature. Once GTK, mutter and all the other pieces in the stack are in place, it will just work for any (GTK) application that supports gestures under Wayland already. The same will be true for Qt I expect.

X server 21.1 will be out in a few weeks, xf86-input-libinput 1.2.0 is already out and so are xorgproto 2021.5 and libXi 1.8.

[0] In addition to taking on the Xorg release, so clearly there are no limits here
[1] More specifically: it's done by libinput since neither xf86-input-evdev nor xf86-input-synaptics will ever see gestures being implemented
[2] Hold gestures missed out on the various deadlines

September 22, 2021

Xorg is about to released.

And it's a release without Xwayland.

And... wait, what?

Let's unwind this a bit, and ideally you should come away with a better understanding of Xorg vs Xwayland, and possibly even Wayland itself.

Heads up: if you are familiar with X, the below is simplified to the point it hurts. Sorry about that, but as an X developer you're probably good at coping with pain.

Let's go back to the 1980s, when fashion was weird and there were still reasons to be optimistic about the future. Because this is a thought exercise, we go back with full hindsight 20/20 vision and, ideally, the winning Lotto numbers in case we have some time for some self-indulgence.

If we were to implement an X server from scratch, we'd come away with a set of components. libxprotocol that handles the actual protocol wire format parsing and provides a C api to access that (quite like libxcb, actually). That one will just be the protocol-to-code conversion layer.

We'd have a libxserver component which handles all the state management required for an X server to actually behave like an X server (nothing in the X protocol require an X server to display anything). That library has a few entry points for abstract input events (pointer and keyboard, because this is the 80s after all) and a few exit points for rendered output.

libxserver uses libxprotocol but that's an implementation detail, we can ignore the protocol for the rest of the post.

Let's create a github organisation and host those two libraries. We now have: http://github.com/x/libxserver and http://github.com/x/libxprotocol [1].

Now, to actually implement a working functional X server, our new project would link against libxserver hook into this library's API points. For input, you'd use libinput and pass those events through, for output you'd use the modesetting driver that knows how to scream at the hardware until something finally shows up. This is somewhere between outrageously simplified and unacceptably wrong but it'll do for this post.

Your X server has to handle a lot of the hardware-specifics but other than that it's a wrapper around libxserver which does the work of ... well, being an X server.

Our stack looks like this:


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
|[libinput] [modesetting]|
+------------------------+
| kernel |
+------------------------+
Hooray, we have re-implemented Xorg. Or rather, XFree86 because we're 20 years from all the pent-up frustratrion that caused the Xorg fork. Let's host this project on http://github.com/x/xorg

Now, let's say instead of physical display devices, we want to render into an framebuffer, and we have no input devices.


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
| [write()] |
+------------------------+
| some buffer |
+------------------------+
This is basically Xvfb or, if you are writing out PostScript, Xprint. Let's host those on github too, we're accumulating quite a set of projects here.

Now, let's say those buffers are allocated elsewhere and we're just rendering to them. And those buffer are passed to us via an IPC protocol, like... Wayland!


+------------------------+
| xserver [libxserver]|--------[ X client ]
| |
|input events [render]|
+------------------------+
| |
+------------------------+
| Wayland compositor |
+------------------------+
And voila, we have Xwayland. If you swap out the protocol you can have Xquartz (X on Macos) or Xwin (X on Windows) or Xnext/Xephyr (X on X) or Xvnc (X over VNC). The principle is always the same.

Fun fact: the Wayland compositor doesn't need to run on the hardware, you can play display server matryoshka until you run out of turtles.

In our glorious revisioned past all these are distinct projects, re-using libxserver and some external libraries where needed. Depending on the projects things may be very simple or get very complex, it depends on how we render things.

But in the end, we have several independent projects all providing us with an X server process - the specific X bits are done in libxserver though. We can release Xwayland without having to release Xorg or Xvfb.

libxserver won't need a lot of releases, the behaviour is largely specified by the protocol requirements and once you're done implementing it, it'll be quite a slow-moving project.

Ok, now, fast forward to 2021, lose some hindsight, hope, and attitude and - oh, we have exactly the above structure. Except that it's not spread across multiple independent repos on github, it's all sitting in the same git directory: our Xorg, Xwayland, Xvfb, etc. are all sitting in hw/$name, and libxserver is basically the rest of the repo.

A traditional X server release was a tag in that git directory. An XWayland-only release is basically an rm -rf hw/*-but-not-xwayland followed by a tag, an Xorg-only release is basically an rm -rf hw/*-but-not-xfree86 [2].

In theory, we could've moved all these out into separate projects a while ago but the benefits are small and no-one has the time for that anyway.

So there you have it - you can have Xorg-only or XWayland-only releases without the world coming to an end.

Now, for the "Xorg is dead" claims - it's very likely that the current release will be the last Xorg release. [3] There is little interest in an X server that runs on hardware, or rather: there's little interest in the effort required to push out releases. Povilas did a great job in getting this one out but again, it's likely this is the last release. [4]

Xwayland - very different, it'll hang around for a long time because it's "just" a protocol translation layer. And of course the interest is there, so we have volunteers to do the releases.

So basically: expecting Xwayland releases, be surprised (but not confused) by Xorg releases.

[1] Github of course doesn't exist yet because we're in the 80s. Time-travelling is complicated.
[2] Historical directory name, just accept it.
[3] Just like the previous release...
[4] At least until the next volunteer steps ups. Turns out the problem "no-one wants to work on this" is easily fixed by "me! me! I want to work on this". A concept that is apparently quite hard to understand in the peanut gallery.

The Strange State of Authenticated Boot and Disk Encryption on Generic Linux Distributions

TL;DR: Linux has been supporting Full Disk Encryption (FDE) and technologies such as UEFI SecureBoot and TPMs for a long time. However, the way they are set up by most distributions is not as secure as they should be, and in some ways quite frankly weird. In fact, right now, your data is probably more secure if stored on current ChromeOS, Android, Windows or MacOS devices, than it is on typical Linux distributions.

Generic Linux distributions (i.e. Debian, Fedora, Ubuntu, …) adopted Full Disk Encryption (FDE) more than 15 years ago, with the LUKS/cryptsetup infrastructure. It was a big step forward to a more secure environment. Almost ten years ago the big distributions started adding UEFI SecureBoot to their boot process. Support for Trusted Platform Modules (TPMs) has been added to the distributions a long time ago as well — but even though many PCs/laptops these days have TPM chips on-board it's generally not used in the default setup of generic Linux distributions.

How these technologies currently fit together on generic Linux distributions doesn't really make too much sense to me — and falls short of what they could actually deliver. In this story I'd like to have a closer look at why I think that, and what I propose to do about it.

The Basic Technologies

Let's have a closer look what these technologies actually deliver:

  1. LUKS/dm-crypt/cryptsetup provide disk encryption, and optionally data authentication. Disk encryption means that reading the data in clear-text form is only possible if you possess a secret of some form, usually a password/passphrase. Data authentication means that no one can make changes to the data on disk unless they possess a secret of some form. Most distributions only enable the former though — the latter is a more recent addition to LUKS/cryptsetup, and is not used by default on most distributions (though it probably should be). Closely related to LUKS/dm-crypt is dm-verity (which can authenticate immutable volumes) and dm-integrity (which can authenticate writable volumes, among other things).

  2. UEFI SecureBoot provides mechanisms for authenticating boot loaders and other pre-OS binaries before they are invoked. If those boot loaders then authenticate the next step of booting in a similar fashion there's a chain of trust which can ensure that only code that has some level of trust associated with it will run on the system. Authentication of boot loaders is done via cryptographic signatures: the OS/boot loader vendors cryptographically sign their boot loader binaries. The cryptographic certificates that may be used to validate these signatures are then signed by Microsoft, and since Microsoft's certificates are basically built into all of today's PCs and laptops this will provide some basic trust chain: if you want to modify the boot loader of a system you must have access to the private key used to sign the code (or to the private keys further up the certificate chain).

  3. TPMs do many things. For this text we'll focus one facet: they can be used to protect secrets (for example for use in disk encryption, see above), that are released only if the code that booted the host can be authenticated in some form. This works roughly like this: every component that is used during the boot process (i.e. code, certificates, configuration, …) is hashed with a cryptographic hash function before it is used. The resulting hash is written to some small volatile memory the TPM maintains that is write-only (the so called Platform Configuration Registers, "PCRs"): each step of the boot process will write hashes of the resources needed by the next part of the boot process into these PCRs. The PCRs cannot be written freely: the hashes written are combined with what is already stored in the PCRs — also through hashing and the result of that then replaces the previous value. Effectively this means: only if every component involved in the boot matches expectations the hash values exposed in the TPM PCRs match the expected values too. And if you then use those values to unlock the secrets you want to protect you can guarantee that the key is only released to the OS if the expected OS and configuration is booted. The process of hashing the components of the boot process and writing that to the TPM PCRs is called "measuring". What's also important to mention is that the secrets are not only protected by these PCR values but encrypted with a "seed key" that is generated on the TPM chip itself, and cannot leave the TPM (at least so goes the theory). The idea is that you cannot read out a TPM's seed key, and thus you cannot duplicate the chip: unless you possess the original, physical chip you cannot retrieve the secret it might be able to unlock for you. Finally, TPMs can enforce a limit on unlock attempts per time ("anti-hammering"): this makes it hard to brute force things: if you can only execute a certain number of unlock attempts within some specific time then brute forcing will be prohibitively slow.

How Linux Distributions use these Technologies

As mentioned already, Linux distributions adopted the first two of these technologies widely, the third one not so much.

So typically, here's how the boot process of Linux distributions works these days:

  1. The UEFI firmware invokes a piece of code called "shim" (which is stored in the EFI System Partition — the "ESP" — of your system), that more or less is just a list of certificates compiled into code form. The shim is signed with the aforementioned Microsoft key, that is built into all PCs/laptops. This list of certificates then can be used to validate the next step of the boot process. The shim is measured by the firmware into the TPM. (Well, the shim can do a bit more than what I describe here, but this is outside of the focus of this article.)

  2. The shim then invokes a boot loader (often Grub) that is signed by a private key owned by the distribution vendor. The boot loader is stored in the ESP as well, plus some other places (i.e. possibly a separate boot partition). The corresponding certificate is included in the list of certificates built into the shim. The boot loader components are also measured into the TPM.

  3. The boot loader then invokes the kernel and passes it an initial RAM disk image (initrd), which contains initial userspace code. The kernel itself is signed by the distribution vendor too. It's also validated via the shim. The initrd is not validated, though (!). The kernel is measured into the TPM, the initrd sometimes too.

  4. The kernel unpacks the initrd image, and invokes what is contained in it. Typically, the initrd then asks the user for a password for the encrypted root file system. The initrd then uses that to set up the encrypted volume. No code authentication or TPM measurements take place.

  5. The initrd then transitions into the root file system. No code authentication or TPM measurements take place.

  6. When the OS itself is up the user is prompted for their user name, and their password. If correct, this will unlock the user account: the system is now ready to use. At this point no code authentication, no TPM measurements take place. Moreover, the user's password is not used to unlock any data, it's used only to allow or deny the login attempt — the user's data has already been decrypted a long time ago, by the initrd, as mentioned above.

What you'll notice here of course is that code validation happens for the shim, the boot loader and the kernel, but not for the initrd or the main OS code anymore. TPM measurements might go one step further: the initrd is measured sometimes too, if you are lucky. Moreover, you might notice that the disk encryption password and the user password are inquired by code that is not validated, and is thus not safe from external manipulation. You might also notice that even though TPM measurements of boot loader/OS components are done nothing actually ever makes use of the resulting PCRs in the typical setup.

Attack Scenarios

Of course, before determining whether the setup described above makes sense or not, one should have an idea what one actually intends to protect against.

The most basic attack scenario to focus on is probably that you want to be reasonably sure that if someone steals your laptop that contains all your data then this data remains confidential. The model described above probably delivers that to some degree: the full disk encryption when used with a reasonably strong password should make it hard for the laptop thief to access the data. The data is as secure as the password used is strong. The attacker might attempt to brute force the password, thus if the password is not chosen carefully the attacker might be successful.

Two more interesting attack scenarios go something like this:

  1. Instead of stealing your laptop the attacker takes the harddisk from your laptop while you aren't watching (e.g. while you went for a walk and left it at home or in your hotel room), makes a copy of it, and then puts it back. You'll never notice they did that. The attacker then analyzes the data in their lab, maybe trying to brute force the password. In this scenario you won't even know that your data is at risk, because for you nothing changed — unlike in the basic scenario above. If the attacker manages to break your password they have full access to the data included on it, i.e. everything you so far stored on it, but not necessarily on what you are going to store on it later. This scenario is worse than the basic one mentioned above, for the simple fact that you won't know that you might be attacked. (This scenario could be extended further: maybe the attacker has a chance to watch you type in your password or so, effectively lowering the password strength.)

  2. Instead of stealing your laptop the attacker takes the harddisk from your laptop while you aren't watching, inserts backdoor code on it, and puts it back. In this scenario you won't know your data is at risk, because physically everything is as before. What's really bad though is that the attacker gets access to anything you do on your laptop, both the data already on it, and whatever you will do in the future.

I think in particular this backdoor attack scenario is something we should be concerned about. We know for a fact that attacks like that happen all the time (Pegasus, industry espionage, …), hence we should make them hard.

Are we Safe?

So, does the scheme so far implemented by generic Linux distributions protect us against the latter two scenarios? Unfortunately not at all. Because distributions set up disk encryption the way they do, and only bind it to a user password, an attacker can easily duplicate the disk, and then attempt to brute force your password. What's worse: since code authentication ends at the kernel — and the initrd is not authenticated anymore —, backdooring is trivially easy: an attacker can change the initrd any way they want, without having to fight any kind of protections. And given that FDE unlocking is implemented in the initrd, and it's the initrd that asks for the encryption password things are just too easy: an attacker could trivially easily insert some code that picks up the FDE password as you type it in and send it wherever they want. And not just that: since once they are in they are in, they can do anything they like for the rest of the system's lifecycle, with full privileges — including installing backdoors for versions of the OS or kernel that are installed on the device in the future, so that their backdoor remains open for as long as they like.

That is sad of course. It's particular sad given that the other popular OSes all address this much better. ChromeOS, Android, Windows and MacOS all have way better built-in protections against attacks like this. And it's why one can certainly claim that your data is probably better protected right now if you store it on those OSes then it is on generic Linux distributions.

(Yeah, I know that there are some niche distros which do this better, and some hackers hack their own. But I care about general purpose distros here, i.e. the big ones, that most people base their work on.)

Note that there are more problems with the current setup. For example, it's really weird that during boot the user is queried for an FDE password which actually protects their data, and then once the system is up they are queried again – now asking for a username, and another password. And the weird thing is that this second authentication that appears to be user-focused doesn't really protect the user's data anymore — at that moment the data is already unlocked and accessible. The username/password query is supposed to be useful in multi-user scenarios of course, but how does that make any sense, given that these multiple users would all have to know a disk encryption password that unlocks the whole thing during the FDE step, and thus they have access to every user's data anyway if they make an offline copy of the harddisk?

Can we do better?

Of course we can, and that is what this story is actually supposed to be about.

Let's first figure out what the minimal issues we should fix are (at least in my humble opinion):

  1. The initrd must be authenticated before being booted into. (And measured unconditionally.)

  2. The OS binary resources (i.e. /usr/) must be authenticated before being booted into. (But don't need to be encrypted, since everyone has the same anyway, there's nothing to hide here.)

  3. The OS configuration and state (i.e. /etc/ and /var/) must be encrypted, and authenticated before they are used. The encryption key should be bound to the TPM device; i.e system data should be locked to a security concept belonging to the system, not the user.

  4. The user's home directory (i.e. /home/lennart/ and similar) must be encrypted and authenticated. The unlocking key should be bound to a user password or user security token (FIDO2 or PKCS#11 token); i.e. user data should be locked to a security concept belonging to the user, not the system.

Or to summarize this differently:

  1. Every single component of the boot process and OS needs to be authenticated, i.e. all of shim (done), boot loader (done), kernel (done), initrd (missing so far), OS binary resources (missing so far), OS configuration and state (missing so far), the user's home (missing so far).

  2. Encryption is necessary for the OS configuration and state (bound to TPM), and for the user's home directory (bound to a user password or user security token).

In Detail

Let's see how we can achieve the above in more detail.

How to Authenticate the initrd

At the moment initrds are generated on the installed host via scripts (dracut and similar) that try to figure out a minimal set of binaries and configuration data to build an initrd that contains just enough to be able to find and set up the root file system. What is included in the initrd hence depends highly on the individual installation and its configuration. Pretty likely no two initrds generated that way will be fully identical due to this. This model clearly has benefits: the initrds generated this way are very small and minimal, and support exactly what is necessary for the system to boot, and not less or more. It comes with serious drawbacks too though: the generation process is fragile and sometimes more akin to black magic than following clear rules: the generator script natively has to understand a myriad of storage stacks to determine what needs to be included and what not. It also means that authenticating the image is hard: given that each individual host gets a different specialized initrd, it means we cannot just sign the initrd with the vendor key like we sign the kernel. If we want to keep this design we'd have to figure out some other mechanism (e.g. a per-host signature key – that is generated locally; or by authenticating it with a message authentication code bound to the TPM). While these approaches are certainly thinkable, I am not convinced they actually are a good idea though: locally and dynamically generated per-host initrds is something we probably should move away from.

If we move away from locally generated initrds, things become a lot simpler. If the distribution vendor generates the initrds on their build systems then it can be attached to the kernel image itself, and thus be signed and measured along with the kernel image, without any further work. This simplicity is simply lovely. Besides robustness and reproducibility this gives us an easy route to authenticated initrds.

But of course, nothing is really that simple: working with vendor-generated initrds means that we can't adjust them anymore to the specifics of the individual host: if we pre-build the initrds and include them in the kernel image in immutable fashion then it becomes harder to support complex, more exotic storage or to parameterize it with local network server information, credentials, passwords, and so on. Now, for my simple laptop use-case these things don't matter, there's no need to extend/parameterize things, laptops and their setups are not that wildly different. But what to do about the cases where we want both: extensibility to cover for less common storage subsystems (iscsi, LVM, multipath, drivers for exotic hardware…) and parameterization?

Here's a proposal how to achieve that: let's build a basic initrd into the kernel as suggested, but then do two things to make this scheme both extensible and parameterizable, without compromising security.

  1. Let's define a way how the basic initrd can be extended with additional files, which are stored in separate "extension images". The basic initrd should be able to discover these extension images, authenticate them and then activate them, thus extending the initrd with additional resources on-the-fly.

  2. Let's define a way how we can safely pass additional parameters to the kernel/initrd (and actually the rest of the OS, too) in an authenticated (and possibly encrypted) fashion. Parameters in this context can be anything specific to the local installation, i.e. server information, security credentials, certificates, SSH server keys, or even just the root password that shall be able to unlock the root account in the initrd …

In such a scheme we should be able to deliver everything we are looking for:

  1. We'll have a full trust chain for the code: the boot loader will authenticate and measure the kernel and basic initrd. The initrd extension images will then be authenticated by the basic initrd image.

  2. We'll have authentication for all the parameters passed to the initrd.

This so far sounds very unspecific? Let's make it more specific by looking closer at the components I'd suggest to be used for this logic:

  1. The systemd suite since a few months contains a subsystem implementing system extensions (v248). System extensions are ultimately just disk images (for example a squashfs file system in a GPT envelope) that can extend an underlying OS tree. Extending in this regard means they simply add additional files and directories into the OS tree, i.e. below /usr/. For a longer explanation see systemd-sysext(8). When a system extension is activated it is simply mounted and then merged into the main /usr/ tree via a read-only overlayfs mount. Now what's particularly nice about them in this context we are talking about here is that the extension images may carry dm-verity authentication data, and PKCS#7 signatures (once this is merged, that is, i.e. v250).

  2. The systemd suite also contains a concept called service "credentials". These are small pieces of information passed to services in a secure way. One key feature of these credentials is that they can be encrypted and authenticated in a very simple way with a key bound to the TPM (v250). See LoadCredentialEncrypted= and systemd-creds(1) for details. They are great for safely storing SSL private keys and similar on your system, but they also come handy for parameterizing initrds: an encrypted credential is just a file that can only be decoded if the right TPM is around with the right PCR values set.

  3. The systemd suite contains a component called systemd-stub(7). It's an EFI stub, i.e. a small piece of code that is attached to a kernel image, and turns the kernel image into a regular EFI binary that can be directly executed by the firmware (or a boot loader). This stub has a number of nice features (for example, it can show a boot splash before invoking the Linux kernel itself and such). Once this work is merged (v250) the stub will support one more feature: it will automatically search for system extension image files and credential files next to the kernel image file, measure them and pass them on to the main initrd of the host.

Putting this together we have nice way to provide fully authenticated kernel images, initrd images and initrd extension images; as well as encrypted and authenticated parameters via the credentials logic.

How would a distribution actually make us of this? A distribution vendor would pre-build the basic initrd, and glue it into the kernel image, and sign that as a whole. Then, for each supposed extension of the basic initrd (e.g. one for iscsi support, one for LVM, one for multipath, …), the vendor would use a tool such as mkosi to build an extension image, i.e. a GPT disk image containing the files in squashfs format, a Verity partition that authenticates it, plus a PKCS#7 signature partition that validates the root hash for the dm-verity partition, and that can be checked against a key provided by the boot loader or main initrd. Then, any parameters for the initrd will be encrypted using systemd-creds encrypt -T. The resulting encrypted credentials and the initrd extension images are then simply placed next to the kernel image in the ESP (or boot partition). Done.

This checks all boxes: everything is authenticated and measured, the credentials also encrypted. Things remain extensible and modular, can be pre-built by the vendor, and installation is as simple as dropping in one file for each extension and/or credential.

How to Authenticate the Binary OS Resources

Let's now have a look how to authenticate the Binary OS resources, i.e. the stuff you find in /usr/, i.e. the stuff traditionally shipped to the user's system via RPMs or DEBs.

I think there are three relevant ways how to authenticate this:

  1. Make /usr/ a dm-verity volume. dm-verity is a concept implemented in the Linux kernel that provides authenticity to read-only block devices: every read access is cryptographically verified against a top-level hash value. This top-level hash is typically a 256bit value that you can either encode in the kernel image you are using, or cryptographically sign (which is particularly nice once this is merged). I think this is actually the best approach since it makes the /usr/ tree entirely immutable in a very simple way. However, this also means that the whole of /usr/ needs to be updated as once, i.e. the traditional rpm/apt based update logic cannot work in this mode.

  2. Make /usr/ a dm-integrity volume. dm-integrity is a concept provided by the Linux kernel that offers integrity guarantees to writable block devices, i.e. in some ways it can be considered to be a bit like dm-verity while permitting write access. It can be used in three ways, one of which I think is particularly relevant here. The first way is with a simple hash function in "stand-alone" mode: this is not too interesting here, it just provides greater data safety for file systems that don't hash check their files' data on their own. The second way is in combination with dm-crypt, i.e. with disk encryption. In this case it adds authenticity to confidentiality: only if you know the right secret you can read and make changes to the data, and any attempt to make changes without knowing this secret key will be detected as IO error on next read by those in possession of the secret (more about this below). The third way is the one I think is most interesting here: in "stand-alone" mode, but with a keyed hash function (e.g. HMAC). What's this good for? This provides authenticity without encryption: if you make changes to the disk without knowing the secret this will be noticed on the next read attempt of the data and result in IO errors. This mode provides what we want (authenticity) and doesn't do what we don't need (encryption). Of course, the secret key for the HMAC must be provided somehow, I think ideally by the TPM.

  3. Make /usr/ a dm-crypt (LUKS) + dm-integrity volume. This provides both authenticity and encryption. The latter isn't typically needed for /usr/ given that it generally contains no secret data: anyone can download the binaries off the Internet anyway, and the sources too. By encrypting this you'll waste CPU cycles, but beyond that it doesn't hurt much. (Admittedly, some people might want to hide the precise set of packages they have installed, since it of course does reveal a bit of information about you: i.e. what you are working on, maybe what your job is – think: if you are a hacker you have hacking tools installed – and similar). Going this way might simplify things in some cases, as it means you don't have to distinguish "OS binary resources" (i.e /usr/) and "OS configuration and state" (i.e. /etc/ + /var/, see below), and just make it the same volume. Here too, the secret key must be provided somehow, I think ideally by the TPM.

All three approach are valid. The first approach has my primary sympathies, but for distributions not willing to abandon client-side updates via RPM/dpkg this is not an option, in which case I would propose the other two approaches for these cases.

The LUKS encryption key (and in case of dm-integrity standalone mode the key for the keyed hash function) should be bound to the TPM. Why the TPM for this? You could also use a user password, a FIDO2 or PKCS#11 security token — but I think TPM is the right choice: why that? To reduce the requirement for repeated authentication, i.e. that you first have to provide the disk encryption password, and then you have to login, providing another password. It should be possible that the system boots up unattended and then only one authentication prompt is needed to unlock the user's data properly. The TPM provides a way to do this in a reasonably safe and fully unattended way. Also, when we stop considering just the laptop use-case for a moment: on servers interactive disk encryption prompts don't make much sense — the fact that TPMs can provide secrets without this requiring user interaction and thus the ability to work in entirely unattended environments is quite desirable. Note that crypttab(5) as implemented by systemd (v248) provides native support for authentication via password, via TPM2, via PKCS#11 or via FIDO2, so the choice is ultimately all yours.

How to Encrypt/Authenticate OS Configuration and State

Let's now look at the OS configuration and state, i.e. the stuff in /etc/ and /var/. It probably makes sense to not consider these two hierarchies independently but instead just consider this to be the root file system. If the OS binary resources are in a separate file system it is then mounted onto the /usr/ sub-directory of the root file system.

The OS configuration and state (or: root file system) should be both encrypted and authenticated: it might contain secret keys, user passwords, privileged logs and similar. This data matters and contains plenty data that should remain confidential.

The encryption of choice here is dm-crypt (LUKS) + dm-integrity similar as discussed above, again with the key bound to the TPM.

If the OS binary resources are protected the same way it is safe to merge these two volumes and have a single partition for both (see above)

How to Encrypt/Authenticate the User's Home Directory

The data in the user's home directory should be encrypted, and bound to the user's preferred token of authentication (i.e. a password or FIDO2/PKCS#11 security token). As mentioned, in the traditional mode of operation the user's home directory is not individually encrypted, but only encrypted because FDE is in use. The encryption key for that is a system wide key though, not a per-user key. And I think that's problem, as mentioned (and probably not even generally understood by our users). We should correct that and ensure that the user's password is what unlocks the user's data.

In the systemd suite we provide a service systemd-homed(8) (v245) that implements this in a safe way: each user gets its own LUKS volume stored in a loopback file in /home/, and this is enough to synthesize a user account. The encryption password for this volume is the user's account password, thus it's really the password provided at login time that unlocks the user's data. systemd-homed also supports other mechanisms of authentication, in particular PKCS#11/FIDO2 security tokens. It also provides support for other storage back-ends (such as fscrypt), but I'd always suggest to use the LUKS back-end since it's the only one providing the comprehensive confidentiality guarantees one wants for a UNIX-style home directory.

Note that there's one special caveat here: if the user's home directory (e.g. /home/lennart/) is encrypted and authenticated, what about the file system this data is stored on, i.e. /home/ itself? If that dir is part of the the root file system this would result in double encryption: first the data is encrypted with the TPM root file system key, and then again with the per-user key. Such double encryption is a waste of resources, and unnecessary. I'd thus suggest to make /home/ its own dm-integrity volume with a HMAC, keyed by the TPM. This means the data stored directly in /home/ will be authenticated but not encrypted. That's good not only for performance, but also has practical benefits: it allows extracting the encrypted volume of the various users in case the TPM key is lost, as a way to recover from dead laptops or similar.

Why authenticate /home/, if it only contains per-user home directories that are authenticated on their own anyway? That's a valid question: it's because the kernel file system maintainers made clear that Linux file system code is not considered safe against rogue disk images, and is not tested for that; this means before you mount anything you need to establish trust in some way because otherwise there's a risk that the act of mounting might exploit your kernel.

Summary of Resources and their Protections

So, let's now put this all together. Here's a table showing the various resources we deal with, and how I think they should be protected (in my idealized world).

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources yes no dm-verity root hash linked into kernel image, or firmware/initrd certificate database top-level partition
OS configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This should provide all the desired guarantees: everything is authenticated, and the individualized per-host or per-user data is also encrypted. No double encryption takes place. The encryption keys/verification certificates are stored/bound to the most appropriate infrastructure.

Does this address the three attack scenarios mentioned earlier? I think so, yes. The basic attack scenario I described is addressed by the fact that /var/, /etc/ and /home/*/ are encrypted. Brute forcing the former two is harder than in the status quo ante model, since a high entropy key is used instead of one derived from a user provided password. Moreover, the "anti-hammering" logic of the TPM will make brute forcing prohibitively slow. The home directories are protected by the user's password or ideally a personal FIDO2/PKCS#11 security token in this model. Of course, a password isn't better security-wise then the status quo ante. But given the FIDO2/PKCS#11 support built into systemd-homed it should be easier to lock down the home directories securely.

Binding encryption of /var/ and /etc/ to the TPM also addresses the first of the two more advanced attack scenarios: a copy of the harddisk is useless without the physical TPM chip, since the seed key is sealed into that. (And even if the attacker had the chance to watch you type in your password, it won't help unless they possess access to to the TPM chip.) For the home directory this attack is not addressed as long as a plain password is used. However, since binding home directories to FIDO2/PKCS#11 tokens is built into systemd-homed things should be safe here too — provided the user actually possesses and uses such a device.

The backdoor attack scenario is addressed by the fact that every resource in play now is authenticated: it's hard to backdoor the OS if there's no component that isn't verified by signature keys or TPM secrets the attacker hopefully doesn't know.

For general purpose distributions that focus on updating the OS per RPM/dpkg the idealized model above won't work out, since (as mentioned) this implies an immutable /usr/, and thus requires updating /usr/ via an atomic update operation. For such distros a setup like the following is probably more realistic, but see above.

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources, configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This means there's only one root file system that contains all of /etc/, /var/ and /usr/.

Recovery Keys

When binding encryption to TPMs one problem that arises is what strategy to adopt if the TPM is lost, due to hardware failure: if I need the TPM to unlock my encrypted volume, what do I do if I need the data but lost the TPM?

The answer here is supporting recovery keys (this is similar to how other OSes approach this). Recovery keys are pretty much the same concept as passwords. The main difference being that they are computer generated rather than user-chosen. Because of that they typically have much higher entropy (which makes them more annoying to type in, i.e you want to use them only when you must, not day-to-day). By having higher entropy they are useful in combination with TPM, FIDO2 or PKCS#11 based unlocking: unlike a combination with passwords they do not compromise the higher strength of protection that TPM/FIDO2/PKCS#11 based unlocking is supposed to provide.

Current versions of systemd-cryptenroll(1) implement a recovery key concept in an attempt to address this problem. You may enroll any combination of TPM chips, PKCS#11 tokens, FIDO2 tokens, recovery keys and passwords on the same LUKS volume. When enrolling a recovery key it is generated and shown on screen both in text form and as QR code you can scan off screen if you like. The idea is write down/store this recovery key at a safe place so that you can use it when you need it. Note that such recovery keys can be entered wherever a LUKS password is requested, i.e. after generation they behave pretty much the same as a regular password.

TPM PCR Brittleness

Locking devices to TPMs and enforcing a PCR policy with this (i.e. configuring the TPM key to be unlockable only if certain PCRs match certain values, and thus requiring the OS to be in a certain state) brings a problem with it: TPM PCR brittleness. If the key you want to unlock with the TPM requires the OS to be in a specific state (i.e. that all OS components' hashes match certain expectations or similar) then doing OS updates might have the affect of making your key inaccessible: the OS updates will cause the code to change, and thus the hashes of the code, and thus certain PCRs. (Thankfully, you unrolled a recovery key, as described above, so this doesn't mean you lost your data, right?).

To address this I'd suggest three strategies:

  1. Most importantly: don't actually use the TPM PCRs that contain code hashes. There are actually multiple PCRs defined, each containing measurements of different aspects of the boot process. My recommendation is to bind keys to PCR 7 only, a PCR that contains measurements of the UEFI SecureBoot certificate databases. Thus, the keys will remain accessible as long as these databases remain the same, and updates to code will not affect it (updates to the certificate databases will, and they do happen too, though hopefully much less frequent then code updates). Does this reduce security? Not much, no, because the code that's run is after all not just measured but also validated via code signatures, and those signatures are validated with the aforementioned certificate databases. Thus binding an encrypted TPM key to PCR 7 should enforce a similar level of trust in the boot/OS code as binding it to a PCR with hashes of specific versions of that code. i.e. using PCR 7 means you say "every code signed by these vendors is allowed to unlock my key" while using a PCR that contains code hashes means "only this exact version of my code may access my key".

  2. Use LUKS key management to enroll multiple versions of the TPM keys in relevant volumes, to support multiple versions of the OS code (or multiple versions of the certificate database, as discussed above). Specifically: whenever an update is done that might result changing the relevant PCRs, pre-calculate the new PCRs, and enroll them in an additional LUKS slot on the relevant volumes. This means that the unlocking keys tied to the TPM remain accessible in both states of the system. Eventually, once rebooted after the update, remove the old slots.

  3. If these two strategies didn't work out (maybe because the OS/firmware was updated outside of OS control, or the update mechanism was aborted at the wrong time) and the TPM PCRs changed unexpectedly, and the user now needs to use their recovery key to get access to the OS back, let's handle this gracefully and automatically reenroll the current TPM PCRs at boot, after the recovery key checked out, so that for future boots everything is in order again.

Other approaches can work too: for example, some OSes simply remove TPM PCR policy protection of disk encryption keys altogether immediately before OS or firmware updates, and then reenable it right after. Of course, this opens a time window where the key bound to the TPM is much less protected than people might assume. I'd try to avoid such a scheme if possible.

Anything Else?

So, given that we are talking about idealized systems: I personally actually think the ideal OS would be much simpler, and thus more secure than this:

I'd try to ditch the Shim, and instead focus on enrolling the distribution vendor keys directly in the UEFI firmware certificate list. This is actually supported by all firmwares too. This has various benefits: it's no longer necessary to bind everything to Microsoft's root key, you can just enroll your own stuff and thus make sure only what you want to trust is trusted and nothing else. To make an approach like this easier, we have been working on doing automatic enrollment of these keys from the systemd-boot boot loader, see this work in progress for details. This way the Firmware will authenticate the boot loader/kernel/initrd without any further component for this in place.

I'd also not bother with a separate boot partition, and just use the ESP for everything. The ESP is required anyway by the firmware, and is good enough for storing the few files we need.

FAQ

Can I implement all of this in my distribution today?

Probably not. While the big issues have mostly been addressed there's a lot of integration work still missing. As you might have seen I linked some PRs that haven't even been merged into our tree yet, and definitely not been released yet or even entered the distributions.

Will this show up in Fedora/Debian/Ubuntu soon?

I don't know. I am making a proposal how these things might work, and am working on getting various building blocks for this into shape. What the distributions do is up to them. But even if they don't follow the recommendations I make 100%, or don't want to use the building blocks I propose I think it's important they start thinking about this, and yes, I think they should be thinking about defaulting to setups like this.

Work for measuring/signing initrds on Fedora has been started, here's a slide deck with some information about it.

But isn't a TPM evil?

Some corners of the community tried (unfortunately successfully to some degree) to paint TPMs/Trusted Computing/SecureBoot as generally evil technologies that stop us from using our systems the way we want. That idea is rubbish though, I think. We should focus on what it can deliver for us (and that's a lot I think, see above), and appreciate the fact we can actually use it to kick out perceived evil empires from our devices instead of being subjected to them. Yes, the way SecureBoot/TPMs are defined puts you in the driver seat if you want — and you may enroll your own certificates to keep out everything you don't like.

What if my system doesn't have a TPM?

TPMs are becoming quite ubiquitous, in particular as the upcoming Windows versions will require them. In general I think we should focus on modern, fully equipped systems when designing all this, and then find fall-backs for more limited systems. Frankly it feels as if so far the design approach for all this was the other way round: try to make the new stuff work like the old rather than the old like the new (I mean, to me it appears this thinking is the main raison d'être for the Grub boot loader).

More specifically, on the systems where we have no TPM we ultimately cannot provide the same security guarantees as for those which have. So depending on the resource to protect we should fall back to different TPM-less mechanisms. For example, if we have no TPM then the root file system should probably be encrypted with a user provided password, typed in at boot as before. And for the encrypted boot credentials we probably should simply not encrypt them, and place them in the ESP unencrypted.

Effectively this means: without TPM you'll still get protection regarding the basic attack scenario, as before, but not the other two.

What if my system doesn't have UEFI?

Many of the mechanisms explained above taken individually do not require UEFI. But of course the chain of trust suggested above requires something like UEFI SecureBoot. If your system lacks UEFI it's probably best to find work-alikes to the technologies suggested above, but I doubt I'll be able to help you there.

rpm/dpkg already cryptographically validates all packages at installation time (gpg), why would I need more than that?

This type of package validation happens once: at the moment of installation (or update) of the package, but not anymore when the data installed is actually used. Thus when an attacker manages to modify the package data after installation and before use they can make any change they like without this ever being noticed. Such package download validation does address certain attack scenarios (i.e. man-in-the-middle attacks on network downloads), but it doesn't protect you from attackers with physical access, as described in the attack scenarios above.

Systems such as ostree aren't better than rpm/dpkg regarding this BTW, their data is not validated on use either, but only during download or when processing tree checkouts.

Key really here is that the scheme explained here provides offline protection for the data "at rest" — even someone with physical access to your device cannot easily make changes that aren't noticed on next use. rpm/dpkg/ostree provide online protection only: as long as the system remains up, and all OS changes are done through the intended program code-paths, and no one has physical access everything should be good. In today's world I am sure this is not good enough though. As mentioned most modern OSes provide offline protection for the data at rest in one way or another. Generic Linux distributions are terribly behind on this.

This is all so desktop/laptop focused, what about servers?

I am pretty sure servers should provide similar security guarantees as outlined above. In a way servers are a much simpler case: there are no users and no interactivity. Thus the discussion of /home/ and what it contains and of user passwords doesn't matter. However, the authenticated initrd and the unattended TPM-based encryption I think are very important for servers too, in a trusted data center environment. It provides security guarantees so far not given by Linux server OSes.

I'd like to help with this, or discuss/comment on this

Submit patches or reviews through GitHub. General discussion about this is best done on the systemd mailing list.

September 20, 2021

ES: But Why

I got a request recently to fix up the WebGL Aquarium demo. I’ve had this bookmarked for a while since it’s one of the only test cases for GL_EXT_multisampled_render_to_texture I’m aware of, at least when running Chrome in EGL mode.

Naturally, I decided to do both at once since this would be yet another extension that no native desktop driver in Mesa currently supports.

Transient Multisampling

The idea behind this extension is that on tilers, a single-sample image can be temporarily treated as multisampled without needing the extra memory bandwidth that multisampled images require. The multisampled rendering image is transient, meaning it isn’t loaded or written to, so it can be lazily allocated and then discarded.

Vulkan has similar mechanisms: loadOp and storeOp set to VK_ATTACHMENT_LOAD_OP_DONT_CARE with an image allocated using ZINK_HEAP_DEVICE_LOCAL_LAZY and VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT, then adding a resolve attachment for writeout. Simple enough.

Unfortunately, this doesn’t quite work.

As Danylo Piliaiev, RenderDoc master and Finder Of Bad Pixels, was quick to point out, this approach will discard any previous data the texture had, resulting in bad pixels. Lots of bad pixels, in fact.

The solution, which I hate, is that the “transient” multisampled image now gets a full-texture draw before its “transient” renderpass to initialize it, then is loaded with VK_ATTACHMENT_LOAD_OP_LOAD, effectively making it not transient at all.

Sorry tilers.

No frames for you.

aquarium.gif

But it does work and give solid performance (-20% or so while recording smh screen recorders git gud), so I can’t be too mad.

September 16, 2021

I am back with another status update on raytracing in RADV. And the good news is that things are finally starting to come together. After ~9 months of on and off work we’re now having games working with raytracing. Working on first try after getting all the required functionality was Control:

Control with raytracing on RADV

After poking for a long time at CTS and demos it is really nice to see the fruits of ones work.

The piece that I added recently was copy/compaction and serialization of acceleration structures which was a bunch of shader writing, handling another type of queries and dealing with indirect dispatches. (Since of course the API doesn’t give the input size on the CPU. No idea how this API should be secured …)

What games?

I did try 5 games:

  1. Quake 2 RTX (Vulkan): works. This was working already on my previous update.
  2. Control (D3D): works. Pretty much just works. Runs at maybe 30-50% of RT performance on Windows.
  3. Metro Exodus (Vulkan): works. Needs one workaround and is very finicky in WSI but otherwise works fine. Runs at 20-25% of RT performance on Windows.
  4. Ghostrunner (D3D): Does not work. This really needs per shadergroup compilation instead of just mashing all the shaders together as I get shaders now with 1 million NIR instructions, which is a pain to debug.
  5. Doom Eternal (Vulkan): Does not work. The raytracing option in the menu stays grayed out and at this point I’m at a loss what is required to make the game allow enabling RT.

If anybody could tell me how to get Doom Eternal to allow RT I’d appreciate it.

What is next?

Of course the support is far from done. Some things to still make progress on:

  1. Upstreaming what I have. Samuel has been busy reviewing my MRs and I think there is a good chance that what I have now will make it into 21.3.
  2. Improve the pipeline compilation model to hopefully make ghostrunner work.
  3. Improved BVH building. The current BVH is really naive, which is likely one of the big performance factors.
  4. Improve traversal.
  5. Move on to stuff needed for DXR 1.1 like VK_KHR_ray_query.

P.S. If you haven’t seen it yet, Jason Ekstrand from Intel recently gave a talk about how Intel implements raytracing. Nice showcase of how you can provide some more involved hardware implementation than RDNA2 does.

Been some time since my last update, so I felt it was time to flex my blog writing muscles again and provide some updates of some of the things we are working on in Fedora in preparation for Fedora Workstation 35. This is not meant to be a comprehensive whats new article about Fedora Workstation 35, more of a listing of some of the things we are doing as part of the Red Hat desktop team.

NVidia support for Wayland
One thing we spent a lot of effort on for a long time now is getting full support for the NVidia binary driver under Wayland. It has been a recurring topic in our bi-weekly calls with the NVidia engineering team ever since we started looking at moving to Wayland. There has been basic binary driver support for some time, meaning you could run a native Wayland session on top of the binary driver, but the critical missing piece was that you could not get support for accelerated graphics when running applications through XWayland, our X.org compatibility layer. Which basically meant that any application requiring 3D support and which wasn’t a native Wayland application yet wouldn’t work. So over the last Months we been having a great collaboration with NVidia around closing this gap, with them working closely with us in fixing issues in their driver while we have been fixing bugs and missing pieces in the rest of the stack. We been reporting and discussing issues back and forth allowing us a very quickly turnaround on issues as we find them which of course all resulted in the NVidia 470.42.01 driver with XWayland support. I am sure we will find new corner cases that needs to be resolved in the coming Months, but I am equally sure we will be able to quickly resolve them due to the close collaboration we have now established with NVidia. And I know some people will wonder why we spent so much time working with NVidia around their binary driver, but the reality is that NVidia is the market leader, especially in the professional Linux workstation space, and there are lot of people who either would end up not using Linux or using Linux with X without it, including a lot of Red Hat customers and Fedora users. And that is what I and my team are here for at the end of the day, to make sure Red Hat customers are able to get their job done using their Linux systems.

Lightweight kiosk mode
One of the wonderful things about open source is the constant flow of code and innovation between all the different parts of the ecosystem. For instance one thing we on the RHEL side have often been asked about over the last few years is a lightweight and simple to use solution for people wanting to run single application setups, like information boards, ATM machines, cash registers, information kiosks and so on. For many use cases people felt that running a full GNOME 3 desktop underneath their application was either to resource hungry and or created a risk that people accidentally end up in the desktop session. At the same time from our viewpoint as a development team we didn’t want a completely separate stack for this use case as that would just increase our maintenance burden as we would end up having to do a lot of things twice. So to solve this problem Ray Strode spent some time writing what we call GNOME Kiosk mode which makes setting up a simple session running single application easy and without running things like the GNOME shell, tracker, evolution etc. This gives you a window manager with full support for the latest technologies such as compositing, libinput and Wayland, but coming in at about 18MB, which is about 71MB less than a minimal GNOME 3 desktop session. You can read more about the new Kiosk mode and how to use it in this great blog post from our savvy Edge Computing Product Manager Ben Breard. The kiosk mode session described in Ben’s article about RHEL will be available with Fedora Workstation 35.

high-definition mouse wheel support
A major part of what we do is making sure that Red Hat Enterprise Linux customers and Fedora users get hardware support on par with what you find on other operating systems. We try our best to work with our hardware partners, like Lenovo, to ensure that such hardware support comes day and date with when those features are enabled on other systems, but some things ends up taking longer time for various reasons. Support for high-definition mouse wheels was one of those. Peter Hutterer, our resident input expert, put together a great blog post explaining the history and status of high-definition mouse wheel support. As Peter points out in his blog post the feature is not yet fully supported under Wayland, but we hope to close that gap in time for Fedora Workstation 35.

Mouse with hires mouse

Mouse with HiRes scroll wheel

PipeWire
I feel I can’t do one of these posts without talking about latest developments in PipeWire, our unified audio and video server. Wim Taymans keeps working with rapidly growing PipeWire community to fix issues as they are reported and add new features to PipeWire. Most recently Wims focus has been on implementing support for S/PDIF passthrough support over both S/PDIF and HDMI connections. This will allow us to send undecoded data over such connections which is critical for working well with surround sound systems and soundbars. Also the PipeWire community has been working hard on further improving the Bluetooth support with bluetooth battery status support for head-set profile and using Apple extensions. aptX-LL and FastStream codec support was also added. And of course a huge amount of bug fixes, it turns out that when you replace two different sound servers that has been around for close to two decades there are a lot of corner cases to cover :). Make sure to check out two latest release notes for 0.3.35 and for 0.3.36 for details.

Screenshot of Easyeffects

EasyEffects is a great example of a cool new application built with PipeWire

Privacy screen
Another feature that we have been working on as a result of our Lenovo partnership is Privacy screen support. For those not familiar with this technology it is basically to allow you to reduce the readability of your screen when viewed from the side, so that if you are using your laptop at a coffee shop for instance then a person sitting close by will have a lot harder time trying to read what is on your screen. Hans de Goede has been shepherding the kernel side of this forward working with Marco Trevisan from Canonical on the userspace part of it (which also makes this a nice example of cross-company collaboration), allowing you to turn this feature on or off. This feature though is not likely to fully land in time for Fedora Workstation 35 so we are looking at if we will bring this in as an update to Fedora Workstation 35 or if it will be a Fedora Workstation 36 feature.

Penny

zink inside

Zink inside the penny


As most of you know the future of 3D graphics on Linux is the Vulkan API from the Khronos Group. This doesn’t mean that OpenGL is going away anytime soon though, as there is a large host of applications out there using this API and for certain types of 3D graphics development developers might still choose to use OpenGL over Vulkan. Of course for us that creates a little bit of a challenge because maintaining two 3D graphics interfaces is a lot of work, even with the great help and contributions from the hardware makers themselves. So we been eyeing the Zink project for a while, which aims at re-implementing OpenGL on top of Vulkan, as a potential candidate for solving our long term needs to support the OpenGL API, but without drowning us in work while doing so. The big advantage to Zink is that it allows us to support one shared OpenGL implementation across all hardware and then focus our HW support efforts on the Vulkan drivers. As part of this effort Adam Jackson has been working on a project called Penny.

Zink implements OpenGL in terms of Vulkan, as far as the drawing itself is concerned, but presenting that drawing to the rest of the system is currently system-specific (GLX). For hardware that already has a Mesa driver, we use GBM. On NVIDIA’s Vulkan (and probably any other binary stacks on Linux, and probably also like WSL or macOS + MoltenVK) we download the image from the GPU back to the CPU and then use the same software upload/display path as llvmpipe, which as you can imagine is Not Fast.

Penny aims to extend Zink by replacing both of those paths, and instead using the various Vulkan WSI extensions to manage presentation. Even for the GBM case this should enable higher performance since zink will have more information about the rendering pipeline (multisampling in particular is poorly handled atm). Future window system integration work can focus on Vulkan, with EGL and GLX getting features “for free” once they’re enabled in Vulkan.

3rd party software cleanup
Over time we have been working on adding more and more 3rd party software for easy consumption in Fedora Workstation. The problem we discovered though was that due to this being done over time, with changing requirements and expectations, the functionality was not behaving in a very intuitive way and there was also new questions that needed to be answered. So Allan Day and Owen Taylor spent some time this cycle to review all the bits and pieces of this functionality and worked to clean it up. So the goal is that when you enable third-party repositories in Fedora Workstation 35 it behaves in a much more predictable and understandable way and also includes a lot of applications from Flathub. Yes, that is correct you should be able to install a lot of applications from Flathub in Fedora Workstation 35 without having to first visit the Flathub website to enable it, instead they will show up once you turned the knob for general 3rd party application support.

Power profiles
Another item we spent quite a bit of time for Fedora Workstation 35 is making sure we integrate the Power Profiles work that Bastien Nocera has been working on as part of our collaboration with Lenovo. Power Profiles is basically a feature that allows your system to behave in a smarter way when it comes to power consumption and thus prolongs your battery life. So for instance when we notice you are getting low on battery we can offer you to go into a strong power saving mode to prolong how long you can use the system until you can recharge. More in-depth explanation of Power profiles in the official README.

Wayland
I usually also have ended up talking about Wayland in my posts, but I expect to be doing less going forward as we have now covered all the major gaps we saw between Wayland and X.org. Jonas Ådahl got the headless support merged which was one of our big missing pieces and as mentioned above Olivier Fourdan and Jonas and others worked with NVidia on getting the binary driver with XWayland support working with GNOME Shell. Of course this being software we are never truly done, there will of course be new issues discovered, random bugs that needs to be fixed, and of course also new features that needs to be implemented. We already have our next big team focus in place, HDR support, which will need work from the graphics drivers, up through Mesa, into the window manager and the GUI toolkits and in the applications themselves. We been investigating and trying out some things for a while already, but we are now ready to make this a main focus for the team. In fact we will soon be posting a new job listing for a fulltime engineer to work on HDR vertically through the stack so keep an eye out for that if you are interested in working on this. The job will be open to candidates who which to work remotely, so as long as Red Hat has a business presence in the country you live we should be able to offer you the job if you are the right candidate for us. Update:Job listing is now online for our HDR engineer.

BTW, if you want to see future updates and keep on top of other happenings from Fedora and Red Hat in the desktop space, make sure to follow me on twitter.

September 13, 2021

Zink Is Over: This Time I’m Serious.

Look.

I know what you’re gonna say, and maybe I did just say zink was done a week or two ago.

I’m not saying I didn’t.

But that was practically last year at the speed with which zink’s codebase moves and its developer community sits in my office eating cookies between Mesa builds, and it was also before I set off on my journey to make the rest of those zany Phoronix benchmark games run instead of crashing or whatever.

What do we got left on that list anyway?

Metro: Last Light Redux

Oh you want some Metro? We got Metro at home.

metro.png

HITMAN

hitman.png

Agent 47, I’m gonna pretend I didn’t see that. Pull yourself together.

Basemark: High Settings

basemark.png

It’s uh… Mangohud’s slowing me down.

Bioshock Infinite

I bet you’re wondering where this one is, huh.

Warhammer 40,000: Dawn of War

dow3-fail.png

Easy as that, ju—Wait, what?

This game requires ARB_bindless_texture just to run? Is this a joke? Even fucking DOOM 2016, the final boss of OpenGL, doesn’t require bindless textures.

Fine.

Totally fine.

Not at all a problem, and I’m sure it’ll be easy to do.

Definitely no reason why only two Mesa drivers total implement it other than it being some trivial switch that everyone forgot to flip, right?

Probably just a config value here, or maybe a couple lines of code there…

Ignore all the validation errors because descriptor indexing isn’t accurately supported…

Add some null checks…

Fire up ASAN to fix a random stack explosion

File a piglit ticket because two of the eighty unit tests for the extension are bugged and these are quite literally the only unit tests available…

dow3.png

Kapow, first try, expect it in zink-wip later today-ish.

It’s just that easy.

If you disagree, you are nitpicking and biased.

September 01, 2021

I've been working on portals recently and one of the issues for me was that the documentation just didn't quite hit the sweet spot. At least the bits I found were either too high-level or too implementation-specific. So here's a set of notes on how a portal works, in the hope that this is actually correct.

First, Portals are supposed to be a way for sandboxed applications (flatpaks) to trigger functionality they don't have direct access too. The prime example: opening a file without the application having access to $HOME. This is done by the applications talking to portals instead of doing the functionality themselves.

There is really only one portal process: /usr/libexec/xdg-desktop-portal, started as a systemd user service. That process owns a DBus bus name (org.freedesktop.portal.Desktop) and an object on that name (/org/freedesktop/portal/desktop). You can see that bus name and object with D-Feet, from DBus' POV there's nothing special about it. What makes it the portal is simply that the application running inside the sandbox can talk to that DBus name and thus call the various methods. Obviously the xdg-desktop-portal needs to run outside the sandbox to do its things.

There are multiple portal interfaces, all available on that one object. Those interfaces have names like org.freedesktop.portal.FileChooser (to open/save files). The xdg-desktop-portal implements those interfaces and thus handles any method calls on those interfaces. So where an application is sandboxed, it doesn't implement the functionality itself, it instead calls e.g. the OpenFile() method on the org.freedesktop.portal.FileChooser interface. Then it gets an fd back and can read the content of that file without needing full access to the file system.

Some interfaces are fully handled within xdg-desktop-portal. For example, the Camera portal checks a few things internally, pops up a dialog for the user to confirm access to if needed [1] but otherwise there's nothing else involved with this specific method call.

Other interfaces have a backend "implementation" DBus interface. For example, the org.freedesktop.portal.FileChooser interface has a org.freedesktop.impl.portal.FileChooser (notice the "impl") counterpart. xdg-desktop-portal does not implement those impl.portals. xdg-desktop-portal instead routes the DBus calls to the respective "impl.portal". Your sandboxed application calls OpenFile(), xdg-desktop-portal now calls OpenFile() on org.freedesktop.impl.portal.FileChooser. That interface returns a value, xdg-desktop-portal extracts it and returns it back to the application in respones to the original OpenFile() call.

What provides those impl.portals doesn't matter to xdg-desktop-portal, and this is where things are hot-swappable. GTK and Qt both provide (some of) those impl portals, There are GTK and Qt-specific portals with xdg-desktop-portal-gtk and xdg-desktop-portal-kde but another one is provided by GNOME Shell directly. You can check the files in /usr/share/xdg-desktop-portal/portals/ and see which impl portal is provided on which bus name. The reason those impl.portals exist is so they can be native to the desktop environment - regardless what application you're running and with a generic xdg-desktop-portal, you see the native file chooser dialog for your desktop environment.

So the full call sequence is:

  • At startup, xdg-desktop-portal parses the /usr/libexec/xdg-desktop-portal/*.portal files to know which impl.portal interface is provided on which bus name
  • The application calls OpenFile() on the org.freedesktop.portal.FileChooser interface on the object path /org/freedesktop/portal/desktop. It can do so because the bus name this object sits on is not restricted by the sandbox
  • xdg-desktop-portal receives that call. This is portal with an impl.portal so xdg-desktop-portal calls OpenFile() on the bus name that provides the org.freedesktop.impl.portal.FileChooser interface (as previously established by reading the *.portal files)
  • Assuming xdg-desktop-portal-gtk provides that portal at the moment, that process now pops up a GTK FileChooser dialog that runs outside the sandbox. User selects a file
  • xdg-desktop-portal-gtk sends back the fd for the file to the xdg-desktop-portal, and the impl.portal parts are done
  • xdg-desktop-portal receives that fd and sends it back as reply to the OpenFile() method in the normal portal
  • The application receives the fd and can read the file now
A few details here aren't fully correct, but it's correct enough to understand the sequence - the exact details depend on the method call anyway.

Finally: because of DBus restrictions, the various methods in the portal interfaces don't just reply with values. Instead, the xdg-desktop-portal creates a new org.freedesktop.portal.Request object and returns the object path for that. Once that's done the method is complete from DBus' POV. When the actual return value arrives (e.g. the fd), that value is passed via a signal on that Request object, which is then destroyed. This roundabout way is done for purely technical reasons, regular DBus methods would time out while the user picks a file path.

Anyway. Maybe this helps someone understanding how the portal bits fit together.

[1] it does so using another portal but let's ignore that
[2] not really hot-swappable though. You need to restart xdg-desktop-portal but not your host. So luke-warm-swappable only

Edit Sep 01: clarify that it's not GTK/Qt providing the portals, but xdg-desktop-portal-gtk and -kde

August 31, 2021

Gut Ding braucht Weile. Almost three years ago, we added high-resolution wheel scrolling to the kernel (v5.0). The desktop stack however was first lagging and eventually left behind (except for an update a year ago or so, see here). However, I'm happy to announce that thanks to José Expósito's efforts, we now pushed it across the line. So - in a socially distanced manner and masked up to your eyebrows - gather round children, for it is storytime.

Historical History

In the beginning, there was the wheel detent. Or rather there were 24 of them, dividing a 360 degree [1] movement of a wheel into a neat set of 15 clicks. libinput exposed those wheel clicks as part of the "pointer axis" namespace and you could get the click count with libinput_event_pointer_get_axis_discrete() (announced here). The degree value is exposed as libinput_event_pointer_get_axis_value(). Other scroll backends (finger-scrolling or button-based scrolling) expose the pixel-precise value via that same function.

In a "recent" Microsoft Windows version (Vista!), MS added the ability for wheels to trigger more than 24 clicks per rotation. The MS Windows API now treats one "traditional" wheel click as a value of 120, anything finer-grained will be a fraction thereof. You may have a mouse that triggers quarter-wheel clicks, each sending a value of 30. This makes for smoother scrolling and is supported(-ish) by a lot of mice introduced in the last 10 years [2]. Obviously, three small scrolls are nicer than one large scroll, so the UX is less bad than before.

Now it's time for libinput to catch up with Windows Vista! For $reasons, the existing pointer axis API could get changed to accommodate for the high-res values, so a new API was added for scroll events. Read on for the details, you will believe what happens next.

Out with the old, in with the new

As of libinput 1.19, libinput has three new events: LIBINPUT_EVENT_POINTER_SCROLL_WHEEL, LIBINPUT_EVENT_POINTER_SCROLL_FINGER, and LIBINPUT_EVENT_POINTER_SCROLL_CONTINUOUS. These events reflect, perhaps unsuprisingly, scroll movements of a wheel, a finger or along a continuous axis (e.g. button scrolling). And they replace the old event LIBINPUT_EVENT_POINTER_AXIS. Those familiar with libinput will notice that the new event names encode the scroll source in the event name now. This makes them slightly more flexible and saves callers an extra call.

In terms of actual API, the new events have two new functions: libinput_event_pointer_get_scroll_value(). For the FINGER and CONTINUOUS events, the value returned is in "pixels" [3]. For the new WHEEL events, the value is in degrees. IOW this is a drop-in replacement for the old libinput_event_pointer_get_axis_value() function. The second call is libinput_event_pointer_get_scroll_value_v120() which, for WHEEL events, returns the 120-based logical units the kernel uses as well. libinput_event_pointer_has_axis() returns true if the given axis has a value, just as before. With those three calls you now get the data for the new events.

Backwards compatibility

To ensure backwards compatibility, libinput generates both old and new events so the rule for callers is: if you want to support the new events, just ignore the old ones completely. libinput also guarantees new events even on pre-5.0 kernels. This makes the old and new code easy to ifdef out, and once you get past the immediate event handling the code paths are virtually identical.

When, oh when?

These changes have been merged into the libinput main branch and will be part of libinput 1.19. Which is due to be released over the next month or so, so feel free to work backwards from that for your favourite distribution.

Having said that, libinput is merely the lowest block in the Jenga tower that is the desktop stack. José linked to the various MRs in the upstream libinput MR, so if you're on your seat's edge waiting for e.g. GTK to get this, well, there's an MR for that.

[1] That's degrees of an angle, not Fahrenheit
[2] As usual, on a significant number of those you'll need to know whatever proprietary protocol the vendor deemed to be important IP. Older MS mice stand out here because they use straight HID.
[3] libinput doesn't really have a concept of pixels, but it has a normalized pixel that movements are defined as. Most callers take that as real pixels except for the high-resolution displays where it's appropriately scaled.

Zink Is Over

A while ago I blogged about finishing up ES 3.2. Then I didn’t mention it again because…well, I suppose I’ve only blogged four times since then, but I’m going to pretend this was part of my master plan to make everyone forget so I could build hype again.

The hype is here: Zink can now* run ES 3.2 apps.

  • “now” is a variable unit of time subject to CI not trying to drown itself at the nearest pub, literally melt itself to slag, or hurl itself off a cliff the instant daniels takes his eyes off it

What Does This Mean For The Future?

I know I’ve said this a few times previously, and we all had a good laugh, but this time I mean it.

Zink is done.

The final boss has been beaten, there’s no more versions to support, no extensions left on my todo list, definitely no bugs remaining, and performance can’t possibly improve further.

If you think you’ve found a zink bug, report it to whoever wrote the test or app you’re running, because the only thing I plan on doing for the rest of 2021 is playing Cyberpunk 2077 on Lavapipe.

Right after it finishes loading.

August 30, 2021

The Struggle Continues

Everyone’s seen the Phoronix benchmark numbers by now, and though there’s a lot of confusion over how to calculate the percentage increase between “game does did not run a year ago” and “game runs”, it seems like a couple people out there at Big Triangle are starting to take us seriously.

With that said, even my parents are asking me what the deal is with this one result in particular:

ohno.png

Performance isn’t supposed to go down. Everyone knows this. The version numbers go up and so does the performance as long as it’s not Javascript-based.

Enraged, I sprinted to my computer and searched for tesseract game, which gave me the entirely wrong result, but I eventually did manage to find the right one. I fired up zink-wip, certain that this would end up being some bug I’d already fixed.

Unfortunately, this was not the case.

tesseract-bad.png

I vowed not to sleep, rebase, leave my office, or even run another application until this was resolved, so you can imagine how pleased I am to be writing this post after spending way too much time getting to the bottom of everything.

Speculation Interlude

Full disclosure: I didn’t actually check why performance went down. I’m pretty sure it’s just the result of having improved buffer mapping to be better in most cases, which ended up hurting this case.

But Why

…is the performance so bad?

A quick profiling revealed that this was down to a Gallium component called vbuf, used for translating vertex buffers and attributes from the ones specified by the application to ones that drivers can actually support. The component itself is fine, the problem was that, ideally, it’s not something you ever want to be hitting when you want performance.

Consider the usual sequence of drawing a frame:

  • generate and upload vertex data
  • bind some descriptors
  • maybe throw in a query or two if you need some spice
  • draw
  • repeat until frame is done

This is all great and normal, but what would happen—just hypothetically of course—if instead it looked like this:

  • generate and upload vertex data
  • stall and read vertex data
  • rewrite vertex data in another format and reupload
  • bind some descriptors
  • maybe throw in a query or two if you need some spice
  • draw
  • repeat until frame is done

Suddenly the driver is now stalling multiple times per frame on top of doing lots of CPU work!

Incidentally, this is (almost certainly) why performance appeared to have regressed: the vertex buffer is now device-local and can’t be mapped directly, so it has to be copied to a new buffer before it can be read, which is even slower.

Just AMD Problems

DISCLAIMER: We’re going deep into meme territory now, so let’s all dial down the seriousness about a thousand notches before posting about how much I hate AMD or whatever.

vertexattribmeme.png

Unlike cool hardware, AMD opts to not support features which might be less performant. I assume this is in the hopes that developers will Make The Right Choice and not use those features, but obviously developers are gonna develop, and so it is that Tesseract-The-Game-But-Not-The-One-On-Steam uses 3-component vertex attributes that aren’t supported by AMD hardware, necessitating the use of vbuf to translate them to 4-component attributes that can be safely used.

Decomposition

The vertex buffer format at work here was R8G8B8_SNORM, which is a perfectly cromulent format as long as you hate yourself. A shader would read this as a vec4, which, by the power of buffer robustness, gets translated to vec4(x, y, z, 1.0) because the w component is missing.

The approach I took to solving this was to decompose the vertex attribute into three separate R8_SNORM attributes, as this single-component format is wimpy enough for AMD to handle. Thus, a vertex input state containing three separate attributes including this one would now contain five, as the original R8G8B8_SNORM one is split into three, each reading a single component at an offset to simulate the original attribute.

The tricky part to this is that it requires a vertex shader prolog and variant in order to successfully split the shader’s input in such a way that the read value is the same. It also requires a NIR pass. Let’s check out the NIR pass since this blog has gone for way too long without seeing any real work:

struct decompose_state {
  nir_variable **split;
  bool needs_w;
};

static bool
decompose_attribs(nir_shader *nir, uint32_t decomposed_attrs, uint32_t decomposed_attrs_without_w)
{
   uint32_t bits = 0;
   nir_foreach_variable_with_modes(var, nir, nir_var_shader_in)
      bits |= BITFIELD_BIT(var->data.driver_location);
   bits = ~bits;
   u_foreach_bit(location, decomposed_attrs | decomposed_attrs_without_w) {
      nir_variable *split[5];
      struct decompose_state state;
      state.split = split;
      nir_variable *var = nir_find_variable_with_driver_location(nir, nir_var_shader_in, location);
      assert(var);
      split[0] = var;
      bits |= BITFIELD_BIT(var->data.driver_location);
      const struct glsl_type *new_type = glsl_type_is_scalar(var->type) ? var->type : glsl_get_array_element(var->type);
      unsigned num_components = glsl_get_vector_elements(var->type);
      state.needs_w = (decomposed_attrs_without_w & BITFIELD_BIT(location)) != 0 && num_components == 4;
      for (unsigned i = 0; i < (state.needs_w ? num_components - 1 : num_components); i++) {
         split[i+1] = nir_variable_clone(var, nir);
         split[i+1]->name = ralloc_asprintf(nir, "%s_split%u", var->name, i);
         if (decomposed_attrs_without_w & BITFIELD_BIT(location))
            split[i+1]->type = !i && num_components == 4 ? var->type : new_type;
         else
            split[i+1]->type = new_type;
         split[i+1]->data.driver_location = ffs(bits) - 1;
         bits &= ~BITFIELD_BIT(split[i+1]->data.driver_location);
         nir_shader_add_variable(nir, split[i+1]);
      }
      var->data.mode = nir_var_shader_temp;
      nir_shader_instructions_pass(nir, lower_attrib, nir_metadata_dominance, &state);
   }
   nir_fixup_deref_modes(nir);
   NIR_PASS_V(nir, nir_remove_dead_variables, nir_var_shader_temp, NULL);
   optimize_nir(nir);
   return true;
}

First, the base of the pass; two masks are provided, one for attributes that are being fully split (i.e., four components) and one for attributes that have fewer than four components and thus need to have a w component added, as in the Tesseract case. Each variable in the mask is split into four, with slightly different behavior for the ones needing a w and the ones that don’t.

The new variables are all given new driver locations matching the ones given to the split attributes for the vertex input pipeline state, and the decompose_state is passed along to the per-instruction part of the pass:

static bool
lower_attrib(nir_builder *b, nir_instr *instr, void *data)
{
   struct decompose_state *state = data;
   nir_variable **split = state->split;
   if (instr->type != nir_instr_type_intrinsic)
      return false;
   nir_intrinsic_instr *intr = nir_instr_as_intrinsic(instr);
   if (intr->intrinsic != nir_intrinsic_load_deref)
      return false;
   nir_deref_instr *deref = nir_src_as_deref(intr->src[0]);
   nir_variable *var = nir_deref_instr_get_variable(deref);
   if (var != split[0])
      return false;
   unsigned num_components = glsl_get_vector_elements(split[0]->type);
   b->cursor = nir_after_instr(instr);
   nir_ssa_def *loads[4];
   for (unsigned i = 0; i < (state->needs_w ? num_components - 1 : num_components); i++)
      loads[i] = nir_load_deref(b, nir_build_deref_var(b, split[i+1]));
   if (state->needs_w) {
      loads[3] = nir_channel(b, loads[0], 3);
      loads[0] = nir_channel(b, loads[0], 0);
   }
   nir_ssa_def *new_load = nir_vec(b, loads, num_components);
   nir_ssa_def_rewrite_uses(&intr->dest.ssa, new_load);
   nir_instr_remove_v(instr);
   return true;
}

The existing variable is passed along with the new variable array. Where the original is loaded, instead the new variables are all loaded in sequence and assembled into a vec matching the length of the original one. For attributes needing a w component, the first new variable is loaded as a vec4 so that the w component can be reused naturally. Then the original load instruction is removed, and with it, the original variable and its brokenness.

Immediate Results

Sort of.

tesseract-semifixed.png

The frames were definitely there, but the graphics…

Occlusion Queries

It turns out there’s almost zero coverage for occlusion queries in Vulkan’s CTS. There’s surprisingly little coverage for most query-related things, in fact, which means it wasn’t too surprising when it turned out that there were RADV query bugs at play. What was surprising was how they manifested, but that was about par for anything that reads garbage memory.

A simple one-liner later (just kidding, this fucken thing took like 4 days to find) and, magically, things were happening:

tesseract-fixed.png

We Did It.

A big thanks to Bas Nieuwenhuizen for consulting along the way even despite being so busy getting a RADV raytracing MR up and, as always, preparing his next blog post.

August 25, 2021

A year ago, I first announced libei - a library to support emulated input. After an initial spurt of development, it was left mostly untouched until a few weeks ago. Since then, another flurry of changes have been added, including some initial integration into GNOME's mutter. So, let's see what has changed.

A Recap

First, a short recap of what libei is: it's a transport layer for emulated input events to allow for any application to control the pointer, type, etc. But, unlike the XTEST extension in X, libei allows the compositor to be in control over clients, the devices they can emulate and the input events as well. So it's safer than XTEST but also a lot more flexible. libei already supports touch and smooth scrolling events, something XTest doesn't have or is struggling with.

Terminology refresher: libei is the client library (used by an application wanting to emulate input), EIS is the Emulated Input Server, i.e. the part that typically runs in the compositor.

Server-side Devices

So what has changed recently: first, the whole approach has flipped on its head - now a libei client connects to the EIS implementation and "binds" to the seats the EIS implementation provides. The EIS implementation then provides input devices to the client. In the simplest case, that's just a relative pointer but we have capabilities for absolute pointers, keyboards and touch as well. Plans for the future is to add gestures and tablet support too. Possibly joysticks, but I haven't really thought about that in detail yet.

So basically, the initial conversation with an EIS implementation goes like this:

  • Client: Hello, I am $NAME
  • Server: Hello, I have "seat0" and "seat1"
  • Client: Bind to "seat0" for pointer, keyboard and touch
  • Server: Here is a pointer device
  • Server: Here is a keyboard device
  • Client: Send relative motion event 10/2 through the pointer device
Notice how the touch device is missing? The capabilities the client binds to are just what the client wants, the server doesn't need to actually give the client a device for that capability.

One of the design choices for libei is that devices are effectively static. If something changes on the EIS side, the device is removed and a new device is created with the new data. This applies for example to regions and keymaps (see below), so libei clients need to be able to re-create their internal states whenever the screen or the keymap changes.

Device Regions

Devices can now have regions attached to them, also provided by the EIS implementation. These regions define areas reachable by the device and are required for clients such as Barrier. On a dual-monitor setup you may have one device with two regions or two devices with one region (representing one monitor), it depends on the EIS implementation. But either way, as libei client you will know that there is an area and you will know how to reach any given pixel on that area. Since the EIS implementation decides the regions, it's possible to have areas that are unreachable by emulated input (though I'm struggling a bit for a real-world use-case).

So basically, the conversation with an EIS implementation goes like this:

  • Client: Hello, I am $NAME
  • Server: Hello, I have "seat0" and "seat1"
  • Client: Bind to "seat0" for absolute pointer
  • Server: Here is an abs pointer device with regions 1920x1080@0,0, 1080x1920@1920,0
  • Server: Here is an abs pointer device with regions 1920x1080@0,0
  • Server: Here is an abs pointer device with regions 1080x1920@1920,0
  • Client: Send abs position 100/100 through the second device
Notice how we have three absolute devices? A client emulating a tablet that is mapped to a screen could just use the third device. As with everything, the server decides what devices are created and the clients have to figure out what they want to do and how to do it.

Perhaps unsurprisingly, the use of regions make libei clients windowing-system independent. The Barrier EI support WIP no longer has any Wayland-specific code in it. In theory, we could implement EIS in the X server and libei clients would work against that unmodified.

Keymap handling

The keymap handling has been changed so the keymap too is provided by the EIS implementation now, effectively in the same way as the Wayland compositor provides the keymap to Wayland clients. This means a client knows what keycodes to send, it can handle the state to keep track of things, etc. Using Barrier as an example again - if you want to generate an "a", you need to look up the keymap to figure out which keycode generates an A, then you can send that through libei to actually press the key.

Admittedly, this is quite messy. XKB (and specifically libxkbcommon) does not make it easy to go from a keysym to a key code. The existing Barrier X code is full of corner-cases with XKB already, I espect those to be necessary for the EI support as well.

Scrolling

Scroll events have four types: pixel-based scrolling, discrete scrolling, and scroll stop/cancel events. The first should be obvious, discrete scrolling is for mouse wheels. It uses the same 120-based API that Windows (and the kernel) use, so it's compatible with high-resolution wheel mice. The scroll stop event notifies an EIS implementation that the scroll interaction has stopped (e.g. lifting fingers off) which in turn may start kinetic scrolling - just like the libinput/Wayland scroll stop events. The scroll cancel event notifies the EIS implementation that scrolling really has stopped and no kinetic scrolling should be triggered. There's no equivalent in libinput/Wayland for this yet but it helps to get the hook in place.

Emulation "Transactions"

This has fairly little functional effect, but interactions with an EIS server are now sandwiched in a start/stop emulating pair. While this doesn't matter for one-shot tools like xdotool, it does matter for things like Barrier which can send the start emulating event when the pointer enters the local window. This again allows the EIS implementation to provide some visual feedback to the user. To correct the example from above, the sequence is actually:

  • ...
  • Server: Here is a pointer device
  • Client: Start emulating
  • Client: Send relative motion event 10/2 through the pointer device
  • Client: Send relative motion event 1/4 through the pointer device
  • Client: Stop emulating

Properties

Finally, there is now a generic property API, something copied from PipeWire. Properties are simple key/value string pairs and cover those things that aren't in the immediate API. One example here: the portal can set things like "ei.application.appid" to the Flatpak's appid. Properties can be locked down and only libei itself can set properties before the initial connection. This makes them reliable enough for the EIS implementation to make decisions based on their values. Just like with PipeWire, the list of useful properties will grow over time. it's too early to tell what is really needed.

Repositories

Now, for the actual demo bits: I've added enough support to Barrier, XWayland, Mutter and GNOME Shell that I can control a GNOME on Wayland session through Barrier (note: the controlling host still needs to run X since we don't have the ability to capture input events under Wayland yet). The keymap handling in Barrier is nasty but it's enough to show that it can work.

GNOME Shell has a rudimentary UI, again just to show what works:

The status icon shows ... if libei clients are connected, it changes to !!! while the clients are emulating events. Clients are listed by name and can be disconnected at will. I am not a designer, this is just a PoC to test the hooks.

Note how xdotool is listed in this screenshot: that tool is unmodified, it's the XWayland libei implementation that allows it to work and show up correctly

The various repositories are in the "wip/ei" branch of:

And of course libei itself.

Where to go from here? The last weeks were driven by rapid development, so there's plenty of test cases to be written to make sure the new code actually works as intended. That's easy enough. Looking at the Flatpak integration is another big ticket item, once the portal details are sorted all the pieces are (at least theoretically) in place. That aside, improving the integrations into the various systems above is obviously what's needed to get this working OOTB on the various distributions. Right now it's all very much in alpha stage and I could use help with all of those (unless you're happy to wait another year or so...). Do ping me if you're interested to work on any of this.

August 18, 2021

We Back

Just a quick update today while I dip my toes back into the blogosphere to remind myself that it’s not so scary.

Remember when I blogged about how nice it would be to have a suballocator all those months ago?

Now it’s landed, and it’s nice indeed to have a suballocator.

Remember when everyone wanted GL 4.6 compatibility contexts so they could play Feral ports of their favorite games? zink-wip did that 6 months ago.

What this all means is that it’s more or less open testing season on zink. I’ve already got a sizable number of tickets open for various Steam games based on zink-wip testing, but this is hardly conclusive.

What games work for you?

What games don’t work?

I’m not gonna test them all myself, so get out there and start crashing.

August 11, 2021

Here I’m playing “Spelunky 2” on my laptop and simultaneously replaying the same Vulkan calls on an ARM board with Adreno GPU running the open source Turnip Vulkan driver. Hint: it’s an x64 Windows game that doesn’t run on ARM.

The bottom right is the game I’m playing on my laptop, the top left is GFXReconstruct immediately replaying Vulkan calls from the game on ARM board.

How is it done? And why would it be useful for debugging? Read below!


Debugging issues a driver faces with real-world applications requires the ability to capture and replay graphics API calls. However, for mobile GPUs it becomes even more challenging since for Vulkan driver the main “source” of real-world workload are x86-64 apps that run via Wine + DXVK, mainly games which were made for desktop x86-64 Windows and do not run on ARM. Efforts are being made to run these apps on ARM but it is still work-in-progress. And we want to test the drivers NOW.

The obvious solution would be to run those applications on an x86-64 machine capturing all Vulkan calls. Then replaying those calls on a second machine where we cannot run the app. This way it would be possible to test the driver even without running the application directly on it.

The main trouble is that Vulkan calls made on one GPU + Driver combo are not generally compatible with other GPU + Driver combo, sometimes even for one GPU vendor. There are different memory capabilities (VkPhysicalDeviceMemoryProperties), different memory requirements for buffer and images, different extensions available, and different optional features supported. It is easier with OpenGL but there are also some incompatibilities there.

There are two open-source vendor-agnostic tools for capturing Vulkan calls: RenderDoc (captures single frame) and GFXReconstruct (captures multiple frames). RenderDoc at the moment isn’t suitable for the task of capturing applications on desktop GPUs and replaying on mobile because it doesn’t translate memory type and requirements (see issue #814). GFXReconstruct on the other hand has the necessary features for this.

I’ll show a couple of tricks with GFXReconstruct I’m using to test things on Turnip.


Capturing with GFXReconstruct

At this point you either have the application itself or, if it doesn’t use Vulkan, a trace of its calls that could be translated to Vulkan. There is a detailed instruction on how to use GFXReconstruct to capture a trace on desktop OS. However there is no clear instruction of how to do this on Android (see issue #534), fortunately there is one in Android’s documentation:

Android how-to (click me)
For Android 9 you should copy layers to the application which will be traced
For Android 10+ it's easier to copy them to com.lunarg.gfxreconstruct.replay
You should have userdebug build of Android or probably rooted Android

# Push GFXReconstruct layer to the device
adb push libVkLayer_gfxreconstruct.so /sdcard/

# Since there is to APK for capture layer,
# copy the layer to e.g. folder of com.lunarg.gfxreconstruct.replay
adb shell run-as com.lunarg.gfxreconstruct.replay cp /sdcard/libVkLayer_gfxreconstruct.so .

# Enable layers
adb shell settings put global enable_gpu_debug_layers 1

# Specify target application
adb shell settings put global gpu_debug_app <package_name>

# Specify layer list (from top to bottom)
adb shell settings put global gpu_debug_layers VK_LAYER_LUNARG_gfxreconstruct

# Specify packages to search for layers
adb shell settings put global gpu_debug_layer_app com.lunarg.gfxreconstruct.replay

If the target application doesn’t have rights to write into external storage - you should change where the capture file is created:

adb shell "setprop debug.gfxrecon.capture_file '/data/data/<target_app_folder>/files/'"


However, when trying to replay the trace you captured on another GPU - most likely it will result in an error:

[gfxrecon] FATAL - API call vkCreateDevice returned error value VK_ERROR_EXTENSION_NOT_PRESENT that does not match the result from the capture file: VK_SUCCESS.  Replay cannot continue.
Replay has encountered a fatal error and cannot continue: the specified extension does not exist

Or other errors/crashes. Fortunately we could limit the capabilities of desktop GPU with VK_LAYER_LUNARG_device_simulation

VK_LAYER_LUNARG_device_simulation when simulating another GPU should be told to intersect the capabilities of both GPUs, making the capture compatible with both of them. This could be achieved by recently added environment variables:

VK_DEVSIM_MODIFY_EXTENSION_LIST=whitelist
VK_DEVSIM_MODIFY_FORMAT_LIST=whitelist
VK_DEVSIM_MODIFY_FORMAT_PROPERTIES=whitelist

whitelist name is rather confusing because it’s essentially means “intersection”.

One would also need to get a json file which describes target GPU capabilities, this should be done by running:

vulkaninfo -j &> <device_name>.json

The final command to capture a trace would be:

VK_LAYER_PATH=<path/to/device-simulation-layer>:<path/to/gfxreconstruct-layer> \
VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_gfxreconstruct:VK_LAYER_LUNARG_device_simulation \
VK_DEVSIM_FILENAME=<device_name>.json \
VK_DEVSIM_MODIFY_EXTENSION_LIST=whitelist \
VK_DEVSIM_MODIFY_FORMAT_LIST=whitelist \
VK_DEVSIM_MODIFY_FORMAT_PROPERTIES=whitelist \
<the_app>

Replaying with GFXReconstruct

gfxrecon-replay -m rebind --skip-failed-allocations <trace_name>.gfxr
  • -m Enable memory translation for replay on GPUs with memory types that are not compatible with the capture GPU’s
    • rebind Change memory allocation behavior based on resource usage and replay memory properties. Resources may be bound to different allocations with different offsets.
  • --skip-failed-allocations skip vkAllocateMemory, vkAllocateCommandBuffers, and vkAllocateDescriptorSets calls that failed during capture

Without these options replay would fail.

Now you could easily test any app/game on your ARM board, if you have enough RAM =) I even successfully ran a capture of “Metro Exodus” on Turnip.

But what if I want to test something that requires interactivity?

Or you don’t want to save a huge trace on disk, which could grow tens of gigabytes if application is running for considerable amount of time.

During the recording GFXReconstruct just appends calls to a file, there are no additional post-processing steps. Given that the next logical step is to just skip writing to a disk and send Vulkan calls over the network!

This would allow us to interact with the application and immediately see the results on another device with different GPU. And so I hacked together a crude support of over-the-network replay.

The only difference with ordinary tracing is that now instead of file we have to specify a network address of the target device:

VK_LAYER_PATH=<path/to/device-simulation-layer>:<path/to/gfxreconstruct-layer> \
    ...
GFXRECON_CAPTURE_FILE="<ip>:<port>" \
<the_app>

And on the target device:

while true; do gfxrecon-replay -m rebind --sfa ":<port>"; done

Why while true? It is common for DXVK to call vkCreateInstance several times leading to the creation of several traces. When replaying over the network we therefor want gfxrecon-replay to immediately restart when one trace ends to be ready for another.

You may want to bring the FPS down to match the capabilities of lower power GPU in order to prevent constant hiccups. It could be done either with libstrangle or with mangohud:

  • stranglevk -f 10
  • MANGOHUD_CONFIG=fps_limit=10 mangohud

You have seen the result at the start of the post.

August 10, 2021

I’ve been silent here for quite some time, so here is a quick summary of some of the new functionality we have been exposing in V3DV, the Vulkan driver for Raspberry PI 4, over the last few months:

  • VK_KHR_bind_memory2
  • VK_KHR_copy_commands2
  • VK_KHR_dedicated_allocation
  • VK_KHR_descriptor_update_template
  • VK_KHR_device_group
  • VK_KHR_device_group_creation
  • VK_KHR_external_fence
  • VK_KHR_external_fence_capabilities
  • VK_KHR_external_fence_fd
  • VK_KHR_external_semaphore
  • VK_KHR_external_semaphore_capabilities
  • VK_KHR_external_semaphore_fd
  • VK_KHR_get_display_properties2
  • VK_KHR_get_memory_requirements2
  • VK_KHR_get_surface_capabilities2
  • VK_KHR_image_format_list
  • VK_KHR_incremental_present
  • VK_KHR_maintenance2
  • VK_KHR_maintenance3
  • VK_KHR_multiview
  • VK_KHR_relaxed_block_layout
  • VK_KHR_sampler_mirror_clamp_to_edge
  • VK_KHR_storage_buffer_storage_class
  • VK_KHR_uniform_buffer_standard_layout
  • VK_KHR_variable_pointers
  • VK_EXT_custom_border_color
  • VK_EXT_external_memory_dma_buf
  • VK_EXT_index_type_uint8
  • VK_EXT_physical_device_drm

Besides that list of extensions, we have also added basic support for Vulkan subgroups (this is a Vulkan 1.1 feature) and Geometry Shaders (we use this to implement multiview).

I think we now meet most (if not all) of the Vulkan 1.1 mandatory feature requirements, but we still need to check this properly and we also need to start doing Vulkan 1.1 CTS runs and fix test failures. In any case, the bottom line is that Vulkan 1.1 should be fairly close now.

August 05, 2021

Just about a year after the original announcement, I think it's time to see the progress on power-profiles-daemon.

Note that I would still recommend you read the up-to-date project README if you have questions about why this project was necessary, and why a new project was started rather than building on an existing one.

 The project was born out of the need to make a firmware feature available to end-users for a number of lines of Lenovo laptops for them to be fully usable on Fedora. For that, I worked with Mark Pearson from Lenovo, who wrote the initial kernel support for the feature and served as our link to the Lenovo firmware team, and Hans de Goede, who worked on making the kernel interfaces more generic.

More generic, but in a good way

 With the initial kernel support written for (select) Lenovo laptops, Hans implemented a more generic interface called platform_profile. This interface is now the one that power-profiles-daemon will integrate with, and means that it also supports a number of Microsoft Surface, HP, Lenovo's own Ideapad laptops, and maybe Razer laptops soon.

 The next item to make more generic is Lenovo's "lap detection" which still relies on a custom driver interface. This should be soon transformed into a generic proximity sensor, which will mean I get to work some more on iio-sensor-proxy.

Working those interactions

 power-profiles-dameon landed in a number of distributions, sometimes enabled by default, sometimes not enabled by default (sigh, the less said about that the better), which fortunately meant that we had some early feedback available.

 The goal was always to have the user in control, but we still needed to think carefully about how the UI would look and how users would interact with it when a profile was temporarily unavailable, or the system started a "power saver" mode because battery was running out.

 The latter is something that David Redondo's work on the "HoldProfile" API made possible. Software can programmatically switch to the power-saver or performance profile for the duration of a command. This is useful to switch to the Performance profile when running a compilation (eg. powerprofilesctl jhbuild --no-interact build gnome-shell), or for gnome-settings-daemon to set the power-saver profile when low on battery.

 The aforementioned David Redondo and Kai Uwe Broulik also worked on the KDE interface to power-profiles-daemon, as Florian Müllner implemented the gnome-shell equivalent.

Promised by me, delivered by somebody else :)

 I took this opportunity to update the Power panel in Settings, which shows off the temporary switch to the performance mode, and the setting to automatically switch to power-saver when low on battery.

Low-Power, everywhere

 Talking of which, while it's important for the system to know that they're targetting a power saving behaviour, it's also pretty useful for applications to try and behave better.
 
 Maybe you've already integrated with "low memory" events using GLib, but thanks to Patrick Griffis you can be an event better ecosystem citizen and monitor whether the system is in "Power Saver" mode and adjust your application's behaviour.
 
 This feature will be available in GLib 2.70 along with documentation of useful steps to take. GNOME Software will already be using this functionality to avoid large automated downloads when energy saving is needed.

Availability

 The majority of the above features are available in the GNOME 41 development branches and should get to your favourite GNOME-friendly distribution for their next release, such as Fedora 35.
August 04, 2021

 I've been chasing a crocus misrendering bug show in a qt trace.


The bottom image is crocus vs 965 on top. This only happened on Gen4->5, so Ironlake and GM45 were my test machines. I burned a lot of time trying to work this out. I trimmed the traces down, dumped a stupendous amount of batchbuffers, turned off UBO push constants, dump all the index and vertex buffers, tried some RGBx changes, but nothing was rushing to hit me, except that the vertex shaders produced were different.

However they were different for many reasons, due to the optimization pipelines the mesa state tracker runs vs the 965 driver. Inputs and UBO loads were in different places so there was a lot of noise in the shaders.

I ported the trace to a piglit GL application so I could easier hack on the shaders and GL, with that I trimmed it down even further (even if I did burn some time on a misplace */+ typo).

Using the ported app, I removed all uniform buffer loads and then split the vertex shader in half (it was quite large, but had two chunks). I finally then could spot the difference in the NIR shaders.

What stood out was the 965 shader had an if which the crocus shader has converted to a bcsel. This is part of peephole optimization and the mesa/st calls it, and sure enough removing that call fixed the rendering, but why? it is a valid optimization.

In a parallel thread on another part of the planet, Ian Romanick filed a MR to mesa https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12191 fixing a bug in the gen4/5 fs backend with conditional selects. This was something he noticed while debugging elsewhere. However his fix was for the fragment shader backend, and my bug was in the vec4 vertex shader backend. I tracked down where the same changes were needed in the vec4 backend and tested a fix on top of his branch, and the misrendering disappeared.

It's a strange coincidence we both started hitting the same bug in different backends in the same week via different tests, but he's definitely saved me a lot of pain in working this out! Hopefully we can combine them and get it merged this week.

Also thanks to Angelo on the initial MR for testing crocus with some real workloads.

July 30, 2021

Deeper Into Software

I don’t feel like blogging about zink today, so here’s more about everyone’s favorite software implementation of Vulkan.

The existing LLVMpipe architecture works like this from a top-down view:

  • mesa / st - this is the GL/Gallium state tracker
  • llvmpipe - this is the Gallium driver
  • gallivm - this is the LLVM program compiler
  • llvm - this is where the fragment shader runs

In short, everything is for the purpose of compiling LLVM programs which will draw/compute the desired result.

Lavapipe makes a slight change:

  • lavapipe - this is the Vulkan state tracker
  • llvmpipe - this is the Gallium driver
  • gallivm - this is the LLVM program compiler
  • llvm - this is where the fragment shader runs

It’s that simple.

Thus, any time a new feature is added to Lavapipe, what’s actually being done is plumbing that Vulkan feature through some number of layers to change how LLVM is executed. Some features, like samplerAnisotropy, require significant work at the gallivm layer just to toggle a boolean flag at the lavapipe level.

Other changes, like KHR_timeline_semaphores are entirely contained in Lavapipe.

What Are Timeline Semaphores?

Vulkan has a number of mechanisms for synchronization including fences, events, and binary semaphores, all of which serve a specific purpose. For more conrete on all of them, please read the blog of an actual expert.

The best and most awesome (don’t @ me, it’s not debatable) of these synchronization methods, however is the timeline semaphore.

A timeline semaphore is an object that can be used to signal and wait on specific integer-assigned points in command execution, also known as timelines. Each queue submission can be accompanied by an array of timeline semaphores to wait on and an array to signal; command buffers in a given submission will wait before executing, then signal after they’re done. This enables parallel code design where one thread can assemble command buffers and submit them, and the GPU can be made to pause at certain points for buffers/images referenced to become populated by another thread before continuing with execution.

Typically, semaphores are managed through signals which pass through the kernel and hardware, meaning that “waiting” on a timeline is really just waiting on an ioctl (DRM_IOCTL_SYNCOBJ_TIMELINE_WAIT) to signal that the specified timeline id has occurred, which requires no additional host-side synchronization. Things get a bit trickier in software, however, as the kernel is not involved, so everything must be managed in the driver.

Lavapipe And Timelines

This was a todo item sitting on the list for a while because it was tricky to handle. The most visible problems here were:

  • connecting timeline identifiers with queue submissions; timelines only need to be monotonic, not sequential, meaning that using something like a sliding array wouldn’t be very efficient
  • the actual synchronization when threads are involved

After some thought and deliberation about my life choices up to this point, I decided to tackle this implementation. The methodology I selected was to add a monotonic counter to the internal command buffer submission and then create a series of per-object timeline “links” which would serve to match the counter to the timeline identifier. This would enable each timeline semaphore to maintain a singly-linked list of links each time they were submitted, and the list could then be pruned at any given time—referenced against the internal counter—to update the “current” timeline id and then evaluate whether a specified wait condition had passed. In the case where the condition had not passed, the timeline link could also store a handle to the fence from the llvmpipe queue submission that could be waited on directly.

Did it work?

Almost on the first try, actually.

But then I ran into a wall in CI while running piglit tests through zink.

It turns out that the CTS tests are considerably less aggressive than the piglit ones for things like this: specifically, there don’t appear to be any cases where a single timeline has 16 threads all trying to wait on it at different values, iterating thousands of times over the course of a couple seconds.

Oops.

But that’s now taken care of, and conformance never felt so good.

The road to Vulkan 1.2 continues!

July 29, 2021

This is a title

I’m back.

Where did I go?

My birthday passed recently, so I gifted myself a couple weeks off from blogging. Feels good.

For today, this is a Lavapipe blog.

What’s New With Lavapipe?

Lots.

Let’s check out what conformant features were added just in July:

  • EXT_line_rasterization
  • EXT_vertex_input_dynamic_state
  • EXT_extended_dynamic_state2
  • EXT_color_write_enable
  • features.strictLines
  • features.shaderStorageImageExtendedFormats
  • features.shaderStorageImageReadWithoutFormat
  • features.samplerAnisotropy
  • KHR_timeline_semaphores

Also under the hood now is a new 2D rasterizer from VMWare which yields “a 2x to 3x performance improvement for 2D workloads”.

Why Aren’t You Using Lavapipe Yet?

Have a big Vulkan-using project? Do you constantly have to worry about breakages from all manner of patches being merged without testing? Can’t afford or too lazy to set up and maintain actual hardware for testing?

Why not Lavapipe?

Seriously, why not? If there’s features missing that you need for your project, open tickets so we know what to work on.

July 28, 2021

Part 1, Part 2, Part 3

After getting thouroughly nerd-sniped a few weeks back, we now have FreeBSD support through qemu in the freedesktop.org ci-templates. This is possible through the qemu image generation we have had for quite a while now. So let's see how we can easily add a FreeBSD VM (or other distributions) to our gitlab CI pipeline:


.freebsd:
variables:
FDO_DISTRIBUTION_VERSION: '13.0'
FDO_DISTRIBUTION_TAG: 'freebsd.0' # some value for humans to read

build-image:
extends:
- .freebsd
- .fdo.qemu-build@freebsd
variables:
FDO_DISTRIBUTION_PACKAGES: "curl wget"
Now, so far this may all seem quite familiar. And indeed, this is almost exactly the same process as for normal containers (see Part 1), the only difference is the .fdo.qemu-build base template. Using this template means we build an image babushka: our desired BSD image is actual a QEMU RAW image sitting inside another generic container image. That latter image only exists to start the QEMU image and set up the environment if need be, you don't need to care what distribution it runs out (Fedora for now).

Because of the nesting, we need to handle this accordingly in our script: tag for the actual test job - we need to start the image and make sure our jobs are actually built within. The templates set up an ssh alias "vm" for this and the vmctl script helps to do things on the vm:


test-build:
extends:
- .freebsd
- .fdo.distribution-image@freebsd
script:
# start our QEMU image
- /app/vmctl start

# copy our current working directory to the VM
# (this is a yaml multiline command to work around the colon)
- |
scp -r $PWD vm:

# Run the build commands on the VM and if they succeed, create a .success file
- /app/vmctl exec "cd $CI_PROJECT_NAME; meson builddir; ninja -C builddir" && touch .success || true

# Copy results back to our run container so we can include them in artifacts:
- |
scp -r vm:$CI_PROJECT_NAME/builddir .

# kill the VM
- /app/vmctl stop

# Now that we have cleaned up: if our build job before
# failed, exit with an error
- [[ -e .success ]] || exit 1
Now, there's a bit to unpack but with the comments above it should be fairly obvious what is happening. We start the VM, copy our working directory over and then run a command on the VM before cleaning up. The reason we use touch .success is simple: it allows us to copy things out and clean up before actually failing the job.

Obviously, if you want to build any other distribution you just swap the freebsd out for fedora or whatever - the process is the same. libinput has been using fedora qemu images for ages now.

July 27, 2021

Thanks to the work done by Josè Expòsito, libinput 1.19 will ship with a new type of gesture: Hold Gestures. So far libinput supported swipe (moving multiple fingers in the same direction) and pinch (moving fingers towards each other or away from each other). These gestures are well-known, commonly used, and familiar to most users. For example, GNOME 40 recently has increased its use of touchpad gestures to switch between workspaces, etc. Swipe and pinch gestures require movement, it was not possible (for callers) to detect fingers on the touchpad that don't move.

This gap is now filled by Hold gestures. These are triggered when a user puts fingers down on the touchpad, without moving the fingers. This allows for some new interactions and we had two specific ones in mind: hold-to-click, a common interaction on older touchscreen interfaces where holding a finger in place eventually triggers the context menu. On a touchpad, a three-finger hold could zoom in, or do dictionary lookups, or kill a kitten. Whatever matches your user interface most, I guess.

The second interaction was the ability to stop kinetic scrolling. libinput does not actually provide kinetic scrolling, it merely provides the information needed in the client to do it there: specifically, it tells the caller when a finger was lifted off a touchpad at the end of a scroll movement. It's up to the caller (usually: the toolkit) to implement the kinetic scrolling effects. One missing piece was that while libinput provided information about lifting the fingers, it didn't provide information about putting fingers down again later - a common way to stop scrolling on other systems.

Hold gestures are intended to address this: a hold gesture triggered after a flick with two fingers can now be used by callers (read: toolkits) to stop scrolling.

Now, one important thing about hold gestures is that they will generate a lot of false positives, so be careful how you implement them. The vast majority of interactions with the touchpad will trigger some movement - once that movement hits a certain threshold the hold gesture will be cancelled and libinput sends out the movement events. Those events may be tiny (depending on touchpad sensitivity) so getting the balance right for the aforementioned hold-to-click gesture is up to the caller.

As usual, the required bits to get hold gestures into the wayland protocol are either in the works, mid-flight or merge-ready so expect this to hit the various repositories over the medium-term future.

July 26, 2021

I have not talked about raytracing in RADV for a while, but after some procrastination being focused on some other things I recently got back to it and achieved my next milestone.

In particular I have been hacking away at CTS and got to a point where CTS on dEQP-VK.ray_tracing.* runs to completion without crashes or hangs. Furthermore, I got the passrate to 90% of non-skiped tests. So we’re finally getting somewhere close to usable.

As further show that it is usable my fixes for CTS also fixed the corruption issues in Quake 2 RTX (Github version), delivering this image:

Q2RTX on RADV

Of course not everything is perfect yet. Besides the not 100% CTS passrate it has like half the Windows performance at 4k right now and we still have some feature gaps to make it really usable for most games.

Why is it slow?

TL;DR Because I haven’t optimized it yet and implemented every shortcut imaginable.

AMD raytracing primer

Raytracing with Vulkan works with two steps:

  1. You built a giant acceleration structure that contains all your geometry. Typically this ends up being some kind of tree, typically a Bounding Volume Hierarchy (BVH).
  2. Then you trace rays using some traversal shader through the acceleration structure you just built.

With RDNA2 AMD started accelerating this by adding an instruction that allowed doing intersection tests between a ray and a single BVH node, where the BVH node can either be

  • A triangle
  • A box node specifying 4 AABB boxes

Of course this isn’t quite enough to deal with all geometry types in Vulkan so we also add two more:

  • an AABB box
  • an instance of another BVH combined with a transformation matrix

Building the BVH

With a search tree like a BVH it is very possibly to make trees that are very useless. As an example consider a binary search tree that is very unbalanced. We can have similarly bad things with a BVH including making it unbalanced or having overlapping bounding volumes.

And my implementation is the simplest thing possible: the input geometry becomes the leaves in exactly the same order and then internal nodes are created just as you’d draw them. That is probably decently fast in building the BVH but surely results in a terrible BVH to actually use.

BVH traversal

After we built a BVH we can start tracing some rays. In rough pseudocode the current implementation is

stack = empty
insert root node into stack
while stack is not empty:

   node = pop a node from the stack

   if we left the bottom level BVH:
      reset ray origin/direction to initial origin/direction

   result = amd_intersect(ray, node)
   switch node type:
      triangle:
         if result is a hit:
            load some node data
            process hit
      box node:
         for each box hit:
            push child node on stack
      custom node 1 (instance):
         load node data
         push the root node of the bottom BVH on the stack
         apply transformation matrix to ray origin/direction
      custom node 2 (AABB geometry):
         load node data
         process hit

We already knew there were inherently going to be some difficulties:

  • We have a poor BVH so we’re going to do way more iterations than needed.
  • Calling shaders as a result of hits is going to result in some divergence.

Furthermore this also clearly shows some difficulties with how we approached the intersection instruction. Some advantages of the intersection instruction are that it avoids divergence in computing collisions if we have different node types in a subgroup and to be cheaper when there are only a few lanes active. (A single CU can process one ray/node intersection per cycle, modulo memory latency, while it can process an ALU instruction on 64 lanes per cycle).

However even if it avoids the divergence in the collision computation we still introduce a ton of divergence in the processing of the results of the intersection. So we are still doing pretty bad here.

A fast GPU traversal stack needs some work too

Another thing to be noted is our traversal stack size. According to the Vulkan specification a bottom level acceleration structure should support 2^24 -1 triangles and a top level acceleration structure should support 2^24 - 1 bottom level structures. Combined with a tree with 4 children in each internal node we can end up with a tree depth of about 24 levels.

In each internal node iteration of our loop we pop one element and push up to 4 elements, so at the deepest level of traversal we could end up with a 72 entry stack. Assuming these are 32-bit node identifiers, that ends up with 288 bytes of stack per lane, or ~18 KiB per 64 lane workgroup (the minimum which could possibly keep a CU busy with an ALU only workload). Given that we have 64 KiB of LDS (yes I am using LDS since there is no divergent dynamic register addressing) per CU that leaves only 3 workgroups per CU, leaving very little options for parallelism between different hardware execution units (e.g. the ALU and the texture unit that executes the ray intersections) or latency hiding of memory operations.

So ideally we get this stack size down significantly.

Where do we go next?

First step is to get CTS passing and getting an initial merge request into upstream Mesa. As a follow on to that I’d like to get a minimal prototype going for some DXR 1.0 games with vkd3d-proton just to make sure we have the right feature coverage.

After that we’ll have to do all the traversal optimizations. I’ll probably implement a bunch of instrumentation so I actually have a clue on what to optimize. This is where having some runnable games really helps get the right idea about performance bottlenecks.

Finally, with some luck better shaders to build a BVH will materialize as well.

July 22, 2021

If you want to write an X application, you need to use some library that speaks the X11 protocol. For a long time this meant libX11, often called xlib, which - like most things about X - is a fantastic bit of engineering that is very much a product of its time with some confusing baroque bits. Overall it does a very nice job of hiding the icky details of the protocol from the application developer.

One of the details it hides has to do with how resource IDs are allocated in X. A resource ID (an XID, in the jargon) is a 32 29-bit integer that names a resource - window, colormap, what have you. Those 29 bits are split up netmask/hostmask style, where the top 8 or so uniquely identify the client, and the rest identify the resource belonging to that client. When you create a window in X, what you really tell the server is "I want a window that's initially this size, this background color (etc.) and from now on when I say (my client id + 17) I mean that window." This is great for performance because it means resource allocation is assumed to succeed and you don't have to wait for a reply from the server.

Key to all this is that in xlib the XID is the return value from the call that issues the resource creation request. Internally the request gets queued into the protocol's write buffer, but the client can march ahead and issue the next few commands as if creation had succeeded - because it probably did, and if it didn't you're probably going to crash anyway.

So to allocate XIDs the client just marches forward through its XID range. What happens when you hit the end of the range? Before X11R4, you'd crash, because xlib doesn't keep track of which XIDs it's allocated, just the lowest one it hasn't allocated yet. Starting in R4 the server added an extension called XC-MISC that lets the client ask the server for a list of unused XIDs, so when xlib hits the end of the range it can request a new range from the server.

But. UI programming tends to want threads, and xlib is perhaps not the most thread-friendly. So XCB was invented, which sacrifices some of xlib's ease of use for a more direct binding to the protocol and (in theory) an explicitly thread-safe design. We then modified xlib and XCB to coexist in the same process, using the same I/O buffers, reply and event management, etc.

This literal reflection of the protocol into the API has consequences. In XCB, unlike xlib, XID generation is an explicit step. The client first calls into XCB to allocate the XID, and then passes that XID to the creation request in order to give the resource a name.

Which... sorta ruins that whole thread-safety thing.

Let's say you call xcb_generate_id in thread A and the XID it returns is the last one in your range. Then thread B schedules in and tries to allocate another XID. You'll ask the server for a new range, but since thread A hasn't called its resource creation request yet, from the server's perspective that "allocated" XID looks like it's still free! So now, whichever thread issues their resource creation request second will get BadIDChoice thrown at them if the other thread's resource hasn't been destroyed in the interim.

A library that was supposed to be about thread safety baked a thread safety hazard into the API. Good work, team.

How do you fix this without changing the API? Maybe you could keep a bitmap on the client side that tracks XID allocation, that's only like 256KB worst case, you can grow it dynamically and most clients don't create more than a few dozen resources anyway. Make xcb_generate_id consult that bitmap for the first unallocated ID, and mark it used when it returns. Then track every resource destruction request and zero it back out of the bitmap. You'd only need XC-MISC if some other client destroyed one of your resources and you were completely out of XIDs otherwise.

And you can implement this, except. One, XCB has zero idea what a resource destruction request is, that's simply not in the protocol description. Not a big deal, you can fix that, there's only like forty destructors you'd need to annotate. But then two, that would only catch resource destruction calls that flow through XCB's protocol binding API, which xlib does not, xlib instead pushes raw data through xcb_writev. So now you need to modify every client library (libXext, libGL, ...) to inform XCB about resource destruction.

Which is doable. Tedious. But doable.

I think.

I feel a little weird writing about this because: surely I can't be the first person to notice this.

July 21, 2021

In order to expose OpenGL 4.6 the last missing feature in llvmpipe is anisotropic texture filtering. Adding support for this also allows lavapipe expose the Vulkan samplerAnisotropy feature.

I started writing anisotropic support > 6 months ago. At the time we were trying to deprecate the classic swrast driver, and someone pointed out it had support for anisotropic filtering. This support had also been ported to the softpipe driver, but never to llvmpipe.

I had also considered porting swiftshaders anisotropic support, but since I was told the softpipe code was functional and had users I based my llvmpipe port on that.

Porting the code to llvmpipe means rewriting it to generate LLVM IR using the llvmpipe vector processing code. This is a lot messier than just writing linear processing code, and when I thought I had it working it passes GL CTS, but failed the VK CTS. The results also to my eye looked worse than I'd have thought was acceptable, and softpipe seemed to be as bad.

Once I swung back around to this I decided to port the VK CTS test to GL and run it on softpipe and llvmpipe code. Initially llvmpipe had some more bugs to solve esp where the mipmap levels were being chosen, but once I'd finished aligning softpipe and llvmpipe I started digging into why the softpipe code wasn't as nice as I expected.

The softpipe code was based on an implementation of an Elliptical Weighted Average Filter (EWA). The paper "Creating Raster Omnimax Images from Multiple Perspective Views Using the Elliptical Weighted Average Filter" described this. I sat down with the paper and softpipe code and eventually found the one line where they diverged.[1] This turned out to be a bug introduced in a refactoring 5 years ago, and nobody had noticed or tracked it down.

I then ported the same fix to my llvmpipe code, and VK CTS passes. I also optimized the llvmpipe code a bit to avoid doing pointless sampling and cleaned things up. This code landed in [2] today.

For GL4.6 there are still some fixes in other areas.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11917

[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/8804

July 20, 2021

After a month of reverse-engineering, we’re excited to release documentation on the Valhall instruction set, available as a PDF. The findings are summarized in an XML architecture description for machine consumption. In tandem with the documentation, we’ve developed a Valhall assembler and disassembler as a reverse-engineering aid.

Valhall is the fourth Arm® Mali™ architecture and the fifth Mali instruction set. It is implemented in the Arm® Mali™-G78, the most recently released Mali hardware, and Valhall will continue to be implemented in Mali products yet to come.

Each architecture represents a paradigm shift from the last. Midgard generalizes the Utgard pixel processor to support compute shaders by unifying the shader stages, adding general purpose memory access, and supporting integers of various bit sizes. Bifrost scalarizes Midgard, transitioning away from the fixed 4-channel vector (vec4) architecture of Utgard and Midgard to instead rely on warp-based execution for parallelism, better using the hardware on modern workloads. Valhall linearizes Bifrost, removing the Very Long Instruction Word mechanisms of its predecessors. Valhall replaces the compiler’s static scheduling with hardware dynamic scheduling, trading additional control hardware for higher average performance. That means padding with “no operation” instructions is no longer required, which may decrease code size, promising better instruction cache use.

All information in this post and the linked PDF and XML is published in good faith and for general information purpose only. We do not make any warranties about the completeness, reliability and accuracy of this information. Any action you take upon the information you find here, is strictly at your own risk. We are not be liable for any losses and/or damages in connection with the use of this information.

While we strive to make the information as accurate as possible, we make no claims, promises, or guarantees about its accuracy, completeness, or adequacy. We expressly disclaim liability for content, errors and omissions in this information.

Let’s dig in.

Getting started

In June, Collabora procured an International edition of the Samsung Galaxy S21 phone, powered by a system-on-chip with Mali G78. Although Arm announced Valhall with the Mali G77 in May 2019, roll out has been slow due to the COVID-19 pandemic. At the time of writing, there are not yet Linux friendly devices with a Valhall chip, forcing use of a locked down Android device. There’s a silver lining: we have a head start on the reverse-engineering, so by the time hacker-friendly devices arrive with Valhall GPUs, we can have open source drivers ready.

Android complicates reverse-engineering (though not as much as macOS). On Linux, we can compile a library on the device to intercept data sent to the GPU. On Android, we must cross-compile from a desktop with the Android Native Development Kit, ironically software that doesn’t run on Arm processors. Further, where on Linux we can track the standard system calls, Android device drivers replace the standard open() system call with a complicated Android-only “binder” interface. Adapting the library to support binder would be gnarly, but do we have to? We could sprinkle in one little hack anywhere we see a file descriptor without the file name.

#define MALI0 "/dev/mali0"

bool is_mali(int fd)
{
    char in[128] = { 0 }, out[128] = { 0 };
    snprintf(in, sizeof(in), "/proc/self/fd/%d", fd);

    int count = readlink(in, out, sizeof(out) - 1);
    return count == strlen(MALI0) && strncmp(out, MALI0, count) == 0;
}

Now we can hook the Mali ioctl() calls without tracing binder and easily dump graphics memory.

We’re interested in the new instruction set, so we’re looking for the compiled shader binaries in memory. There’s a chicken-and-egg problem: we need to find the shaders to reverse-engineer them, but we need to reverse-engineer the shaders to know what to look for. Fortunately, there’s an escape hatch. The proprietary Mali drivers allow an OpenGL application to query the compiled binary with the ARM_mali_program_binary extension, returning a file in the Mali Binary Shader format. That format was reverse-engineered years ago by Connor Abbott for earlier Mali architectures, and the basic structure is unchanged in Valhall. Our task is simple: compile a test shader, dump both GPU memory and the Mali Binary Shader, and find the common section. Searching for the common bytes produces an address in executable graphics memory, in this case 0x7f0002de00. Searching for that address in turn finds the “shader program descriptor” which references it.

18 00 00 80 00 10 00 00 00 DE 02 00 7F 00 00 00

Another search shows this descriptor’s address in the payload of an index-driven vertex shading job for graphics or a compute job for OpenCL. Those jobs contain the Job Manager header introduced a decade ago for Midgard, so we understand them well: they form a linked list of jobs, and only the first job is passed to the kernel. The kernel interface has a “job chain” parameter on the submit system call taking a GPU address. We understand the kernel interface well as it is open source due to kernel licensing requirements.

With each layer identified, we teach the wrapper library to chase the pointers and dump every shader executed, enabling us to reverse-engineer the new instruction set and develop a disassembler.

Instruction set reconnaissance

Reverse-engineering in the dark is possible, but it’s easier to have some light. While waiting for the Valhall phone to arrive, I read everything Arm made public about the instruction set, particularly this article from Anandtech. Without lifting a finger, that article tells us Valhall is…

  • Warp-based, like Bifrost, but with 16 threads per warp instead of Bifrost’s 4/8.
  • Isomorphic to Bifrost on the instruction level (“operational equivalence”).
  • Regularly encoded.
  • Flat, lacking Bifrost’s clause and tuple packaging.

It also says that Valhall has a 16KB instruction cache, holding 2048 instructions. Since Valhall has a regular encoding, we divide 16384 bytes by 2048 instructions to find a Valhall instruction is 8 bytes. Our first attempt at a “disassembler” can print hex dumps of every 8 bytes on a line; our calculation ensures that is the correct segmentation.

From here on, reverse-engineering is iterative. We have a baseline level of knowledge, and we want to grow that knowledge. To do so, we input test programs into the proprietary driver to observe the output, then perturbe the input program to see how the output changes.

As we discover new facts about the architecture, we update our disassembler, demonstrating new knowledge and separating the known from the unknown. Ideally, we encode these facts in a machine-readable file forming a single reference for the architecture. From this file, we can generate a disassembler, an assembler, an instruction encoder, and documentation. For Valhall, I use an XML file, resembling Bifrost’s equivalent XML.

Filling out this file is usually straightforward though tedious. Modern APIs are large, so there is a great deal of effort required to map the API requirements to the hardware features.

However, some hardware features do not map to any API. Here are subtler tales from reversing Valhall.

Dependency slots

Arithmetic is faster than memory access, so modern processors execute arithmetic in parallel with pending memory accesses. Modern GPU architectures require the compiler to manage this mechanism by analyzing the program and instructing the hardware to wait for the results before they’re needed.

For this purpose, Bifrost uses an explicit scoreboarding system. Bifrost groups up to 16 instructions together in a clause, and each clause has a fixed header. The compiler assigns a “dependency slot” between 0 and 7 to each clause, specified in the header. Each clause can wait on any set of slots, specified with another 8-bits in the clause header. Specifying dependencies per-clause is a compromise between precision and code size.

We expect Valhall to feature a similar scheme, but Valhall doesn’t have clauses or clause headers, so where does it specify this info?

Studying compiled shaders, we see the last byte of every instruction is usually zero. But when the result of a memory access is first read, the previous instruction has a bit set in the last byte. Which bit is set depends on the number of memory accesses in flight, so it seems the last byte encodes a dependency wait. The memory access instructions themselves are often zero in their last bytes, so it doesn’t look like the last byte is used to encode the dependency slot – but executing many memory access instructions at once and comparing the bits, we see a single 2-bit field stands out as differing. The dependency slot is specified inside the instruction, not in the metadata.

What makes this design practical? Two factors.

One, only the waits need to be specified in general. Arithmetic instructions don’t need a dependency slot, since they complete immediately. The longest message passing instructions is shorter than the longer arithmetic instruction, so there is space in the instruction itself to specify only when needed.

Two, the performance gain from adding extra slots levels off quickly. Valhall cuts back on Bifrost’s 8 slots (6 general purpose). Instead it has 4 or 5 slots, with only 3 general purpose, saving 4-bits for every instruction.

This story exemplifies a general pattern: Valhall is a flattening of Bifrost. Alternatively, Bifrost is “Valhall with clauses”, although that description is an anachronism. Why does Bifrost have clauses, and why does Valhall remove them? The pattern in this story of dependency waits generalizes to answer the question: grouping many instructions into Bifrost clauses allows the hardware to amortize operations like dependency waits and reduce the hardware gate count of the shader core. However, clauses add substantial encoding overhead, compiler complexity, and imprecision. Bifrost optimizes for die space; Valhall optimizes for performance.

The missing modifier

Hardware features that are unused by the proprietary driver are a perennial challenge for reverse-engineering. However, we have a complete Bifrost reference at our disposal, and Valhall instructions are usually equivalent to Bifrost. Special instructions and modes from Bifrost cast a shadow on Valhall, showing where there are gaps in our knowledge. Sometimes these gaps are impractical to close, short of brute-forcing the encoding space. Other times we can transfer knowledge and make good guesses.

Consider the Cross Lane PERmute instruction, CLPER, which takes a register and the index of another lane in the warp, and returns the value of the register in the specified lane. CLPER is a “subgroup operation”, required for Vulkan and used to implement screen-space derivatives in fragment shaders. On Bifrost, the CLPER instruction is defined as:

<ins name="+CLPER.i32" mask="0xfc000" exact="0x7c000">
  <src start="0" mask="0x7"/>
  <src start="3"/>
  <mod name="lane_op" start="6" size="2">
    <opt>none</opt>
    <opt>xor</opt>
    <opt>accumulate</opt>
    <opt>shift</opt>
  </mod>
  <mod name="subgroup" start="8" size="2">
    <opt>subgroup2</opt>
    <opt>subgroup4</opt>
    <opt>subgroup8</opt>
  </mod>
  <mod name="inactive_result" start="10" size="4">
    <opt>zero</opt>
    <opt>umax</opt>
    ....
    <opt>v2infn</opt>
    <opt>v2inf</opt>
  </mod>
</ins>

We expect a similar definition for Valhall. One modification is needed: Valhall warps contain 16 threads, so there should be a subgroup16 option after subgroup8, with the natural binary encoding 11. Looking at a binary Valhall CLPER instruction, we see a 11 pair corresponding to the subgroup field. Similarly experimenting with different subgroup operations in OpenCL lets us figure out the lane_op field. We end up with an instruction definition like:

<ins name="CLPER.u32" title="Cross-lane permute" dests="1" opcode="0xA0" opcode2="0xF">
  <src/>
  <src widen="true"/>
  <subgroup/>
  <lane_op/>
</ins>

Notice we do not specify the encoding in the Valhall XML, since Valhall encoding is regular. Also notice we lack the inactive_result modifier. On Bifrost, inactive_result specifies the value returned if the program attempts to access an inactive lane. We may guess Valhall has the same mechanism, but that modifier is not directly controllable by current APIs. How do we proceed?

If we can run code on the device, we can experiment with the instruction. Inactive lanes may be caused by divergent control flow, where one lane in the thread branches but another lane does not, forcing the hardware to execute only part of the warp. After reverse-engineering Valhall’s branch instructions, we can construct a situation where a single lane is active and the rest are inactive. Then we insert a CLPER instruction with extra bits set, store the result to main memory, and print the result. This assembly program does the trick:

# Elect a single lane
BRANCHZ.reconverge.id lane_id, offset:3

# Try to read a value from an inactive thread
CLPER.u32 r0, r0, 0x01000000.b3, inactive_result:VALUE

# Store the value
STORE.i32.slot0.reconverge @r0, u0, offset:0

# End shader
NOP.return

With the assembler we’re writing, we can assemble this compute kernel. How do we run it on the device without knowing the GPU data structures required to dispatch compute shaders? We make use of another classic reverse-engineering technique: instead of writing the initialization code ourselves, piggyback off the proprietary driver. Our wrapper library allows us to access graphics memory before the driver submits work to the hardware. We use this to read the memory, but we may also modify it. We already identified the shader program descriptor, so we can inject our own shaders. From here, we can jury-rig a script to execute arbitrary shader binaries on the device in the context of an OpenCL application running under the proprietary driver.

Putting it together, we find the inactive_result bits in the CLPER encoding and write one more script to dump all values.

for ((i = 0 ; i < 16 ; i++)); do
  sed -e "s/VALUE/$i/" shader.asm | python3 asm.py shader.bin
  adb push shader.bin /data/local/tmp/
  adb shell 'REPLACE=/data/local/tmp/shader.bin '\
    'LD_PRELOAD=/data/local/tmp/panwrap.so '\
    '/data/local/tmp/test-opencl'
done

The script’s output contains sixteen possibilities – and they line up perfectly with Bifrost’s sixteen options. Success.

Next steps

There’s more to learn about Valhall, but we’ve reverse-engineered enough to develop a Valhall compiler. As Valhall is a simplification of Bifrost, and we’ve already developed a free and open source compiler for Bifrost, this task is within reach. Indeed, adapting the Bifrost compiler to Valhall will require refactoring but little new development.

Mali G78 does bring changes beyond the instruction set. The data structures are changed to reduce Vulkan driver overhead. For example, the monolithic “Renderer State Descriptor” on Bifrost is split into a “Shader Program Descriptor” and a “Depth Stencil Descriptor”, so changes to the depth/stencil state no longer require the driver to re-emit shader state. True, the changes require more reverse-engineering. Fortunately, many data structures are adapted from Bifrost requiring few changes to the Mesa driver.

Overall, supporting Valhall in Mesa is within reach. If you’re designing a Linux-friendly device with Valhall and looking for open source drivers, please reach out!

Originally posted on Collabora’s blog

July 09, 2021

It Happened.

ablend.png

That’s right.

Zink(-wip) now fully supports GL_KHR_blend_equation_advanced, which means ES 3.2 is a go (once my local CI clears me to push today’s snapshot).

And all it took was one brief exchange with a top Mesa reviewer who is incidentally rumored to be undergoing training deep in the mountains to become an expert BBQ master on the extremely professional #zink channel on OFTC:

That's the thing. You can totally do it in Zink.

My mind was blown.

Why hadn’t I thought of that sooner?

I could just…do it? Just like that? And then it’d be done?

Truly the experts are on a different level from us mortals.

So now it’s done, and that means zink is finished. I don’t expect there will be any more work to do now that the final boss has been defeated. Don’t even bother trying to file bug reports.

You may not like it, but this is what peak Friday looks like.

July 07, 2021

The Unsung Heroes

This is going to be less of a technical post and more of a have you thought about post from me personally (usual disclaimer: this post represents only my views). With that said, I think this is more important than the average post here, meaning that expectations should be set somewhere between I need to stop everything else I’m doing until I finish reading and this is the most important event in my life.

Let’s talk about open source. No, Open Source. The idea of it.

How Does Open Source Work?

Those of you who are veterans are rolling your eyes. Another post about the glory of Open Source.

The thing about Open Source is that it’s sort of whatever you make of it. At its core, it’s about getting people together to solve a problem—community building. Whether that community is large or small, the goal is the same: write some quality software.

To that end, you’ve got your usual corporate powerpoint slide of community roles:

  • maintainers
  • developers
  • reviewers
  • whatever other buzzwords are currently relevant

In Mesa, the maintainer and developer roles are mostly the same among core contributors: these are the people who write the code that gets posted about on all the news sites.

The reviewer is a bit more mysterious though. Who are reviewers, and what separates them from the others?

WD-40

Reviewers are the grease that makes the project work. There’s really no other way of saying it.

Outside of a few components of Mesa that are effectively the wild west, without any form of oversight or approval needed for changes to be landed, every driver and utility in the tree requires that changes undergo review before they land. This means that each and every patch which affects code or build has to have a person stop everything else they’re doing and physically scroll through each patch, line-by-line, then add a Reviewed-by or Acked-by tag.

If you’re unclear as to the meanings of these tags, consider it like you’re going skydiving with someone you’ve never met before who has been in charge of preparing your parachute:

  • Reviewed-by means “I triple-checked your parachute as well as your reserve, and I’m as certain as a human is capable of being that everything is how it should be”
  • Acked-by means “Hey, I grabbed this already-packed parachute off the hanger and gave it a once-over; you’ll probably be fine”

It’s then up to the developer to decide whether to merge the code based on the feedback given to them by the reviewer.

This, of course, assumes they get feedback at all.

Balance

Too often on news sites (and in certain corporate metrics) you’ll see something like “Patches McCodesAlot, working for GreatCodingCompany, authored the most code changes for this release cycle (9001 patches), which is over 100x more than the next highest contributor.”

The manager at a company sees this and thinks “I’ll send this up the chain. We should poach Patches so we can have greater control over this project which underpins our entire business strategy. Also it’ll make my powerpoint pie charts look rad.”

The casual reader sees this and says “Wow, Patches is awesome! Without Patches, I probably couldn’t even play Fororantwatch on my Linux gaming desktop!”

But how do the patches that Patches writes get merged into the release? Unless Patches works exclusively in one of the undermaintained areas of the project, in which case it’s unlikely that their work is being widely used, the odds are that someone’s pulling a huge lift on the review side to enable all of those patches landing into the repository.

This is the job of the reviewer.

A Thanks

As this Mesa release cycle starts to wind down, I hope that readers of this blog and news sites can take a moment to look past Patches McCodesAlot and see the people who make it possible for Patches to land so many damn patches.

At the time of this post, this is what the top 10 reviewers managed to accomplish over the past few months:

Number of Reviews Reviewer Name Corporate Affiliation
91 Erik Faye-Lund Collabora
94 Samuel Pitoiset Valve
99 Alejandro Piñeiro Igalia
115 Kenneth Graunke Intel
116 Bas Nieuwenhuizen Blogger
121 Lionel Landwerlin Intel
128 Adam Jackson Red Hat
140 Marek Olšák AMD
176 Jason Ekstrand Intel
300 Dave Airlie Red Hat

Summed up, that’s over 1300 patches reviewed! For perspective, that’s around 30% of all the patches in this release, and it’s about 70% of the total number of patches that zink has received in the course of its existence.

Looking at it another way though, this is over 1300 patches that other people wrote which were able to land because these people took the time to look over the proposed changes—to triple-check the parachutes, as it were.

So thanks, Mesa reviewers. The project wouldn’t exist without all of you (and your generous employers, who should be blasting these metrics in the press when they talk about being good Open Source citizens).

But Also

I’d be remiss if I didn’t also mention the people working on Mesa CI. There’s no patch counts or review counts or anything to recognize everyone hard at work here, but CI is what keeps the triangles blasting out of your GPUs looking how they should.

Thanks, CI team. You’re awesome.

According to a recent metric, the Mesa CI infrastructure only had a 0.6% accidental failure rate. That’s pretty good considering how many thousands of jobs run every day.

June 28, 2021

This year, I decided to participate as speaker in esLibre 2021 conference. esLibre is a Spanish free software conference that covers a lot of different topics related to open-source projects: from the technical point of view to its social impact.

This year the conference had talks about game development with Godot, KDE, LibreOffice, Free Software in Universities among many others. Check out the program.

esLibre 2021

This is my first time participating in this conference and I enjoyed it a lot. Huge applause to the organization team for the huge work to organize this edition, for helping out the speakers with different testing days and for their kindness to reply any question from me and other attendees. They did a superb job!

My talk was an introduction to Mesa where I covered things like where is Mesa in the open-source graphics stack, a summary of what it does, the drivers implemented in Mesa, how our community is organized and how to contribute to it. If you know Spanish, you can check it out here (PDF). But in case you want an English version of it, this talk is very similar to the one I gave at Ubucon Europe 2018.

My esLibre talk was recorded as well! Edit: the recording is already available here.

Enjoy it!

Introduction to Mesa

June 23, 2021

BREAKING: THIS IS NO LONGER A ZINK BLOG

For today, at least.

Today, this blog is a Gallium blog. And it’s a momentous day indeed.

We all know what this is:

portal2-title.png

It’s a screenshot of Portal 2 with the Gallium HUD activated and VSync disabled.

But what driver is that underneath?

Well, for today’s blog it’s RadeonSI, the reference implementation of a Gallium driver.

And why is this, I can hear you all asking.

What if I told you that this screenshot with 10% higher FPS is also Portal 2 with VSync disabled on RadeonSI using one trick that graphics developers WON’T TELL YOU:

portal2-nine-title.png

Interested?

Coming Soon (Maybe, And Also Maybe Requiring Some Early 2000s-era Elbow Grease From Anyone Wanting To Try): Native Linux Source Games On Gallium Nine

We did it.

By assembling an elite team of individuals with a few minutes to spare here and there over the past week, including:

  • Josh Ashton, expert spammer of 🐸 emojis
  • Axel Davy, expert struct packer
  • Me, expert blogger
  • Is Such A Thing Even Possible? Why Yes, Yes It Is.

it is now (technically) possible to run DXVK-compatible Source Engine games through Gallium’s Nine state tracker, providing a native D3D9 runtime.

Is your Portal 2 in-game FPS sad and barely even 500 like this screenshot?

portal2-ingame.png

Why not jack it up to more than TWICE THAT NUMBER* with riced out, GPU-fan-shredding technology that Mesa Gallium drivers have been shipping for years?

portal2-nine-ingame.png

Disclaimer*

This post does not represent any form of official statement or address from Valve and is only a small project that was started out of boredom while I waited for CTS runs to finish.

This post also does not make any claims or statements regarding performance on other drivers, or performance comparisons using alternative graphics emulation layers, though whew, it sure would be interesting to see what those kinds of numbers look like!

June 21, 2021

We Did It

After months and months of the construction crews hammering away, VK_EXT_multi_draw has now been released for general use.

Will this suddenly make zink the fastest GPU driver in history?

Obviously.

Long-time readers will recall that I memed about this extension some time ago, and the numbers in a synthetic benchmark targeted at exactly this feature are phenomenal.

For more on the topic, we go to our Senior Multidraw Correspondent and my personal Khronos BFF, Ricardo Garcia, who has been following this story since the beginning.

June 18, 2021

Fast Friday

In short, an issue was filed recently about getting the Nine state tracker working with zink.

Was it the first? No..

Was it the first one this year? Yes.

Thus began a minutes-long, helter-skelter sequence of events to get Nine up and running, spread out over the course of a day or two. In need of a skilled finagler knowledgeable in the mysterium of Gallium state trackers, I contacted the only developer I know with a rockstar name, Axel Davy. We set out at dawn, and I strapped on my parachute. It was almost immediately that I heard a familiar call: there’s a build issue.

Next stop was crashing in unimplemented interface methods with a stopover in flailing about wildly in TGSI what even is this before I arrivated at my target:

nine.png

Ah, glorious triangles.

June 16, 2021

I Said I Would

A long, long time ago in a month far, far away I said I was going to blog about some improvements I’d been working on for zink. I blogged about some of them, but one was conspicuously absent from the original list:

  • make zink usable for gaming

There’s a lot that goes into this item. The post you’re reading now isn’t about to go so far as to claim that zink(-wip) is usable for gaming. No, that day is still far, far away. But this post is going to be the first step.

To begin with, a riddle: what change was made to zink between these two screenshots?

tr-slow.png

tr-zoom.png

That’s right, I put the punchline in the title.

A suballocator.

What Is A Suballocator?

A suballocator is a mechanism by which small blocks of memory can be suballocated out of larger one. For example, if I want to allocate an 64byte chunk of memory, I could allocate it directly and get my block, or I could allocate a 4096byte chunk of memory and then take 64bytes out of it.

When performance is involved, it’s important to consider the time-cost of allocations, and so yes, it’s useful to have already allocated another 63 instances of 64bytes when I need a second one, but there’s another, deeper issue that’s also necessary to address, especially as it relates to gaming: 32bit environments.

In a 32bit process, the amount of address space available is limited to 4GB, regardless of how much actual memory is physically present, some of which is dedicated to system resources and unavailable for general use. Any time a buffer or image is mapped by the driver in a process, this uses up address space in order to create an addressable region of memory that can be read or written to. Once all the address space has been used up, no other resources can be mapped, and it becomes impossible to continue normal operations.

In short, the game crashes.

In Vulkan, and just generally in driver work, it’s important to keep allocation sizes aligned to the preference of the hardware for a given usage; this amounts to minMemoryMapAlignment, which is 4096bytes on many drivers. Similarly, vkGetBufferMemoryRequirements and vkGetImageMemoryRequirements return aligned memory sizes, so even if only 64bytes are needed, 4096bytes must still be allocated—4032 bytes unused. This ends up wasting tons of memory when an app is allocating lots of smaller regions, and it’s further wasting address space since Vulkan prohibits memory from being mapped multiple times, meaning that each 64byte buffer is wasting an additional 4032bytes of address space.

While 4k of memory may seem like a small amount, and why would anyone ever need more than 256kb memory anyway, these allocations all add up, fast enough that zink runs out of address space in a 32bit game like Tomb Raider within a couple minutes.

Playable?

Probably not.

The Solution, As Always

If you’re working in Mesa, you basically have two options when you come across a new problem: delete some code or copy some code. It’s not often that I come across an issue which can’t be resolved by one of the two.

In this case, I had known for quite a while that the solution was going to be copying some code. Thus I entered the realm of Gallium’s awesome auxilliary/pipebuffer, a fearsome component that had only been leveraged by one driver.

zink_bo.png

Yup, it was time to throw more galaxybrain.jpg code into the blender and see what came out. Ultimately, I was able to repurpose a lot of the core calculation code for sizing allocations, which saved me from having to do any kind of thinking or maffs. This let me cut down my suballocator implementation to a little under 700 lines, leaving much, much, much more space for bugsactivities.

At a high level, here’s an overview of aux/pb:

  • call pb_cache_init to set up a memory cache
  • initialize slab allocators with pb_slabs_init
  • when allocating a new resource, determine if it can be slab allocated; if yes, use pb_slab_alloc to reuse/reclaim a slab allocation, otherwise manually allocate new memory
  • when destroying a resource, use pb_reference_with_winsys

There’s more under the hood, but it mostly boils down to filling in the interface functions to manage detecting whether resources are busy or can be reclaimed for reuse. The actual caching/reclaiming/reusing are all handled by aux/pb, meaning I was free to go about breaking everything with all the leftover time that I had.

Cultured users of zink-wip can now enjoy massively improved performance (and have already been enjoying it for the past month) in many apps. The rest of you get to sit around and watch while I bang my head against CI while ajax showers me with memes.

June 14, 2021

There’s a lot that has happened in the world of Zink since my last update, so let’s see if I can bring you up to date on the most important stuff.

Upstream development

Gosh, when I last blogged about Zink, it hadn’t even landed upstream in Mesa yet! Well, by now it’s been upstream for quite a while, and most development has moved there.

At the time of writing, we have merged 606 merge-requests labeled “zink”. The current tip of mesa’s main branch is totaling 1717 commits touching the src/gallium/drivers/zink/ sub-folder, written by 42 different contributors. That’s pretty awesome in my eyes, Zink has truly become a community project!

Another noteworthy change is that Mike Blumenkrantz has come aboard the project, and has churned out an incredible amount of improvements to Zink! He got hired by Valve to work on Zink (among other things), and is now the most prolific contributor, with more than twice the amount of commits than I have written.

If you want a job in Open Source graphics, Zink has a proven track-record as a job-creator! :smile:

In addition to Mike, there’s some other awesome people who have been helping out lately.

Half-Life 2 running with Zink. Half-Life 2 running with Zink.

OpenGL 4.6 support

Thanks to a lot of hard work by Mike assisted by Dave Airlie and Adam Jackson, both of RedHat, Zink is now able to expose the OpenGL 4.6 (Core Profile) feature set, given enough Vulkan features! :tada:

Please note that this doesn’t mean that Zink is yet a conformant implementation, there’s some details left to be ironed out before we can claim that. In particular, we need to pass the conformance tests, and submit a conformance report to Khronos. We’re not there yet.

I’m also happy to see that Zink is currently at the top of MesaMatrix (together with LLVMpipe, i965 and RadeonSI), reporting a total of 160 OpenGL extensions at the time of writing!

In theory, that means you can run any OpenGL application you can think of on top of Zink. Mike is hard at work testing the entire Steam game library, and things are working pretty well.

Is this the end of the line for Zink? Are we done now? Not at all! :laughing:

OpenGL compabibility profile

We’re still stuck at OpenGL 3.0 for compatibility contexts, mainly due to lack of testing. There’s a lot of features that need to work together in relatively complicated ways for this to work for us.

Note that this only matters for applications that rely on legacy OpenGL features. Modern OpenGL programs gets OpenGL 4.6 support, as mentioned previously.

I don’t think this is going to be a big deal to enable, but I haven’t spent time on it.

OpenGL ES 3.1 support

Similar to the OpenGL 4.6 support, we’re now able to expose the OpenGL ES 3.1 feature set. This is again thanks to a lot of hard work by Mike and the gang.

Why not OpenGL ES 3.2? This comes down to the GL_KHR_blend_equation_advanced feature. Mike blogged about the issue a while ago.

Lavapipe and continuous integration

To prevent regressions, we’ve started testing Zink on the Mesa CI system for every change. This is made possible thanks to Lavapipe, a Vulkan software implementation in Mesa that reuse the rasterizer from LLVMpipe.

This means we can run tests on virtual cloud machines without having to depend on unreliable hardware. :robot:

At the time of writing, we’re only exposing OpenGL 4.1 on top of Lavapipe, due to some lacking features. But we have patches in the works to bring this up to OpenGL 4.5, and OpenGL 4.6 probably won’t be far off when that lands.

Windows support

Basic support for Zink on Microsoft Windows has landed. This isn’t particularly useful at the moment, because we need better window-system integration to get anywhere near reasonable performance. But it’s there.

macOS support

Thanks to work by Duncan Hopkins of The Foundry, there’s also some support for macOS. This uses MoltenVK as the Vulkan implementation, meaning that we also support the Vulkan Portability Extension to some degree.

This support isn’t quite as drop-in as on other platforms, because it’s completely lacking window-system integration. But it seems to work for the use-cases they have at The Foundry, so it’s worth mentioning as well.

Driver support

Beyond this, Igalia has brought up Zink on the V3DV driver, and I’ve heard some whispers that there’s some people running Zink on top of Turnip, an open-source Vulkan driver for recent Qualcomm Adreno GPUs.

I’ve heard some people have some success getting things running on NVIDIA, but there’s a few obvious problems in the way there due to the lack of proper DRI support… Which brings us to:

Window System Integration

Another awesome new development is that Adam is working on Penny. So, what’s Penny?

Penny is another way of bringing up Zink, on systems without DRI support. It works as a dedicated GLX integration that uses the VK_KHR_swapchain extension to integrate properly with the native Vulkan driver’s window-system integration instead of Mesa baking its own.

This solves a lot of small, nasty issues in the DRI code-path. I’ll say the magic “implicit synchronization” word, and hope that scares away anyone wondering what it’s about.

Performance

A lot more has happened on the performance front as well, again all thanks to Mike. However, much of this is still out-of-tree, and waiting in Mike’s zink-wip branch.

So instead, I suggest you check out Mike’s blog for the latest performance information (and much more up-to-date info on Zink). There’s been a lot going on, and I’m sure there’s even more to come!

Closing words

I think this should cover the most interesting bits of development.

On a personal note, I recently became a dad for the first time, and as a result I’ll be away for a while on paternity leave, starting early this fall. Luckily, Zink is in good hands with Mike and the rest of the upstream community taking care of things.

I would like to again plug Mike’s blog as a great source of Zink-related news, if you’re not already following it. He posts a lot more frequent than I do, and he’s also an epic meme master, so it’s all great fun!

(Added a section on load/store pairs on June 14th)

This question probably seems absurd. An unoptimized memcpy is a simple loop that copies bytes. How hard can that be? Well...

There's a fascinating thread on llvm-dev started by George Mitenkov proposing a new family of "byte" types. I found the proposal and discussion difficult to follow. In my humble opinion, this is because the proposal touches some rather subtle and underspecified aspects of LLVM IR semantics, and rather than address those fundamentals systematically, it jumps right into the minutiae of the instruction set. I look forward to seeing how the proposal evolves. In the meantime, this article is a byproduct of me attempting to digest the problem space.

Here is a fairly natural way to (attempt to) implement memcpy in LLVM IR:

define void @memcpy(i8* %dst, i8* %src, i64 %n) {
entry:
  %dst.end = getelementptr i8, i8* %dst, i64 %n
  %isempty = icmp eq i64 %n, 0
  br i1 %isempty, label %out, label %loop

loop:
  %src.loop = phi i8* [ %src, %entry ], [ %src.next, %loop ]
  %dst.loop = phi i8* [ %dst, %entry ], [ %dst.next, %loop ]
  %ch = load i8, i8* %src.loop
  store i8 %ch, i8* %dst.loop
  %src.next = getelementptr i8, i8* %src.loop, i64 1
  %dst.next = getelementptr i8, i8* %dst.loop, i64 1
  %done = icmp eq i8* %dst.next, %dst.end
  br i1 %done, label %out, label %loop

out:
  ret void
}

Unfortunately, the copy that is written to the destination is not a perfect copy of the source.

Hold on, I hear you think, each byte of memory holds one of 256 possible bit patterns, and this bit pattern is perfectly copied by the `load`/`store` sequence! The catch is that in LLVM's model of execution, a byte of memory can in fact hold more than just one of those 256 values. For example, a byte of memory can be poison, which means that there are at least 257 possible values. Poison is forwarded perfectly by the code above, so that's fine. The trouble starts because of pointer provenance.


What and why is pointer provenance?

From a machine perspective, a pointer is just an integer that is interpreted as a memory address.

For the compiler, alias analysis -- that is, the ability to prove that different pointers point at different memory addresses -- is crucial for optimization. One basic tool in the alias analysis toolbox is to recognize that if pointers point into different "memory objects" -- different stack or heap allocations -- then they cannot alias.

Unfortunately, many pointers are obtained via getelementptr (GEP) using dynamic (non-constant) indices. These dynamic indices could be such that the resulting pointer points into a different memory object than the base pointer. This makes it nearly impossible to determine at compile time whether two pointers point into the same memory object or not.

Which is why there is a rule which says (among other things) that if a pointer P obtained via GEP ends up going out-of-bounds and pointing into a different memory object than the pointer on which the GEP was based, then dereferencing P is undefined behavior even though the pointer's memory address is valid from the machine perspective.

As a corollary, a situation is possible in which there are two pointers whose underlying memory address is identical but whose provenance is different. In that case, it's possible that one of them can be dereferenced while dereferencing the other is undefined behavior.

This only makes sense if, in the formal semantics of LLVM IR, pointer values carry more information than just an integer interpreted as a memory address. They also carry provenance information, which is essentially the set of memory objects that can be accessed via this pointer and any pointers derived from it.


Bytes in memory carry provenance information

What is the provenance of a pointer that results from a load instruction? In a clean operational semantics, the load must derive this provenance from the values stored in memory.

If bytes of memory can only hold one of 256 bit patterns (or poison), that doesn't give us much to work with. We could say that the provenance of the pointer is "empty", meaning the pointer cannot be used to access any memory objects -- but that's clearly useless. Or we could say that the provenance of the pointer is "all", meaning the pointer (or pointers derived from it) can be freely used to access all memory objects, assuming the underlying address is adjusted appropriately. That isn't much better.[0]

Instead, we must say that -- as far as LLVM IR semantics are concerned -- each byte of memory holds pointer provenance information in addition to its i8 content. The provenance information in memory is written by pointer store, and pointer load uses it to reconstruct the original provenance of the loaded pointer.

What happens to provenance information in non-pointer load/store? A load can simply ignore the additional information in memory. For store, I see 3 possible choices:

1. Leave the provenance information that already happens to be in memory unmodified.
2. Set the provenance to "empty".
3. Set the provenance to "all".

Looking back at our attempt to implement memcpy, there is no choice which results in a perfect copy. All of the choices lose provenance information.

Without major changes to LLVM IR, only the last choice is potentially viable because it is the only choice that allows dereferencing pointers that are loaded from the memcpy destination.

Should we care about losing provenance information?

Without major changes to LLVM IR, we can only implement a memcpy that loses provenance information during the copy.

So what? Alias analysis around memcpy and code like it ends up being conservative, but reasonable people can argue that this doesn't matter. The burden of evidence lies on whoever wants to make a large change here in order to improve alias analysis.

That said, we cannot just call it a day and go (or stay) home either, because there are related correctness issues in LLVM today, e.g. bug 37469 mentioned in the initial email of that llvm-dev thread.

Here's a simpler example of a correctness issue using our hand-coded memcpy:

define i32 @sample(i32** %pp) {
  %tmp = alloca i32*
  %pp.8 = bitcast i32** %pp to i8*
  %tmp.8 = bitcast i32** %tmp to i8*
  call void @memcpy(i8* %tmp.8, i8* %pp.8, i64 8)
  %p = load i32*, i32** %tmp
  %x = load i32, i32* %p
  ret i32 %x
}

A transform that should be possible is to eliminate the memcpy and temporary allocation:

define i32 @sample(i32** %pp) {
  %p = load i32*, i32** %pp
  %x = load i32, i32* %p
  ret i32 %x
}

This transform is incorrect because it introduces undefined behavior.

To see why, remember that this is the world where we agree that integer stores write an "all" provenance to memory, so %p in the original program has "all" provenance. In the transformed program, this may no longer be the case. If @sample is called with a pointer that was obtained through an out-of-bounds GEP whose resulting address just happens to fall into a different memory object, then the transformed program has undefined behavior where the original program didn't.

We could fix this correctness issue by introducing an unrestrict instruction which elevates a pointer's provenance to the "all" provenance:

define i32 @sample(i32** %pp) {
  %p = load i32*, i32** %pp
  %q = unrestrict i32* %p
  %x = load i32, i32* %q
  ret i32 %x
}

Here, %q has "all" provenance and therefore no undefined behavior is introduced.

I believe that (at least for address spaces that are well-behaved?) it would be correct to fold inttoptr(ptrtoint(x)) to unrestrict(x). The two are really the same.

For that reason, unrestrict could also be used to fix the above-mentioned bug 37469. Several folks in the bug's discussion stated the opinion that the bug is caused by incorrect store forwarding that should be weakened via inttoptr(ptrtoint(x)). unrestrict(x) is simply a clearer spelling of the same idea.


A dead end: integers cannot have provenance information

A natural thought at this point is that the situation could be improved by adding provenance information to integers. This is technically correct: our hand-coded memcpy would then produce a perfect copy of the memory contents.

However, we would get into serious trouble elsewhere because global value numbering (GVN) and similar transforms become incorrect: two integers could compare equal using the icmp instruction, but still be different because of different provenance. Replacing one by the other could result in miscompilation.

GVN is important enough that adding provenance information to integers is a no-go.

I suspect that the unrestrict instruction would allow us to apply GVN to pointers, at the cost of making later alias analysis more conservative and sprinkling unrestrict instructions that may inhibit other transforms. I have no idea what the trade-off is on that.


The "byte" types: accurate representation of memory contents

With all the above in mind, I can see the first-principles appeal of the proposed "byte" types. They allow us to represent the contents of memory accurately in SSA values, and so they fill a real gap in the expressiveness of LLVM IR.

That said, the software development cost of adding a whole new family of types to LLVM is very high, so it better be justified by more than just aesthetics.

Our hand-coded memcpy can be turned into a perfect copier with straightforward replacement of i8 by b8:

define void @memcpy(b8* %dst, b8* %src, i64 %n) {
entry:
  %dst.end = getelementptr b8, b8* %dst, i64 %n
  %isempty = icmp eq i64 %n, 0
  br i1 %isempty, label %out, label %loop

loop:
  %src.loop = phi b8* [ %src, %entry ], [ %src.next, %loop ]
  %dst.loop = phi b8* [ %dst, %entry ], [ %dst.next, %loop ]
  %ch = load b8, b8* %src.loop
  store b8 %ch, b8* %dst.loop
  %src.next = getelementptr b8, b8* %src.loop, i64 1
  %dst.next = getelementptr b8, b8* %dst.loop, i64 1
  %done = icmp eq b8* %dst.next, %dst.end
  br i1 %done, label %out, label %loop

out:
  ret void
}

Looking at the concrete choices made in the proposal, I disagree with some of them.

Memory should not be typed. In the proposal, storing an integer always results in different memory contents than storing a pointer (regardless of its provenance), and implicitly trying to mix pointers and integers is declared to be undefined behavior. In other words, a sequence such as:

store i64 %x, i64* %p
%q = bitcast i64* %p to i8**
%y = load i8*, i8** %q

... is undefined behavior under the proposal instead of being effectively inttoptr(%x). That seems fine for C/C++, but is it going to be fine for other frontends?

The corresponding distinction between bytes-as-integers and bytes-as-pointers complicates the proposal overall, e.g. it forces them to add a bytecast instruction.

Conversely, the benefits of the distinction are unclear to me. One benefit appears to be guaranteed non-aliasing between pointer and non-pointer memory accesses, but that is a form of type-based alias analysis which in LLVM should idiomatically be done via TBAA metadata. (Update: see the addendum for another potential argument in favor of typed memory.)

So let's keep memory untyped, please.

Bitwise poison in byte values makes me really nervous due to the arbitrary deviation from how poison works in other types. I don't see any justification for it in the proposal. I can kind of see how one could be motivated by implementing memcpy with vector intrinsics operating on, for example, <8 x b32>, but a simpler solution would be to just use <32 x b8> instead. And if poison is indeed bitwise, then certainly pointer provenance would also have to be bitwise!

Finally, no design discussion is complete without a little bit of bike-shedding. I believe the name "byte" is inspired by C++'s std::byte, but given that types such as b256 are possible, this name would forever be a source of confusion. Naming is hard, and I think we should at least try to look for a better one. Let me kick off the brainstorming by suggesting we think of them as "memory content" values, because that's what they are. The types could be spelled m8, m32, etc. in IR assembly.

A variation: adding a pointer provenance type

In the llvm-dev thread, Jeroen Dobbelaere points out work being done to introduce explicit `ptr_provenance` operands on certain instructions, in service of C99's restrict keyword. I haven't properly digested this work, but it inspired the thoughts of this section.

Values of the proposed byte types have both a bit pattern and a pointer provenance. Do we really need to have both pieces of information in the same SSA value? We could instead split them up into an integer bit pattern value and a pointer provenance value with an explicit provenance type. Loads of integers could read out the provenance information stored in memory and provide it as a secondary result. Similarly, stores of integers could accept the desired provenance to be stored in memory as a secondary data operand. This would allow us to write a perfect memcpy by replacing the core load/store sequence with something like:

%ch, %provenance = load_with_provenance i8, i8* %src
store_with_provenance i8 %ch, provenance %provenance, i8* %dst

The syntax and instruction names in the example are very much straw men. Don't take them too seriously, especially because LLVM IR doesn't currently allow multiple result values.

Interestingly, this split allows the derivation of pointer provenance to follow a different path than the calculation of the pointer's bit pattern. This in turn allows us in principle to perform GVN on pointers without being conservative for alias analysis.

One of the steps in bug 37469 is not quite GVN, but morally similar. Simplifying a lot, the original program sequence:

%ch1 = load i8, i8* %p1
%ch2 = load i8, i8* %p2
%eq = icmp eq i8 %ch1, %ch2
%ch = select i1 %eq, i8 %ch1, i8 %ch2
store i8 %ch, i8* %p3

... is transformed into:

%ch2 = load i8, i8* %p2
store i8 %ch2, i8* %p3

This is correct for the bit patterns being loaded and stored, but the program also indirectly relies on pointer provenance of the data. Of course, there is no pointer provenance information being copied here because i8 only holds a bit pattern. However, with the "byte" proposal, all the i8s would be replaced by b8s, and then the transform becomes incorrect because it changes the provenance information.

If we split the proposed use of b8 into a use of i8 and explicit provenance values, the original program becomes:

%ch1, %prov1 = load_with_provenance i8, i8* %p1
%ch2, %prov2 = load_with_provenance i8, i8* %p2
%eq = icmp eq i8 %ch1, %ch2
%ch = select i1 %eq, i8 %ch1, i8 %ch2
%prov = select i1 %eq, provenance %prov1, provenance %prov2
store_with_provenance i8 %ch, provenance %prov, i8* %p3

This could be transformed into something like:

%prov1 = load_only_provenance i8* %p1
%ch2, %prov2 = load_with_provenance i8, i8* %p2
%prov = merge provenance %prov1, %prov2
store_with_provenance i8 %ch2, provenance %prov, i8* %p3

... which is just as good for code generation but loses only very little provenance information.

Aside: loop idioms

Without major changes to LLVM IR, a perfect memcpy cannot be implemented because pointer provenance information is lost.

Nevertheless, one could still define the @llvm.memcpy intrinsic to be a perfect copy. This helps memcpys in the original source program be less conservative in terms of alias analysis. However, it also makes it incorrect to replace a memcpy loop idiom with a use of @llvm.memcpy: without adding unrestrict instructions, the replacement may introduce undefined behavior; and there is no way to bound the locations where such unrestricts may be needed.

We could augment @llvm.memcpy with an immediate argument that selects its provenance behavior.

In any case, one can argue that bug 37469 is really a bug in the loop idiom recognizer. It boils down to the details of how everything is defined, and unfortunately, these weird corner cases are currently underspecified in the LangRef.

Conclusion

We started with the question of whether memcpy can be implemented in LLVM IR. The answer is a qualified Yes. It is possible, but the resulting copy is imperfect because pointer provenance information is lost. This has surprising implications which in turn happen to cause real miscompilation bugs -- although those bugs could be fixed even without a perfect memcpy.

The "byte" proposal has a certain aesthetic appeal because it fixes a real gap in the expressiveness of LLVM IR, but its software engineering cost is large and I object to some of its details. There are also alternatives to consider.

The miscompilation bugs obviously need to be fixed, but they can be fixed much less intrusively, albeit at the cost of more conservative alias analysis in the affected places. It is not clear to me whether improving alias analysis justifies the more complex solutions.

I would like to understand better how all of this interacts with the C99 restrict work. That work introduces mechanisms for explicitly talking about pointer provenance in the IR, which may allow us to kill two birds with one stone.

In any case, this is a fascinating topic and discussion, and I feel like we're only at the beginning.


Addendum: storing back previously loaded integers

(Added this section on June 14th)

Harald van Dijk on Phrabricator and Ralf Jung on llvm-dev, referring to a Rust issue, explicitly and implicitly point out a curious issue with loading and storing integers.

Here is Harald's example:

define i8* @f(i8* %p) {
  %buf = alloca i8*
  %buf.i32 = bitcast i8** %buf to i32*
  store i8* %p, i8** %buf
  %i = load i32, i32* %buf.i32
  store i32 %i, i32* %buf.i32
  %q = load i8*, i8** %buf
  ret i8* %q
}

There is a pair of load/store of i32 which is fully redundant from a machine perspective and so we'd like to optimize that away, after which it becomes obvious that the function really just returns %p -- at least as far as bit patterns are concerned.

However, in a world where memory is untyped but has provenance information, this optimization is incorrect because it can introduce undefined behavior: the load/store of i32 resets the provenance information in memory to "all", so that the original function returns an unrestricted version of %p. This is no longer the case after the optimization.

There are at least two possible ways of resolving this conflict.

We could define memory to be typed, in the sense that each byte of memory remembers whether it was most recently stored as a pointer or a non-pointer. A load with the wrong type returns poison. In that case, the example above returns poison before the optimization (because %i is guaranteed to be poison). After the optimization it returns non-poison, which is an acceptable refinement, so the optimization is correct.

The alternative is to keep memory untyped and say that directly eliminating the i32 store in the example is incorrect.

We are facing a tradeoff that depends on how important that optimization is for performance.

Two observations to that end. First, the more common case of dead store elimination is one where there are multiple stores to the same address in a row, and we remove all but the last one of them. That more common optimization is unaffected by provenance issues either way.

Second, we can still perform store forwarding / peephole optimization across such load/store pairs, as long as we are careful to introduce unrestrict where needed. The example above can be optimized via store forwarding to:

define i8* @f(i8* %p) {
  %buf = alloca i8*
  %buf.i32 = bitcast i8** %buf to i32*
  store i8* %p, i8** %buf
  %i = load i32, i32* %buf.i32
  store i32 %i, i32* %buf.i32
  %q = unrestrict i8* %p
  ret i8* %q
}

We can then dead-code eliminate the bulk of the function and obtain:

define i8* @f(i8* %p) {
  %q = unrestrict i8* %p
  ret i8* %q
}

... which is as good as it can possibly get.

So, there is a good chance that preventing this particular optimization is relatively cheap in terms of code quality, and the gain in overall design simplicity may well be worth it.




[0] We could also say that the loaded pointer's provenance is magically the memory object that happens to be at the referenced memory address. Either way, provenance would become a useless no-op in most cases. For example, mem2reg would have to insert unrestrict instructions (defined later) everywhere because pointers become effectively "unrestricted" when loaded from alloca'd memory.

June 13, 2021

In an earlier article I showed how reading from VRAM with the CPU can be very slow. It however turns out there there are ways to make it less slow.

The key to this are instructions with non-temporal hints, in particular VMOVNTDQA. The Intel Instruction Manual says the following about this instruction:

“MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory type, the nontemporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the temporary internal buffer if data is available. “ (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2)

This sounds perfect for our VRAM and WC System Memory buffers as we typically only read 16-bytes per instruction and this allows us to read entire cachelines at time.

It turns out that Mesa already implemented a streaming memcpy using these instructions so all we had to do was throw that into our benchmark and write a corresponding memcpy that does non-temporal stores to benchmark writing to these memory regions.

As a reminder, we look into three allocation types that are exposed by the amdgpu Linux kernel driver:

  • VRAM. This lives on the GPU and is mapped with Uncacheable Speculative Write Combining (USWC) on the CPU. This means that accesses from the CPU are not cached, but writes can be write-combined.

  • Cacheable system memory. This is system memory that has caching enabled on the CPU and there is cache snooping to ensure the memory is coherent between the CPU and GPU (up till the top level caches. The GPU caches do not participate in the coherence).

  • USWC system memory. This is system memory that is mapped with Uncacheable Speculative Write Combining on the CPU. This can lead to slight performance benefits compared to cacheable system memory due to lack of cache snooping.

Furthermore this still uses a RX 6800 XT + a 2990WX with 4 channel 3200 MT/s RAM.

method (MiB/s) VRAM Cacheable System Memory USWC System Memory
read via memcpy 15 11488 137
write via memcpy 10028 18249 11480
read via streaming memcpy 756 6719 4409
write via streaming memcpy 10550 14737 11652

Using this memcpy implementation we get significantly better performance in uncached memory situations, 50x for VRAM and 26x for USWC system memory. If this is a significant bottleneck in your workload this can be a gamechanger. Or if you were using SDMA to avoid this hit, you might be able to do things at significantly lower latency. That said it is not at a level where it does not matter. For big copies using DMA can still be a significant win.

Note that I initially gave an explanation on why the non-temporal loads should be faster, but the increases in performance are significantly above what something that just fiddles with loading entire cachelines would achieve. I have not dug into the why of the performance increase.

DMA performance

I have been claiming DMA is faster for CPU readbacks of VRAM in both this article and the previous article on the topic. One might ask how fast DMA is then. To demonstrate this I benchmarked VRAM<->Cacheable System Memory copies using the SDMA hardware block on Radeon GPUs.

Note that there is a significant overhead per copy here due to submitting work to the GPU, so I will shows results vs copy size. The rate is measured while doing a wait after each individual copy and taking the wall clock time as these usecases tend to be latency sensitive and hence batching is not too interesting.

copy size copy from VRAM (MiB/s) copy to VRAM (MiB/s)
4 KiB 62 63
16 KiB 245 240
64 KiB 953 1015
256 KiB 3106 3082
1 MiB 6715 7281
4 MiB 9737 11636
16 MiB 12129 12158
64 MiB 13041 12975
256 MiB 13429 13387

This shows that for reads DMA is faster than a normal memcpy at 4 KiB and faster than a streaming memcpy at 64 KiB. Of course one still needs to do their CPU access at that point, but at both these thresholds even with an additional CPU memcpy the total process should still be fast with DMA.

June 10, 2021

TL;DR: Tag your GPT partitions with the right, descriptive partition types, and the world will become a better place.

A number of years ago we started the Discoverable Partitions Specification which defines GPT partition type UUIDs and partition flags for the various partitions Linux systems typically deal with. Before the specification all Linux partitions usually just used the same type, basically saying "Hey, I am a Linux partition" and not much else. With this specification the GPT partition type, flags and label system becomes a lot more expressive, as it can tell you:

  1. What kind of data a partition contains (i.e. is this swap data, a file system or Verity data?)
  2. What the purpose/mount point of a partition is (i.e. is this a /home/ partition or a root file system?)
  3. What CPU architecture a partition is intended for (i.e. is this a root partition for x86-64 or for aarch64?)
  4. Shall this partition be mounted automatically? (i.e. without specifically be configured via /etc/fstab)
  5. And if so, shall it be mounted read-only?
  6. And if so, shall the file system be grown to its enclosing partition size, if smaller?
  7. Which partition contains the newer version of the same data (i.e. multiple root file systems, with different versions)

By embedding all of this information inside the GPT partition table disk images become self-descriptive: without requiring any other source of information (such as /etc/fstab) if you look at a compliant GPT disk image it is clear how an image is put together and how it should be used and mounted. This self-descriptiveness in particular breaks one philosophical weirdness of traditional Linux installations: the original source of information which file system the root file system is, typically is embedded in the root file system itself, in /etc/fstab. Thus, in a way, in order to know what the root file system is you need to know what the root file system is. 🤯 🤯 🤯

(Of course, the way this recursion is traditionally broken up is by then copying the root file system information from /etc/fstab into the boot loader configuration, resulting in a situation where the primary source of information for this — i.e. /etc/fstab — is actually mostly irrelevant, and the secondary source — i.e. the copy in the boot loader — becomes the configuration that actually matters.)

Today, the GPT partition type UUIDs defined by the specification have been adopted quite widely, by distributions and their installers, as well as a variety of partitioning tools and other tools.

In this article I want to highlight how the various tools the systemd project provides make use of the concepts the specification introduces.

But before we start with that, let's underline why tagging partitions with these descriptive partition type UUIDs (and the associated partition flags) is a good thing, besides the philosophical points made above.

  1. Simplicity: in particular OS installers become simpler — adjusting /etc/fstab as part of the installation is not necessary anymore, as the partitioning step already put all information into place for assembling the system properly at boot. i.e. installing doesn't mean that you always have to get fdisk and /etc/fstab into place, the former suffices entirely.

  2. Robustness: since partition tables mostly remain static after installation the chance of corruption is much lower than if the data is stored in file systems (e.g. in /etc/fstab). Moreover by associating the metadata directly with the objects it describes the chance of things getting out of sync is reduced. (i.e. if you lose /etc/fstab, or forget to rerun your initrd builder you still know what a partition is supposed to be just by looking at it.)

  3. Programmability: if partitions are self-descriptive it's much easier to automatically process them with various tools. In fact, this blog story is mostly about that: various systemd tools can naturally process disk images prepared like this.

  4. Alternative entry points: on traditional disk images, the boot loader needs to be told which kernel command line option root= to use, which then provides access to the root file system, where /etc/fstab is then found which describes the rest of the file systems. Where precisely root= is configured for the boot loader highly depends on the boot loader and distribution used, and is typically encoded in a Turing complete programming language (Grub…). This makes it very hard to automatically determine the right root file system to use, to implement alternative entry points to the system. By alternative entry points I mean other ways to boot the disk image, specifically for running it as a systemd-nspawn container — but this extends to other mechanisms where the boot loader may be bypassed to boot up the system, for example qemu when configured without a boot loader.

  5. User friendliness: it's simply a lot nicer for the user looking at a partition table if the partition table explains what is what, instead of just saying "Hey, this is a Linux partition!" and nothing else.

Uses for the concept

Now that we cleared up the Why?, lets have a closer look how this is currently used and exposed in systemd's various components.

Use #1: Running a disk image in a container

If a disk image follows the Discoverable Partition Specification then systemd-nspawn has all it needs to just boot it up. Specifically, if you have a GPT disk image in a file foobar.raw and you want to boot it up in a container, just run systemd-nspawn -i foobar.raw -b, and that's it (you can specify a block device like /dev/sdb too if you like). It becomes easy and natural to prepare disk images that can be booted either on a physical machine, inside a virtual machine manager or inside such a container manager: the necessary meta-information is included in the image, easily accessible before actually looking into its file systems.

Use #2: Booting an OS image on bare-metal without /etc/fstab or kernel command line root=

If a disk image follows the specification in many cases you can remove /etc/fstab (or never even install it) — as the basic information needed is already included in the partition table. The systemd-gpt-auto-generator logic implements automatic discovery of the root file system as well as all auxiliary file systems. (Note that the former requires an initrd that uses systemd, some more conservative distributions do not support that yet, unfortunately). Effectively this means you can boot up a kernel/initrd with an entirely empty kernel command line, and the initrd will automatically find the root file system (by looking for a suitably marked partition on the same drive the EFI System Partition was found on).

(Note, if /etc/fstab or root= exist and contain relevant information they always takes precedence over the automatic logic. This is in particular useful to tweaks thing by specifying additional mount options and such.)

Use #3: Mounting a complex disk image for introspection or manipulation

The systemd-dissect tool may be used to introspect and manipulate OS disk images that implement the specification. If you pass the path to a disk image (or block device) it will extract various bits of useful information from the image (e.g. what OS is this? what partitions to mount?) and display it.

With the --mount switch a disk image (or block device) can be mounted to some location. This is useful for looking what is inside it, or changing its contents. This will dissect the image and then automatically mount all contained file systems matching their GPT partition description to the right places, so that you subsequently could chroot into it. (But why chroot if you can just use systemd-nspawn? 😎)

Use #4: Copying files in and out of a disk image

The systemd-dissect tool also has two switches --copy-from and --copy-to which allow copying files out of or into a compliant disk image, taking all included file systems and the resulting mount hierarchy into account.

Use #5: Running services directly off a disk image

The RootImage= setting in service unit files accepts paths to compliant disk images (or block device nodes), and can mount them automatically, running service binaries directly off them (in chroot() style). In fact, this is the base for the Portable Service concept of systemd.

Use #6: Provisioning disk images

systemd provides various tools that can run operations provisioning disk images in an "offline" mode. Specifically:

systemd-tmpfiles

With the --image= switch systemd-tmpfiles can directly operate on a disk image, and for example create all directories and other inodes defined in its declarative configuration files included in the image. This can be useful for example to set up the /var/ or /etc/ tree according to such configuration before first boot.

systemd-sysusers

Similar, the --image= switch of systemd-sysusers tells the tool to read the declarative system user specifications included in the image and synthesizes system users from it, writing them to the /etc/passwd (and related) files in the image. This is useful for provisioning these users before the first boot, for example to ensure UID/GID numbers are pre-allocated, and such allocations not delayed until first boot.

systemd-machine-id-setup

The --image= switch of systemd-machine-id-setup may be used to provision a fresh machine ID into /etc/machine-id of a disk image, before first boot.

systemd-firstboot

The --image= switch of systemd-firstboot may be used to set various basic system setting (such as root password, locale information, hostname, …) on the specified disk image, before booting it up.

Use #7: Extracting log information

The journalctl switch --image= may be used to show the journal log data included in a disk image (or, as usual, the specified block device). This is very useful for analyzing failed systems offline, as it gives direct access to the logs without any further, manual analysis.

Use #8: Automatic repartitioning/growing of file systems

The systemd-repart tool may be used to repartition a disk or image in an declarative and additive way. One primary use-case for it is to run during boot on physical or VM systems to grow the root file system to the disk size, or to add in, format, encrypt, populate additional partitions at boot.

With its --image= switch it the tool may operate on compliant disk images in offline mode of operation: it will then read the partition definitions that shall be grown or created off the image itself, and then apply them to the image. This is particularly useful in combination with the --size= which allows growing disk images to the specified size.

Specifically, consider the following work-flow: you download a minimized disk image foobar.raw that contains only the minimized root file system (and maybe an ESP, if you want to boot it on bare-metal, too). You then run systemd-repart --image=foo.raw --size=15G to enlarge the image to the 15G, based on the declarative rules defined in the repart.d/ drop-in files included in the image (this means this can grow the root partition, and/or add in more partitions, for example for /srv or so, maybe encrypted with a locally generated key or so). Then, you proceed to boot it up with systemd-nspawn --image=foo.raw -b, making use of the full 15G.

Versioning + Multi-Arch

Disk images implementing this specifications can carry OS executables in one of three ways:

  1. Only a root file system

  2. Only a /usr/ file system (in which case the root file system is automatically picked as tmpfs).

  3. Both a root and a /usr/file system (in which case the two are combined, the /usr/ file system mounted into the root file system, and the former possibly in read-only fashion`)

They may also contain OS executables for different architectures, permitting "multi-arch" disk images that can safely boot up on multiple CPU architectures. As the root and /usr/ partition type UUIDs are specific to architectures this is easily done by including one such partition for x86-64, and another for aarch64. If the image is now used on an x86-64 system automatically the former partition is used, on aarch64 the latter.

Moreover, these OS executables may be contained in different versions, to implement a simple versioning scheme: when tools such as systemd-nspawn or systemd-gpt-auto-generator dissect a disk image, and they find two or more root or /usr/ partitions of the same type UUID, they will automatically pick the one whose GPT partition label (a 36 character free-form string every GPT partition may have) is the newest according to strverscmp() (OK, truth be told, we don't use strverscmp() as-is, but a modified version with some more modern syntax and semantics, but conceptually identical).

This logic allows to implement a very simple and natural A/B update scheme: an updater can drop multiple versions of the OS into separate root or /usr/ partitions, always updating the partition label to the version included there-in once the download is complete. All of the tools described here will then honour this, and always automatically pick the newest version of the OS.

Verity

When building modern OS appliances, security is highly relevant. Specifically, offline security matters: an attacker with physical access should have a difficult time modifying the OS in a way that isn't noticed. i.e. think of a car or a cell network base station: these appliances are usually parked/deployed in environments attackers can get physical access to: it's essential that in this case the OS itself sufficiently protected, so that the attacker cannot just mount the OS file system image, make modifications (inserting a backdoor, spying software or similar) and the system otherwise continues to run without this being immediately detected.

A great way to implement offline security is via Linux' dm-verity subsystem: it allows to securely bind immutable disk IO to a single, short trusted hash value: if an attacker manages to offline modify the disk image the modified disk image won't match the trusted hash anymore, and will not be trusted anymore (depending on policy this then just result in IO errors being generated, or automatic reboot/power-off).

The Discoverable Partitions Specification declares how to include Verity validation data in disk images, and how to relate them to the file systems they protect, thus making if very easy to deploy and work with such protected images. For example systemd-nspawn supports a --root-hash= switch, which accepts the Verity root hash and then will automatically assemble dm-verity with this, automatically matching up the payload and verity partitions. (Alternatively, just place a .roothash file next to the image file).

Future

The above already is a powerful tool set for working with disk images. However, there are some more areas I'd like to extend this logic to:

bootctl

Similar to the other tools mentioned above, bootctl (which is a tool to interface with the boot loader, and install/update systemd's own EFI boot loader sd-boot) should learn a --image= switch, to make installation of the boot loader on disk images easy and natural. It would automatically find the ESP and other relevant partitions in the image, and copy the boot loader binaries into them (or update them).

coredumpctl

Similar to the existing journalctl --image= logic the coredumpctl tool should also gain an --image= switch for extracting coredumps from compliant disk images. The combination of journalctl --image= and coredumpctl --image= would make it exceptionally easy to work with OS disk images of appliances and extracting logging and debugging information from them after failures.

And that's all for now. Please refer to the specification and the man pages for further details. If your distribution's installer does not yet tag the GPT partition it creates with the right GPT type UUIDs, consider asking them to do so.

Thank you for your time.

June 09, 2021

Memes

We’ve all been there. No matter how 10x someone is or feels, everyone has had a moment where abruptly they say to themselves, HOW THE FUCK DO THREADS EVEN WORK?

This may be precipitated by any number of events, including, but not limited to:

  • forgetting a lock
  • forgetting to unlock
  • missing an unlock at an early return
  • forgetting to initialize a lock
  • forgetting to spawn a thread
  • forgetting to signal a conditional
  • forgetting to initialize a conditional
  • running the test case with the wrong driver

I’m not going to say that I’ve been there recently.

I’m not going to say that it was today, nor am I going to state, on the record, that at least one existing zink-wip snapshot may or may not be affected by an issue which may or may not be on the above list.

I’m not going to say any of these things.

What I am going to do is talk about a new oom handler I’ve been working on to handle the dreaded spec@!opengl 1.1@streaming-texture-leak case from piglit.

The Case

This test is annoying in that it is effectively a test of a driver’s ability to throttle itself when an app is generating and using $infinity textures without ever explicitly triggering a flush.

In short, it’s:

for (i = 0; i < 5000; i++) {
   glGenTextures(1, &texture);
   glBindTexture(GL_TEXTURE_2D, texture);
   glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, TEX_SIZE, TEX_SIZE, 0, GL_RGBA, GL_UNSIGNED_BYTE, tex_buffer);
   piglit_draw_rect_tex(0, 0, piglit_width, piglit_height, 0, 0, 1, 1);
   glDeleteTextures(1, &texture);
}

The textures are “deleted”, yes, but because they’re in use, the driver can’t actually delete them at this point of call, meaning that they can only truly be deleted once they are no longer in use by the GPU. At some iteration, this will begin to oom the GPU, and the driver will have to determine how to handle things.

The Zink Case

At present, mainline zink uses a hammer-and-nail methodology that I came up with last year: the total amount of GPU memory in use by resources in a given cmdbuf is tracked, and that amount is tracked per-context. If the in-use context memory exceeds a threshold of the total VRAM, the driver stalls, thereby freeing up all the resources that are in use so they can be recycled into new ones.

There’s a number of problems with this approach, but the biggest one is that it fails to account for cases like a AAA game that just uses as much memory as it can in order to optimize performance/resolution/graphics. I discovered such a case some time ago while running Tomb Raider, and then I set out to improve things since it was costing me about 10% of my perf on the title screen.

The annoying part of this problem is that the piglit test is a very uncommon case, and it’s tricky to handle it in a way that doesn’t also impact other cases which appear similar but need to not get memory-clamped. As a result, it’s tough to really do anything based on “overall” memory usage.

In the end, what I decided on was using the per-cmdbuf memory usage counter to trigger a check for completed cmdbufs on submit, iterating over all the pending ones to check whether they’ve completed, resetting them and freeing associated resources when possible. This yields good memory reclaiming behavior for problem cases while leaving games like Tomb Raider untouched and definitely not deadlocking or anything like that.

June 02, 2021

Remember When…

I said I’d be blogging every day about some changes? And that was a month ago or however long it’s been? And we all had a good chuckle at the idea that I could blog every day like how things used to be?

Yeah, I remember that too.

Anyway, Bas still hasn’t blogged, so let’s check the blogenda:

  • handwaving about C++ draw templates
  • some obscure vbuf thing
  • shower
  • make zink usable for gaming
  • complain about construction
  • improve shader caching
  • this week’s queue rewrite
  • some other stuff
  • suballocator?

I guess it’s that time of the week again because the schedule says it’s time to talk about this week’s (or whenever it was) major rewrite of zink’s queue handling. But first, only 90s kids will remember that time I blogged about a major queue rewrite and was excited to almost be hitting 70% of native performance.

Synchronization

A common use of GL for big games is using multiple GL contexts to parallelize work. There’s a lot of tricky restrictions for this, both on the app side and the driver side, but this is sort of the closest thing to multiple cmdbufs that GL provides.

We all recall how zink now uses a monotonic queue: upon commencing recording, each cmdbuf gets tagged with a 32bit integer id that doubles as a timeline semaphore id for fencing. The queue iterates, the cmdbuf counter increments, queue submission is done per-context in a thread, the GPU gets triangles, everyone is happy.

But how well does that work out with multiple contexts?

Pretty well, it turns out, as long as you’re using a Vulkan driver that doesn’t actually check to ensure you’re using monotonic ids for your timeline values. Let’s check out a totally hypothetical scenario that isn’t just Steam:

  • Have contexts A and B
  • Context A starts recording, gets id 1 (nonzero id club represent)
  • Context B starts recording, gets id 2
  • Context A finishes recording, submits cmdbuf
  • Context B finishes recording, submits cmdbuf
  • Timeline wait on id 1
  • Timeline wait on id 2

So far so good. But then we get past the “Checking for updates” window:

  • Context A starts recording, gets id 3
  • Context B starts recording, gets id 4
  • Context B finishes recording, submits cmdbuf
  • Context A finishes recording, submits cmdbuf
  • Timeline wait on id 3
  • Timeline wait on id 4

thonking.png

So now context B’s submit thread is dumping cmdbuf 4’s triangles into the GPU, then context A’s submit thread is also trying to dump cmdbuf 3’s triangles into the GPU, but the wait order for the timeline is still A -> B, meaning that the values are not monotonic.

Will any drivers care?

Magic 8-ball says no, no drivers care about this and everything still works fine. That’s cool and interesting, but probably it’d be better to not do that.

This Time It’s Definitely Fixed

The problem here is two problems:

  • the queue submission thread is context-based when it should be screen based
  • cmdbufs get an id when they start recording, not when they get submitted

The first problem is easy to fix: just deduplicate the thread and move the struct member.

The second one is trickier because everything in zink relies on cmdbufs getting an id as soon as they become active. This is done so that any resources written to by a given cmdbuf can have their usage tracked for synchronization purposes, e.g., reading back a buffer only after all its writes have landed.

The problem is further complicated by zink not having a great API barrier between directly accessing the “usage” for a resource and the value itself, by which I mean parts of the codebase directly reading the integer value vs having an API to wrap it; the latter would enable replacing the mechanism with whatever I wanted, so I decided to start by creating such a wrapper based on this:

struct zink_batch_usage {
   uint32_t usage;
   bool unflushed;
};

This is the existing struct zink_batch_usage but now with a bool value indicating that this cmdbuf is yet to be flushed. Each cmdbuf batch now has this sub-struct inlined onto it, and resources in zink can take references (pointers) to a specific cmdbuf’s usage struct. Because batches are never destroyed, this means the wrapper API can always dereference the struct to determine how to synchronize the usage: if it’s unflushed, it can flush or sync the flush thread; if it’s real, pending usage, it can safely wait on that usage as a timeline value and guarantee monotonic ordering.

bool
zink_screen_usage_check_completion(struct zink_screen *screen, const struct zink_batch_usage *u)
{
   if (!zink_batch_usage_exists(u))
      return true;
   if (zink_batch_usage_is_unflushed(u))
      return false;

   return zink_screen_batch_id_wait(screen, u->usage, 0);
}

bool
zink_batch_usage_check_completion(struct zink_context *ctx, const struct zink_batch_usage *u)
{
   if (!zink_batch_usage_exists(u))
      return true;
   if (zink_batch_usage_is_unflushed(u))
      return false;
   return zink_check_batch_completion(ctx, u->usage);
}

void
zink_batch_usage_wait(struct zink_context *ctx, const struct zink_batch_usage *u)
{
   if (!zink_batch_usage_exists(u))
      return;
   if (zink_batch_usage_is_unflushed(u))
      zink_fence_wait(&ctx->base);
   else
      zink_wait_on_batch(ctx, u->usage);
}

Now things render exactly the same, but with a truly monotonic queue underneath that’s conformant to specifications.

June 01, 2021

As the President of the GNOME Foundation Board of Directors, I’m really pleased to see the number and breadth of candidates we have for this year’s election. Thank you to everyone who has submitted their candidacy and volunteered their time to support the Foundation. Allan has recently blogged about how the board has been evolving, and I wanted to follow that post by talking about where the GNOME Foundation is in terms of its strategy. This may be helpful as people consider which candidates might bring the best skills to shape the Foundation’s next steps.

Around three years ago, the Foundation received a number of generous donations, and Rosanna (Director of Operations) gave a presentation at GUADEC about her and Neil’s (Executive Director, essentially the CEO of the Foundation) plans to use these funds to transform the Foundation. We would grow our activities, increasing the pace of events, outreach, development and infrastructure that supported the GNOME project and the wider desktop ecosystem – and, crucially, would grow our funding to match this increased level of activity.

I think it’s fair to say that half of this has been a great success – we’ve got a larger staff team than GNOME has ever had before. We’ve widened the GNOME software ecosystem to include related apps and projects under the GNOME Circle banner, we’ve helped get GTK 4 out of the door, run a wider-reaching program in the Community Engagement Challenge, and consistently supported better infrastructure for both GNOME and the Linux app community in Flathub.

Aside from another grant from Endless (note: my employer), our fundraising hasn’t caught up with this pace of activities. As a result, the Board recently approved a budget for this financial year which will spend more funds from our reserves than we expect to raise in income. Due to our reserves policy, this is essentially the last time we can do this: over the next 6-12 months we need to either raise more money, or start spending less.

For clarity – the Foundation is fit and well from a financial perspective – we have a very healthy bank balance, and a very conservative “12 month run rate” reserve policy to handle fluctuations in income. If we do have to slow down some of our activities, we will return to a “steady state” where our regular individual donations and corporate contributions can support a smaller staff team that supports the events and infrastructure we’ve come to rely on.

However, this isn’t what the Board wants to do – the previous and current boards were unanimous in their support of the idea that we should be ambitious: try to do more in the world and bring the benefits of GNOME to more people. We want to take our message of trusted, affordable and accessible computing to the wider world.

Typically, a lot of the activities of the Foundation have been very inwards-facing – supporting and engaging with either the existing GNOME or Open Source communities. This is a very restricted audience in terms of fundraising – many corporate actors in our community already support GNOME hugely in terms of both financial and in-kind contributions, and many OSS users are already supporters either through volunteer contributions or donating to those nonprofits that they feel are most relevant and important to them.

To raise funds from new sources, the Foundation needs to take the message and ideals of GNOME and Open Source software to new, wider audiences that we can help. We’ve been developing themes such as affordability, privacy/trust and education as promising areas for new programs that broaden our impact. The goal is to find projects and funding that allow us to both invest in the GNOME community and find new ways for FOSS to benefit people who aren’t already in our community.

Bringing it back to the election, I’d like to make clear that I see this – reaching the outside world, and finding funding to support that – as the main priority and responsibility of the Board for the next term. GNOME Foundation elections are a slightly unusual process that “filters” our board nominees by being existing Foundation members, which means that candidates already work inside our community when they stand for election. If you’re a candidate and are already active in the community – THANK YOU – you’re doing great work, keep doing it! That said, you don’t need to be a Director to achieve things within our community or gain the support of the Foundation: being a community leader is already a fantastic and important role.

The Foundation really needs support from the Board to make a success of the next 12-18 months. We need to understand our financial situation and the trade-offs we have to make, and help to define the strategy with the Executive Director so that we can launch some new programs that will broaden our impact – and funding – for the future. As people cast their votes, I’d like people to think about what kind of skills – building partnerships, commercial background, familiarity with finances, experience in nonprofit / impact spaces, etc – will help the Board make the Foundation as successful as it can be during the next term.

I Hate Construction.

Specifically when it’s right outside my house and starts at 5:30AM with heavy machinery moving around.

With this said, I’m overdue for a post, and if I don’t set a good example by continuing to blog, why would anyone else? PS. BAS IT’S TIME.

Let’s see what’s on the agenda:

  • handwaving about C++ draw templates
  • some obscure vbuf thing
  • shower
  • make zink usable for gaming
  • complain about construction
  • improve shader caching

Looks like the next thing on the list is shader caching.

The Art Of The Cache

If you’re a long-time zink connoisseur, or if you’re just a casual reader of the blog, you know that zink has a shader cache.

But did you know that it doesn’t actually do anything at present?

Indeed, it was to my chagrin that, upon diving back into my slapdash pipeline cache implementation, I discovered that it was doing absolutely nothing. And this was a different nothing than that one time I didn’t actually pass the cache back to the vulkan driver! Yes, this was the nothing of I have a cache, why am I still compiling a hundred pipelines per frame? that the occasional lucky developer runs into every now and again.

But hwhy? Who would do such a thing?

spideymeme.jpg

Past recriminations aside, how does a shader/pipeline cache work, anyway? The gist of it in most Mesa drivers is this:

Thus a shader gets cached based on its text representation, enabling matching shaders across programs to use the same cache entry. After noting the success of Steam’s fossilize-based single file cache, I decided to use a single file for zink’s shader cache.

Oops.

The problem in this case was that I was just jamming all the pipelines into a single file, written once at program exit, and expecting the Vulkan driver to figure things out.

But what if the program didn’t exit cleanly? Or what if the write failed for some reason?

In short, the pipeline cache was mostly being written as a big block of garbage data. Not very useful.

Next-Level Caching Technique

Clearly I needed to reeducate myself in the ways of managing a cache, something that, in my former life as a GUI expert, I did routinely, but that I could no longer comprehend now that I only speak bitfields and command buffers.

I sought out the reclusive Timothy Arceri, a well-known sage in many esoteric, arcane arts, and, as I recall it, purveyor of great wisdom such as (paraphrased because the original text has been lost to the ages): We Both Know The GLSL Compiler Code For Uniform Blocks Is Unfathomable, Why Do You Insist On Attempting To Modify It?

The answers I received from my sojourn were swift and concise:

Stop that. Fossilize caching wasn’t meant to work that way.

My thoughts whirling, confidence badly shaken, I stumbled and fell from the summit of the mountain and dashed my heretical cache implementation against the solid foundation of git rebase -i.

What had I been thinking?

It was back to the charts for me, and this time I had a number of different goals:

  • go back to multi-file caching (since it’s the only option)
  • smaller caches
  • more frequent updates
  • fully async

Turns out this wasn’t actually as hard as expected?

More Flowcharts (Fulfilling Image Quote For Graphics Blog)

Because we’re all big Vulkan adults, we do big Vulkan pipeline caches instead of wimpy OpenGL shader caches like so:

This has the added benefit of providing all the state variants for a given shader pipeline, saving additional lookups and ensuring that all the potential compiled pipelines are available at once. Furthermore, because there’s a (very) short delay between knowing what shaders are grouped together and needing the actual compiled pipeline, I can dump this all into a thread and handle the lookup while I update descriptors #2021 ASYNC THREADS++++++ BAYBEEEEEE.

But also, also, the one thing to absolutely not ever forget or else it’ll be really embarrassing is to ensure that you add your driver’s sha1 hash to your disk cache lookup, otherwise the whole thing explodes and @tarceri will frown down upon you.