planet.freedesktop.org
June 18, 2021

Fast Friday

In short, an issue was filed recently about getting the Nine state tracker working with zink.

Was it the first? No..

Was it the first one this year? Yes.

Thus began a minutes-long, helter-skelter sequence of events to get Nine up and running, spread out over the course of a day or two. In need of a skilled finagler knowledgeable in the mysterium of Gallium state trackers, I contacted the only developer I know with a rockstar name, Axel Davy. We set out at dawn, and I strapped on my parachute. It was almost immediately that I heard a familiar call: there’s a build issue.

Next stop was crashing in unimplemented interface methods with a stopover in flailing about wildly in TGSI what even is this before I arrivated at my target:

nine.png

Ah, glorious triangles.

June 16, 2021

I Said I Would

A long, long time ago in a month far, far away I said I was going to blog about some improvements I’d been working on for zink. I blogged about some of them, but one was conspicuously absent from the original list:

  • make zink usable for gaming

There’s a lot that goes into this item. The post you’re reading now isn’t about to go so far as to claim that zink(-wip) is usable for gaming. No, that day is still far, far away. But this post is going to be the first step.

To begin with, a riddle: what change was made to zink between these two screenshots?

tr-slow.png

tr-zoom.png

That’s right, I put the punchline in the title.

A suballocator.

What Is A Suballocator?

A suballocator is a mechanism by which small blocks of memory can be suballocated out of larger one. For example, if I want to allocate an 64byte chunk of memory, I could allocate it directly and get my block, or I could allocate a 4096byte chunk of memory and then take 64bytes out of it.

When performance is involved, it’s important to consider the time-cost of allocations, and so yes, it’s useful to have already allocated another 63 instances of 64bytes when I need a second one, but there’s another, deeper issue that’s also necessary to address, especially as it relates to gaming: 32bit environments.

In a 32bit process, the amount of address space available is limited to 4GB, regardless of how much actual memory is physically present, some of which is dedicated to system resources and unavailable for general use. Any time a buffer or image is mapped by the driver in a process, this uses up address space in order to create an addressable region of memory that can be read or written to. Once all the address space has been used up, no other resources can be mapped, and it becomes impossible to continue normal operations.

In short, the game crashes.

In Vulkan, and just generally in driver work, it’s important to keep allocation sizes aligned to the preference of the hardware for a given usage; this amounts to minMemoryMapAlignment, which is 4096bytes on many drivers. Similarly, vkGetBufferMemoryRequirements and vkGetImageMemoryRequirements return aligned memory sizes, so even if only 64bytes are needed, 4096bytes must still be allocated—4032 bytes unused. This ends up wasting tons of memory when an app is allocating lots of smaller regions, and it’s further wasting address space since Vulkan prohibits memory from being mapped multiple times, meaning that each 64byte buffer is wasting an additional 4032bytes of address space.

While 4k of memory may seem like a small amount, and why would anyone ever need more than 256kb memory anyway, these allocations all add up, fast enough that zink runs out of address space in a 32bit game like Tomb Raider within a couple minutes.

Playable?

Probably not.

The Solution, As Always

If you’re working in Mesa, you basically have two options when you come across a new problem: delete some code or copy some code. It’s not often that I come across an issue which can’t be resolved by one of the two.

In this case, I had known for quite a while that the solution was going to be copying some code. Thus I entered the realm of Gallium’s awesome auxilliary/pipebuffer, a fearsome component that had only been leveraged by one driver.

zink_bo.png

Yup, it was time to throw more galaxybrain.jpg code into the blender and see what came out. Ultimately, I was able to repurpose a lot of the core calculation code for sizing allocations, which saved me from having to do any kind of thinking or maffs. This let me cut down my suballocator implementation to a little under 700 lines, leaving much, much, much more space for bugsactivities.

At a high level, here’s an overview of aux/pb:

  • call pb_cache_init to set up a memory cache
  • initialize slab allocators with pb_slabs_init
  • when allocating a new resource, determine if it can be slab allocated; if yes, use pb_slab_alloc to reuse/reclaim a slab allocation, otherwise manually allocate new memory
  • when destroying a resource, use pb_reference_with_winsys

There’s more under the hood, but it mostly boils down to filling in the interface functions to manage detecting whether resources are busy or can be reclaimed for reuse. The actual caching/reclaiming/reusing are all handled by aux/pb, meaning I was free to go about breaking everything with all the leftover time that I had.

Cultured users of zink-wip can now enjoy massively improved performance (and have already been enjoying it for the past month) in many apps. The rest of you get to sit around and watch while I bang my head against CI while ajax showers me with memes.

June 14, 2021

There’s a lot that has happened in the world of Zink since my last update, so let’s see if I can bring you up to date on the most important stuff.

Upstream development

Gosh, when I last blogged about Zink, it hadn’t even landed upstream in Mesa yet! Well, by now it’s been upstream for quite a while, and most development has moved there.

At the time of writing, we have merged 606 merge-requests labeled “zink”. The current tip of mesa’s main branch is totaling 1717 commits touching the src/gallium/drivers/zink/ sub-folder, written by 42 different contributors. That’s pretty awesome in my eyes, Zink has truly become a community project!

Another noteworthy change is that Mike Blumenkrantz has come aboard the project, and has churned out an incredible amount of improvements to Zink! He got hired by Valve to work on Zink (among other things), and is now the most prolific contributor, with more than twice the amount of commits than I have written.

If you want a job in Open Source graphics, Zink has a proven track-record as a job-creator! :smile:

In addition to Mike, there’s some other awesome people who have been helping out lately.

Half-Life 2 running with Zink. Half-Life 2 running with Zink.

OpenGL 4.6 support

Thanks to a lot of hard work by Mike assisted by Dave Airlie and Adam Jackson, both of RedHat, Zink is now able to expose the OpenGL 4.6 (Core Profile) feature set, given enough Vulkan features! :tada:

Please note that this doesn’t mean that Zink is yet a conformant implementation, there’s some details left to be ironed out before we can claim that. In particular, we need to pass the conformance tests, and submit a conformance report to Khronos. We’re not there yet.

I’m also happy to see that Zink is currently at the top of MesaMatrix (together with LLVMpipe, i965 and RadeonSI), reporting a total of 160 OpenGL extensions at the time of writing!

In theory, that means you can run any OpenGL application you can think of on top of Zink. Mike is hard at work testing the entire Steam game library, and things are working pretty well.

Is this the end of the line for Zink? Are we done now? Not at all! :laughing:

OpenGL compabibility profile

We’re still stuck at OpenGL 3.0 for compatibility contexts, mainly due to lack of testing. There’s a lot of features that need to work together in relatively complicated ways for this to work for us.

Note that this only matters for applications that rely on legacy OpenGL features. Modern OpenGL programs gets OpenGL 4.6 support, as mentioned previously.

I don’t think this is going to be a big deal to enable, but I haven’t spent time on it.

OpenGL ES 3.1 support

Similar to the OpenGL 4.6 support, we’re now able to expose the OpenGL ES 3.1 feature set. This is again thanks to a lot of hard work by Mike and the gang.

Why not OpenGL ES 3.2? This comes down to the GL_KHR_blend_equation_advanced feature. Mike blogged about the issue a while ago, but since then the situation has luckily changed a bit.

The VK_EXT_blend_operation_advanced extension is now available in some drivers like ANV, and we should be able to get this going. But there’s quite a bit of churn needed for this for several reasons, so don’t expect this to be checked off anytime soon.

Lavapipe and continuous integration

To prevent regressions, we’ve started testing Zink on the Mesa CI system for every change. This is made possible thanks to Lavapipe, a Vulkan software implementation in Mesa that reuse the rasterizer from LLVMpipe.

This means we can run tests on virtual cloud machines without having to depend on unreliable hardware. :robot:

At the time of writing, we’re only exposing OpenGL 4.1 on top of Lavapipe, due to some lacking features. But we have patches in the works to bring this up to OpenGL 4.5, and OpenGL 4.6 probably won’t be far off when that lands.

Windows support

Basic support for Zink on Microsoft Windows has landed. This isn’t particularly useful at the moment, because we need better window-system integration to get anywhere near reasonable performance. But it’s there.

macOS support

Thanks to work by Duncan Hopkins of The Foundry, there’s also some support for macOS. This uses MoltenVK as the Vulkan implementation, meaning that we also support the Vulkan Portability Extension to some degree.

This support isn’t quite as drop-in as on other platforms, because it’s completely lacking window-system integration. But it seems to work for the use-cases they have at The Foundry, so it’s worth mentioning as well.

Driver support

Beyond this, Igalia has brought up Zink on the V3DV driver, and I’ve heard some whispers that there’s some people running Zink on top of Turnip, an open-source Vulkan driver for recent Qualcomm Adreno GPUs.

I’ve heard some people have some success getting things running on NVIDIA, but there’s a few obvious problems in the way there due to the lack of proper DRI support… Which brings us to:

Window System Integration

Another awesome new development is that Adam is working on Penny. So, what’s Penny?

Penny is another way of bringing up Zink, on systems without DRI support. It works as a dedicated GLX integration that uses the VK_KHR_swapchain extension to integrate properly with the native Vulkan driver’s window-system integration instead of Mesa baking its own.

This solves a lot of small, nasty issues in the DRI code-path. I’ll say the magic “implicit synchronization” word, and hope that scares away anyone wondering what it’s about.

Performance

A lot more has happened on the performance front as well, again all thanks to Mike. However, much of this is still out-of-tree, and waiting in Mike’s zink-wip branch.

So instead, I suggest you check out Mike’s blog for the latest performance information (and much more up-to-date info on Zink). There’s been a lot going on, and I’m sure there’s even more to come!

Closing words

I think this should cover the most interesting bits of development.

On a personal note, I recently became a dad for the first time, and as a result I’ll be away for a while on paternity leave, starting early this fall. Luckily, Zink is in good hands with Mike and the rest of the upstream community taking care of things.

I would like to again plug Mike’s blog as a great source of Zink-related news, if you’re not already following it. He posts a lot more frequent than I do, and he’s also an epic meme master, so it’s all great fun!

(Added a section on load/store pairs on June 14th)

This question probably seems absurd. An unoptimized memcpy is a simple loop that copies bytes. How hard can that be? Well...

There's a fascinating thread on llvm-dev started by George Mitenkov proposing a new family of "byte" types. I found the proposal and discussion difficult to follow. In my humble opinion, this is because the proposal touches some rather subtle and underspecified aspects of LLVM IR semantics, and rather than address those fundamentals systematically, it jumps right into the minutiae of the instruction set. I look forward to seeing how the proposal evolves. In the meantime, this article is a byproduct of me attempting to digest the problem space.

Here is a fairly natural way to (attempt to) implement memcpy in LLVM IR:

define void @memcpy(i8* %dst, i8* %src, i64 %n) {
entry:
  %dst.end = getelementptr i8, i8* %dst, i64 %n
  %isempty = icmp eq i64 %n, 0
  br i1 %isempty, label %out, label %loop

loop:
  %src.loop = phi i8* [ %src, %entry ], [ %src.next, %loop ]
  %dst.loop = phi i8* [ %dst, %entry ], [ %dst.next, %loop ]
  %ch = load i8, i8* %src.loop
  store i8 %ch, i8* %dst.loop
  %src.next = getelementptr i8, i8* %src.loop, i64 1
  %dst.next = getelementptr i8, i8* %dst.loop, i64 1
  %done = icmp eq i8* %dst.next, %dst.end
  br i1 %done, label %out, label %loop

out:
  ret void
}

Unfortunately, the copy that is written to the destination is not a perfect copy of the source.

Hold on, I hear you think, each byte of memory holds one of 256 possible bit patterns, and this bit pattern is perfectly copied by the `load`/`store` sequence! The catch is that in LLVM's model of execution, a byte of memory can in fact hold more than just one of those 256 values. For example, a byte of memory can be poison, which means that there are at least 257 possible values. Poison is forwarded perfectly by the code above, so that's fine. The trouble starts because of pointer provenance.


What and why is pointer provenance?

From a machine perspective, a pointer is just an integer that is interpreted as a memory address.

For the compiler, alias analysis -- that is, the ability to prove that different pointers point at different memory addresses -- is crucial for optimization. One basic tool in the alias analysis toolbox is to recognize that if pointers point into different "memory objects" -- different stack or heap allocations -- then they cannot alias.

Unfortunately, many pointers are obtained via getelementptr (GEP) using dynamic (non-constant) indices. These dynamic indices could be such that the resulting pointer points into a different memory object than the base pointer. This makes it nearly impossible to determine at compile time whether two pointers point into the same memory object or not.

Which is why there is a rule which says (among other things) that if a pointer P obtained via GEP ends up going out-of-bounds and pointing into a different memory object than the pointer on which the GEP was based, then dereferencing P is undefined behavior even though the pointer's memory address is valid from the machine perspective.

As a corollary, a situation is possible in which there are two pointers whose underlying memory address is identical but whose provenance is different. In that case, it's possible that one of them can be dereferenced while dereferencing the other is undefined behavior.

This only makes sense if, in the formal semantics of LLVM IR, pointer values carry more information than just an integer interpreted as a memory address. They also carry provenance information, which is essentially the set of memory objects that can be accessed via this pointer and any pointers derived from it.


Bytes in memory carry provenance information

What is the provenance of a pointer that results from a load instruction? In a clean operational semantics, the load must derive this provenance from the values stored in memory.

If bytes of memory can only hold one of 256 bit patterns (or poison), that doesn't give us much to work with. We could say that the provenance of the pointer is "empty", meaning the pointer cannot be used to access any memory objects -- but that's clearly useless. Or we could say that the provenance of the pointer is "all", meaning the pointer (or pointers derived from it) can be freely used to access all memory objects, assuming the underlying address is adjusted appropriately. That isn't much better.[0]

Instead, we must say that -- as far as LLVM IR semantics are concerned -- each byte of memory holds pointer provenance information in addition to its i8 content. The provenance information in memory is written by pointer store, and pointer load uses it to reconstruct the original provenance of the loaded pointer.

What happens to provenance information in non-pointer load/store? A load can simply ignore the additional information in memory. For store, I see 3 possible choices:

1. Leave the provenance information that already happens to be in memory unmodified.
2. Set the provenance to "empty".
3. Set the provenance to "all".

Looking back at our attempt to implement memcpy, there is no choice which results in a perfect copy. All of the choices lose provenance information.

Without major changes to LLVM IR, only the last choice is potentially viable because it is the only choice that allows dereferencing pointers that are loaded from the memcpy destination.

Should we care about losing provenance information?

Without major changes to LLVM IR, we can only implement a memcpy that loses provenance information during the copy.

So what? Alias analysis around memcpy and code like it ends up being conservative, but reasonable people can argue that this doesn't matter. The burden of evidence lies on whoever wants to make a large change here in order to improve alias analysis.

That said, we cannot just call it a day and go (or stay) home either, because there are related correctness issues in LLVM today, e.g. bug 37469 mentioned in the initial email of that llvm-dev thread.

Here's a simpler example of a correctness issue using our hand-coded memcpy:

define i32 @sample(i32** %pp) {
  %tmp = alloca i32*
  %pp.8 = bitcast i32** %pp to i8*
  %tmp.8 = bitcast i32** %tmp to i8*
  call void @memcpy(i8* %tmp.8, i8* %pp.8, i64 8)
  %p = load i32*, i32** %tmp
  %x = load i32, i32* %p
  ret i32 %x
}

A transform that should be possible is to eliminate the memcpy and temporary allocation:

define i32 @sample(i32** %pp) {
  %p = load i32*, i32** %pp
  %x = load i32, i32* %p
  ret i32 %x
}

This transform is incorrect because it introduces undefined behavior.

To see why, remember that this is the world where we agree that integer stores write an "all" provenance to memory, so %p in the original program has "all" provenance. In the transformed program, this may no longer be the case. If @sample is called with a pointer that was obtained through an out-of-bounds GEP whose resulting address just happens to fall into a different memory object, then the transformed program has undefined behavior where the original program didn't.

We could fix this correctness issue by introducing an unrestrict instruction which elevates a pointer's provenance to the "all" provenance:

define i32 @sample(i32** %pp) {
  %p = load i32*, i32** %pp
  %q = unrestrict i32* %p
  %x = load i32, i32* %q
  ret i32 %x
}

Here, %q has "all" provenance and therefore no undefined behavior is introduced.

I believe that (at least for address spaces that are well-behaved?) it would be correct to fold inttoptr(ptrtoint(x)) to unrestrict(x). The two are really the same.

For that reason, unrestrict could also be used to fix the above-mentioned bug 37469. Several folks in the bug's discussion stated the opinion that the bug is caused by incorrect store forwarding that should be weakened via inttoptr(ptrtoint(x)). unrestrict(x) is simply a clearer spelling of the same idea.


A dead end: integers cannot have provenance information

A natural thought at this point is that the situation could be improved by adding provenance information to integers. This is technically correct: our hand-coded memcpy would then produce a perfect copy of the memory contents.

However, we would get into serious trouble elsewhere because global value numbering (GVN) and similar transforms become incorrect: two integers could compare equal using the icmp instruction, but still be different because of different provenance. Replacing one by the other could result in miscompilation.

GVN is important enough that adding provenance information to integers is a no-go.

I suspect that the unrestrict instruction would allow us to apply GVN to pointers, at the cost of making later alias analysis more conservative and sprinkling unrestrict instructions that may inhibit other transforms. I have no idea what the trade-off is on that.


The "byte" types: accurate representation of memory contents

With all the above in mind, I can see the first-principles appeal of the proposed "byte" types. They allow us to represent the contents of memory accurately in SSA values, and so they fill a real gap in the expressiveness of LLVM IR.

That said, the software development cost of adding a whole new family of types to LLVM is very high, so it better be justified by more than just aesthetics.

Our hand-coded memcpy can be turned into a perfect copier with straightforward replacement of i8 by b8:

define void @memcpy(b8* %dst, b8* %src, i64 %n) {
entry:
  %dst.end = getelementptr b8, b8* %dst, i64 %n
  %isempty = icmp eq i64 %n, 0
  br i1 %isempty, label %out, label %loop

loop:
  %src.loop = phi b8* [ %src, %entry ], [ %src.next, %loop ]
  %dst.loop = phi b8* [ %dst, %entry ], [ %dst.next, %loop ]
  %ch = load b8, b8* %src.loop
  store b8 %ch, b8* %dst.loop
  %src.next = getelementptr b8, b8* %src.loop, i64 1
  %dst.next = getelementptr b8, b8* %dst.loop, i64 1
  %done = icmp eq b8* %dst.next, %dst.end
  br i1 %done, label %out, label %loop

out:
  ret void
}

Looking at the concrete choices made in the proposal, I disagree with some of them.

Memory should not be typed. In the proposal, storing an integer always results in different memory contents than storing a pointer (regardless of its provenance), and implicitly trying to mix pointers and integers is declared to be undefined behavior. In other words, a sequence such as:

store i64 %x, i64* %p
%q = bitcast i64* %p to i8**
%y = load i8*, i8** %q

... is undefined behavior under the proposal instead of being effectively inttoptr(%x). That seems fine for C/C++, but is it going to be fine for other frontends?

The corresponding distinction between bytes-as-integers and bytes-as-pointers complicates the proposal overall, e.g. it forces them to add a bytecast instruction.

Conversely, the benefits of the distinction are unclear to me. One benefit appears to be guaranteed non-aliasing between pointer and non-pointer memory accesses, but that is a form of type-based alias analysis which in LLVM should idiomatically be done via TBAA metadata. (Update: see the addendum for another potential argument in favor of typed memory.)

So let's keep memory untyped, please.

Bitwise poison in byte values makes me really nervous due to the arbitrary deviation from how poison works in other types. I don't see any justification for it in the proposal. I can kind of see how one could be motivated by implementing memcpy with vector intrinsics operating on, for example, <8 x b32>, but a simpler solution would be to just use <32 x b8> instead. And if poison is indeed bitwise, then certainly pointer provenance would also have to be bitwise!

Finally, no design discussion is complete without a little bit of bike-shedding. I believe the name "byte" is inspired by C++'s std::byte, but given that types such as b256 are possible, this name would forever be a source of confusion. Naming is hard, and I think we should at least try to look for a better one. Let me kick off the brainstorming by suggesting we think of them as "memory content" values, because that's what they are. The types could be spelled m8, m32, etc. in IR assembly.

A variation: adding a pointer provenance type

In the llvm-dev thread, Jeroen Dobbelaere points out work being done to introduce explicit `ptr_provenance` operands on certain instructions, in service of C99's restrict keyword. I haven't properly digested this work, but it inspired the thoughts of this section.

Values of the proposed byte types have both a bit pattern and a pointer provenance. Do we really need to have both pieces of information in the same SSA value? We could instead split them up into an integer bit pattern value and a pointer provenance value with an explicit provenance type. Loads of integers could read out the provenance information stored in memory and provide it as a secondary result. Similarly, stores of integers could accept the desired provenance to be stored in memory as a secondary data operand. This would allow us to write a perfect memcpy by replacing the core load/store sequence with something like:

%ch, %provenance = load_with_provenance i8, i8* %src
store_with_provenance i8 %ch, provenance %provenance, i8* %dst

The syntax and instruction names in the example are very much straw men. Don't take them too seriously, especially because LLVM IR doesn't currently allow multiple result values.

Interestingly, this split allows the derivation of pointer provenance to follow a different path than the calculation of the pointer's bit pattern. This in turn allows us in principle to perform GVN on pointers without being conservative for alias analysis.

One of the steps in bug 37469 is not quite GVN, but morally similar. Simplifying a lot, the original program sequence:

%ch1 = load i8, i8* %p1
%ch2 = load i8, i8* %p2
%eq = icmp eq i8 %ch1, %ch2
%ch = select i1 %eq, i8 %ch1, i8 %ch2
store i8 %ch, i8* %p3

... is transformed into:

%ch2 = load i8, i8* %p2
store i8 %ch2, i8* %p3

This is correct for the bit patterns being loaded and stored, but the program also indirectly relies on pointer provenance of the data. Of course, there is no pointer provenance information being copied here because i8 only holds a bit pattern. However, with the "byte" proposal, all the i8s would be replaced by b8s, and then the transform becomes incorrect because it changes the provenance information.

If we split the proposed use of b8 into a use of i8 and explicit provenance values, the original program becomes:

%ch1, %prov1 = load_with_provenance i8, i8* %p1
%ch2, %prov2 = load_with_provenance i8, i8* %p2
%eq = icmp eq i8 %ch1, %ch2
%ch = select i1 %eq, i8 %ch1, i8 %ch2
%prov = select i1 %eq, provenance %prov1, provenance %prov2
store_with_provenance i8 %ch, provenance %prov, i8* %p3

This could be transformed into something like:

%prov1 = load_only_provenance i8* %p1
%ch2, %prov2 = load_with_provenance i8, i8* %p2
%prov = merge provenance %prov1, %prov2
store_with_provenance i8 %ch2, provenance %prov, i8* %p3

... which is just as good for code generation but loses only very little provenance information.

Aside: loop idioms

Without major changes to LLVM IR, a perfect memcpy cannot be implemented because pointer provenance information is lost.

Nevertheless, one could still define the @llvm.memcpy intrinsic to be a perfect copy. This helps memcpys in the original source program be less conservative in terms of alias analysis. However, it also makes it incorrect to replace a memcpy loop idiom with a use of @llvm.memcpy: without adding unrestrict instructions, the replacement may introduce undefined behavior; and there is no way to bound the locations where such unrestricts may be needed.

We could augment @llvm.memcpy with an immediate argument that selects its provenance behavior.

In any case, one can argue that bug 37469 is really a bug in the loop idiom recognizer. It boils down to the details of how everything is defined, and unfortunately, these weird corner cases are currently underspecified in the LangRef.

Conclusion

We started with the question of whether memcpy can be implemented in LLVM IR. The answer is a qualified Yes. It is possible, but the resulting copy is imperfect because pointer provenance information is lost. This has surprising implications which in turn happen to cause real miscompilation bugs -- although those bugs could be fixed even without a perfect memcpy.

The "byte" proposal has a certain aesthetic appeal because it fixes a real gap in the expressiveness of LLVM IR, but its software engineering cost is large and I object to some of its details. There are also alternatives to consider.

The miscompilation bugs obviously need to be fixed, but they can be fixed much less intrusively, albeit at the cost of more conservative alias analysis in the affected places. It is not clear to me whether improving alias analysis justifies the more complex solutions.

I would like to understand better how all of this interacts with the C99 restrict work. That work introduces mechanisms for explicitly talking about pointer provenance in the IR, which may allow us to kill two birds with one stone.

In any case, this is a fascinating topic and discussion, and I feel like we're only at the beginning.


Addendum: storing back previously loaded integers

(Added this section on June 14th)

Harald van Dijk on Phrabricator and Ralf Jung on llvm-dev, referring to a Rust issue, explicitly and implicitly point out a curious issue with loading and storing integers.

Here is Harald's example:

define i8* @f(i8* %p) {
  %buf = alloca i8*
  %buf.i32 = bitcast i8** %buf to i32*
  store i8* %p, i8** %buf
  %i = load i32, i32* %buf.i32
  store i32 %i, i32* %buf.i32
  %q = load i8*, i8** %buf
  ret i8* %q
}

There is a pair of load/store of i32 which is fully redundant from a machine perspective and so we'd like to optimize that away, after which it becomes obvious that the function really just returns %p -- at least as far as bit patterns are concerned.

However, in a world where memory is untyped but has provenance information, this optimization is incorrect because it can introduce undefined behavior: the load/store of i32 resets the provenance information in memory to "all", so that the original function returns an unrestricted version of %p. This is no longer the case after the optimization.

There are at least two possible ways of resolving this conflict.

We could define memory to be typed, in the sense that each byte of memory remembers whether it was most recently stored as a pointer or a non-pointer. A load with the wrong type returns poison. In that case, the example above returns poison before the optimization (because %i is guaranteed to be poison). After the optimization it returns non-poison, which is an acceptable refinement, so the optimization is correct.

The alternative is to keep memory untyped and say that directly eliminating the i32 store in the example is incorrect.

We are facing a tradeoff that depends on how important that optimization is for performance.

Two observations to that end. First, the more common case of dead store elimination is one where there are multiple stores to the same address in a row, and we remove all but the last one of them. That more common optimization is unaffected by provenance issues either way.

Second, we can still perform store forwarding / peephole optimization across such load/store pairs, as long as we are careful to introduce unrestrict where needed. The example above can be optimized via store forwarding to:

define i8* @f(i8* %p) {
  %buf = alloca i8*
  %buf.i32 = bitcast i8** %buf to i32*
  store i8* %p, i8** %buf
  %i = load i32, i32* %buf.i32
  store i32 %i, i32* %buf.i32
  %q = unrestrict i8* %p
  ret i8* %q
}

We can then dead-code eliminate the bulk of the function and obtain:

define i8* @f(i8* %p) {
  %q = unrestrict i8* %p
  ret i8* %q
}

... which is as good as it can possibly get.

So, there is a good chance that preventing this particular optimization is relatively cheap in terms of code quality, and the gain in overall design simplicity may well be worth it.




[0] We could also say that the loaded pointer's provenance is magically the memory object that happens to be at the referenced memory address. Either way, provenance would become a useless no-op in most cases. For example, mem2reg would have to insert unrestrict instructions (defined later) everywhere because pointers become effectively "unrestricted" when loaded from alloca'd memory.

June 13, 2021

In an earlier article I showed how reading from VRAM with the CPU can be very slow. It however turns out there there are ways to make it less slow.

The key to this are instructions with non-temporal hints, in particular VMOVNTDQA. The Intel Instruction Manual says the following about this instruction:

“MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory type, the nontemporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the temporary internal buffer if data is available. “ (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2)

This sounds perfect for our VRAM and WC System Memory buffers as we typically only read 16-bytes per instruction and this allows us to read entire cachelines at time.

It turns out that Mesa already implemented a streaming memcpy using these instructions so all we had to do was throw that into our benchmark and write a corresponding memcpy that does non-temporal stores to benchmark writing to these memory regions.

As a reminder, we look into three allocation types that are exposed by the amdgpu Linux kernel driver:

  • VRAM. This lives on the GPU and is mapped with Uncacheable Speculative Write Combining (USWC) on the CPU. This means that accesses from the CPU are not cached, but writes can be write-combined.

  • Cacheable system memory. This is system memory that has caching enabled on the CPU and there is cache snooping to ensure the memory is coherent between the CPU and GPU (up till the top level caches. The GPU caches do not participate in the coherence).

  • USWC system memory. This is system memory that is mapped with Uncacheable Speculative Write Combining on the CPU. This can lead to slight performance benefits compared to cacheable system memory due to lack of cache snooping.

Furthermore this still uses a RX 6800 XT + a 2990WX with 4 channel 3200 MT/s RAM.

method (MiB/s) VRAM Cacheable System Memory USWC System Memory
read via memcpy 15 11488 137
write via memcpy 10028 18249 11480
read via streaming memcpy 756 6719 4409
write via streaming memcpy 10550 14737 11652

Using this memcpy implementation we get significantly better performance in uncached memory situations, 50x for VRAM and 26x for USWC system memory. If this is a significant bottleneck in your workload this can be a gamechanger. Or if you were using SDMA to avoid this hit, you might be able to do things at significantly lower latency. That said it is not at a level where it does not matter. For big copies using DMA can still be a significant win.

Note that I initially gave an explanation on why the non-temporal loads should be faster, but the increases in performance are significantly above what something that just fiddles with loading entire cachelines would achieve. I have not dug into the why of the performance increase.

DMA performance

I have been claiming DMA is faster for CPU readbacks of VRAM in both this article and the previous article on the topic. One might ask how fast DMA is then. To demonstrate this I benchmarked VRAM<->Cacheable System Memory copies using the SDMA hardware block on Radeon GPUs.

Note that there is a significant overhead per copy here due to submitting work to the GPU, so I will shows results vs copy size. The rate is measured while doing a wait after each individual copy and taking the wall clock time as these usecases tend to be latency sensitive and hence batching is not too interesting.

copy size copy from VRAM (MiB/s) copy to VRAM (MiB/s)
4 KiB 62 63
16 KiB 245 240
64 KiB 953 1015
256 KiB 3106 3082
1 MiB 6715 7281
4 MiB 9737 11636
16 MiB 12129 12158
64 MiB 13041 12975
256 MiB 13429 13387

This shows that for reads DMA is faster than a normal memcpy at 4 KiB and faster than a streaming memcpy at 64 KiB. Of course one still needs to do their CPU access at that point, but at both these thresholds even with an additional CPU memcpy the total process should still be fast with DMA.

June 10, 2021

TL;DR: Tag your GPT partitions with the right, descriptive partition types, and the world will become a better place.

A number of years ago we started the Discoverable Partitions Specification which defines GPT partition type UUIDs and partition flags for the various partitions Linux systems typically deal with. Before the specification all Linux partitions usually just used the same type, basically saying "Hey, I am a Linux partition" and not much else. With this specification the GPT partition type, flags and label system becomes a lot more expressive, as it can tell you:

  1. What kind of data a partition contains (i.e. is this swap data, a file system or Verity data?)
  2. What the purpose/mount point of a partition is (i.e. is this a /home/ partition or a root file system?)
  3. What CPU architecture a partition is intended for (i.e. is this a root partition for x86-64 or for aarch64?)
  4. Shall this partition be mounted automatically? (i.e. without specifically be configured via /etc/fstab)
  5. And if so, shall it be mounted read-only?
  6. And if so, shall the file system be grown to its enclosing partition size, if smaller?
  7. Which partition contains the newer version of the same data (i.e. multiple root file systems, with different versions)

By embedding all of this information inside the GPT partition table disk images become self-descriptive: without requiring any other source of information (such as /etc/fstab) if you look at a compliant GPT disk image it is clear how an image is put together and how it should be used and mounted. This self-descriptiveness in particular breaks one philosophical weirdness of traditional Linux installations: the original source of information which file system the root file system is, typically is embedded in the root file system itself, in /etc/fstab. Thus, in a way, in order to know what the root file system is you need to know what the root file system is. 🤯 🤯 🤯

(Of course, the way this recursion is traditionally broken up is by then copying the root file system information from /etc/fstab into the boot loader configuration, resulting in a situation where the primary source of information for this — i.e. /etc/fstab — is actually mostly irrelevant, and the secondary source — i.e. the copy in the boot loader — becomes the configuration that actually matters.)

Today, the GPT partition type UUIDs defined by the specification have been adopted quite widely, by distributions and their installers, as well as a variety of partitioning tools and other tools.

In this article I want to highlight how the various tools the systemd project provides make use of the concepts the specification introduces.

But before we start with that, let's underline why tagging partitions with these descriptive partition type UUIDs (and the associated partition flags) is a good thing, besides the philosophical points made above.

  1. Simplicity: in particular OS installers become simpler — adjusting /etc/fstab as part of the installation is not necessary anymore, as the partitioning step already put all information into place for assembling the system properly at boot. i.e. installing doesn't mean that you always have to get fdisk and /etc/fstab into place, the former suffices entirely.

  2. Robustness: since partition tables mostly remain static after installation the chance of corruption is much lower than if the data is stored in file systems (e.g. in /etc/fstab). Moreover by associating the metadata directly with the objects it describes the chance of things getting out of sync is reduced. (i.e. if you lose /etc/fstab, or forget to rerun your initrd builder you still know what a partition is supposed to be just by looking at it.)

  3. Programmability: if partitions are self-descriptive it's much easier to automatically process them with various tools. In fact, this blog story is mostly about that: various systemd tools can naturally process disk images prepared like this.

  4. Alternative entry points: on traditional disk images, the boot loader needs to be told which kernel command line option root= to use, which then provides access to the root file system, where /etc/fstab is then found which describes the rest of the file systems. Where precisely root= is configured for the boot loader highly depends on the boot loader and distribution used, and is typically encoded in a Turing complete programming language (Grub…). This makes it very hard to automatically determine the right root file system to use, to implement alternative entry points to the system. By alternative entry points I mean other ways to boot the disk image, specifically for running it as a systemd-nspawn container — but this extends to other mechanisms where the boot loader may be bypassed to boot up the system, for example qemu when configured without a boot loader.

  5. User friendliness: it's simply a lot nicer for the user looking at a partition table if the partition table explains what is what, instead of just saying "Hey, this is a Linux partition!" and nothing else.

Uses for the concept

Now that we cleared up the Why?, lets have a closer look how this is currently used and exposed in systemd's various components.

Use #1: Running a disk image in a container

If a disk image follows the Discoverable Partition Specification then systemd-nspawn has all it needs to just boot it up. Specifically, if you have a GPT disk image in a file foobar.raw and you want to boot it up in a container, just run systemd-nspawn -i foobar.raw -b, and that's it (you can specify a block device like /dev/sdb too if you like). It becomes easy and natural to prepare disk images that can be booted either on a physical machine, inside a virtual machine manager or inside such a container manager: the necessary meta-information is included in the image, easily accessible before actually looking into its file systems.

Use #2: Booting an OS image on bare-metal without /etc/fstab or kernel command line root=

If a disk image follows the specification in many cases you can remove /etc/fstab (or never even install it) — as the basic information needed is already included in the partition table. The systemd-gpt-auto-generator logic implements automatic discovery of the root file system as well as all auxiliary file systems. (Note that the former requires an initrd that uses systemd, some more conservative distributions do not support that yet, unfortunately). Effectively this means you can boot up a kernel/initrd with an entirely empty kernel command line, and the initrd will automatically find the root file system (by looking for a suitably marked partition on the same drive the EFI System Partition was found on).

(Note, if /etc/fstab or root= exist and contain relevant information they always takes precedence over the automatic logic. This is in particular useful to tweaks thing by specifying additional mount options and such.)

Use #3: Mounting a complex disk image for introspection or manipulation

The systemd-dissect tool may be used to introspect and manipulate OS disk images that implement the specification. If you pass the path to a disk image (or block device) it will extract various bits of useful information from the image (e.g. what OS is this? what partitions to mount?) and display it.

With the --mount switch a disk image (or block device) can be mounted to some location. This is useful for looking what is inside it, or changing its contents. This will dissect the image and then automatically mount all contained file systems matching their GPT partition description to the right places, so that you subsequently could chroot into it. (But why chroot if you can just use systemd-nspawn? 😎)

Use #4: Copying files in and out of a disk image

The systemd-dissect tool also has two switches --copy-from and --copy-to which allow copying files out of or into a compliant disk image, taking all included file systems and the resulting mount hierarchy into account.

Use #5: Running services directly off a disk image

The RootImage= setting in service unit files accepts paths to compliant disk images (or block device nodes), and can mount them automatically, running service binaries directly off them (in chroot() style). In fact, this is the base for the Portable Service concept of systemd.

Use #6: Provisioning disk images

systemd provides various tools that can run operations provisioning disk images in an "offline" mode. Specifically:

systemd-tmpfiles

With the --image= switch systemd-tmpfiles can directly operate on a disk image, and for example create all directories and other inodes defined in its declarative configuration files included in the image. This can be useful for example to set up the /var/ or /etc/ tree according to such configuration before first boot.

systemd-sysusers

Similar, the --image= switch of systemd-sysusers tells the tool to read the declarative system user specifications included in the image and synthesizes system users from it, writing them to the /etc/passwd (and related) files in the image. This is useful for provisioning these users before the first boot, for example to ensure UID/GID numbers are pre-allocated, and such allocations not delayed until first boot.

systemd-machine-id-setup

The --image= switch of systemd-machine-id-setup may be used to provision a fresh machine ID into /etc/machine-id of a disk image, before first boot.

systemd-firstboot

The --image= switch of systemd-firstboot may be used to set various basic system setting (such as root password, locale information, hostname, …) on the specified disk image, before booting it up.

Use #7: Extracting log information

The journalctl switch --image= may be used to show the journal log data included in a disk image (or, as usual, the specified block device). This is very useful for analyzing failed systems offline, as it gives direct access to the logs without any further, manual analysis.

Use #8: Automatic repartitioning/growing of file systems

The systemd-repart tool may be used to repartition a disk or image in an declarative and additive way. One primary use-case for it is to run during boot on physical or VM systems to grow the root file system to the disk size, or to add in, format, encrypt, populate additional partitions at boot.

With its --image= switch it the tool may operate on compliant disk images in offline mode of operation: it will then read the partition definitions that shall be grown or created off the image itself, and then apply them to the image. This is particularly useful in combination with the --size= which allows growing disk images to the specified size.

Specifically, consider the following work-flow: you download a minimized disk image foobar.raw that contains only the minimized root file system (and maybe an ESP, if you want to boot it on bare-metal, too). You then run systemd-repart --image=foo.raw --size=15G to enlarge the image to the 15G, based on the declarative rules defined in the repart.d/ drop-in files included in the image (this means this can grow the root partition, and/or add in more partitions, for example for /srv or so, maybe encrypted with a locally generated key or so). Then, you proceed to boot it up with systemd-nspawn --image=foo.raw -b, making use of the full 15G.

Versioning + Multi-Arch

Disk images implementing this specifications can carry OS executables in one of three ways:

  1. Only a root file system

  2. Only a /usr/ file system (in which case the root file system is automatically picked as tmpfs).

  3. Both a root and a /usr/file system (in which case the two are combined, the /usr/ file system mounted into the root file system, and the former possibly in read-only fashion`)

They may also contain OS executables for different architectures, permitting "multi-arch" disk images that can safely boot up on multiple CPU architectures. As the root and /usr/ partition type UUIDs are specific to architectures this is easily done by including one such partition for x86-64, and another for aarch64. If the image is now used on an x86-64 system automatically the former partition is used, on aarch64 the latter.

Moreover, these OS executables may be contained in different versions, to implement a simple versioning scheme: when tools such as systemd-nspawn or systemd-gpt-auto-generator dissect a disk image, and they find two or more root or /usr/ partitions of the same type UUID, they will automatically pick the one whose GPT partition label (a 36 character free-form string every GPT partition may have) is the newest according to strverscmp() (OK, truth be told, we don't use strverscmp() as-is, but a modified version with some more modern syntax and semantics, but conceptually identical).

This logic allows to implement a very simple and natural A/B update scheme: an updater can drop multiple versions of the OS into separate root or /usr/ partitions, always updating the partition label to the version included there-in once the download is complete. All of the tools described here will then honour this, and always automatically pick the newest version of the OS.

Verity

When building modern OS appliances, security is highly relevant. Specifically, offline security matters: an attacker with physical access should have a difficult time modifying the OS in a way that isn't noticed. i.e. think of a car or a cell network base station: these appliances are usually parked/deployed in environments attackers can get physical access to: it's essential that in this case the OS itself sufficiently protected, so that the attacker cannot just mount the OS file system image, make modifications (inserting a backdoor, spying software or similar) and the system otherwise continues to run without this being immediately detected.

A great way to implement offline security is via Linux' dm-verity subsystem: it allows to securely bind immutable disk IO to a single, short trusted hash value: if an attacker manages to offline modify the disk image the modified disk image won't match the trusted hash anymore, and will not be trusted anymore (depending on policy this then just result in IO errors being generated, or automatic reboot/power-off).

The Discoverable Partitions Specification declares how to include Verity validation data in disk images, and how to relate them to the file systems they protect, thus making if very easy to deploy and work with such protected images. For example systemd-nspawn supports a --root-hash= switch, which accepts the Verity root hash and then will automatically assemble dm-verity with this, automatically matching up the payload and verity partitions. (Alternatively, just place a .roothash file next to the image file).

Future

The above already is a powerful tool set for working with disk images. However, there are some more areas I'd like to extend this logic to:

bootctl

Similar to the other tools mentioned above, bootctl (which is a tool to interface with the boot loader, and install/update systemd's own EFI boot loader sd-boot) should learn a --image= switch, to make installation of the boot loader on disk images easy and natural. It would automatically find the ESP and other relevant partitions in the image, and copy the boot loader binaries into them (or update them).

coredumpctl

Similar to the existing journalctl --image= logic the coredumpctl tool should also gain an --image= switch for extracting coredumps from compliant disk images. The combination of journalctl --image= and coredumpctl --image= would make it exceptionally easy to work with OS disk images of appliances and extracting logging and debugging information from them after failures.

And that's all for now. Please refer to the specification and the man pages for further details. If your distribution's installer does not yet tag the GPT partition it creates with the right GPT type UUIDs, consider asking them to do so.

Thank you for your time.

June 09, 2021

Memes

We’ve all been there. No matter how 10x someone is or feels, everyone has had a moment where abruptly they say to themselves, HOW THE FUCK DO THREADS EVEN WORK?

This may be precipitated by any number of events, including, but not limited to:

  • forgetting a lock
  • forgetting to unlock
  • missing an unlock at an early return
  • forgetting to initialize a lock
  • forgetting to spawn a thread
  • forgetting to signal a conditional
  • forgetting to initialize a conditional
  • running the test case with the wrong driver

I’m not going to say that I’ve been there recently.

I’m not going to say that it was today, nor am I going to state, on the record, that at least one existing zink-wip snapshot may or may not be affected by an issue which may or may not be on the above list.

I’m not going to say any of these things.

What I am going to do is talk about a new oom handler I’ve been working on to handle the dreaded spec@!opengl 1.1@streaming-texture-leak case from piglit.

The Case

This test is annoying in that it is effectively a test of a driver’s ability to throttle itself when an app is generating and using $infinity textures without ever explicitly triggering a flush.

In short, it’s:

for (i = 0; i < 5000; i++) {
   glGenTextures(1, &texture);
   glBindTexture(GL_TEXTURE_2D, texture);
   glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, TEX_SIZE, TEX_SIZE, 0, GL_RGBA, GL_UNSIGNED_BYTE, tex_buffer);
   piglit_draw_rect_tex(0, 0, piglit_width, piglit_height, 0, 0, 1, 1);
   glDeleteTextures(1, &texture);
}

The textures are “deleted”, yes, but because they’re in use, the driver can’t actually delete them at this point of call, meaning that they can only truly be deleted once they are no longer in use by the GPU. At some iteration, this will begin to oom the GPU, and the driver will have to determine how to handle things.

The Zink Case

At present, mainline zink uses a hammer-and-nail methodology that I came up with last year: the total amount of GPU memory in use by resources in a given cmdbuf is tracked, and that amount is tracked per-context. If the in-use context memory exceeds a threshold of the total VRAM, the driver stalls, thereby freeing up all the resources that are in use so they can be recycled into new ones.

There’s a number of problems with this approach, but the biggest one is that it fails to account for cases like a AAA game that just uses as much memory as it can in order to optimize performance/resolution/graphics. I discovered such a case some time ago while running Tomb Raider, and then I set out to improve things since it was costing me about 10% of my perf on the title screen.

The annoying part of this problem is that the piglit test is a very uncommon case, and it’s tricky to handle it in a way that doesn’t also impact other cases which appear similar but need to not get memory-clamped. As a result, it’s tough to really do anything based on “overall” memory usage.

In the end, what I decided on was using the per-cmdbuf memory usage counter to trigger a check for completed cmdbufs on submit, iterating over all the pending ones to check whether they’ve completed, resetting them and freeing associated resources when possible. This yields good memory reclaiming behavior for problem cases while leaving games like Tomb Raider untouched and definitely not deadlocking or anything like that.

June 02, 2021

Remember When…

I said I’d be blogging every day about some changes? And that was a month ago or however long it’s been? And we all had a good chuckle at the idea that I could blog every day like how things used to be?

Yeah, I remember that too.

Anyway, Bas still hasn’t blogged, so let’s check the blogenda:

  • handwaving about C++ draw templates
  • some obscure vbuf thing
  • shower
  • make zink usable for gaming
  • complain about construction
  • improve shader caching
  • this week’s queue rewrite
  • some other stuff
  • suballocator?

I guess it’s that time of the week again because the schedule says it’s time to talk about this week’s (or whenever it was) major rewrite of zink’s queue handling. But first, only 90s kids will remember that time I blogged about a major queue rewrite and was excited to almost be hitting 70% of native performance.

Synchronization

A common use of GL for big games is using multiple GL contexts to parallelize work. There’s a lot of tricky restrictions for this, both on the app side and the driver side, but this is sort of the closest thing to multiple cmdbufs that GL provides.

We all recall how zink now uses a monotonic queue: upon commencing recording, each cmdbuf gets tagged with a 32bit integer id that doubles as a timeline semaphore id for fencing. The queue iterates, the cmdbuf counter increments, queue submission is done per-context in a thread, the GPU gets triangles, everyone is happy.

But how well does that work out with multiple contexts?

Pretty well, it turns out, as long as you’re using a Vulkan driver that doesn’t actually check to ensure you’re using monotonic ids for your timeline values. Let’s check out a totally hypothetical scenario that isn’t just Steam:

  • Have contexts A and B
  • Context A starts recording, gets id 1 (nonzero id club represent)
  • Context B starts recording, gets id 2
  • Context A finishes recording, submits cmdbuf
  • Context B finishes recording, submits cmdbuf
  • Timeline wait on id 1
  • Timeline wait on id 2

So far so good. But then we get past the “Checking for updates” window:

  • Context A starts recording, gets id 3
  • Context B starts recording, gets id 4
  • Context B finishes recording, submits cmdbuf
  • Context A finishes recording, submits cmdbuf
  • Timeline wait on id 3
  • Timeline wait on id 4

thonking.png

So now context B’s submit thread is dumping cmdbuf 4’s triangles into the GPU, then context A’s submit thread is also trying to dump cmdbuf 3’s triangles into the GPU, but the wait order for the timeline is still A -> B, meaning that the values are not monotonic.

Will any drivers care?

Magic 8-ball says no, no drivers care about this and everything still works fine. That’s cool and interesting, but probably it’d be better to not do that.

This Time It’s Definitely Fixed

The problem here is two problems:

  • the queue submission thread is context-based when it should be screen based
  • cmdbufs get an id when they start recording, not when they get submitted

The first problem is easy to fix: just deduplicate the thread and move the struct member.

The second one is trickier because everything in zink relies on cmdbufs getting an id as soon as they become active. This is done so that any resources written to by a given cmdbuf can have their usage tracked for synchronization purposes, e.g., reading back a buffer only after all its writes have landed.

The problem is further complicated by zink not having a great API barrier between directly accessing the “usage” for a resource and the value itself, by which I mean parts of the codebase directly reading the integer value vs having an API to wrap it; the latter would enable replacing the mechanism with whatever I wanted, so I decided to start by creating such a wrapper based on this:

struct zink_batch_usage {
   uint32_t usage;
   bool unflushed;
};

This is the existing struct zink_batch_usage but now with a bool value indicating that this cmdbuf is yet to be flushed. Each cmdbuf batch now has this sub-struct inlined onto it, and resources in zink can take references (pointers) to a specific cmdbuf’s usage struct. Because batches are never destroyed, this means the wrapper API can always dereference the struct to determine how to synchronize the usage: if it’s unflushed, it can flush or sync the flush thread; if it’s real, pending usage, it can safely wait on that usage as a timeline value and guarantee monotonic ordering.

bool
zink_screen_usage_check_completion(struct zink_screen *screen, const struct zink_batch_usage *u)
{
   if (!zink_batch_usage_exists(u))
      return true;
   if (zink_batch_usage_is_unflushed(u))
      return false;

   return zink_screen_batch_id_wait(screen, u->usage, 0);
}

bool
zink_batch_usage_check_completion(struct zink_context *ctx, const struct zink_batch_usage *u)
{
   if (!zink_batch_usage_exists(u))
      return true;
   if (zink_batch_usage_is_unflushed(u))
      return false;
   return zink_check_batch_completion(ctx, u->usage);
}

void
zink_batch_usage_wait(struct zink_context *ctx, const struct zink_batch_usage *u)
{
   if (!zink_batch_usage_exists(u))
      return;
   if (zink_batch_usage_is_unflushed(u))
      zink_fence_wait(&ctx->base);
   else
      zink_wait_on_batch(ctx, u->usage);
}

Now things render exactly the same, but with a truly monotonic queue underneath that’s conformant to specifications.

June 01, 2021

As the President of the GNOME Foundation Board of Directors, I’m really pleased to see the number and breadth of candidates we have for this year’s election. Thank you to everyone who has submitted their candidacy and volunteered their time to support the Foundation. Allan has recently blogged about how the board has been evolving, and I wanted to follow that post by talking about where the GNOME Foundation is in terms of its strategy. This may be helpful as people consider which candidates might bring the best skills to shape the Foundation’s next steps.

Around three years ago, the Foundation received a number of generous donations, and Rosanna (Director of Operations) gave a presentation at GUADEC about her and Neil’s (Executive Director, essentially the CEO of the Foundation) plans to use these funds to transform the Foundation. We would grow our activities, increasing the pace of events, outreach, development and infrastructure that supported the GNOME project and the wider desktop ecosystem – and, crucially, would grow our funding to match this increased level of activity.

I think it’s fair to say that half of this has been a great success – we’ve got a larger staff team than GNOME has ever had before. We’ve widened the GNOME software ecosystem to include related apps and projects under the GNOME Circle banner, we’ve helped get GTK 4 out of the door, run a wider-reaching program in the Community Engagement Challenge, and consistently supported better infrastructure for both GNOME and the Linux app community in Flathub.

Aside from another grant from Endless (note: my employer), our fundraising hasn’t caught up with this pace of activities. As a result, the Board recently approved a budget for this financial year which will spend more funds from our reserves than we expect to raise in income. Due to our reserves policy, this is essentially the last time we can do this: over the next 6-12 months we need to either raise more money, or start spending less.

For clarity – the Foundation is fit and well from a financial perspective – we have a very healthy bank balance, and a very conservative “12 month run rate” reserve policy to handle fluctuations in income. If we do have to slow down some of our activities, we will return to a “steady state” where our regular individual donations and corporate contributions can support a smaller staff team that supports the events and infrastructure we’ve come to rely on.

However, this isn’t what the Board wants to do – the previous and current boards were unanimous in their support of the idea that we should be ambitious: try to do more in the world and bring the benefits of GNOME to more people. We want to take our message of trusted, affordable and accessible computing to the wider world.

Typically, a lot of the activities of the Foundation have been very inwards-facing – supporting and engaging with either the existing GNOME or Open Source communities. This is a very restricted audience in terms of fundraising – many corporate actors in our community already support GNOME hugely in terms of both financial and in-kind contributions, and many OSS users are already supporters either through volunteer contributions or donating to those nonprofits that they feel are most relevant and important to them.

To raise funds from new sources, the Foundation needs to take the message and ideals of GNOME and Open Source software to new, wider audiences that we can help. We’ve been developing themes such as affordability, privacy/trust and education as promising areas for new programs that broaden our impact. The goal is to find projects and funding that allow us to both invest in the GNOME community and find new ways for FOSS to benefit people who aren’t already in our community.

Bringing it back to the election, I’d like to make clear that I see this – reaching the outside world, and finding funding to support that – as the main priority and responsibility of the Board for the next term. GNOME Foundation elections are a slightly unusual process that “filters” our board nominees by being existing Foundation members, which means that candidates already work inside our community when they stand for election. If you’re a candidate and are already active in the community – THANK YOU – you’re doing great work, keep doing it! That said, you don’t need to be a Director to achieve things within our community or gain the support of the Foundation: being a community leader is already a fantastic and important role.

The Foundation really needs support from the Board to make a success of the next 12-18 months. We need to understand our financial situation and the trade-offs we have to make, and help to define the strategy with the Executive Director so that we can launch some new programs that will broaden our impact – and funding – for the future. As people cast their votes, I’d like people to think about what kind of skills – building partnerships, commercial background, familiarity with finances, experience in nonprofit / impact spaces, etc – will help the Board make the Foundation as successful as it can be during the next term.

I Hate Construction.

Specifically when it’s right outside my house and starts at 5:30AM with heavy machinery moving around.

With this said, I’m overdue for a post, and if I don’t set a good example by continuing to blog, why would anyone else? PS. BAS IT’S TIME.

Let’s see what’s on the agenda:

  • handwaving about C++ draw templates
  • some obscure vbuf thing
  • shower
  • make zink usable for gaming
  • complain about construction
  • improve shader caching

Looks like the next thing on the list is shader caching.

The Art Of The Cache

If you’re a long-time zink connoisseur, or if you’re just a casual reader of the blog, you know that zink has a shader cache.

But did you know that it doesn’t actually do anything at present?

Indeed, it was to my chagrin that, upon diving back into my slapdash pipeline cache implementation, I discovered that it was doing absolutely nothing. And this was a different nothing than that one time I didn’t actually pass the cache back to the vulkan driver! Yes, this was the nothing of I have a cache, why am I still compiling a hundred pipelines per frame? that the occasional lucky developer runs into every now and again.

But hwhy? Who would do such a thing?

spideymeme.jpg

Past recriminations aside, how does a shader/pipeline cache work, anyway? The gist of it in most Mesa drivers is this:

Thus a shader gets cached based on its text representation, enabling matching shaders across programs to use the same cache entry. After noting the success of Steam’s fossilize-based single file cache, I decided to use a single file for zink’s shader cache.

Oops.

The problem in this case was that I was just jamming all the pipelines into a single file, written once at program exit, and expecting the Vulkan driver to figure things out.

But what if the program didn’t exit cleanly? Or what if the write failed for some reason?

In short, the pipeline cache was mostly being written as a big block of garbage data. Not very useful.

Next-Level Caching Technique

Clearly I needed to reeducate myself in the ways of managing a cache, something that, in my former life as a GUI expert, I did routinely, but that I could no longer comprehend now that I only speak bitfields and command buffers.

I sought out the reclusive Timothy Arceri, a well-known sage in many esoteric, arcane arts, and, as I recall it, purveyor of great wisdom such as (paraphrased because the original text has been lost to the ages): We Both Know The GLSL Compiler Code For Uniform Blocks Is Unfathomable, Why Do You Insist On Attempting To Modify It?

The answers I received from my sojourn were swift and concise:

Stop that. Fossilize caching wasn’t meant to work that way.

My thoughts whirling, confidence badly shaken, I stumbled and fell from the summit of the mountain and dashed my heretical cache implementation against the solid foundation of git rebase -i.

What had I been thinking?

It was back to the charts for me, and this time I had a number of different goals:

  • go back to multi-file caching (since it’s the only option)
  • smaller caches
  • more frequent updates
  • fully async

Turns out this wasn’t actually as hard as expected?

More Flowcharts (Fulfilling Image Quote For Graphics Blog)

Because we’re all big Vulkan adults, we do big Vulkan pipeline caches instead of wimpy OpenGL shader caches like so:

This has the added benefit of providing all the state variants for a given shader pipeline, saving additional lookups and ensuring that all the potential compiled pipelines are available at once. Furthermore, because there’s a (very) short delay between knowing what shaders are grouped together and needing the actual compiled pipeline, I can dump this all into a thread and handle the lookup while I update descriptors #2021 ASYNC THREADS++++++ BAYBEEEEEE.

But also, also, the one thing to absolutely not ever forget or else it’ll be really embarrassing is to ensure that you add your driver’s sha1 hash to your disk cache lookup, otherwise the whole thing explodes and @tarceri will frown down upon you.

May 24, 2021

Stop The Optimizing

I had planned to write more posts about some optimizations and whatever other cool stuff I’ve been working on.

I had planned to make more zink-wip snapshots.

I did shower; stop spamming frog emotes at me.

But I also encountered a bug so bizarre, so infruating, so esoteric, that I need to take a bit of a victory lap now that I’ve successfully corralled it. So let’s get into a real, vintage SGC blog post like we used to have back when SGC was a good blog and dive into what it took to fix a bug that took me four full days to resolve.

The Problem

In the course of writing a suballocator, I ran zero tests, as is my way. When the coding fugue ended, I stumbled weakly to my local CI script and managed to hit the Enter key before collapsing into a restless sleep where I was chased by angry triangles. Things were different when I awoke; namely I now had a lot of failing tests.

But I fixed them, because that’s what driver developers do.

All except one, which I assumed was a flake after running it a few times and seeing no failures.

This was how I got to know the horror of dEQP-GLES3.functional.vertex_array_objects.all_attributes.

The test itself is awful to debug. It generates GL_MAX_VERTEX_ATTRIBS vertex attributes to use with the maximum number of vertex buffers and does a series of draws, verifying the results. Normal enough.

Except the attributes are completely randomized, even whether they’re enabled, so no two runs are the same.

And the bisect hit just right.

The Basics Of Problem Solving

When a new bug is found with a driver, the first question is usually “Did this used to work?” followed quickly by “When did it start?” if the first answer was yes. The reason for this is that determining exactly when a problem began and what caused it to manifest gives the developer some vague starting point for determining what is happening to cause a bug.

So it was that I embarked on a bisect to figure out why dEQP-GLES3.functional.vertex_array_objects.all_attributes was suddenly failing. But obviously I couldn’t just bisect for this test. No, no, that would be far too easy.

This test only fails if run in conjunction with a series of other tests. Thus my deqp caselist file:

dEQP-GLES3.functional.vertex_array_objects.all_attributes
dEQP-GLES3.functional.vertex_arrays.single_attribute.first.byte.first6_offset1_stride2_quads256
dEQP-GLES3.functional.vertex_arrays.single_attribute.output_types.int.components2_ivec4_quads256
dEQP-GLES3.functional.vertex_arrays.single_attribute.usages.static_copy.stride32_fixed_quads1

The problem test always runs last, so something was clearly going on over time that was causing the failure. Armed with this knowledge, and so sure that this would end up being some trivial one-liner that I could fix in a few minutes, I set up my startpoint and endpoint for the bisect and went to work.

Zink By Bisection

Generally speaking, I assume every bug I find is going to be a zink bug. Just by the numbers, that’s usually the case, but then also it’s just always the case. It was therefore no surprise that my bisect landed on a certain commit:

commit 6b13e7cede95504ce8309744d8b9d83c7dbab7c9
Author: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Date:   Mon May 17 08:44:02 2021 -0400

    try better map flags

diff --git a/src/gallium/drivers/zink/zink_resource.c b/src/gallium/drivers/zink/zink_resource.c
index 55f37380d9f..121f6f0076e 100644
--- a/src/gallium/drivers/zink/zink_resource.c
+++ b/src/gallium/drivers/zink/zink_resource.c
@@ -1201,7 +1201,7 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
          /* At this point, the buffer is always idle (we checked it above). */
          usage |= PIPE_MAP_UNSYNCHRONIZED;
       }
-   } else if ((usage & PIPE_MAP_READ) && !(usage & PIPE_MAP_PERSISTENT)) {
+   } else if (((usage & PIPE_MAP_READ) && !(usage & PIPE_MAP_PERSISTENT)) || !res->obj->host_visible) {
       assert(!(usage & (TC_TRANSFER_MAP_THREADED_UNSYNC | PIPE_MAP_THREAD_SAFE)));
       if (usage & PIPE_MAP_DONTBLOCK) {
          /* sparse/device-local will always need to wait since it has to copy */
@@ -1209,7 +1209,7 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
             return NULL;
          if (!zink_resource_usage_check_completion(ctx, res, ZINK_RESOURCE_ACCESS_WRITE))
             return NULL;
-      } else if (!res->obj->host_visible) {
+      } else if (!res->obj->host_visible || res->base.b.usage != PIPE_USAGE_STAGING) {
          trans->staging_res = pipe_buffer_create(&screen->base, PIPE_BIND_LINEAR, PIPE_USAGE_STAGING, box->x + box->width);
          if (!trans->staging_res)
             return NULL;
@@ -1218,8 +1218,12 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
          zink_copy_buffer(ctx, NULL, staging_res, res, box->x, box->x, box->width);
          res = staging_res;
          zink_fence_wait(&ctx->base);
-      } else
-         zink_resource_usage_wait(ctx, res, ZINK_RESOURCE_ACCESS_WRITE);
+      } else {
+         if (!(usage & PIPE_MAP_WRITE))
+            zink_resource_usage_wait(ctx, res, ZINK_RESOURCE_ACCESS_WRITE);
+         else
+            zink_resource_usage_wait(ctx, res, ZINK_RESOURCE_ACCESS_RW);
+      }
    }
 
    if (!ptr) {

As clearly explained by my laconic commit log, this patch aims to improve non-persistent buffer mappings by forcing non-staging resources to use a snooped staging resource. For more details on why this is desirable, check out this encyclopedia of wisdom on the topic, written by RADV co-founder and Commander Of The Rays, Bas Nieuwenhuizen.

But somehow this small patch was breaking the test, so I set out to investigate.

Isolation

Once a problem area is identified, it’s usually helpful to try and isolate the exact hunks of a patch which cause the problem. In this case, I had three distinct and only vaguely-related hunks, so it was an ideal case for this strategy. The middle hunk ended up being the culprit:

@@ -1209,7 +1209,7 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
             return NULL;
          if (!zink_resource_usage_check_completion(ctx, res, ZINK_RESOURCE_ACCESS_WRITE))
             return NULL;
-      } else if (!res->obj->host_visible) {
+      } else if (!res->obj->host_visible || res->base.b.usage != PIPE_USAGE_STAGING) {
          trans->staging_res = pipe_buffer_create(&screen->base, PIPE_BIND_LINEAR, PIPE_USAGE_STAGING, box->x + box->width);
          if (!trans->staging_res)
             return NULL;

It seemed a bit odd to me, but nothing that stood out as impossible; perhaps there was some manner of issue with buffer copying offsets for setting up the staging resource, or some synchronization issue, or whatever. There were options, and I now knew that the problem was caused by setting up a staging buffer. Further prinfs revealed that this conditional was only hit for read access, so it was now narrowed down even further.

Initial Testing

Was it a buffer offset problem with copying the data for the staging resource?

Well.

No.

As interesting as it would’ve been for that to have been the case, there’s zero chance that this one test case was invoking a magical offset that wasn’t also triggered in other cases. If the general buffer copying code here was broken, it was probably broken everywhere in zink, so there would’ve been many, many more failures. There was only this one case, however, and deeper investigation confirmed this, as I directly mapped both buffers and compared the data ranges, which matched.

Synchronization it was, then, and I can hear the disembodied voice of Dave Airlie now shouting “Barriers!” before vanishing off into the ether.

First, I tried adding more GPU stalls. Like, lots more. Like, so many that the test took minutes to complete. There was no change, however. Just for hahas, I even added some usleep calls around.

Still nothing.

At this point I was seriously stumped. By now I’d fully instrumented all of the buffer access codepaths with asserts to verify the mapped contents matched the real buffer contents in all cases, and none of the asserts were ever hit.

But if it wasn’t actually an issue with synchronizing the staging buffer, what could it be?

I decided to check the test with with ANV at this point, being the case that I always run CTS against lavapipe to avoid killing my session in case I’ve foolished and added some code which triggers hangs, and…

And the test passed with ANV.

confused_nick_young.jpg

This was a real thinker, so I went to get a second opinion from Bas and RADV. RADV told me that ANV didn’t know what it was talking about, and the test was definitely failing, so I went with that answer because it seemed more sane.

As a final idea, I did the truly unthinkable: I threw in a malloc call, allocated some host memory, and copied the map contents directly into that buffer.

And leaked it.

Yes, I know, I know, We Don’t Do That, but it was just this one time. Just a little bit. Just to see if I could valg—Of course valgrind crashes when running anything in lavapipe due to unimplemented instructions, so why did I bother?

Getting Deeper

There comes a time when saying We Need To Go Deeper isn’t just a meme. That time was now, and I was about to get, as they say in technical terms when such depth is approached, deep as fuck.

Continuing to experiment with my memory leaking, the conditional block in question had by now degenerated into spaghetti:

         trans->staging_res = pipe_buffer_create(&screen->base, PIPE_BIND_LINEAR, PIPE_USAGE_STAGING, box->x + box->width);
         if (!trans->staging_res)
            return NULL;
         struct zink_resource *staging_res = zink_resource(trans->staging_res);
         trans->offset = staging_res->obj->offset;
         uint8_t *p = map_resource(screen, res);
         trans->data = malloc(box->x + box->width);
         trans->data2 = malloc(box->x + box->width);
         memset(trans->data, 0, box->x + box->width);
         memset(trans->data2, 0, box->x + box->width);
         memcpy(trans->data, p, box->x + box->width);
         memcpy(trans->data2, p, box->x + box->width);
         printf("SIZE NEEDED %u\n", box->x + box->width);
      for (unsigned i = 0; i < box->x + box->width; i++) {
         uint8_t *map = res->obj->map;
         assert(trans->data[i] == trans->data2[i]);
         assert(map[i] == trans->data2[i]);
         printf("MAP[%u] = %u\n", i, trans->data2[i]);
      }
         //zink_copy_buffer(ctx, NULL, staging_res, res, box->x, box->x, MIN2(box->width + 4, res->base.b.width0-box->x));
         //zink_copy_buffer(ctx, NULL, staging_res, res, box->x, box->x, box->width);
         ptr = trans->data;

         res = staging_res;
         zink_fence_wait(&ctx->base);

Obviously I’m gonna double buffer my memory leak so I can verify that it’s not secretly being modified on unmap (it wasn’t), and then also verify that the data matches before returning the pointer. And print it all, of course, because if you can actually read your terminal when you reach this sort of depth in the course of a debugging session, probably you’re doing it wrong.

But the time had come to start applying hacks elsewhere: namely the test itself. Being a random test case made it impossible to figure out what was going on between runs, but I’d determined one thing of interest: no matter what, unless I returned the direct mapping for the buffer, the test failed.

Let’s see what Mr. Crowbar had to say about that though when I applied him to the CTS case:

diff --git a/modules/gles3/functional/es3fVertexArrayObjectTests.cpp b/modules/gles3/functional/es3fVertexArrayObjectTests.cpp
index 82578b1ce..e231c4b1a 100644
--- a/modules/gles3/functional/es3fVertexArrayObjectTests.cpp
+++ b/modules/gles3/functional/es3fVertexArrayObjectTests.cpp
@@ -765,14 +765,14 @@ void MultiVertexArrayObjectTest::init (void)
 		m_spec.buffers.push_back(shortCoordBuffer48);
 
 		m_spec.state.attributes.push_back(Attribute());
-		m_spec.state.attributes[attribNdx].enabled		= (m_random.getInt(0, 4) == 0) ? GL_FALSE : GL_TRUE;
-		m_spec.state.attributes[attribNdx].size			= m_random.getInt(2,4);
-		m_spec.state.attributes[attribNdx].stride		= 2*m_random.getInt(1, 3);
+		m_spec.state.attributes[attribNdx].enabled		= GL_TRUE;
+		m_spec.state.attributes[attribNdx].size			= (attribNdx % 2) + 2;
+		m_spec.state.attributes[attribNdx].stride	= 2 * ((attribNdx % 2) + 1);
 		m_spec.state.attributes[attribNdx].type			= GL_SHORT;
-		m_spec.state.attributes[attribNdx].integer		= m_random.getBool();
-		m_spec.state.attributes[attribNdx].divisor		= m_random.getInt(0, 1);
-		m_spec.state.attributes[attribNdx].offset		= 2*m_random.getInt(0, 2);
-		m_spec.state.attributes[attribNdx].normalized	= m_random.getBool();
+		m_spec.state.attributes[attribNdx].integer		= attribNdx % 3 == 1;
+		m_spec.state.attributes[attribNdx].divisor		= 0;
+		m_spec.state.attributes[attribNdx].offset		= attribNdx % 5;
+		m_spec.state.attributes[attribNdx].normalized	= attribNdx % 3 == 1;
 		m_spec.state.attributes[attribNdx].bufferNdx	= attribNdx+1;
 
 		if (attribNdx == 0)
@@ -783,14 +783,14 @@ void MultiVertexArrayObjectTest::init (void)
 		}
 
 		m_spec.vao.attributes.push_back(Attribute());
-		m_spec.vao.attributes[attribNdx].enabled		= (m_random.getInt(0, 4) == 0) ? GL_FALSE : GL_TRUE;
-		m_spec.vao.attributes[attribNdx].size			= m_random.getInt(2,4);
-		m_spec.vao.attributes[attribNdx].stride			= 2*m_random.getInt(1, 3);
+		m_spec.vao.attributes[attribNdx].enabled		= GL_TRUE;
+		m_spec.vao.attributes[attribNdx].size			= (attribNdx % 2) + 2;
+		m_spec.vao.attributes[attribNdx].stride			= 2 * ((attribNdx % 2) + 1);
 		m_spec.vao.attributes[attribNdx].type			= GL_SHORT;
-		m_spec.vao.attributes[attribNdx].integer		= m_random.getBool();
-		m_spec.vao.attributes[attribNdx].divisor		= m_random.getInt(0, 1);
-		m_spec.vao.attributes[attribNdx].offset			= 2*m_random.getInt(0, 2);
-		m_spec.vao.attributes[attribNdx].normalized		= m_random.getBool();
+		m_spec.vao.attributes[attribNdx].integer		= attribNdx % 3 == 1;
+		m_spec.vao.attributes[attribNdx].divisor		= 0;
+		m_spec.vao.attributes[attribNdx].offset			= attribNdx % 5;
+		m_spec.vao.attributes[attribNdx].normalized		= attribNdx % 3 == 1;
 		m_spec.vao.attributes[attribNdx].bufferNdx		= attribCount - attribNdx;
 
 		if (attribNdx == 0)

Now I had a consistently failing test (as long as I ran it with the other test cases so it didn’t feel too lonely and accidentally pass) with consistent data, and I was dumping it all to logs that I could compare if I returned the direct pointer for the map to legitimately pass the test.

Naturally the output data that I was printing matched. It’d be pretty weird if it didn’t considering all the asserts that I had in the code, right? Hah, yeah, that’d be… That’d be pretty weird, all right…

The Forgotten Depths

By this point I had determined that it was a specific range of buffer mappings causing the problem, specifically those sized between 50 and 100 bytes. I also knew that these buffers were being mapped by u_vbuf, also known colloquially as the hinterlands of Gallium, an obscure component used to handle translating unsupported vertex buffer formats.

Veteran Mesa developers reading along are going full sensiblechuckle.gif right now, but I’ll request that we continue our no spoiler policy.

If the buffer contents were the same as the mapped contents but the test was still failing, then there had to be a reason for that. I fumbled my way over to the vertex attribute translator and fingerpainted in a printf to dump the translated vertex attributes. This enabled me to diff between a good run and a bad run.

It was then that I made a bewildering discovery.

Any time I had a 96-byte buffer map, the attributes starting at offset 92 didn’t match in cases when the test failed.

This was another thinker, so I decided to enhance my memory leaks a bit to copy more buffer since this was all 4096-aligned and it wasn’t like I was going to be copying out of bounds. This was when things started to get really weird.

Returning a copy of the resquested 96 bytes of the buffer failed the test, but returning 100 bytes passed it.

Uh-oh.

Now that I took a closer look at those vertex attribs, I realized that the ones which were failing were the ones that were read from bytes 96 and 97 of the buffer. The buffer which only had 96 bytes mapped, meaning that only the range of [0..95] was valid…

At Last

Resolution. What I had tripped over was a buffer overrun, one that was undetectable through normal means because of reasons like:

  • this is a GPU buffer, so tools which would normally catch buffer overruns wouldn’t detect it
  • this is u_vbuf, which is code that’s generally known to work pretty well given that it’s 10+ years old and is widely used and tested
  • RadeonSI is likely the only other driver which uses the same sorts of buffer mapping optimizations, and it doesn’t use u_vbuf

Iteration on various fixes finally yielded a patch that was upstreamable; the crux of the problem here was that the stride of vertex attributes was being used to calculate the size of the region to map, but the stride only determines the number of bytes between elements, not their size. For example, if the stride was 4 bytes but the element was 8 bytes, the overrun would be 4 bytes for the last element. The solution was to calculate the offset of the last element being mapped, then add the size of the element using the attribute’s format block size, which guarantees that the last attribute won’t be truncated.

Fuck that bug.

May 20, 2021

So we are looking to hire quite a few people into the Desktop team currently. First of all we are looking to hire two graphics engineers to help us work on Linux Graphics drivers. The first of those two jobs is now online on the Red Hat jobs site. This is a job in our core graphics team focusing on RHEL, Fedora and upstream around the Intel, AMD and NVidia open source drivers. This is an opportunity to join a team of incredibly talented engineers working on everything from the graphics system of the Linux kernel and on the userspace bits like Vulkan, OpenGL and Wayland.  The job is listed as Senior Engineer, but for the right candidate we have flexibility there. We also have flexibility for people who want to work remotely, so as long as there is a Red Hat office in your home country you can work remotely for us.  The second job, which we hope to have up soon, will be looking more at ARM graphics and be tied to our automotive effort, but we will be looking at the applications for either position in combination so feel free to apply for the already listed job even if you are more interested in the second one as we will discuss both jobs with potential candidates.

The second job we have up is for – Software Engineer – GPU, Input and Multimedia which is also for joining our Graphics team. This job is targetted at our  office in Brno, Czechia and is a great entry level position if you are interested in the field of graphics. The job listing can be found here and outlines the kind of things we want you to look at, but do expect initially your job will be focused on helping the rest of the team manage their backlog and then grow from there.

The last job we have online now is for the automotive team, where we are looking for someone at the Senior/Principal level to join our Infotainment team, working with car makers around issues related to multimedia and help identifying needs and gaps and then work with upstream communities to figure out how we can resolve those issues. The job is targeted at Madrid, Spain as it is where we hope to center some of the infotainment effort and it makes things easier in terms of hardware access and similar, but for the right candidate we might be open to looking for candidates wanting to work remote or in another Red Hat office. You can find this job listing here.

We expect to be posting further jobs for the infotainment team within a week or two, so I will update once they are up.

May 19, 2021

Using Power For Evil

There’s no shortage of very smart people working on Mesa. One of those, aspiring benchmark-quadrupler Marek Olšák, had a novel idea some time ago: Could C++ function templates were used to optimize draw dispatch in driver?

The answer was yes, and so began what was probably five or ten minutes of furiously jamming brackets and braces into a C++ file in order to achieve the intended result. Let’s check out what’s going on here.

Setup

To start, the templates must be accessible from C, as this is what the driver is written in. The methodology here is simple: Generate the templates as an array of function pointers such that they can be accessed by indexing the arrays with the template values. Here’s what the code looks like:

template <chip_class GFX_VERSION, si_has_tess HAS_TESS, si_has_gs HAS_GS,
          si_has_ngg NGG, si_has_prim_discard_cs ALLOW_PRIM_DISCARD_CS>
static void si_init_draw_vbo(struct si_context *sctx)
{
   /* Prim discard CS is only useful on gfx7+ because gfx6 doesn't have async compute. */
   if (ALLOW_PRIM_DISCARD_CS && GFX_VERSION < GFX7)
      return;

   if (NGG && GFX_VERSION < GFX10)
      return;

   sctx->draw_vbo[GFX_VERSION - GFX6][HAS_TESS][HAS_GS][NGG][ALLOW_PRIM_DISCARD_CS] =
      si_draw_vbo<GFX_VERSION, HAS_TESS, HAS_GS, NGG, ALLOW_PRIM_DISCARD_CS>;
}

template <chip_class GFX_VERSION, si_has_tess HAS_TESS, si_has_gs HAS_GS>
static void si_init_draw_vbo_all_internal_options(struct si_context *sctx)
{
   si_init_draw_vbo<GFX_VERSION, HAS_TESS, HAS_GS, NGG_OFF, PRIM_DISCARD_CS_OFF>(sctx);
   si_init_draw_vbo<GFX_VERSION, HAS_TESS, HAS_GS, NGG_OFF, PRIM_DISCARD_CS_ON>(sctx);
   si_init_draw_vbo<GFX_VERSION, HAS_TESS, HAS_GS, NGG_ON, PRIM_DISCARD_CS_OFF>(sctx);
   si_init_draw_vbo<GFX_VERSION, HAS_TESS, HAS_GS, NGG_ON, PRIM_DISCARD_CS_ON>(sctx);
}

template <chip_class GFX_VERSION>
static void si_init_draw_vbo_all_pipeline_options(struct si_context *sctx)
{
   si_init_draw_vbo_all_internal_options<GFX_VERSION, TESS_OFF, GS_OFF>(sctx);
   si_init_draw_vbo_all_internal_options<GFX_VERSION, TESS_OFF, GS_ON>(sctx);
   si_init_draw_vbo_all_internal_options<GFX_VERSION, TESS_ON, GS_OFF>(sctx);
   si_init_draw_vbo_all_internal_options<GFX_VERSION, TESS_ON, GS_ON>(sctx);
}

static void si_init_draw_vbo_all_families(struct si_context *sctx)
{
   si_init_draw_vbo_all_pipeline_options<GFX6>(sctx);
   si_init_draw_vbo_all_pipeline_options<GFX7>(sctx);
   si_init_draw_vbo_all_pipeline_options<GFX8>(sctx);
   si_init_draw_vbo_all_pipeline_options<GFX9>(sctx);
   si_init_draw_vbo_all_pipeline_options<GFX10>(sctx);
   si_init_draw_vbo_all_pipeline_options<GFX10_3>(sctx);
}

static void si_invalid_draw_vbo(struct pipe_context *pipe,
                                const struct pipe_draw_info *info,
                                const struct pipe_draw_indirect_info *indirect,
                                const struct pipe_draw_start_count *draws,
                                unsigned num_draws)
{
   unreachable("vertex shader not bound");
}

extern "C"
void si_init_draw_functions(struct si_context *sctx)
{
   si_init_draw_vbo_all_families(sctx);

   /* Bind a fake draw_vbo, so that draw_vbo isn't NULL, which would skip
    * initialization of callbacks in upper layers (such as u_threaded_context).
    */
   sctx->b.draw_vbo = si_invalid_draw_vbo;
   sctx->blitter->draw_rectangle = si_draw_rectangle;

   si_init_ia_multi_vgt_param_table(sctx);
}

This calls through a series of functions, ultimately reaching si_init_draw_vbo where a template is set to a member of the function pointer array based on the template parameters. Specialized functions can then be generated based on hardware type, pipeline shader presence, and more.

Application

Once initialized, there’s an inline function used to set the current function pointer:

static inline void si_select_draw_vbo(struct si_context *sctx)
{
   sctx->b.draw_vbo = sctx->draw_vbo[sctx->chip_class - GFX6]
                                    [!!sctx->shader.tes.cso]
                                    [!!sctx->shader.gs.cso]
                                    [sctx->ngg]
                                    [si_compute_prim_discard_enabled(sctx)];
   assert(sctx->b.draw_vbo);
}

Thus the parameters are pulled directly from the context, and the function can be called whenever the draw function pointer needs to be updated, such as when new shaders are bound or primitive discard is enabled.

Result

The result is that now the draw dispatch can be fully optimized for the codepath required by the active hardware and graphics pipeline, reducing the CPU overhead and making the draw code the tiniest bit faster. For example, here’s just the top part of the templated function:

template <chip_class GFX_VERSION, si_has_tess HAS_TESS, si_has_gs HAS_GS, si_has_ngg NGG,
          si_has_prim_discard_cs ALLOW_PRIM_DISCARD_CS>
static void si_draw_vbo(struct pipe_context *ctx,
                        const struct pipe_draw_info *info,
                        unsigned drawid_offset,
                        const struct pipe_draw_indirect_info *indirect,
                        const struct pipe_draw_start_count_bias *draws,
                        unsigned num_draws)
{
   /* Keep code that uses the least number of local variables as close to the beginning
    * of this function as possible to minimize register pressure.
    *
    * It doesn't matter where we return due to invalid parameters because such cases
    * shouldn't occur in practice.
    */
   struct si_context *sctx = (struct si_context *)ctx;

   /* Recompute and re-emit the texture resource states if needed. */
   unsigned dirty_tex_counter = p_atomic_read(&sctx->screen->dirty_tex_counter);
   if (unlikely(dirty_tex_counter != sctx->last_dirty_tex_counter)) {
      sctx->last_dirty_tex_counter = dirty_tex_counter;
      sctx->framebuffer.dirty_cbufs |= ((1 << sctx->framebuffer.state.nr_cbufs) - 1);
      sctx->framebuffer.dirty_zsbuf = true;
      si_mark_atom_dirty(sctx, &sctx->atoms.s.framebuffer);
      si_update_all_texture_descriptors(sctx);
   }

   unsigned dirty_buf_counter = p_atomic_read(&sctx->screen->dirty_buf_counter);
   if (unlikely(dirty_buf_counter != sctx->last_dirty_buf_counter)) {
      sctx->last_dirty_buf_counter = dirty_buf_counter;
      /* Rebind all buffers unconditionally. */
      si_rebind_buffer(sctx, NULL);
   }

   si_decompress_textures(sctx, u_bit_consecutive(0, SI_NUM_GRAPHICS_SHADERS));
   si_need_gfx_cs_space(sctx, num_draws);

   /* If we're using a secure context, determine if cs must be secure or not */
   if (GFX_VERSION >= GFX9 && unlikely(radeon_uses_secure_bos(sctx->ws))) {
      bool secure = si_gfx_resources_check_encrypted(sctx);
      if (secure != sctx->ws->cs_is_secure(&sctx->gfx_cs)) {
         si_flush_gfx_cs(sctx, RADEON_FLUSH_ASYNC_START_NEXT_GFX_IB_NOW |
                               RADEON_FLUSH_TOGGLE_SECURE_SUBMISSION, NULL);
      }
   }

   if (HAS_TESS) {
      struct si_shader_selector *tcs = sctx->shader.tcs.cso;

      /* The rarely occuring tcs == NULL case is not optimized. */
      bool same_patch_vertices =
         GFX_VERSION >= GFX9 &&
         tcs && info->vertices_per_patch == tcs->info.base.tess.tcs_vertices_out;

      if (sctx->same_patch_vertices != same_patch_vertices) {
         sctx->same_patch_vertices = same_patch_vertices;
         sctx->do_update_shaders = true;
      }

      if (GFX_VERSION == GFX9 && sctx->screen->info.has_ls_vgpr_init_bug) {
         /* Determine whether the LS VGPR fix should be applied.
          *
          * It is only required when num input CPs > num output CPs,
          * which cannot happen with the fixed function TCS. We should
          * also update this bit when switching from TCS to fixed
          * function TCS.
          */
         bool ls_vgpr_fix =
            tcs && info->vertices_per_patch > tcs->info.base.tess.tcs_vertices_out;

         if (ls_vgpr_fix != sctx->ls_vgpr_fix) {
            sctx->ls_vgpr_fix = ls_vgpr_fix;
            sctx->do_update_shaders = true;
         }
      }

Note that the hardware version parts are templated, as is the HAS_TESS conditional, enabling it to be skipped entirely if there’s no tessellation shader active.

With techniques like this, it’s no surprise that RadeonSI is the driver to beat in performance and low overhead. The latest zink-wip snapshots include similar work, skipping considerable amounts of the draw dispatch when possible, and (hopefully) lowering the CPU overhead of the draw dispatch.

May 18, 2021

TL;DR: don't use select() + bump the RLIMIT_NOFILE soft limit to the hard limit in your modern programs.

The primary way to reference, allocate and pin runtime OS resources on Linux today are file descriptors ("fds"). Originally they were used to reference open files and directories and maybe a bit more, but today they may be used to reference almost any kind of runtime resource in Linux userspace, including open devices, memory (memfd_create(2)), timers (timefd_create(2)) and even processes (with the new pidfd_open(2) system call). In a way, the philosophically skewed UNIX concept of "everything is a file" through the proliferation of fds actually acquires a bit of sensible meaning: "everything has a file descriptor" is certainly a much better motto to adopt.

Because of this proliferation of fds, non-trivial modern programs tend to have to deal with substantially more fds at the same time than they traditionally did. Today, you'll often encounter real-life programs that have a few thousand fds open at the same time.

Like on most runtime resources on Linux limits are enforced on file descriptors: once you hit the resource limit configured via RLIMIT_NOFILE any attempt to allocate more is refused with the EMFILE error — until you close a couple of those you already have open.

Because fds weren't such a universal concept traditionally, the limit of RLIMIT_NOFILE used to be quite low. Specifically, when the Linux kernel first invokes userspace it still sets RLIMIT_NOFILE to a low value of 1024 (soft) and 4096 (hard). (Quick explanation: the soft limit is what matters and causes the EMFILE issues, the hard limit is a secondary limit that processes may bump their soft limit to — if they like — without requiring further privileges to do so. Bumping the limit further would require privileges however.). A limit of 1024 fds made fds a scarce resource: APIs tried to be careful with using fds, since you simply couldn't have that many of them at the same time. This resulted in some questionable coding decisions and concepts at various places: often secondary descriptors that are very similar to fds — but were not actually fds — were introduced (e.g. inotify watch descriptors), simply to avoid for them the low limits enforced on true fds. Or code tried to aggressively close fds when not absolutely needing them (e.g. ftw()/nftw()), losing the nice + stable "pinning" effect of open fds.

Worse though is that certain OS level APIs were designed having only the low limits in mind. The worst offender being the BSD/POSIX select(2) system call: it only works with fds in the numeric range of 0…1023 (aka FD_SETSIZE-1). If you have an fd outside of this range, tough luck: select() won't work, and only if you are lucky you'll detect that and can handle it somehow.

Linux fds are exposed as simple integers, and for most calls it is guaranteed that the lowest unused integer is allocated for new fds. Thus, as long as the RLIMIT_NOFILE soft limit is set to 1024 everything remains compatible with select(): the resulting fds will also be below 1024. Yay. If we'd bump the soft limit above this threshold though and at some point in time an fd higher than the threshold is allocated, this fd would not be compatible with select() anymore.

Because of that, indiscriminately increasing the soft RLIMIT_NOFILE resource limit today for every userspace process is problematic: as long as there's userspace code still using select() doing so will risk triggering hard-to-handle, hard-to-debug errors all over the place.

However, given the nowadays ubiquitous use of fds for all kinds of resources (did you know, an eBPF program is an fd? and a cgroup too? and attaching an eBPF program to cgroup is another fd? …), we'd really like to raise the limit anyway. 🤔

So before we continue thinking about this problem, let's make the problem more complex (…uh, I mean… "more exciting") first. Having just one hard and one soft per-process limit on fds is boring. Let's add more limits on fds to the mix. Specifically on Linux there are two system-wide sysctls: fs.nr_open and fs.file-max. (Don't ask me why one uses a dash and the other an underscore, or why there are two of them...) On today's kernels they kinda lost their relevance. They had some originally, because fds weren't accounted by any other counter. But today, the kernel tracks fds mostly as small pieces of memory allocated on userspace requests — because that's ultimately what they are —, and thus charges them to the memory accounting done anyway.

So now, we have four limits (actually: five if you count the memory accounting) on the same kind of resource, and all of them make a resource artificially scarce that we don't want to be scarce. So what to do?

Back in systemd v240 already (i.e. 2019) we decided to do something about it. Specifically:

  • Automatically at boot we'll now bump the two sysctls to their maximum, making them effectively ineffective. This one was easy. We got rid of two pretty much redundant knobs. Nice!

  • The RLIMIT_NOFILE hard limit is bumped substantially to 512K. Yay, cheap fds! You may have an fd, and you, and you as well, everyone may have an fd!

  • But … we left the soft RLIMIT_NOFILE limit at 1024. We weren't quite ready to break all programs still using select() in 2019 yet. But it's not as bad as it might sound I think: given the hard limit is bumped every program can easily opt-in to a larger number of fds, by setting the soft limit to the hard limit early on — without requiring privileges.

So effectively, with this approach fds should be much less scarce (at least for programs that opt into that), and the limits should be much easier to configure, since there are only two knobs now one really needs to care about:

  • Configure the RLIMIT_NOFILE hard limit to the maximum number of fds you actually want to allow a process.

  • In the program code then either bump the soft to the hard limit, or not. If you do, you basically declare "I understood the problem, I promise to not use select(), drown me fds please!". If you don't then effectively everything remains as it always was.

Apparently this approach worked, since the negative feedback on change was even scarcer than fds traditionally were (ha, fun!). We got reports from pretty much only two projects that were bitten by the change (one being a JVM implementation): they already bumped their soft limit automatically to their hard limit during program initialization, and then allocated an array with one entry per possible fd. With the new high limit this resulted in one massive allocation that traditionally was just a few K, and this caused memory checks to be hit.

Anyway, here's the take away of this blog story:

  • Don't use select() anymore in 2021. Use poll(), epoll, iouring, …, but for heaven's sake don't use select(). It might have been all the rage in the 1990s but it doesn't scale and is simply not designed for today's programs. I wished the man page of select() would make clearer how icky it is and that there are plenty of more preferably APIs.

  • If you hack on a program that potentially uses a lot of fds, add some simple code somewhere to its start-up that bumps the RLIMIT_NOFILE soft limit to the hard limit. But if you do this, you have to make sure your code (and any code that you link to from it) refrains from using select(). (Note: there's at least one glibc NSS plugin using select() internally. Given that NSS modules can end up being loaded into pretty much any process such modules should probably be considered just buggy.)

  • If said program you hack on forks off foreign programs, make sure to reset the RLIMIT_NOFILE soft limit back to 1024 for them. Just because your program might be fine with fds >= 1024 it doesn't mean that those foreign programs might. And unfortunately RLIMIT_NOFILE is inherited down the process tree unless explicitly set.

And that's all I have for today. I hope this was enlightening.

Click Play

It’s been a while.

I meant to blog. I meant to make new zink-wip snapshots. I meant to shower.

Look, none of us are perfect, and I’m just gonna get into some graphics so nobody remembers how this post started.

tombraider-suballocated.png

Boom, beautiful triangles. Look at that ultra smooth fps in mangohud. Protip: if you’re seeing weird flickering or misrenders in your app/game, try throwing mangohud in front of the zink bus to see if it fixes them.

So what has been going on for the past however since the last post?

In a word: lots.

Here’s the rundown.

The Rundown

The 20210517 zink-wip snapshot is the biggest one in history. I say this with no exaggeration.

Changes since the last snapshot include:

  • an imperial units (and I measured this precisely) fuckton of general driver overhead reduction
  • (yet another) queue/dispatch rewrite, this one more optimized for threaded and multi-context use
  • an actually working disk cache implementation
  • an entire suballocator

One way or another, this is going to feel like a new driver. Ideally I’ll be doing a post every day detailing one of the items on that list, but for now I’ll close the post by saying that zink should be 100%-1000% faster (not a typo) in most scenarios where it was previously much slower than native GL drivers.

Yeah, Big Triangle knows who we are now.

May 09, 2021
This all started with a Mele PCG09 before testing Linux on this I took a quick look under Windows and the device-manager there showed an exclamation mark next to a Realtek 8723BS bluetooth device, so BT did not work. Under Linux I quickly found out why, the device actually uses a Broadcom Wifi/BT chipset attached over SDIO/an UART for the Wifi resp. BT parts. The UART connected BT part was described in the ACPI tables with a HID (Hardware-ID) of "OBDA8723", not good.

Now I could have easily fixed this with an extra initrd with DSDT-overrride but that did not feel right. There was an option in the BIOS which actually controls what HID gets advertised for the Wifi/BT named "WIFI" which was set to "RTL8723" which obviously is wrong, but that option was grayed out. So instead of going for the DSDT-override I really want to be able to change that BIOS option and set it to the right value. Some duckduckgo-ing found this blogpost on changing locked BIOS settings.

The flashrom packaged in Fedora dumped the BIOS in one go and after build UEFITool and ifrextract from source from their git repos I could extract the interface description for the BIOS Setup menus without issues (as described in the blogpost). Here is the interesting part of the IFR for changing the Wifi/BT model:


0xC521 One Of: WIFI, VarStoreInfo (VarOffset/VarName): 0x110, VarStore: 0x1, QuestionId: 0x1AB, Size: 1, Min: 0x0, Max 0x2, Step: 0x0 {05 91 53 03 54 03 AB 01 01 00 10 01 10 10 00 02 00}
0xC532 One Of Option: RTL8723, Value (8 bit): 0x1 (default) {09 07 55 03 10 00 01}
0xC539 One Of Option: AP6330, Value (8 bit): 0x2 {09 07 56 03 00 00 02}
0xC540 One Of Option: Disabled, Value (8 bit): 0x0 {09 07 01 04 00 00 00}
0xC547 End One Of {29 02}



So to fix the broken BT I need to change the byte at offset 0x110 in the "Setup" EFI variable which contains the BIOS settings from 0x01 to 0x02. Easy, one problem though, the "dd on /sys/firmware/efi/efivars/Setup-..." method described in the blogpost does not work on most devices. Most devices protect the BIOS settings from being modified this way by having 2 Setup-${GUID} EFI variables (with different GUIDs), hiding the real one leaving a fake one which is only a couple of bytes large.

But the BIOS Setup-menu itself is just another EFI executable, so how can this access the real Setup variable ? The trick is that the hiding happens when the OS calls exitbootservices to tell EFI it is ready to take over control of the machine. This means that under Linux the real Setup EFI variable has been hidden early on during Linux boot, but when grub is running it is still available! And there is a patch adding a new setup_var command to grub, which allows changing BIOS settings from within grub.

The original setup_var command picks the first Setup EFI variable it finds, but as mentioned already in most cases there are 2, so later an improved setup_var_3 command was added which instead skips Setup EFI variables which are too small (as the fake ones are only a few bytes). After building an EFI version of grub with the setup_var* commands added it is just a matter of booting into a grub commandline and then running "setup_var_3 0x110 2" and from then on the BIOS shows the WIFI type as being AP6330 and the ACPI tables will now report "BCM2E67" as HID for the BT and just like that the bluetooth issue has been fixed.


For your convenience I've uploaded a grubia32.efi and a grubx64.efi with the setup_var patches added here. This is build from this branch at this commit (this was just a random branch which I had checked out while working on this).

The Mele PCG09 use-case for modifying hidden BIOS-settings is a bit of a corner-case. Intel Bay- and Cherry-Trail SoCs come with an embedded OTG XHCI controller to allow them to function as an USB device/gadget rather then only being capable of operating as an USB host. Since most devices ship with Windows and Windows does not really do anything useful with USB-device controllers, this controller is disabled by most BIOS-es and there is no visible option to enable it. The same approach from above can be used to enable the "USB OTG" option in the BIOS so that we can use this under Linux. Lets take the Teclast X89 (Windows version) tablet as example. Extracting the IFR and then looking for the "USB OTG" function results in finding this IFR snippet:


0x9560 One Of: USB OTG Support, VarStoreInfo (VarOffset/VarName): 0xDA, VarStore: 0x1, QuestionId: 0xA5, Size: 1, Min: 0x0, Max 0x1, Step: 0x0 {05 91 DE 02 DF 02 A5 00 01 00 DA 00 10 10 00 01 00}
0x9571 Default: DefaultId: 0x0, Value (8 bit): 0x1 {5B 06 00 00 00 01}
0x9577 One Of Option: PCI mode, Value (8 bit): 0x1 {09 07 E0 02 00 00 01}
0x957E One Of Option: Disabled, Value (8 bit): 0x0 {09 07 3B 03 00 00 00}
0x9585 End One Of {29 02}



And then running "setup_var_3 0xda 1" on the grub commandline results in a new "00:16.0 USB controller: Intel Corporation Atom Processor Z36xxx/Z37xxx Series OTG USB Device" entry showing up in lspci.

Actually using this requires a kernel with UDC (USB Device Controller) support enabled as well as some USB gadget drivers, at least the Fedora kernel does not have these enabled by default. On Bay Trail devices an external device-mode USB-PHY is necessary for device-mode to actually work. On a kernel with UDC enabled you can check if your hardware has such a phy by doing: "cat /sys/bus/ulpi/devices/dwc3.4.auto.ulpi/modalias" if there is a phy this will usually return "ulpi:v0451p1508". If you get "ulpi:v0000p0000" instead then your hardware does not have a device-mode phy and you cannot use gadget mode.

On Cherry Trail devices the device-mode phy is build into the SoC, so on most Cherry Trail devices this just works. There is one caveat though, the x5-z83?0 Cherry Trail SoCs only have one set of USB3 superspeed data lines and this is part of the USB-datalines meant for the OTG port. So if you have a Cherry Trail device with a x5-z83?0 SoC and it has a superspeed (USB3) USB-A port then that is using the OTG superspeed-lines, when the OTG XHCI controller is enabled and the micro-usb gets switched to device-mode (which it also does when charging!) then this will now also switch the superspeed datalines to device-mode, disconnecting any superspeed USB device connected to the USB-A port. So on these devices you need to choose, you can either use the micro-usb in device-mode, or get superspeed on the USB-A port, you cannot use both at the same time.

If you have a kernel build with UDC support a quick test is to run a USB-A to micro-B cable from a desktop or laptop to the tablet and then do "sudo modprobe g_serial" on the tablet, after this you should see a binch of messages in dmesg on the desktop/tablet about an USB device showing up ending with something like "cdc_acm 1-3:2.0: ttyACM0: USB ACM device". If you want you can run a serial-console on the tablet on /dev/ttyGS0 and then connect to it on the desktop/laptop at /dev/ttyACM0.
As you may know I've been doing a lot of hw-enablement work on Bay- and Cherry-Trail tablets as a side-project for the last couple of years.

Some of these tablets have one interesting feature intended to "flash" Android on them. When turned on with both the volume-up and the volume-down buttons pressed at the same time they enter something called DNX mode, which it will then also print to the LCD panel, this is really just a variant of the android fastboot protocol built into the BIOS. Quite a few models support this, although on Bay Trail it sometimes seems to be supported (it gets shown on the screen) but it does not work since many models which only shipped with Windows lack the external device/gadget-mode phy which the Bay Trail SoC needs to be able to work in device/gadget mode (on Cherry Trail the gadget phy has been integrated into the SoC).

So on to the topic of this blog-post, I recently used DNX mode to unbrick a tablet which was dead due to the BIOS settings get corrupted in a way where it would not boot and it was also impossible to enter the BIOS setup. After some duckduckgo-ing I found a thread about how in DNX mode you can upload a replacement for the efilinux.efi bootloader normally used for "fastboot boot" and how you can use this to upload a binary to flash the BIOS. I did not have a BIOS image of this tablet, so that approach did not work for me. But it did point me in the direction of a different, safer (no BIOS flashing involved) solution to unbrick the tablet.

If you run the following 2 commands on a PC with a Bay- or Cherry-Trail connected in DNX mode:

fastboot flash osloader some-efi-binary.efi
fastboot boot some-android-boot.img

Then the tablet will execute the some-efi-binary.efi. At first I tried getting an EFI shell this way, but this failed because the EFI binary gets passed some arguments about where in RAM it can find the some-android-boot.img. Then I tried booting a grubx64.efi file and that result in a grub commandline. But I had not way to interact with it and replacing the USB connection to the PC with a OTG / USB-host cable with a keyboard attached to it did not result in working input.

So my next step was to build a new grubx64.efi with "-c grub.cfg" added to the commandline for the final grub2-mkimage step, embedding a grub.cfg with a single line in there: "fwsetup". This will cause the tablet to reboot into its BIOS setup menu. Note on some tablets you still will not have keyboard input if you just let the tablet sit there while it is rebooting. But during the reboot there is enough time to swap the USB cable for an OTG adapter with a keyboard attached before the reboot completes and then you will have working keyboard input. At this point you can select "load setup defaults" and then "save and exit" and voila the tablet works again.

For your convenience I've uploaded a grubia32.efi and a grubx64.efi with the necessary "fwsetup" grub.cfg here. This is build from this branch at this commit (this was just a random branch which I had checked out while working on this).

Note the image used for the "fastboot boot some-android-boot.img" command does not matter much, but it must be a valid android boot.img format file otherwise fastboot will refuse to try to boot it.
A while ago I worked on improving Logitech G15 LCD-screen support under Linux. I recently got an email from someone who wanted to add support for the LCD panel in the Logitech Z-10 speakers to lcdproc, asking me to describe the process I went through to improve G15 support in lcdproc and how I made it work without requiring the unmaintained g15daemon code.

So I wrote a long email describing the steps I went through and we both thought this could be interesting for others too, so here it is:

1. For some reason I decided that I did not have enough projects going on at the same time already and I started looking into improving support for the G15 family of keyboards.

2. I started studying the g15daemon code and did a build of it to check that it actually works. I believe making sure that you have a known-to-work codebase as a starting point, even if it is somewhat crufty and ugly, is important. This way you know you have code which speaks the right protocol and you can try to slowly morph it into what you want it to become (making small changes testing every step). Even if you decide to mostly re-implement from scratch, then you will likely use the old code as a protocol documentation and it is important to know it actually works.

3. There were number of things which I did not like about g15daemon:

3.1 It uses libusb, taking control of the entire USB-interface used for the gaming functionality away from the kernel. 

3.2 As part of this it was not just dealing with the LCD, it also was acting as a dispatches for G-key key-presses. IMHO the key-press handling clearly belonged in the kernel. These keys are just extra keys, all macro functionality is handled inside software on the host/PC side. So as a kernel dev I found that these really should be handled as normal keys and emit normal evdev event with KEY_FOO codes from a /dev/input/event# kernel node.

3.3 It is a daemon, rather then a library; and most other code which would deal with the LCD such as lcdproc was a daemon itself too, so now we would have lcdproc's LCDd talking to g15daemon to get to the LCD which felt rather roundabout.
So I set about tackling all of these

4. Kernel changes: I wrote a new drivers/hid/hid-lg-g15.c kernel driver:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hid/hid-lg-g15.c
Which sends KEY_MACRO1 .. KEY_MACRO18, KEY_MACRO_PRESET1 .. KEY_MACRO_PRESET3, KEY_MACRO_RECORD_START, KEY_KBD_LCD_MENU1 .. KEY_KBD_LCD_MENU4 keypresses for all the special keys. Note this requires that the kernel HID driver is left attached to the USB interface, which required changes on the g15dameon/lcdproc side.

This kernel driver also offers a /sys/class/leds/g15::kbd_backlight/ interface which allows controller the backlight intensity through e.g. GNOME kbd-backlight slider in the power pane of the control-center. It also offers a bunch of other LED interfaces under /sys/class/leds/ for controlling things like the LEDs under the M1 - M3 preset selection buttons. The idea here being that the kernel abstract the USB protocol for these gaming-kbd away and that a single userspace daemon for doing gaming kbd macro functionality can be written which will rely only on the kernel interfaces and will thus work with any kbd which has kernel support.

5. lcdproc changes

5.1 lcdproc already did have a g15 driver, which talked to g15daemon. So at first I started testing with this (at this point all my kernel work would not work, since g15daemon would take full control of the USB interface unbinding my kernel driver). I did a couple of bug-fixes / cleanups in this setup to get the code to a starting point where I could run it and it would not show any visible rendering bugs in any of the standard lcdproc screens

5.2 I wrote a little lcdproc helper library, lib_hidraw, which can be used by lcdproc drivers to open a /dev/hidraw device for them. The main feature is, you give the lib_hidraw_open() helper a list of USB-ids you are interested in; and it will then find the right /dev/hidraw# device node (which may be different every boot) and open it for you.

5.3 The actual sending of the bitmap to the LCD screen is quite simple, but it does need to be a very specific format. The format rendering is done by libg15render. So now it was time to replace the code talking to g15daemon with code to directly talk to the LCD through a /dev/hidraw# interface. I kept the libg15render dependency since that was fine. After a bit of refactoring, the actual change over to directly sending the data to the LCD was not that big.

5.4 The change to stop using the g15daemon meant that the g15 driver also lost support for detecting presses on the 4 buttons directly under the LCD which are meant for controlling the menu on the LCD. But now that the code is using /dev/hidraw# the kernel driver I wrote would actually work and report KEY_KBD_LCD_MENU1 .. KEY_KBD_LCD_MENU4 keypresses. So I did a bunch of cleanup / refactoring of lcdproc's linux_input driver and made it take over the reporting of those button presses.

5.5 I wanted everything to just work out of the box, so I wrote some udev rules which automatically generate a lcdproc.conf file configuring the g15 + linux_input drivers (including the key-mapping for the linux_input driver) when a G15 keyboard gets plugged in and the lcdproc.conf file does not exist yet.

All this together means that users under Fedora, where I also packaged all this can now do "dnf install lcdproc", pluging their G15 keyboard and everything will just work.




p.s.

After the email exchange I got curious and found a pair of these speakers 2nd hand for a nice price. The author of the initial email was happy with me doing the work on this. So I added support for the Z-10 speakers to lcdproc (easy) and wrote a set of kernel-patches to support the display and 1-4 keys on the speaker as LCD menu keys.

I've also prepared an update to the Fedora lcdproc packages so that they will now support the Z-10 speakers OOTB, if you have these and are running Fedora, then (once the update has reached the repos) a "sudo dnf install lcdproc" followed by unplugging + replugging the Z-10 speakers should make the display come alive and show the standard lcdproc CPU/mem usage screens.
For a long time Logitech produced wireless keyboards using 27 MHz as communications band. Although these have not been produced for a while now these are still pretty common and a lot of them are still perfectly serviceable.

But when using them under Linux, there is one downside, since the communication is one way by default the wireless link is unencrypted by default, which is kinda bad from a security pov. These keyboards do support using an encrypted link, but this requires a one-time setup where the user manually enters a key on the keyboard.

I've written a small Linux utility to do this under Linux, which should help give these keyboards an extra lease on life and stop them unnecessarily becoming e-waste. Sometimes these keyboards appear to be broken, while the only problem is that the key in the keyboard and receiver are not in sync, the README also contains instructions on howto reset the keyboard, without needing the utility, restoring (unencrypted) functionality.

The 'lg-27MHz-keyboard-encryption-setup' utility is available on Fedora in the 'logitech-27mhz-keyboard-encryption-setup package.
May 05, 2021

In Turnips in the wild (Part 1) we walked through two issues, one in TauCeti Benchmark and the other in Genshin Impact. Today, I have an update about the one I didn’t have plan to fix, and a showcase of two remaining issues I met in Genshin Impact.

Genshin Impact

Gameplay – Disco Water

In the previous post I said that I’m not planning to fix the broken water effect since it relied on undefined behavior.

Screenshot of the gameplay with body of water that has large colorful artifacts

However, I was notified that same issue was fixed in OpenGL driver for Adreno (Freedreno) and the fix is rather easy. Even though for Vulkan it is clearly an undefined behavior, with other APIs it might not be so clear. Thus, given that we want to support translation from other APIs, there are already apps which rely on this behavior, and it would be just a bit more performant - I made a fix for it.

Screenshot of the gameplay with body of water without artifacts

The issue was fixed by “tu: do not corrupt unwritten render targets (!10489)”

Login Screen

The login screen welcomes us with not-so-healthy colors:

Screenshot of a login screen in Genshin Impact which has wrong colors - columns and road are blue and white

And with a few failures to allocate registers in the logs. The failure to allocate registers isn’t good and may cause some important shader not to run, but let’s hope it’s not that. Thus, again, we should take a closer look at the frame.

Once the frame is loaded I’m staring at an empty image at the end of the frame… Not a great start.

Such things mostly happen due to a GPU hang. Since I’m inspecting frames on Linux I took a look at dmesg and confirmed the hang:

 [drm:a6xx_irq [msm]] *ERROR* gpu fault ring 0 fence ...

Fortunately, after walking through draw calls, I found that the mis-rendering happens before the call which hangs. Let’s look at it:

Screenshot of a correct draw call right before the wrong one being inspected in RenderDoc Draw call right before
Screenshot of a draw call, that draws the wrong colors, being inspected in RenderDoc Draw call with the issue

It looks like some fullscreen effect. As in the previous case - the inputs are fine, the only image input is a depth buffer. Also, there are always uniforms passed to the shaders, but when there is only a single problematic draw call - they are rarely an issue (also they are easily comparable with the proprietary driver if I spot some nonsensical values among them).

Now it’s time to look at the shader, ~150 assembly instructions, nothing fancy, nothing obvious, and a lonely kill near the top. Before going into the most “fun” part, it’s a good idea to make sure that the issue is 99% in the shader. RenderDoc has a cool feature which allows to debug shader (its SPIRV code) at a certain fragment (or vertex, or CS invocation), it does the evaluation on CPU, so I can use it as some kind of a reference implementation. In our case the output between RenderDoc and actual shader evaluation on GPU is different:

Screenshot of the color value calculated on CPU by RenderDoc Evaluation on CPU: color = vec4(0.17134, 0.40289, 0.69859, 0.00124)
Screenshot of the color value calculated on GPU On GPU: color = vec4(3.1875, 4.25, 5.625, 0.00061)

Knowing the above there is only one thing left to do - reduce the shader until we find the problematic instruction(s). Fortunately there is a proprietary driver which renders the scene correctly, therefor instead of relying on intuition, luck, and persistance - we could quickly bisect to the issue by editing and comparing the edited shader with a reference driver. Actually, it’s possible to do this with shader debugging in RenderDoc, but I had problems with it at that moment and it’s not that easy to do.

The process goes like this:

  1. Decompile SPIRV into GLSL and check that it compiles back (sometimes it requires some editing)
  2. Remove half of the code, write the most promising temporary variable as a color, and take a look at results
  3. Copy the edited code to RenderDoc instance which runs on proprietary driver
  4. Compare the results
  5. If there is a difference - return deleted code, now we know that the issue is probably in it. Thus, bisect it by returning to step 2.

This way I bisected to this fragment:

_243 = clamp(_243, 0.0, 1.0);
_279 = clamp(_279, 0.0, 1.0);
float _290;
if (_72.x) {
  _290 = _279;
} else {
  _290 = _243;
}

color0 = vec4(_290);
return;

Writing _279 or _243 to color0 produced reasonable results, but writing _290 produced nonsense. The difference was only the presence of condition. Now, having a minimal change which reproduces the issue, it’s possible to compare native assembly.

Bad:

mad.f32 r0.z, c0.y, r0.x, c6.w
sqrt r0.y, r0.y
mul.f r0.x, r1.y, c1.z
(ss)(nop2) mad.f32 r1.z, c6.x, r0.y, c6.y
(nop3) cmps.f.ge r0.y, r0.x, r1.w
(sat)(nop3) sel.b32 r0.w, r0.z, r0.y, r1.z

Good:

(sat)mad.f32 r0.z, c0.y, r0.x, c6.w
sqrt r0.y, r0.y
(ss)(sat)mad.f32 r1.z, c6.x, r0.y, c6.y
(nop2) mul.f r0.y, r1.y, c1.z
add.f r0.x, r0.z, r1.z
(nop3) cmps.f.ge r0.w, r0.y, r1.w
cov.u r1.w, r0.w
(rpt2)nop
(nop3) add.f r0.w, r0.x, r1.w

By running them in my head I reasoned that they should produce the same results. Something works not as expected. After a bit more changes in GLSL, it became apparent that something wrong with clamp(x, 0, 1) which is translated into (sat) modifier for instructions. A bit more digging and I found out that hardware doesn’t understand saturation modifier being placed on sel. instruction (sel is a selection between two values based on third).

Disallowing compiler to place saturation on sel instruction resolved the bug:

Login screen after the fix

The issue was fixed by “ir3: disallow .sat on SEL instructions (!9666)”

Gameplay – Where did the trees go?

Screenshot of the gameplay with trees and grass being almost black

The trees and grass are seem to be rendered incorrectly. After looking through the trace and not finding where they were actually rendered, I studied the trace on proprietary driver and found them. However, there weren’t any such draw calls on Turnip!

The answer was simple, shaders failed to compile due to the failure in a register allocation I mentioned earlier… The general solution would be an implementation of register spilling. However in this case there is a pending merge request that implements a new register allocator, which later would help us implement register spilling. With it shaders can now be compiled!

Screenshot of the gameplay with trees and grass being rendered correctly

More Turnip adventures to come!

wew

After a traumatic, tearful goodbye to weird glxgears which took more out of me than expected, I’m back and blogging again.

It’s been…

Well, I guess it’s been a while since my last post, huh.

So what’s happened since then?

Lots? Yeah, definitely lots. It’s May now. And there’s even been a new Mesa release since then.

Quick zink roundup in case you missed any of these:

  • shader clocks are in
  • sparse buffer support is in
  • 16bit ALU support is in
  • GPU memory queries are in

Cool.

But What’s Really Been Going On?

The truth is that I’ve been taking some time off from zink in a completely futile attempt to make progress on something else while zink-wip continues to land. Inspired by this ticket describing issues getting CS:GO working, I decided to tackle part of Mesa that I haven’t worked on much and that hasn’t seen much work in a long time:

PBOs.

PBO TIME

Pixel Buffer Objects are used when an application needs to perform a transfer of an image from host memory to the GPU or vice versa. A PBO is said to be downloaded when it is copied from the GPU into a host memory buffer, and it is uploaded when it is copied from a memory buffer into GPU memory. PBO uploads are a common way to load assets from disk (e.g., textures), and PBO downloads are common for…ideally nothing performance related, but RPCS3 uses them in such a way.

Uploading of textures in Mesa can take a number of different codepaths depending on various factors using this priority chain:

  • If the format and data type of the pixels is compatible with the format of the texture used for the upload, they can be directly copied by the driver. this is the fastest method
  • If the driver supports shader images and the texture format is supported, Gallium will generate a fragment shader which binds the data-to-be-uploaded as a bufferview, the texture as a framebuffer attachment, and then samples to the framebuffer. this is pretty fast
  • If the format of the host memory buffer (the data being uploaded) is supported by the driver, Gallium will create a staging texture, memcpy the pixel data into it, and then blit it to the target texture. now we’re slowing down a bit
  • Finally, if all else fails, Mesa will just demand the driver maps the target texture so it can dump the pixel data in on the CPU, usually resulting in what looks like a frozen screen because of all the staging textures, flushing, and stalling going on in order to get the pixels where they need to go. this is the bad place

The reason CS:GO takes forever to start with zink is because it performs texture uploads of formats which zink does not support, namely alpha and luminance formats that cannot be emulated in Vulkan, during startup in order to load assets onto the GPU. This hits the CPU copy path 100% of the time, which is actually going to be slower than using something like llvmpipe because GPU-based graphics drivers are not intended to be doing everything on the CPU.

Set PBOs To Ludicrous Speed

I spent a while considering the problem at hand, then decided to start in the place where I could understand what was going on: namely texture downloads. In contrast to uploads, downloads have fewer and simpler codepaths to choose from:

  • If the format and data type of the texture matches the requested pixel format, the data is copied directly with memcpy. this is the fastest method
  • If the format and data type of the requested pixel format are supported by the driver and the texture format is also compatible with the requested format and type, Gallium creates a staging texture to blit to and then memcpy. this is still pretty okay
  • Software fallback. oh no

So to effectively improve this with a new mechanism, all I had to do was meet two criteria:

  • Ensure that all formats are supported so that the software fallback is not needed
  • Ensure that all format conversions are supported so that the software fallback is not needed

There was also, of course, the additional checklist item of Don’t Be Slower Than The Existing Semi-Fastpath Implementation, but I figured I could get to that later.

The implementation I settled on, after much trial and error, was to set up a SSBO descriptor and then use the source texture as a sampler to texelFetch from in a compute shader. A small, vec4-sized constant buffer is passed to the shader, and then everything is done on the GPU to convert the image to the requested pixel format. The SSBO can then be copied to the output memory buffer, and boom, done.

It took some time to get right, but the results are already quite promising. Here’s some results from pbobench, a tool I’m working on to help measure PBO performance. At present, it populates a 1024x1024 pixel texture using GL_R32F data, then downloads (glGetTextureSubImage) it as fast as possible and measures the number of calls per second, iterating over a number of different formats and types for the download.

Here’s the latest results that I’ve run on IRIS using GL_PACK_ALIGNMENT==4 and GL_PACK_SWAP_BYTES==1 for variety.

32x32

# Format Type calls/s current calls/s compute
1 GL_R32F GL_FLOAT 458692 20171
2 GL_RGBA8 GL_UNSIGNED_BYTE 22263 20857
3 GL_RGB5_A1 GL_UNSIGNED_BYTE 22234 20211
4 GL_RGBA4 GL_UNSIGNED_BYTE 22301 20247
5 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 22088 20324
6 GL_RGBA8_SNORM GL_BYTE 22320 20315
7 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 22644 20094
8 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 23189 20229
9 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 23678 19763
10 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 23618 19914
11 GL_RGBA16F GL_HALF_FLOAT 208212 19306
12 GL_RGBA32F GL_FLOAT 231616 19411
13 GL_RGBA16F GL_FLOAT 227953 18887
14 GL_RGB8 GL_UNSIGNED_BYTE 22917 19691
15 GL_RGB565 GL_UNSIGNED_BYTE 23000 20002
16 GL_SRGB8 GL_UNSIGNED_BYTE 22852 20011
17 GL_RGB8_SNORM GL_BYTE 26064 20006
18 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 24226 19813
19 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 140785 19755
20 GL_R11F_G11F_B10F GL_HALF_FLOAT 221776 17852
21 GL_R11F_G11F_B10F GL_FLOAT 193169 19660
22 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 18616 18715
23 GL_RGB9_E5 GL_HALF_FLOAT 207255 18441
24 GL_RGB9_E5 GL_FLOAT 221998 19171
25 GL_RGB16F GL_HALF_FLOAT 228688 18704
26 GL_RGB32F GL_FLOAT 215211 19294
27 GL_RGB16F GL_FLOAT 233145 18858
28 GL_RG8 GL_UNSIGNED_BYTE 21850 18832
29 GL_RG8_SNORM GL_BYTE 19445 18902
30 GL_RG16F GL_HALF_FLOAT 248270 18413
31 GL_RG32F GL_FLOAT 270652 19426
32 GL_RG16F GL_FLOAT 286874 18964
33 GL_R8 GL_UNSIGNED_BYTE 22093 20270
34 GL_R8_SNORM GL_BYTE 21951 20154
35 GL_R16F GL_HALF_FLOAT 300217 19514
36 GL_R16F GL_FLOAT 454349 19784
37 GL_RGBA GL_UNSIGNED_BYTE 21023 19926
38 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 21408 19664
39 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 22183 19090
40 GL_RGB GL_UNSIGNED_BYTE 21791 20054
41 GL_RGB GL_UNSIGNED_SHORT_5_6_5 23290 19164
42 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 21032 20300
43 GL_LUMINANCE GL_UNSIGNED_BYTE 21897 20565
44 GL_ALPHA GL_UNSIGNED_BYTE 21941 20049

The 32x32 size is not very favorable for the new implementation. Drivers are already extremely optimized for small PBO operations, so at best, my work here manages to keep up with the existing codebase for some cases.

Let’s go up a power of two to 64x64.

# Format Type calls/s current calls/s compute
45 GL_R32F GL_FLOAT 187911 18895
46 GL_RGBA8 GL_UNSIGNED_BYTE 21094 18852
47 GL_RGB5_A1 GL_UNSIGNED_BYTE 20121 18515
48 GL_RGBA4 GL_UNSIGNED_BYTE 19995 18532
49 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 20829 19453
50 GL_RGBA8_SNORM GL_BYTE 21167 18914
51 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 5547 19544
52 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 5686 19762
53 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 5934 18229
54 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 5917 18035
55 GL_RGBA16F GL_HALF_FLOAT 69373 17646
56 GL_RGBA32F GL_FLOAT 66885 18507
57 GL_RGBA16F GL_FLOAT 69091 17522
58 GL_RGB8 GL_UNSIGNED_BYTE 5529 18198
59 GL_RGB565 GL_UNSIGNED_BYTE 5489 19197
60 GL_SRGB8 GL_UNSIGNED_BYTE 5735 19455
61 GL_RGB8_SNORM GL_BYTE 6496 19538
62 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 5971 19152
63 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 40364 19262
64 GL_R11F_G11F_B10F GL_HALF_FLOAT 76933 18650
65 GL_R11F_G11F_B10F GL_FLOAT 67688 18642
66 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 4957 18894
67 GL_RGB9_E5 GL_HALF_FLOAT 73775 17660
68 GL_RGB9_E5 GL_FLOAT 69293 18703
69 GL_RGB16F GL_HALF_FLOAT 74131 17808
70 GL_RGB32F GL_FLOAT 67877 18735
71 GL_RGB16F GL_FLOAT 68759 17787
72 GL_RG8 GL_UNSIGNED_BYTE 21194 19150
73 GL_RG8_SNORM GL_BYTE 20644 19174
74 GL_RG16F GL_HALF_FLOAT 90086 19010
75 GL_RG32F GL_FLOAT 88349 19285
76 GL_RG16F GL_FLOAT 89450 19041
77 GL_R8 GL_UNSIGNED_BYTE 21215 19813
78 GL_R8_SNORM GL_BYTE 21280 19457
79 GL_R16F GL_HALF_FLOAT 107419 19180
80 GL_R16F GL_FLOAT 189485 19045
81 GL_RGBA GL_UNSIGNED_BYTE 20784 19454
82 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 5645 19375
83 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 5625 19078
84 GL_RGB GL_UNSIGNED_BYTE 5753 19196
85 GL_RGB GL_UNSIGNED_SHORT_5_6_5 5917 17889
86 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 5553 19014
87 GL_LUMINANCE GL_UNSIGNED_BYTE 21420 18528
88 GL_ALPHA GL_UNSIGNED_BYTE 21506 19944

64x64 starts to show some interesting results, cases like GL_SRGB8 where the compute implementation has dominant performance.

# Format Type calls/s current calls/s compute
89 GL_R32F GL_FLOAT 55577 16096
90 GL_RGBA8 GL_UNSIGNED_BYTE 17315 15264
91 GL_RGB5_A1 GL_UNSIGNED_BYTE 17735 15541
92 GL_RGBA4 GL_UNSIGNED_BYTE 17191 15486
93 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 17343 14831
94 GL_RGBA8_SNORM GL_BYTE 17362 15710
95 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 1367 15981
96 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 1367 16372
97 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 1475 15181
98 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 1482 14789
99 GL_RGBA16F GL_HALF_FLOAT 19254 14916
100 GL_RGBA32F GL_FLOAT 18465 13109
101 GL_RGBA16F GL_FLOAT 18141 13075
102 GL_RGB8 GL_UNSIGNED_BYTE 1439 16143
103 GL_RGB565 GL_UNSIGNED_BYTE 1441 16252
104 GL_SRGB8 GL_UNSIGNED_BYTE 1407 16106
105 GL_RGB8_SNORM GL_BYTE 1583 16071
106 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 1520 16246
107 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 10888 16137
108 GL_R11F_G11F_B10F GL_HALF_FLOAT 22128 14779
109 GL_R11F_G11F_B10F GL_FLOAT 18807 13986
110 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 1251 15138
111 GL_RGB9_E5 GL_HALF_FLOAT 22464 14482
112 GL_RGB9_E5 GL_FLOAT 18562 14075
113 GL_RGB16F GL_HALF_FLOAT 22072 15038
114 GL_RGB32F GL_FLOAT 18801 14100
115 GL_RGB16F GL_FLOAT 18774 13864
116 GL_RG8 GL_UNSIGNED_BYTE 18446 16890
117 GL_RG8_SNORM GL_BYTE 18493 17353
118 GL_RG16F GL_HALF_FLOAT 26391 15989
119 GL_RG32F GL_FLOAT 25502 15230
120 GL_RG16F GL_FLOAT 25498 15027
121 GL_R8 GL_UNSIGNED_BYTE 18754 17213
122 GL_R8_SNORM GL_BYTE 16275 17254
123 GL_R16F GL_HALF_FLOAT 31097 16525
124 GL_R16F GL_FLOAT 54923 16005
125 GL_RGBA GL_UNSIGNED_BYTE 17905 15956
126 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 1434 16266
127 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 1462 16251
128 GL_RGB GL_UNSIGNED_BYTE 1465 16370
129 GL_RGB GL_UNSIGNED_SHORT_5_6_5 1525 16550
130 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 1416 15919
131 GL_LUMINANCE GL_UNSIGNED_BYTE 18876 16872
132 GL_ALPHA GL_UNSIGNED_BYTE 17523 16271

By the 128x128 texture size, the massive performance lead that the existing glGetTextureSubImage handling had has dwindled to 20-50% for some cases, while the compute shader is now outperforming by a factor of ten or more for other cases.

# Format Type calls/s current calls/s compute
133 GL_R32F GL_FLOAT 14768 8850
134 GL_RGBA8 GL_UNSIGNED_BYTE 10980 9142
135 GL_RGB5_A1 GL_UNSIGNED_BYTE 11034 9063
136 GL_RGBA4 GL_UNSIGNED_BYTE 11104 9160
137 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 11196 9085
138 GL_RGBA8_SNORM GL_BYTE 10843 9139
139 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 500 10001
140 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 496 9775
141 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 484 8868
142 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 508 8994
143 GL_RGBA16F GL_HALF_FLOAT 5025 7952
144 GL_RGBA32F GL_FLOAT 4731 6635
145 GL_RGBA16F GL_FLOAT 4722 6519
146 GL_RGB8 GL_UNSIGNED_BYTE 497 9356
147 GL_RGB565 GL_UNSIGNED_BYTE 499 9181
148 GL_SRGB8 GL_UNSIGNED_BYTE 479 9067
149 GL_RGB8_SNORM GL_BYTE 784 9704
150 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 527 9569
151 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 1396 8938
152 GL_R11F_G11F_B10F GL_HALF_FLOAT 5697 8283
153 GL_R11F_G11F_B10F GL_FLOAT 4760 6599
154 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 444 8123
155 GL_RGB9_E5 GL_HALF_FLOAT 5836 8305
156 GL_RGB9_E5 GL_FLOAT 4782 6944
157 GL_RGB16F GL_HALF_FLOAT 5669 8313
158 GL_RGB32F GL_FLOAT 4759 6819
159 GL_RGB16F GL_FLOAT 4779 6912
160 GL_RG8 GL_UNSIGNED_BYTE 11772 10298
161 GL_RG8_SNORM GL_BYTE 11771 10555
162 GL_RG16F GL_HALF_FLOAT 6900 9324
163 GL_RG32F GL_FLOAT 6601 7928
164 GL_RG16F GL_FLOAT 6461 7965
165 GL_R8 GL_UNSIGNED_BYTE 12249 10936
166 GL_R8_SNORM GL_BYTE 12423 11080
167 GL_R16F GL_HALF_FLOAT 8790 10254
168 GL_R16F GL_FLOAT 15005 7751
169 GL_RGBA GL_UNSIGNED_BYTE 11094 8086
170 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 506 9767
171 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 494 9790
172 GL_RGB GL_UNSIGNED_BYTE 497 9689
173 GL_RGB GL_UNSIGNED_SHORT_5_6_5 532 9917
174 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 487 10353
175 GL_LUMINANCE GL_UNSIGNED_BYTE 12297 10944
176 GL_ALPHA GL_UNSIGNED_BYTE 12310 10940

256x256 is the point at which the difference starts to become even more pronounced. The cases where the compute implementation is definitively worse are few and far between, and many of the cases in which it was previously worse are now neck and neck.

# Format Type calls/s current calls/s compute
177 GL_R32F GL_FLOAT 3739 3348
178 GL_RGBA8 GL_UNSIGNED_BYTE 4533 3409
179 GL_RGB5_A1 GL_UNSIGNED_BYTE 4327 3354
180 GL_RGBA4 GL_UNSIGNED_BYTE 4609 3274
181 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 4520 3333
182 GL_RGBA8_SNORM GL_BYTE 4296 3437
183 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 236 3614
184 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 237 3512
185 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 235 2715
186 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 230 3331
187 GL_RGBA16F GL_HALF_FLOAT 1291 2694
188 GL_RGBA32F GL_FLOAT 1127 1020
189 GL_RGBA16F GL_FLOAT 1127 1000
190 GL_RGB8 GL_UNSIGNED_BYTE 223 3648
191 GL_RGB565 GL_UNSIGNED_BYTE 225 3550
192 GL_SRGB8 GL_UNSIGNED_BYTE 227 3487
193 GL_RGB8_SNORM GL_BYTE 358 3586
194 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 252 3567
195 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 661 3084
196 GL_R11F_G11F_B10F GL_HALF_FLOAT 1509 3055
197 GL_R11F_G11F_B10F GL_FLOAT 1164 1222
198 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 209 2636
199 GL_RGB9_E5 GL_HALF_FLOAT 1499 2571
200 GL_RGB9_E5 GL_FLOAT 1182 1234
201 GL_RGB16F GL_HALF_FLOAT 1510 2955
202 GL_RGB32F GL_FLOAT 1179 1112
203 GL_RGB16F GL_FLOAT 1172 1247
204 GL_RG8 GL_UNSIGNED_BYTE 5019 3572
205 GL_RG8_SNORM GL_BYTE 5043 4201
206 GL_RG16F GL_HALF_FLOAT 1796 3471
207 GL_RG32F GL_FLOAT 1677 2701
208 GL_RG16F GL_FLOAT 1668 2638
209 GL_R8 GL_UNSIGNED_BYTE 5374 4084
210 GL_R8_SNORM GL_BYTE 5409 2985
211 GL_R16F GL_HALF_FLOAT 2222 880
212 GL_R16F GL_FLOAT 3689 2904
213 GL_RGBA GL_UNSIGNED_BYTE 4490 3179
214 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 220 3366
215 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 237 3405
216 GL_RGB GL_UNSIGNED_BYTE 228 3392
217 GL_RGB GL_UNSIGNED_SHORT_5_6_5 253 3180
218 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 229 3763
219 GL_LUMINANCE GL_UNSIGNED_BYTE 5400 4099
220 GL_ALPHA GL_UNSIGNED_BYTE 5473 3543

512x512 is even better for the compute-based implementation, which remains extremely consistent across nearly all cases. It now exhibits a commanding lead of almost 1500% for some cases, while at worst, it maintains a consistent rate and achieves only 60% of baseline performance for a few cases.

# Format Type calls/s current calls/s compute
221 GL_R32F GL_FLOAT 831 711
222 GL_RGBA8 GL_UNSIGNED_BYTE 839 724
223 GL_RGB5_A1 GL_UNSIGNED_BYTE 856 707
224 GL_RGBA4 GL_UNSIGNED_BYTE 831 644
225 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 869 679
226 GL_RGBA8_SNORM GL_BYTE 827 730
227 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 77 709
228 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 77 718
229 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 75 664
230 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 75 662
231 GL_RGBA16F GL_HALF_FLOAT 304 550
232 GL_RGBA32F GL_FLOAT 253 403
233 GL_RGBA16F GL_FLOAT 253 358
234 GL_RGB8 GL_UNSIGNED_BYTE 74 522
235 GL_RGB565 GL_UNSIGNED_BYTE 74 594
236 GL_SRGB8 GL_UNSIGNED_BYTE 74 618
237 GL_RGB8_SNORM GL_BYTE 156 575
238 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 80 593
239 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 160 585
240 GL_R11F_G11F_B10F GL_HALF_FLOAT 353 489
241 GL_R11F_G11F_B10F GL_FLOAT 282 366
242 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 67 518
243 GL_RGB9_E5 GL_HALF_FLOAT 352 488
244 GL_RGB9_E5 GL_FLOAT 279 365
245 GL_RGB16F GL_HALF_FLOAT 356 472
246 GL_RGB32F GL_FLOAT 282 352
247 GL_RGB16F GL_FLOAT 269 322
248 GL_RG8 GL_UNSIGNED_BYTE 936 273
249 GL_RG8_SNORM GL_BYTE 954 354
250 GL_RG16F GL_HALF_FLOAT 434 494
251 GL_RG32F GL_FLOAT 393 384
252 GL_RG16F GL_FLOAT 392 284
253 GL_R8 GL_UNSIGNED_BYTE 1398 624
254 GL_R8_SNORM GL_BYTE 1415 630
255 GL_R16F GL_HALF_FLOAT 535 530
256 GL_R16F GL_FLOAT 819 490
257 GL_RGBA GL_UNSIGNED_BYTE 861 504
258 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 76 540
259 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 78 685
260 GL_RGB GL_UNSIGNED_BYTE 74 676
261 GL_RGB GL_UNSIGNED_SHORT_5_6_5 82 706
262 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 75 825
263 GL_LUMINANCE GL_UNSIGNED_BYTE 1431 997
264 GL_ALPHA GL_UNSIGNED_BYTE 1399 962

1024x1024 is the true gauntlet for PBO downloads, and it’s really not something that is likely to be seen in many places. But pbobench covers it because why not, so here we are. The results here are much less pronounced just because the amount of time spent in memcpy on the CPU ends up being so large for both implmentations that the GPU doesn’t get much time to do work.

Improvements are ongoing for compute-accelerated PBOs, and performance continues to rise. I’m looking forward to tackling uploads next, as that should directly improve load times for GL-based games across the board.

Also, Graphics.

As part of the ongoing saga of working for a company that has an interest in gaming, games have been played recently.

One of those games is Tomb Raider (2013), and until very recently it had some issues when run on zink:

tombraider.png

Here we see the wild triangle in its native habitat, foraging for sustenance among its siblings.

All of that changed today, however, when I rebased zink-wip from a couple weeks ago and then hucked it back into the repo with a new snapshot name and zero testing.

I was subsequently informed that I had broken our beautiful triangle princess, and Senior Misrendering Connoisseur Witold Baryluk has provided us with actual gameplay footage with all settings set to EXTREME:

May 02, 2021
glxgears rendered on an Apple M1

After beginning a compiler for the Apple M1 GPU, the next step is to develop a graphics driver exercising the compiler. Since the last post two weeks ago, I’ve begun a Gallium driver for the M1, implementing much of the OpenGL 2.1 and ES 2.0 specifications. With the compiler and driver together, we’re now able to run OpenGL workloads like glxgears and scenes from glmark2 on the M1 with an open source stack. We are passing about 75% of the OpenGL ES 2.0 tests in the drawElements Quality Program used to establish Khronos conformance. To top it off, the compiler and driver are now upstreamed in Mesa!

Gallium is a driver framework inside Mesa. It splits drivers into frontends, like OpenGL and OpenCL, and backends, like Intel and AMD. In between, Gallium has a common caching system for graphics and compute state, reducing the CPU overhead of every Gallium driver. The code sharing, central to Gallium’s design, allows high-performance drivers to be written at a low cost. For us, that means we can focus on writing a Gallium backend for Apple’s GPU and pick up OpenGL and OpenCL support “for free”.

More sharing is possible. A key responsibility of the Gallium backend is to translate Gallium’s state objects into hardware packets, so we need a good representation of hardware packets. While packed bitfields can work, C’s bitfields have performance and safety (overflow) concerns. Hand-coded C structures lack pretty-printing needed for efficient debugging. Finally, while reverse-engineering, hand-coded C structures tend to accumulate random magic numbers in driver code, which is undesirable. These issues are not new; systems like Intel’s GenXML and Nouveau’s envytools solve them by allowing the hardware packets to be described as XML while all necessary C code is auto-generated. For Asahi, I’ve opted to use GenXML, providing a concise description of my reverse-engineering results and an ergonomic API for the driver.

The XML contains recently reverse-engineered packets, like those describing textures and samplers. Fortunately, these packets map closely to both Metal and Gallium, allowing a natural translation. Coupled with Dougall Johnson’s latest notes on texture instructions, adding texture support to the Gallium driver and NIR compiler was a cinch.

The resulting XML is somewhat smaller than that of other reverse-engineered drivers of similar maturity. As discussed in the previous post, Apple relies on shader code in lieu of fixed-function graphics hardware for tasks like vertex attribute fetch and blending. Apple’s design reduces the hardware surface area, in turn reducing the number of packets in the XML at the expense of extra driver code to produce the needed shader variants. Here is yet another win for code sharing: the most complex code needed is for blending and logic operations, for which Mesa already has lowering code. Mali (Panfrost) needs some blending lowered to shader code, and all Mesa drivers need advanced blending equations lowered (modes like overlay, colour dodge, and screen). As a result, it will be a simple task to wire in the “load tilebuffer to register” instruction and Mesa’s blending code and see a thousand more tests pass.

Although Apple culled a great deal of legacy functionality from their GPU, some features are retained to support older APIs like compatibility contexts of desktop OpenGL, even when the features are inaccessible from Metal. Index buffers and primitive types exhibit this phenomenon. In Metal, an application can draw using a 16-bit or 32-bit index buffer, selecting primitives like triangles and triangle strips, with primitive restart always enabled. Most graphics developers want to use this fast path; by only supporting the fast path, Metal prevents a game developer from accidentally introducing slow code. However, this limits our ability to implement Khronos APIs like OpenGL and Vulkan on top of the Apple hardware… or does it?

In addition to the subset supported by Metal, the Khronos APIs also support 8-bit index buffers, additional primitive types like triangle fans, and an option to disable primitive restart. True, some of this functionality is unnecessary. Real geometry usually requires more than 256 vertices, so 8-bit index buffers are a theoretical curiosity. Primitives like triangle fans can bring a hardware penalty relative to triangle strips due to poor cache locality while offering no generality over indexed triangles. Well-written apps generally require primitive restart, and it’s almost free for apps that don’t need it. Even so, real applications (and the Khronos conformance tests) do use these features, so we need to support them in our OpenGL and Vulkan drivers. The issue does not only affect us – drivers layered on top of Metal like MoltenVK struggle with these exact gaps.

If Apple left these features in the hardware but never wired them into Metal, we’d like to know. But clean room reverse-engineering the hardware requires observing the output of the proprietary driver and looking for patterns, so if the proprietary (Metal) driver doesn’t use the functionality, how can we ever know it exists?

This is a case for an underappreciated reverse-engineering technique: guesswork. Hardware designers are logical engineers. If we can understand how Apple approached the hardware design, we can figure out where these hidden features should be in the command stream. We’re looking for conspicuous gaps in our understanding of the hardware, like looking for black holes by observing the absence of light. Consider index buffers, which are configured by an indexed draw packet with a field describing the size. After trying many indexed draws in Metal, I was left with the following reverse-engineered fragment:

<enum name="Index size">
  <value name="U16" value="1"/>
  <value name="U32" value="2"/>
</enum>

<struct name="Indexed draw" size="32">
...
  <field name="Index size" size="2" start="2:17" type="Index size"/>
...
</struct>

If the hardware only supported what Metal supports, this fragment would be unusual. Note 16-bit and 32-bit index buffers are encoded as 1 and 2. In binary, that’s 01 and 10, occupying two different bits, so we know the index size is at least 2 bits wide. That leaves 00 (0) and 11 (3) as possible unidentified index sizes. Now observe index sizes are multiples of 8 bits and powers of two, so there is a natural encoding as the base-2 logarithm of the index size in bytes. This matches our XML fragment, since log2(16 / 8) = 1 and log2(32 / 8) = 2. From here, we make our leap of faith: if the hardware supports 8-bit index buffers, it should be encoded with value log2(8 / 8) = 0. Sure enough, if we try out this guess, we’ll find it passes the relevant OpenGL tests. Success.

Finding the missing primitive types works the same way. The primitive type field is 4-bits in the hardware, allowing for 16 primitive types to be encoded, while Metal only uses 5, leaving only 11 to brute force with the tests in hand. Likewise, few bits vary between an indexed draw of triangles (no primitive restart) and an indexed draw of triangle strips (with primitive restart), giving us a natural candidate for a primitive restart enable bit. Our understanding of the hardware is coming together.

One outstanding difficulty is ironically specific to macOS: the IOGPU interface with the kernel. Traditionally, open source drivers on Linux use simple kernel space drivers together with complex userspace drivers. The userspace (Mesa) handles all 3D state, and the kernel simply handles memory management and scheduling. However, macOS does not follow this model; the IOGPU kernel extension, common to all GPUs on macOS, is made aware of graphics state like surface dimensions and even details about mipmapping. Many of these mechanisms can be ignored in Mesa, but there is still an uncomfortably large volume of “magic” to interface with the kernel, like the memory mapping descriptors. The good news is many of these elements can be simplified when we write a Linux kernel driver. The bad news is that they do need to be reverse-engineered and implemented in Mesa if we would like native Vulkan support on Macs. Still, we know enough to drive the GPU from macOS… and hey, soon enough, we’ll all be running Linux on our M1 machines anyway :-)

April 26, 2021

You Will Be Missed

glxgears.png

RIP weird glxgears 2018-2021.

April 21, 2021

The Meson Build System provides support for running on Microsoft Windows, including support for Microsoft Visual Studio C++. GitHub Actions provides public access to CI machines running Microsoft Windows. But trying to tie both together is not as straightforward as it sounds.

Sometimes you stumble over a task you never thought you have to deal with. This story is about one of those times. In particular, I was faced with running CI tests for a simple C library on Microsoft Visual Studio C++ (MSVC). Gladly, GitHub already provides simple access to machines running Microsoft Windows Server 2016 and 2019, so this sounded like a straightforward task. Unfortunately, my infinite ignorance of anything Windows made this harder than it should have been.

The root of this problem is that the Meson Build System needs to run in the MSVC Developer Shell. This shell has all the necessary environment variables prepared for a particular install of MSVC. Since you can have multiple versions installed in parallel, Meson cannot know which install to use if run outside of such a shell. Unfortunately, GitHub Actions has no simple way to enter this shell. Therefore, running Meson on GitHub Actions will end up using GCC rather than MSVC, since this is what it detects by default in the GitHub Actions Environment. This is not what we wanted, so adjustments are needed.

Luckily, Microsoft provides a tool called vswhere which finds MSVC installs on a Windows system. We can use this to find the required setup scripts and then import the environment variables into our GitHub Actions setup. This tool is pre-deployed on GitHub Actions, so we can simply invoke it to find a suitable MSVC install. From there on, we look for DevShell.dll, which provides the required integration. We load it into PowerShell and invoke the provided Enter-VsDevShell function. By comparing our own environment variables before and after that call, we can extract the changes and export them into the GitHub Actions environment. Thus, the following workflow-steps will have access to those variables as well.

I plugged this into a re-usable GitHub Action using the new composite type. To use it in a GitHub Actions workflow, simply use:

- name: Prepare MSVC
  uses: bus1/cabuild/action/msdevshell@v1
  with:
    architecture: x64

This queries the MSVC environment and exports it to your GitHub Actions job. Following steps will thus run as if in an MSVC Developer Shell. A full example is appended at the bottom, which shows how to get Meson to compile and test a project on MSVC for both Windows Server 2016 and 2019.

If you rather import the code into your own project, you can find it on GitHub. Note that this uses PowerShell syntax, so it might look alien to linux developers.

While this is only roughly 50 lines of PowerShell scripting, it still feels a bit too hacky. The Meson developers are aware of this, but so far no patches have found their way upstream. Lets hope that this workaround will one day be obsolete and Meson invokes vswhere itself.


Following a full example workflow:

name: Continuous Integration

on: [push, pull_request]

jobs:
  ci-msvc:
    name: CI with MSVC
    runs-on: $
    strategy:
      matrix:
        os: [windows-2016, windows-latest]

    steps:
    - name: Fetch Sources
      uses: actions/checkout@v2
    - name: Setup Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.x'
    - name: Install Python Dependencies
      run: pip install meson ninja
    - name: Prepare MSVC
      uses: bus1/cabuild/action/msdevshell@v1
      with:
        architecture: x64
    - name: Prepare Build
      run: meson setup build
    - name: Run Build
      run: meson compile -v -C build
    - name: Run Test Suite
      run: meson test -v -C build
April 20, 2021

Running games and benchmarks is much more exciting than trying to fix a handful of remaining synthetic tests. Turnip, which is an open-source Vulkan driver for recent Adreno GPUs, should already be capable of running real world applications, and they always have a way to break the driver in a new, unexpected ways.

TauCeti Vulkan Technology Benchmark

The benchmark greeted me with this wonderful field which looked a little blocky:

Screenshot of the benchmark with grass/dirt field which looks more like a patchwork

It’s not a crash, it’s not a hang, there is something wrong either with textures or with a shader. So, let’s take a closer look at a frame. Here is the first major draw call which looks wrong:

Screenshot of a bad draw call being inspected in RenderDoc

Now, let’s take a look at textures used by the draw:

More than twelve textures used by the draw call

That’s a lot of textures! But all of them look fine; yes, there are some textures that look blocky, but these textures are just small, so nothing wrong here.

The next stop is the fragment shader, I recently implemented an extension which helps with that. VK_KHR_pipeline_executable_properties allows the reporting of additional information about shaders in a pipeline. In case of Turnip it is statistics about registers and assembly of the shader:

Part of the problematic shader with a few instructions having a warning printed near them Excerpt from the problematic shader
sam.base0 (f32)(xy)r0.x, r0.z, a1.x	; no field 'HAS_SAMP', WARNING: unexpected bits[0:7] in #cat5-samp-s2en-bindless-a1: 0x6 vs 0x0

Bingo! Usually when the issue is in a shader I have to either make educated guesses and/or reduce the shader until the issue is apparent. However, here we can immediately spot instructions with a warning. It may not be the cause of earth mis-rendering, but these instructions do sampling from textures, soooo.

After fixing their encoding, which was caused by a mistake in the XML definition, the ground is now rendered correctly:

Screenshot of the benchmark with a field which now has proper grass and dirt

After looking a bit more at the frame I saw that the same issue plagued all rock formations, and now it is gone too. The final fix could be seen at

ir3/isa,parser: fix encoding and parsing of bindless s2en SAM (!9628)

Genshin Impact

Now it’s time for a more heavyweight opponent - Genshin Impact, one of the most popular mobile games at the moment. Genshin Impact supports both GLES and Vulkan, and defaults to GLES on all but several devices. In any case, Vulkan could be enabled by editing a config file.

There are several mis-renderings in the game, however here I would describe the one which shows why it is important to use all available validation tooling for such complex API as Vulkan. Other mis-renderings would be a matter for the second post.

Gameplay – Disco Water

Proceeding to the gameplay and running around for a bit revealed a major artifacts on the water surface:

Screenshot of the gameplay with body of water that has large colorful artifacts

This one was a multiframe effect so I had to capture a trace of all frames and then find the one where issue began. The draw call I found:

GIF showing before and after problematic draw call

It adds water to the first framebuffer and a lot of artifacts to the second one. Looking at the framebuffer configuration - I could see that the fragment shader doesn’t write to the second framebuffer. Is it allowed? Yes. Is the behavior defined? No! VK_LAYER_KHRONOS_validation warns us about it, if warnings are enabled:

UNASSIGNED-CoreValidation-Shader-InputNotProduced(WARN / SPEC): Validation Warning:
Attachment 1 not written by fragment shader; undefined values will be written to attachment

“undefined values will be written to attachment” may mean that nothing is written into it, like expected by the game, or that random values are written to the second attachment, like Turnip does. Such behavior should not be relied upon, so Turnip does not contradict the specification here and there is nothing to fix. Case closed.

More mis-renderings and investigations to come in the next post(s).

Turnips in the wild (Part 2)

April 19, 2021

Last year I worked on implementing in Turnip the support for a HW feature present in Qualcomm Adreno GPUs: the low-resolution Z buffer (aka LRZ). This is a HW feature already supported in Freedreno, which is the open-source OpenGL driver for these GPUs.

What is low-resolution Z buffer

Low-resolution Z buffer is very similar to a depth prepass that helps the HW avoid executing the fragment shader on those fragments that will be subsequently discarded by the depth test afterwards (Hidden surface removal). This feature comes with some limitations though, such as the fragment shader not being allowed to have side effects (writing to SSBOs, atomic operations, etc) among others.

The interesting part of this feature is that it allows the applications to submit the vertices in any order (saving CPU time that was otherwise used on ordering them) and the HW will process them in the binning pass as explained below, detecting which ones are occluded and giving an increase in performance in some specific use cases due to this.

Tiled-rendering

To understand better how LRZ works, we need to talk a bit about tiled-based rendering. This is a way of rendering based on subdividing the framebuffer in tiles and rendering each tile separately. The advantage of this design is that the amount of memory and bandwidth is reduced compared to immediate mode rendering systems that draw the entire frame at once. Tile-based rendering is very popular on embedded GPUs, including Qualcomm Adreno.

Entering into more details, the graphics pipeline is divided into three different passes executed per tile of the frame.

Tiled-rendering architecture diagram
Tiled-rendering architecture diagram.
  • The binning pass. This pass processes the geometry of the scene and records in a table on which tiles a primitive will be rendered. By doing this, the HW only needs to render the primitives that affect a specific tile when is processed.

  • The rendering pass. This pass gets the rasterized primitives and executes all the fragment related processes of the pipeline (fragment shader execution, depth pass, stencil pass, blending, etc). Once it finishes, the resolve pass starts.

  • The resolve pass. It first resolves the tile buffer (GMEM) if it is multisample, and copy the final color and depth values for all tile pixels back to system memory. If it is the last tile of the framebuffer, it swap buffers and start the binning pass for next frame.

Where is LRZ used then? Well, in both binning and rendering passes. In the binning pass, it is possible to store the depth value of each vertex of the geometries of the scene in a buffer as the HW has that data available. That is the depth buffer used internally for LRZ. It has lower resolution as too much detail is not needed, which helps to save bandwidth while transferring its contents to system memory.

Thanks to LRZ, the rendering pass is only executed on the fragments that are going to be visible at the end. However, there are some limitations as mentioned before: if a fragment shader has collateral effects, such as writing SSBO, atomics, etc; or if blending is enabled, or if the fragment shader could modify the fragment’s depth… then LRZ cannot be used as the results may be wrong.

However, LRZ brings a couple of things on the table that makes it interesting. One is that applications don’t need to reorder their primitives before submission to be more efficient, that is done by the HW with LRZ automatically. Another one is performance improvements in some use cases. For example, imagine a fragment shader discards parts of fragments but it doesn’t have any other collateral effect otherwise. In that case, although we cannot do early depth testing, we can do early LRZ as we know that some fragments won’t pass a depth test even if they are not discarded by the fragment shader.

Turnip implementation

Talking about the LRZ implementation, I took Freedreno’s code as a starting point to implement LRZ on Turnip. After some months of work, it finally landed in Mesa master.

Last week, more patches related to LRZ landed in Mesa master: the ones fixing LRZ interactions with VK_EXT_extended_dynamic_state, as with this extension the application can change some states in command buffer time that could affect LRZ state and, therefore, we need to track them accordingly.

I also implemented some LRZ improvements that are currently under review also landed (thanks Eric Anholt!), such as the support to do early-LRZ-late-depth test that I mentioned before, which could bring a performance improvement in some applications.

LRZ improvements
Left: original vulkan tutorial demo implementation. Right: same demo modified to discard fragments with red component lower than 0.5f.

For instance, I did some measurements in a vulkan-tutorial.com implementation of my own that I modified to discard a significant amount of fragments (see previous figure). This is one of the cases that early-LRZ-late-depth test helps to improve performance.

When running the modified demo with these patches, I found a performance improvement between 13-16%.

Acknowledgments

All this LRZ work was my first big contribution to this open-source reverse engineered driver! I don’t want to finish this post without thanking publicly Rob Clark for the original Freedreno implementation and his reviews of my work, as well as Jonathan Marek and Connor Abbott for their insightful reviews, advice and tips to make it working. Edited: Many thanks to Eric Anholt for his reviews in the last two patch series!

Happy hacking!

For the fun of it I decided to run some real apps on lavapipe.

Talos Principle is still rando crashing on startup, occasionally whatever magic value ends up being right in uninit memory and it suddenly runs fine.

I started Rise of the Tomb Raider, and it renders really slowly up to the menu.

Then I gave DOOM 2016 with the Vulkan renderer a go, and with a few lavapipe hacks to enable some feature bits, I managed to get it to load a game image. It's taking 5-6s per frame to render. However most of the slowness in the frame is the BPTC texture loading which is a path that I've done no tuning for so it definitely running very slowly. I think RoTR is also hitting that slow path so I guess I've some incentive to look at cleaning it up.

 


April 18, 2021
Shaded cube rendered on an Apple M1

After a few weeks of investigating the Apple M1 GPU in January, I was able to draw a triangle with my own open source code. Although I began dissecting the instruction set, the shaders there were specified as machine code. A real graphics driver needs a compiler from high-level shading languages (GLSL or Metal) to a native binary. Our understanding of the M1 GPU’s instruction set has advanced over the past few months. Last week, I began writing a free and open source shader compiler targeting the Apple GPU. Progress has been rapid: at the end of its first week, it can compile both basic vertex and fragment shaders, sufficient to render 3D scenes. The spinning cube pictured above has its shaders written in idiomatic GLSL, compiled with the nascent free software compiler, and rendered with native code like the first triangle in January. No proprietary blobs here!

Over the past few months, Dougall Johnson has investigated the instruction set in-depth, building on my initial work. His findings on the architecture are outstanding, focusing on compute kernels to complement my focus on graphics. Armed with his notes and my command stream tooling, I could chip away at a compiler.

The compiler’s design must fit into the development context. Asahi Linux aims to run a Linux desktop on Apple Silicon, so our driver should follow Linux’s best practices like upstream development. That includes using the New Intermediate Representation (NIR) in Mesa, the home for open source graphics drivers. NIR is a lightweight library for shader compilers, with a GLSL frontend and backend targets including Intel and AMD. NIR is an alternative to LLVM, the compiler framework used by Apple. Just because Apple prefers LLVM doesn’t mean we have to. A team at Valve famously rewrote AMD’s LLVM backend as a NIR compiler, improving performance. If it’s good enough for Valve, it’s good enough for me.

Supporting NIR as input does not dictate our compiler’s own intermediate representation, which reflects the hardware’s design. The instruction set of AGX2 (Apple’s GPU) has:

  • Scalar arithmetic
  • Vectorized input/output
  • 16-bit types
  • Free conversions between 16-bit and 32-bit
  • Free floating-point absolute value, negate, saturate
  • 256 registers (16-bits each)
  • Register usage / thread occupancy trade-off
  • Some form of multi-issue or out-of-order (superscalar) execution

Each hardware property induces a compiler property:

  • Scalar sources. Don’t complicate the compiler by allowing unrestricted vectors.
  • Vector values at the periphery separated with vector combine and extract pseudo-instructions, optimized out during register allocation.
  • 16-bit units.
  • Sources and destinations are sized. The optimizer folds size conversion instructions into uses and definitions.
  • Sources have absolute value and negate bits; instructions have a saturate bit. Again, the optimizer folds these away.
  • A large register file means running out of registers is rare, so don’t optimize for register spilling performance.
  • Minimizing register pressure is crucial. Use static single assignment (SSA) form to facilitate pressure estimates, informing optimizations.
  • The scheduler simply reorders instructions without leaking details to the rest of the backend. Scheduling is feasible both before and after register allocation.

Putting it together, a design for an AGX compiler emerges: a code generator translating NIR to an SSA-based intermediate representation, optimized by instruction combining passes, scheduled to minimize register pressure, register allocated while going out of SSA, scheduled again to maximize instruction-level parallelism, and finally packed to binary instructions.

These decisions reflect the hardware traits visible to software, which are themselves “shadows” cast by the hardware design. Investigating these traits offers insight into the hardware itself. Consider the register file. While every thread can access up to 256 half-word registers, there is a performance penalty: the more registers used, the fewer concurrent threads possible, since threads share a register file. The number of threads allowed in a given shader is reported in Metal as the maxTotalThreadsPerThreadgroup property. So, we can study the register pressure versus occupancy trade-off by varying the register pressure of Metal shaders (confirmed via our disassembler) and correlating with the value of maxTotalThreadsPerThreadgroup:

Registers Threads
<= 104 1024
112 896
120, 128 832
136 768
144 704
152, 160 640
168-184 576
192-208 512
216-232 448
240-256 384

From the table, it’s clear that up until a threshold, it doesn’t matter how many registers the program uses; occupancy is unaffected. Most well-written shaders fall in this bracket and need not worry. After hitting the threshold, other GPUs might spill registers to memory, but Apple doesn’t need to spill until more than 256 registers are required. Between 112 and 256 registers, the number of threads decreases in an almost linear fashion, in increments of 64 threads. Carefully considering rounding, it’s easy to recover the formula Metal uses to map register usage to thread count.

What’s less obvious is that we can infer the size of the machine’s register file. On one hand, if 256 registers are used, the machine can still support 384 threads, so the register file must be at least 256 half-words * 2 bytes per half-word * 384 threads = 192 KiB large. Likewise, to support 1024 threads at 104 registers requires at least 104 * 2 * 1024 = 208 KiB. If the file were any bigger, we would expect more threads to be possible at higher pressure, so we guess each threadgroup has exactly 208 KiB in its register file.

The story does not end there. From Apple’s public specifications, the M1 GPU supports 24576 = 1024 * 24 simultaneous threads. Since the table shows a maximum of 1024 threads per threadgroup, we infer 24 threadgroups may execute in parallel across the chip, each with its own register file. Putting it together, the GPU has 208 KiB * 24 = 4.875 MiB of register file! This size puts it in league with desktop GPUs.

For all the visible hardware features, it’s equally important to consider what hardware features are absent. Intriguingly, the GPU lacks some fixed-function graphics hardware ubiquitous among competitors. For example, I have not encountered hardware for reading vertex attributes or uniform buffer objects. The OpenGL and Vulkan specifications assume dedicated hardware for each, so what’s the catch?

Simply put – Apple doesn’t need to care about Vulkan or OpenGL performance. Their only properly supported API is their own Metal, which they may shape to fit the hardware rather than contorting the hardware to match the API. Indeed, Metal de-emphasizes vertex attributes and uniform buffers, favouring general constant buffers, a compute-focused design. The compiler is responsible for translating the fixed-function attribute and uniform state to shader code. In theory, this has a slight runtime cost; conventional wisdom says dedicated hardware is faster and lower power than software emulation. In practice, the code is so simple it may make no difference, although application developers should be mindful of the vertex formats used in case conversion code is inserted. As always, there is a trade-off: omitting features allows Apple to squeeze more arithmetic logic units (or register file!) onto the chip, speeding up everything else.

The more significant concern is the increased time on the CPU spent compiling shaders. If changing fixed-function attribute state can affect the shader, the compiler could be invoked at inopportune times during random OpenGL calls. Here, Apple has another trick: Metal requires the layout of vertex attributes to be specified when the pipeline is created, allowing the compiler to specialize formats at no additional cost. The OpenGL driver pays the price of the design decision; Metal is exempt from shader recompile tax.

The silver lining is that there is nothing to reverse-engineer for “missing” features like attributes and uniform buffers. As long as we know how to access memory in compute kernels, we can write the lowering code ourselves with no hardware mysteries. So far, I’ve implemented enough to spin a cube.

At present, the in-progress compiler supports most arithmetic and input/output instructions found in OpenGL ES 2.0, with a simple optimizer and native instruction packing. Support for control flow, textures, scheduling, and register allocation will come further down the line as we work towards a real driver.

Get the code while it’s hot!

April 15, 2021

Given that the new RDNA2 GPUs provide some support for hardware accelerated raytracing and there is even a new shiny Vulkan extension for it, it may not be a surprise that we’re working on implementing raytracing support in RADV.

Already some time ago I wrote documentation for the hardware raytracing support. As these GPUs contain quite minimal hardware to implement things there is a large software and shader side to implementing this.

And that is what I’ve been up to for the last couple of weeks. And I now have achieved my first personal milestones for the implementation:

  1. A fully recursive Fibonacci shader
  2. And a raytraced cube:

Raytraced cube

This involves writing initial versions for a lot of the software infrastructure needed, so really shows that the basis is getting there.

At the same time we’re quite a ways off from really testing using CTS or running our first real demos. In particular we are missing things like

  • GPU-side BVH building
  • any-hit and intersection shaders
  • Supporting BVH instances, geometry transforms etc.
  • pipeline libraries

and much more, in addition to some of these initial implementations likely not really being performant.

TFW Long Game Loads

I’m typing this up between loads and runs of various games I’m testing, since bug reports for games are somehow already a thing, and there’s a lot of them.

The worst part about testing games is the unbelievably long load times (and startup videos) most of them have, not to mention those long, panning camera shots at the start of the game before gameplay begins and I can start crashing.

But this isn’t a post about games.

No, no, there’s plenty of time for such things.

This is a combo post: part roundup because blogging has been sporadic the past couple weeks, and part feature.

The Roundup

The big Mesa 21.1 branchpoint happened yesterday, and I’m pretty pleased with the state of zink in this upcoming release.

Things you should expect to see:

  • GL 4.6
  • ES 3.1
  • Reasonable performance in many cases

Things you should not expect to see:

  • Most (any?) AAA games working; I’ve kept GL compat contexts clamped to 3.0 in a certainly-futile attempt to cut down on the absolute deluge of bug tickets I’m expecting once everyone tries to run their favorite games/emulators/whathaveyou with the shipped version
    • This functionality remains enabled in zink-wip and will be dumped into mainline soon
  • ???
  • Tough to say, honestly, since this is effectively a version of zink that is, to me, 5-6 months old with a few other patches sprinkled in here and there

And here’s the zink-wip roundup from the past however long since I did the last one:

  • I doubled blending performance in many cases by fixing an incredibly old TODO item regarding using linear tiled images for scanout; this got rushed into the 21.1 release solely to avoid super embarrassing numbers on Phoronix benchmarks.

Yeah, I’m talking to you.

  • I fixed a ton of bugs. Like, actually a ton. Tomb Raider went from an all-you-can-crash buffet to…well, I’ll leave it as a fun surprise for anyone feeling especially intrepid, but it’s definitely playable. So are things that use queries. Don’t ask.

  • There’s totally some other stuff, but I’m too fried to remember it.

The Real Post

Everything up there was just fluff, but this is where the post gets real.

Zink supports formats with alpha-to-one, e.g., RGBX, BGRX, and even (on zink-wip, very illegal) XRGB and XBGR. This is handy for 24bit visuals, such as (probably) all your windows in Xorg. But the method by which these formats are supported comes with its own issues, part of which I’d fixed some time ago in my quest to reduce GPU overhead, and the other part I discovered more recently.

In zink, an alpha-to-one format is just the equivalent format with alpha. So RGBX is just RGBA. This is due to Vulkan not having format equivalents for these types; when sampling from them, a swizzle is applied to force the alpha channel to the maximum value, which yields the correct result.

But what happens when an RGBX framebuffer attachment is used in a blending operation?

Let’s look at VK_BLEND_OP_ADD as a very simple example. The spec defines this operation as:

As0 × Sa + Ad × Da

That’s alpha-of-src times src-alpha-blend-factor plus alpha-of-dest times dest-alpha-blend-factor yielding the resulting pixel color that gets written to the attachment.

But what if the dest value is expected to always one, and the actual buffer is always zero because its alpha channel is never written?

Such is the case with RGBX and the like, and so more steps are required here for full emulation.

The Real Roundup

Here’s how I went about solving the issue.

First, framebuffer attachments have to be monitored when they’re updated, and any time an alpha-to-one attachment is bound, I set a bitflag for it in the pipeline state. This then triggers a pipeline update for the blend state at the time of draw, where I apply clamping to the appropriate blend factors like so:

if (state->zero_alpha_attachments) {
   for (unsigned i = 0; i < state->num_attachments; i++) {
      blend_att[i] = state->blend_state->attachments[i];
      if (state->zero_alpha_attachments & BITFIELD_BIT(i)) {
         blend_att[i].dstAlphaBlendFactor = VK_BLEND_FACTOR_ZERO;
         blend_att[i].srcColorBlendFactor = clamp_zero_blend_factor(blend_att[i].srcColorBlendFactor);
         blend_att[i].dstColorBlendFactor = clamp_zero_blend_factor(blend_att[i].dstColorBlendFactor);
      }
   }
   blend_state.pAttachments = blend_att;
} else
   blend_state.pAttachments = state->blend_state->attachments;

For any of the attachments in the bitfield, three clamps are performed:

  • dstAlphaBlendFactor is clamped to zero, because there will never be any contribution from the dest component of alpha blending
  • srcColorBlendFactor and dstColorBlendFactor are both clamped using the following check:
if (f == VK_BLEND_FACTOR_ONE_MINUS_DST_ALPHA)
   return VK_BLEND_FACTOR_ZERO;
if (f == VK_BLEND_FACTOR_DST_ALPHA)
   return VK_BLEND_FACTOR_ONE;
return f;

Thus, alpha blending is a passthrough operation from the src component, and for color blending, the dest component is always one or zero. This yields correct results in piglit’s spec@arb_texture_float@fbo-blending-formats test, and also potentially enables the hardware to employ some optimizations to reduce the burden of blending.

Exciting.

April 14, 2021

With dbus-broker we have introduced the resource-accounting of bus1 into the D-Bus world. We believe it greatly improves and strengthens the resource distribution of the D-Bus messages bus, and we have already found a handful of resource leaks that way. However, it can be a daunting task to solve resource exhaustion bugs, so I decided to describe the steps we took to resolve a recent resource-leak in the openQA package.

A few days ago, Adam Williamson approached me 1 with a bug in the openQA package, where he saw the log stream filled with messages like:

dbus-broker[<pid>]: Peer :1.<id> is being disconnected as it does not have the resources to receive a reply or unicast signal it expects.
dbus-broker[<pid>]: UID <uid> exceeded its 'bytes' quota on UID <uid>.

This is the typical sign of a resource exhaustion in dbus-broker. When the message broker generates or forwards messages to an individual client, it will queue them as outgoing-messages and push them into the unix-socket of the client. If this client does not dequeue messages, this queue might fill up. If a limit is reached, something needs to be done. Since D-Bus is not a lossy protocol, dropping messages is not an option. Instead, the message broker will either refuse new incoming operations or disconnect a client. All resources are accounted on UIDs, this means multiple clients of the same user will share the same resource limits.

Depending on what message is sent, it is accounted either on the receiver or sender. Furthermore, some messages can be refused by the broker, others cannot. The exact rules are described in the wiki 2.

In the case of openQA, the first step was to query the accounting information of the running message broker:

sudo dbus-send --system --dest=org.freedesktop.DBus --type=method_call --print-reply /org/freedesktop/DBus org.freedesktop.DBus.Debug.Stats.GetStats

(Replace --system with --session to query the session or user bus.)

While preferably this query is performed when the resource exhaustion happens, it will often yield useful information under normal operation as well. Resources are often consumed slowly, so the accumulation will still show up.

The output 3 of this query shows a list of all D-Bus clients with their accounting information. Furthermore, it lists all UIDs that have clients connected to this message bus, again with all accounting information. The challenge is to find suspicious entries in this huge data dump. The most promising solution so far was to search for "OutgoingBytes" and check for big numbers. This shows the number of bytes queued in the message broker for a particular client. It is usually 0, since the kernel queues are big enough to hold most normal messages. Even if it is not 0, it is usually just a couple of KiB.

In this case, we checked for "OutgoingBytes", and found:

dict entry(
    string "OutgoingBytes"
    uint32 62173024
)

62 MiB of messages are waiting to be delivered to that client. Expanding the logs to show the surrounding block, we see:

struct {
    string ":1.211366"
    array [
        dict entry(
            string "UnixUserID"
            variant                            uint32 991
        )
        dict entry(
            string "ProcessID"
            variant                            uint32 674968
        )
        [...]
    ]

    array [
        [...]
        dict entry(
            string "Matches"
            uint32 1
        )
        [...]
        dict entry(
            string "OutgoingBytes"
            uint32 62173024
        )
        [...]
    ]
}

This tells us the PID 674968 of user 991 has roughly 62 MiB of data queued, and it is likely not dequeuing the data. Furthermore, we see it has 1 message filter (D-Bus match rule) installed. D-Bus message filters will cause matching D-Bus signals to be delivered to a client. So a likely problem is that this client keeps receiving signals, but does not dispatch its client socket.

We digged further, and the data dump includes more such clients. Matching back the PIDs to processes via ps auxf, we found that each and every of those suspicious entries was /usr/bin/isotovideo: backend. The code of this process is part of the os-autoinst repository, in this case qemu.pm. A quick look showed only a single use of D-Bus 4. At a first glance, this looks alright. It creates a system-bus connection via the Net::DBus perl module, dispatches a method-call, and returns the result. However, we know this process has a match-rule installed (assuming the dbus-broker logs are correct), so we checked further and found that the Net::DBus module always installs a match-rule on NameOwnerChanged. Furthermore, it caches the system-bus connection in a global variable, sharing it across users in the same code-base.

Long story short, the os-autoinst qemu module created a D-Bus connection which was idle in the background and never dispatched by any code. However, the connection has a match-rule installed, and the message broker kept sending matching signals to that connection. This data accumulated and eventually exceeded the resource quota of that client. A workaround was quickly provided, and it will hopefully resolve this problem 5.

Hopefully, this short recap will be helpful to debug other similar situations. You are always welcome to message us on bus1-devel@googlegroups or on the dbus-broker GitHub issue tracker if you need help.

April 08, 2021

In RADV we just added an option to speed up rendering by rendering less pixels.

These kinds of techniques have become more common over the past decade with techniques such as checkerboarding, TAA based upscaling and recently DLSS. Fundamentally all they do is trading off rendering quality for rendering cost and many of them include some amount of postprocessing to try to change the curve of that tradeoff. Most notably DLSS has been wildly successful at that to the point many people claim it is barely a quality regression.

Of course increasing GPU performance by up to 50% or so with barely any quality regression seems like must have and I think it would be pretty cool if we could have the same improvements on Linux. I think it has the potential to be a game changer, making games playable on APUs or playing with really high resolution or framerates on desktops.

And today we took our first baby steps in RADV by allowing users to force Variable Rate Shading (VRS) with an experimental environment variable:

RADV_FORCE_VRS=2x2

VRS is a hardware capability that allows us to reduce the number of fragment shader invocations per pixel rendered. So you could say configure the hardware to use one fragment shader invocation per 2x2 pixels. The hardware still renders the edges of geometry exactly, but the inner area of each triangle is rendered with a reduced number of fragment shader invocations.

There are a couple of ways this capability can be configured:

  1. On a per-draw level
  2. On a per-primitive level (e.g. per triangle)
  3. Using an image to configure on a per-region level

This is a new feature for AMD on RDNA2 hardware.

With RADV_FORCE_VRS we use this to improve performance at the cost of visual quality. Since we did not implement any postprocessing the quality loss can be pretty bad, so we restricted the reduce shading rate when we detect one of the following

  1. Something is rendered in 2D, as that is likely some UI where you’d really want some crispness
  2. When the shader can discard pixels, as this implicitly introduces geometry edges that the hardware doesn’t see but that significantly impact the visual quality.

As a result there are some games where this has barely any effect but you also don’t notice the quality regression and there are games where it really improves performance by 30%+ but you really notice the quality regression.

VRS is by far the easiest thing to make work in almost all games. Most alternatives like checkerboarding, TAA and DLSS need modified render target size, significant shader fixups, or even a proprietary integration with games. Making changes that deeply is getting more complicated the more advanced the game is.

If we want to reduce render resolution (which would be a key thing in e.g. checkerboarding or DLSS) it is very hard to confidently tie all resolution dependent things together. For example a big cost for some modern games is raytracing, but the information flow to the main render targets can be very hard to track automatically and hence such a thing would require a lot of investigation or a bunch of per game customizations.

And hence we decided to introduce this first baby step. Enjoy!

April 07, 2021

The lavapipe vulkan software rasterizer in Mesa is now reporting Vulkan 1.1 support.

It passes all CTS tests for those new features in 1.1 but it stills fails all the same 1.0 tests so isn't that close to conformant. (lines/point rendering are the main areas of issue).

There are also a bunch of the 1.2 features implemented so that might not be too far away though 16-bit shader ops and depth resolve are looking a bit tricky.

If there are any specific features anyone wants to see or any crazy places/ideas for using lavapipe out there, please either file a gitlab issue or hit me up on twitter @DaveAirlie


Buffering

The great thing about tomorrow is that it never comes.

Let’s talk about sparse buffers.

What is a sparse buffer? A sparse buffer is a buffer that is not required to be contiguously or fully backed. This means that a buffer larger than the GPU’s available memory can be created, and only some parts of it are utilized at any given time. Because of the non-resident nature of the backing memory, they can never be mapped, instead needing to go through a staging buffer for any host read/write.

In a gallium-based driver, provided that an effective implementation for staging buffers exists, sparse buffer implementation goes almost exclusively through the pipe_context::resource_commit hook, which manages residency of a sparse resource’s backing memory, passing a range to change residency for and an on/off switch.

In zink(-wip), the hook looks like this:

static bool
zink_resource_commit(struct pipe_context *pctx, struct pipe_resource *pres, unsigned level, struct pipe_box *box, bool commit)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_resource *res = zink_resource(pres);
   struct zink_screen *screen = zink_screen(pctx->screen);

   /* if any current usage exists, flush the queue */
   if (zink_batch_usage_matches(&res->obj->reads, ctx->curr_batch) ||
       zink_batch_usage_matches(&res->obj->writes, ctx->curr_batch))
      zink_flush_queue(ctx);

   VkBindSparseInfo sparse;
   sparse.sType = VK_STRUCTURE_TYPE_BIND_SPARSE_INFO;
   sparse.pNext = NULL;
   sparse.waitSemaphoreCount = 0;
   sparse.bufferBindCount = 1;
   sparse.imageOpaqueBindCount = 0;
   sparse.imageBindCount = 0;
   sparse.signalSemaphoreCount = 0;

   VkSparseBufferMemoryBindInfo sparse_bind;
   sparse_bind.buffer = res->obj->buffer;
   sparse_bind.bindCount = 1;
   sparse.pBufferBinds = &sparse_bind;

   VkSparseMemoryBind mem_bind;
   mem_bind.resourceOffset = box->x;
   mem_bind.size = box->width;
   mem_bind.memory = commit ? res->obj->mem : VK_NULL_HANDLE;
   mem_bind.memoryOffset = box->x;
   mem_bind.flags = 0;
   sparse_bind.pBinds = &mem_bind;
   VkQueue queue = util_queue_is_initialized(&ctx->batch.flush_queue) ? ctx->batch.thread_queue : ctx->batch.queue;

   VkResult ret = vkQueueBindSparse(queue, 1, &sparse, VK_NULL_HANDLE);
   if (!zink_screen_handle_vkresult(screen, ret)) {
      check_device_lost(ctx);
      return false;
   }
   return true;
}

Naturally there’s a need to enjoy the verbosity of Vulkan structs here, but there’s two key takeaways.

The first is that this implementation is likely suboptimal; it should be making better use of semaphores to avoid having to flush the queue if the resource has current-batch usage. That’s complex to implement, however, so I took the same shortcut that RadeonSI does here.

The second is that this is just copying the pipe_box struct to the VkSparseMemoryBind struct. The reason this works with a 1:1 mapping is because the backing resource is allocated with a 1:1 range mapping, so the values can be directly used.

Other than that, the only changes required for this implementation were to add a bunch of checks for the sparse flag on resources during map/unmap to force staging buffers and to use device-local memory instead of host-visible.

Sometimes zink can be simple!

April 05, 2021

The crocus project was recently mentioned in a phoronix article. The article covered most of the background for the project.

Crocus is a gallium driver to cover the gen4-gen7 families of Intel GPUs. The basic GPU list is 965, GM45, Ironlake, Sandybridge, Ivybridge and Haswell, with some variants thrown in. This hardware currently uses the Intel classic 965 driver. This is hardware is all gallium capable and since we'd like to put the classic drivers out to pasture, and remove support for the old infrastructure, it would be nice to have these generations supported by a modern gallium driver.

The project was initiated by Ilia Mirkin last year, and I've expended some time in small bursts to moving it forward. There have been some other small contributions from the community. The basis of the project is a fork of the iris driver with the old relocation based batchbuffer and state management added back in. I started my focus mostly on the older gen4/5 hardware since it was simpler and only supported GL 2.1 in the current drivers. I've tried to cleanup support for Ivybridge along the way.

The current status of the driver is in my crocus branch.

Ironlake is the best supported, it runs openarena and supertuxkart, and piglit has only around 100 tests delta vs i965 (mostly edgeflag related) and there is only one missing feature (vertex shader push constants). 

Ivybridge just stop hanging on second batch submission now, and glxgears runs on it. Openarena starts to the menu but is misrendering and a piglit run completes with some gpu hangs and a quite large delta. I expect IVB to move faster now that I've solved the worst hang.

Haswell runs glxgears as well.

I think once I take a closer look at Ivybridge/Haswell and can get Ilia (or anyone else) to do some rudimentary testing on Sandybridge, I will start taking a closer look at upstreaming it into Mesa proper.


Woosh

After last week’s post touting the “final” features being added to the upcoming Mesa release, naturally now that this is a new week, I have to outdo myself.

I’ve heard some speculation about zink’s future regarding features. Specifically regarding all the mesamatrix features that aren’t green-ified for zink yet.

So you want features is what you’re saying.

Let’s see where things stand in today’s zink-wip snapshot:

  • GL_OES_tessellation_shader, GL_OES_gpu_shader5 - this is a mesamatrix bug; zink can’t reach GL 4.0 without supporting them, so obviously they are supported
  • GL_ARB_bindless_texture - the final boss
  • GL_ARB_cl_event - not (yet) supported by mesa
  • GL_ARB_compute_variable_group_size - done
  • GL_ARB_ES3_2_compatibility - missing advanced blend from ES3.2
  • GL_ARB_fragment_shader_interlock - done
  • GL_ARB_gpu_shader_int64 - done
  • GL_ARB_parallel_shader_compile - done
  • GL_ARB_post_depth_coverage - done (thanks ajax)
  • GL_ARB_robustness_isolation - not supported by mesa
  • GL_ARB_sample_locations - done
  • GL_ARB_seamless_cubemap_per_texture - needs a new Vulkan extension
  • GL_ARB_shader_ballot - done
  • GL_ARB_shader_clock - done
  • GL_ARB_shader_stencil_export - done
  • GL_ARB_shader_viewport_layer_array - done
  • GL_ARB_shading_language_include - done
  • GL_ARB_sparse_buffer - done
  • GL_ARB_sparse_texture - not supported by mesa
  • GL_ARB_sparse_texture2 - not supported by mesa
  • GL_ARB_sparse_texture_clamp - not supported by mesa
  • GL_ARB_texture_filter_minmax - done
  • GL_EXT_memory_object - TODO
  • GL_EXT_memory_object_fd - TODO
  • GL_EXT_memory_object_win32 - not supported by mesa
  • GL_EXT_render_snorm - done
  • GL_EXT_semaphore - TODO
  • GL_EXT_semaphore_fd - TODO
  • GL_EXT_semaphore_win32 - not supported by mesa
  • GL_EXT_sRGB_write_control - TODO
  • GL_EXT_texture_norm16 - done
  • GL_EXT_texture_sRGB_R8 - TODO
  • GL_KHR_blend_equation_advanced_coherent - same as regular advanced blend
  • GL_KHR_texture_compression_astc_hdr - TODO
  • GL_KHR_texture_compression_astc_sliced_3d - TODO
  • GL_OES_depth_texture_cube_map - done
  • GL_OES_EGL_image - done
  • GL_OES_EGL_image_external - done
  • GL_OES_EGL_image_external_essl3 - done
  • GL_OES_required_internalformat - done
  • GL_OES_surfaceless_context - done
  • GL_OES_texture_compression_astc - TODO
  • GL_OES_texture_float - done
  • GL_OES_texture_float_linear - done
  • GL_OES_texture_half_float - done
  • GL_OES_texture_half_float_linear - done
  • GL_OES_texture_view - same mesamatrix bug since this is a GL 4.3 extension
  • GL_OES_viewport_array - done
  • GLX_ARB_context_flush_control - not supported by mesa
  • GLX_ARB_robustness_application_isolation - not supported by mesa
  • GLX_ARB_robustness_share_group_isolation - not supported by mesa
  • GL_EXT_shader_group_vote - done
  • GL_EXT_multisampled_render_to_texture - TODO
  • GL_EXT_color_buffer_half_float - TODO
  • GL_EXT_depth_bounds_test - done

By my calculations, that’s 11 TODO, 10 not supported, 2 advanced blend, and 1 final boss, a total of 24 out-of-version-extensions not yet implemented out of 54, meaning that 30 are done, tying with i965 and second only to RadeonSI at 33.

New in today’s snapshot: GL_ARB_fragment_shader_interlock, GL_ARB_sparse_buffer, GL_ARB_sample_locations, GL_ARB_shader_ballot, GL_ARB_shader_clock, GL_ARB_texture_filter_minmax

Cross-referencing

Writing blog posts like this is easy, but you know what’s not easy?

Writing good blog posts.

And new to the blogging game is the one, the only, Bas Nieuwenhuizen of RADV founding fame! If you’re at all curious about how drivers actually work, his is definitely a site to follow, as he’s already gone much deeper into explaining my RPCS3 memcpy fail than I ever did.

Tomorrow

Is sparse buffer implementation 101. I’ve said it, so now the blog post has to happen.

April 03, 2021

In this article I show how reading from VRAM can be a catastrophe for game performance and why.

To illustrate I will go back to fall 2015. AMDGPU was just released, it didn’t even have re-clocking yet and I was just a young student trying to play Skyrim on my new AMD R9 285.

Except it ran slowly. 10-15 FPS slowly. Now one might think that is no surprise as due to lack of re-clocking the GPU ran with a shader clock of 300 MHz. However the real surprise was that the game was not at all GPU bound.

As usual with games of that era there was a single thread doing a lot of the work and that thread was very busy doing something inside the game binary. After a bunch of digging with profilers and gdb, it turned out that the majority of time was spent in a single function that accessed less than 1 MiB from a GPU buffer each frame.

At the time DXVK was not a thing yet and I ran the game with wined3d on top of OpenGL. In OpenGL an application does not specify the location of GPU buffers directly, but specifies some properties about how it is going to be used and the driver decides. Poorly in this case.

There was a clear tweak to the driver heuristics that choose the memory location and the frame rate of the game more than doubled and was now properly GPU bound.

Some Data

After the anecdote above you might be wondering how slow reading from VRAM can really be? 1 MiB is not a lot of data so even if it is slow it cannot be that bad right?

To show you how bad it can be I ran some benchmarks on my system (Threadripper 2990wx, 4 channel DDR4-3200 and a RX 6800 XT). I checked read/write performance using a 16 MiB buffer (512 MiB for system memory to avoid the test being contained in L3 cache)

We look into three allocation types that are exposed by the amdgpu Linux kernel driver:

  • VRAM. This lives on the GPU and is mapped with Uncacheable Speculative Write Combining (USWC) on the CPU. This means that accesses from the CPU are not cached, but writes can be write-combined.

  • Cacheable system memory. This is system memory that has caching enabled on the CPU and there is cache snooping to ensure the memory is coherent between the CPU and GPU (up till the top level caches. The GPU caches do not participate in the coherence).

  • USWC system memory. This is system memory that is mapped with Uncacheable Speculative Write Combining on the CPU. This can lead to slight performance benefits compared to cacheable system memory due to lack of cache snooping.

For context, in Vulkan this would roughly correspond to the following memory types:

Hardware Vulkan
VRAM VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
Cacheable system memory VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT
USWC system memory VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT

The benchmark resulted in the following throughput numbers:

method (MiB/s) VRAM Cacheable System Memory USWC System Memory
read via memcpy 15 11488 137
write via memcpy 10028 18249 11480

I furthermore tested handwritten for-loops accessing 8,16,32 and 64-bit elements at a time and those got similar performance.

This clearly shows that reads from VRAM using memcpy are ~766x slower than memcpy reads from cacheable system memory and even non-cacheable system memory is ~91x slower than cacheable system memory. Reading even small amounts from these can cause severe performance degradations.

Writes show a difference as well, but the difference is not nearly as significant. So if an application does not select the best memory location for their data for CPU access it is still likely to result in a reasonable experience.

APUs Are Affected Too

Even though APUs do not have VRAM they still are affected by the same issue. Typically the GPU gets a certain amount of memory pre-allocated at boot time as a carveout. There are some differences in how this is accessed from the GPU so from the perspective of the GPU this memory can be faster.

At the same time the Linux kernel only gives uncached access to that region from the CPU, so one could expect similar performance issues to crop up.

I did the same test as above on a laptop with a Ryzen 5 2500U (Raven Ridge) APU, and got results that are are not dissimilar from my workstation.

method (MiB/s) Carveout Snooped System Memory USWC System Memory
read via memcpy 108 10426 108
write via memcpy 11797 20743 11821

The carveout performance is virtually identical to the uncached system memory now, which is still ~97x slower than cacheable system memory. So even though it is all system memory on an APU care still has to be taken on how the memory is allocated.

What To Do Instead

Since the performance cliff is so large it is recommended to avoid this issue if at all possible. The following three methods are good ways to avoid the issue:

  1. If the data is only written from the CPU, it is advisable to use a shadow buffer in cacheable system memory (can even be outside of the graphics API, e.g. malloc) and read from that instead.

  2. If this is written by the GPU but not frequently, one could consider putting the buffer in snooped system memory. This makes the GPU traffic go over the PCIE bus though, so it has a trade-off.

  3. Let the GPU copy the data to a buffer in snooped system memory. This is basically an extension of the previous item by making sure that the GPU accesses the data exactly once in system memory. The GPU roundtrip can take a non-trivial wall-time though (up to ~0.5 ms measured on some low end APUs), some of which is size-independent, such as command submission. Additionally this may need to wait till the hardware unit used for the copy is available, which may depend on other GPU work. The SDMA unit (Vulkan transfer queue) is a good option to avoid that.

Other Limitations

Another problem with CPU access from VRAM is the BAR size. Typically only the first 256 MiB of VRAM is configured to be accessible from the CPU and for anything else one needs to use DMA.

If the working set of what is allocated in VRAM and accessed from the CPU is large enough the kernel driver may end up moving buffers frequently in the page fault handler. System memory would be an obvious target, but due to the GPU performance trade-off that is not always the decision that gets made.

Luckily, due to the recent push from AMD for Smart Access Memory, large BARs that encompass the entire VRAM are now much more common on consumer platforms.

April 02, 2021

This is the first post of this blog and with it being past midnight I couldn’t be bothered making one about a technical topic. So instead here is an explanation of my plans with the blog.

I got inspired by the prolific blogging of Mike Blumenkrantz and some discussion on the VKx discord that some actually written updates can be very useful, and that I don’t need to make a paper out of each one.

At the same time I have been involved in some longer running things on the driver side which I think could really use some updates as progress is made. Consider for example raytracing, DRM format modifiers, RGP support and more.

I have no plans at all to be as prolific as Mike by a long shot, but I think the style of articles is probably a good template of what to expect from this blog.

April 01, 2021

I’m Trying

Blogging is tough, but I’m getting the posts out there one way or another.

Today marks what is likely to be the last of the “big” changes to zink in Mesa 21.1 before the merge window closes in less than two weeks, and what changes they were.

Threaded context support is now implemented, which, on its own, makes zink vaguely competitive against native GL drivers. I’d expect that for many scenarios, people should start seeing upwards of 60-70% native perf when previously the numbers were much lower, excepting things like furmark, where a weird problem with alpha blending is still causing a massive perf hit.

If that wasn’t enough, my timeline semaphore handling also snuck in, providing a reduction in CPU overhead for asynchronous queue-related operations where supported. Special thanks to Vulkan crash test dummy and uninitialized variable enthusiast Lionel Landwerlin for tripping and falling over basically every line of code in this implementation to help get it to the finish line for your consumption.

And if that still wasn’t enough, my RADV draw dispatch refactor also landed yesterday, both paving the way for some totally secret future work of mine and also bringing a 3-4% reduction in CPU overhead for draws that will make your gaming feel faster now that you’re aware of it but realistically won’t have any discernible effect. Basically racing stripes.

March 31, 2021

Do As I Say, Not As Zink Does

Today, a brief post and lamentation.

I’m sure everyone is well aware of Vulkan semantics regarding array vs non-array image types: when using an array type, e.g., VK_IMAGE_TYPE_2D with VkImageCreateInfo::arrayLayers > 1, always use array members for accessing/copying/blitting, and when using a 3D type, always use depth members.

This means array types should use baseArrayLayer and layerCount for copying/blitting and arrayPitch for accessing subresource regions. Non-array types, specifically 3D types, should use VkExtent3D::depth and VkSubresourceLayout::depthPitch.

This is really important, as I’ve found out over the past week, given that this has not been handled as it should’ve in many places throughout the zink stack. Some drivers were cool and didn’t make a big deal about it. Other drivers were more accurate and have just been failing all along.

And Before I Forget

I was recently interviewed by Boiling Steam, a small Linux gaming-oriented news site focused on creating original content and interviewing only the most important figures within the community (like me). If you’ve ever wanted to know more about the rich open source pedigree of Super Good Code, the interview goes deep into the back catalogue of how things got to this point.

March 30, 2021

Yeah, Again

It’s been a while since I blogged about descriptors, so let’s fix that since this site may as well be called Super Good Descriptors.

Last time, I talked a bit about descriptors 3.0: lazy descriptors. The idea I settled on here was to do templated updates, and to do the absolute minimal amount of work possible for each draw while still never reusing any written-to descriptor sets once they’d been cycled out.

Lazy descriptors worked out great, and I’m sure many of you have grown to enjoy the ZINK: USING LAZY DESCRIPTORS log message that got spammed on startup over the past couple months of zink-wip.

Well, times have changed, and this message will no longer fill your terminal by default after today’s (20210330) zink-wip snapshot.

Modes

New today is the ZINK_DESCRIPTORS environment variable, supporting three modes of operation:

  • auto - the default, which attempts to detect system capabilities and use caching with templated updates
  • lazy - the mode we all know and love from the past few months
  • notemplates - this is the old-style caching mechanism

auto mode moves zink one step closer to my eventual goal, which is to be able to use driconf to do application-based mode changes to disable caching for apps which never reuse resources. Also potentially useful would be the ability to dynamically disable caching on a pipeline-by-pipeline basis while an application is running if too many cache misses are detected.

Necessary?

With that said, I’ve come to the conclusion that any form of caching may actually be, at best, equivalent to uncached mode for the general desktop user, and it may only be worthwhile for special cases, like Vulkan drivers which can’t do descriptor templates or embedded devices. In my latest testing (on desktop systems), I have yet to see any scenarios where lazy mode fails to provide the best performance.

ARM in particular seems to gain a lot from it, as the post shows a ~40% perf improvement. It’s unclear to me, however, whether any benchmarking was done against a highly optimized uncached implementation like I’ve done. The overhead from doing basic descriptor updating without templates is definitely significant, so it might just be that things are different now that more functionality is available. On Linux systems, at least, every Vulkan driver that matters supports descriptor templates, so this is functionality that can be relied upon.

Is zink-wip slow now?

No.

The goal of zink-wip is to provide an optimal testing environment with the absolute bleeding edge in terms of performance and features. The auto mode should provide that, and the cases I’ve seen where its performance is noticeably worse number exactly one, and it’s a subtest for drawoverhead. If anyone finds any other cases where auto is worse than lazy, I’m interested, but it shouldn’t be a concern.

With that said, it might be worth doing some benchmarking between the two for some extremely high CPU usage scenarios, as that’s the only case where it may be possible to detect a difference. Gone are the days of zink(-wip) hogging the whole CPU, so probably this is just useless pontificating to fill more of a blog page.

But also, if you’re doing any kind of benchmarking on a high-end CPU, I’d probably recommend going with the lazy mode for now.

Game-changers

I’m pleased with the current state of descriptor caching, but it does bother me that it isn’t dramatically better than the uncached mode on my desktop. I think this ultimately just comes down to the current cache implementation being split into two steps:

  • compute the descriptor cache key
  • lookup the set

This effectively sits on top of the lazy mode, moving work out of the Vulkan driver and into the cache lookup any time there’s a cache hit. As such, I’ve been considering working to shift some of this work into threads, though this is somewhat challenging given the current gallium API. Specifically, only SSBOs and shader images can be per-stage updated immediately after bind, as both UBOs and samplers bind only a single descriptor slot at a time, meaning there’s no way to know when the “final” one is bound.

But then again, I’ve certainly reached the point of diminishing returns now. Most applications that I test have minimal CPU usage for the zink driver thread (e.g., Unigine Superposition is only at about 10% utilization on a i7-6700K), and are instead bottlenecking hard in the GPU, so I think it’s time to call things “good enough” here unless things change in a significant way.

March 28, 2021

A restrictive end-user license agreement is one way a company can exert power over the user. When the free software movement was founded thirty years ago, these restrictive licenses were the primary user-hostile power dynamic, so permissive and copyleft licenses emerged as synonyms to software freedom. Licensing does matter; user autonomy is lost with subscription models, revocable licenses, binary-only software, and onerous legal clauses. Yet these issues pertinent to desktop software do not scratch the surface of today’s digital power dynamics.

Today, companies exert power over their users by: tracking, selling data, psychological manipulation, intrusive advertising, planned obsolescence, and hostile Digital “Rights” Management (DRM) software. These issues affect every digital user, technically inclined or otherwise, on desktops and smartphones alike.

The free software movement promised to right these wrongs via free licenses on the source code, with adherents arguing free licenses provide immunity to these forms of malware since users could modify the code. Unfortunately most users lack the resources to do so. While the most egregious violations of user freedom come from companies publishing proprietary software, these ills can remain unchecked even in open source programs, and not all proprietary software exhibits these issues. The modern browser is nominally free software containing the trifecta of telemetry, advertisement, and DRM; a retro video game is proprietary software but relatively harmless.

As such, it’s not enough to look at the license. It’s not even enough to consider the license and a fixed set of issues endemic to proprietary software; the context matters. Software does not exist in a vacuum. Just as proprietary software tends to integrate with other proprietary software, free software tends to integrate with other free software. Software freedom in context demands a gentle nudge towards software in user interests, rather than corporate interests.

How then should we conceptualize software freedom?

Consider the three adherents to free software and open source: hobbyists, corporations, and activists. Individual hobbyists care about tinkering with the software of their choice, emphasizing freely licensed source code. These concerns do not affect those who do not make a sport out of modifying code. There is nothing wrong with this, but it will never be a household issue.

For their part, large corporations claim to love “open source”. No, they do not care about the social movement, only the cost reduction achieved by taking advantage of permissively licensed software. This corporate emphasis on licensing is often to the detriment of software freedom in the broader context. In fact, it is this irony that motivates software freedom beyond the license.

It is the activist whose ethos must apply to everyone regardless of technical ability or financial status. There is no shortage of open source software, often of corporate origin, but this is insufficient – it is the power dynamic we must fight.

We are not alone. Software freedom is intertwined with contemporary social issues, including copyright reform, privacy, sustainability, and Internet addiction. Each issue arises as a hostile power dynamic between a corporate software author and the user, with complicated interactions with software licensing. Disentangling each issue from licensing provides a framework to address nuanced questions of political reform in the digital era.

Copyright reform generalizes the licensing approaches of the free software and free culture movements. Indeed, free licenses empower us to freely use, adapt, remix, and share media and software alike. However, proprietary licenses micromanaging the core of human community and creativity are doomed to fail. Proprietary licenses have had little success preventing the proliferation of the creative works they seek to “protect”, and the rights to adapt and remix media have long been exercised by dedicated fans of proprietary media, producing volumes of fanfiction and fan art. The same observation applies to software: proprietary end-user license agreements have stopped neither file sharing nor reverse-engineering. In fact, a unique creative fandom around proprietary software has emerged in video game modding communities. Regardless of legal concerns, the human imagination and spirit of sharing persists. As such, we need not judge anyone for proprietary software and media in their life; rather, we must work towards copyright reform and free licensing to protect them from copyright overreach.

Privacy concerns are also traditional in software freedom discourse. True, secure communications software can never be proprietary, given the possibility of backdoors and impossibility of transparent audits. Unfortunately, the converse fails: there are freely licensed programs that inherently compromise user privacy. Consider third-party clients to centralized unencrypted chat systems. Although two users of such a client privately messaging one another are using only free software, if their messages are being data mined, there is still harm. The need for context is once more underscored.

Sustainability is an emergent concern, tying to software freedom via the electronic waste crisis. In the mobile space, where deprecating smartphones after a few short years is the norm and lithium batteries are hanging around in landfills indefinitely, we see the paradox of a freely licensed operating system with an abysmal social track record. A curious implication is the need for free device drivers. Where proprietary drivers force devices into obsolescence shortly after the vendor abandons them in favour of a new product, free drivers enable long-term maintenance. As before, licensing is not enough; the code must also be upstreamed and mainlined. Simply throwing source code over a wall is insufficient to resolve electronic waste, but it is a prerequisite. At risk is the right of a device owner to continue use of a device they have already purchased, even after the manufacturer no longer wishes to support it. Desired by climate activists and the dollar conscious alike, we cannot allow software to override this right.

Beyond copyright, privacy, and sustainability concerns, no software can be truly “free” if the technology itself shackles us, dumbing us down and driving us to outrage for clicks. Thanks to television culture spilling onto the Internet, the typical citizen has less to fear from government wiretaps than from themselves. For every encrypted message broken by an intelligence agency, thousands of messages are willingly broadcast to the public, seeking instant gratification. Why should a corporation or a government bother snooping into our private lives, if we present them on a silver platter? Indeed, popular open source implementations of corrupt technology do not constitute success, an issue epitomized by free software responses to social media. No, even without proprietary software, centralization, or cruel psychological manipulation, the proliferation of social media still endangers society.

Overall, focusing on concrete software freedom issues provides room for nuance, rather than the traditional binary view. End-users may make more informed decisions, with awareness of technologies’ trade-offs beyond the license. Software developers gain a framework to understand how their software fits into the bigger picture, as a free license is necessary but not sufficient for guaranteeing software freedom today. Activists can divide-and-conquer.

Many outside of our immediate sphere understand and care about these issues; long-term success requires these allies. Claims of moral superiority by licenses are unfounded and foolish; there is no success backstabbing our friends. Instead, a nuanced approach broadens our reach. While abstract moral philosophies may be intellectually valid, they are inaccessible to all but academics and the most dedicated supporters. Abstractions are perpetually on the political fringe, but these concrete issues are already understood by the general public. Furthermore, we cannot limit ourselves to technical audiences; understanding network topology cannot be a prerequisite to private conversations. Overemphasizing the role of source code and under-emphasizing the power dynamics at play is a doomed strategy; for decades we have tried and failed. In a post-Snowden world, there is too much at stake for more failures. Reforming the specific issues paves the way to software freedom. After all, social change is harder than writing code, but with incremental social reform, licenses become the easy part.

The nuanced analysis even helps individual software freedom activists. Purist attempts to refuse non-free technology categorically are laudable, but outside a closed community, going against the grain leads to activist burnout. During the day, employers and schools invariably mandate proprietary software, sometimes used to facilitate surveillance. At night, popular hobbies and social connections today are mediated by questionable software, from the DRM in a video game to the surveillance of a chat with a group of friends. Cutting ties with friends and abandoning self-care as a prerequisite to fighting powerful organizations seems noble, but is futile. Even without politics, there remain technical challenges to using only free software. Layering in other concerns, or perhaps foregoing a mobile smartphone, only amplifies the risk of software freedom burnout.

As an application, this approach to software freedom brings to light disparate issues with the modern web raising alarm in the free software community. The traditional issue is proprietary JavaScript, a licensing question, yet considering only JavaScript licensing prompts both imprecise and inaccurate conclusions about web “applications”. Deeper issues include rampant advertising and tracking; the Internet is the largest surveillance network in human history, largely for commercial aims. To some degree, these issues are mitigated by script, advertisement, and tracker blockers; these may be pre-installed in a web browser for harm reduction in pursuit of a gentler web. However, the web’s fatal flaw is yet more fundamental. By design, when a user navigates to a URL, their browser executes whatever code is piped on the wire. Effectively, the web implies an automatic auto-update, regardless of the license of the code. Even if the code is benign, it is still every year more expensive to run, forcing a hardware upgrade cycle deprecating old hardware which would work if only the web weren’t bloated by corporate interests. A subtler point is the “attention economy” tied into the web. While it’s hard to become addicted to reading in a text-only browser, binge-watching DRM-encumbered television is a different story. Half-hearted advances like “Reading Mode” are limited by the ironic distribution of documents over an app store. On the web, disparate issues of DRM, forced auto-update, privacy, sustainability, and psychological dark patterns converge to a single worst case scenario for software freedom. The licenses were only the beginning.

Nevertheless, there is cause for optimism. Framed appropriately, the fight for software freedom is winnable. To fight for software freedom, fight for privacy. Fight for copyright reform. Fight for sustainability. Resist psychological dark patterns. At the heart of each is a software freedom battle – keep fighting and we can win.

See also

Declaration of Digital Autonomy

Local-first software: You own your data, in spite of the cloud

The WWWorst App Store

March 25, 2021

 Mike, the zink dev, mentioned that swiftshader seemed slow at some stuff and I realised I've never expended much effort in checking swiftshader vs llvmpipe in benchmarks.

The thing is CPU rendering is pretty much going to top out on memory bandwidth pretty quickly but I decided to do some rough napkin benchmarks using the vulkan samples from Sascha Willems.

I'd also thought that due to having a few devs and the fact that it was used instead of mesa by google for lots of things that llvmpipe would be slower since it hasn't really gotten dedicated development resources.

I picked a random smattering of Vulkan samples and ran them on my Ryzen 

workstation without doing anything else, in their default window size.

The first number is lavapipe fps the second swiftshader.

  • gears: 336 309
  • instancing: 3 3
  • ssao: 19 9
  • deferredmultisampling:  11 4
  • computeparticles: 9 8
  • computeshader: 73 57
  • computeshader sharpen: 54 34

I guess the swift is just good marketing name, now I'm not sure why llvmpipe/lavapipe isn't more of a development target for those devs, imagine how much better it could be if it has fulltime dedicate devs on it.

Enhance Your Pipe!

It’s no secret that CPU renderers are slower than GPU renderers. But at the same time, CPU renderers are crucial for things like CI and also not hanging your current session by testing on a live GPU when you’re deep into Critical Rewrites.

So I’ve spent some time today doing some rewrites in the *pipe section of mesa, and let’s just say that the pipe was good with zink before, but it’s much, much better now.

Here’s where I started on piglit’s drawoverhead test under Lavapipe:

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 1517, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 1519, 100.2%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 1112, 73.3%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  524, 34.5%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  583, 38.4%

Nothing too incredible. Modern CPUs with a discrete GPU should be pulling upwards of 15,000k draws/s, and hitting 10% of that is sort of okay.jpg.

Pipe Faster!

But what if I implemented my Mesa multidraw extensions in Lavapipe?

goku.jpg

The gist of this extension is that when no other draw parameters have changed except for the starting vertex/index and the number of vertices/indices, an array of those values can just be dumped into the Vulkan driver to optimize out a lot of draw dispatch/validation code, mirroring Marek Olšák’s multidraw work in Gallium.

The results, just from doing basic support in Lavapipe without adding any handling in LLVMpipe underneath, speak for themselves:

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 2945, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 2626, 89.2%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 1702, 57.8%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 2881, 97.8%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 2888, 98.1%

Yup, that’s a 100% perf increase.

Pipe Beyond Space And Time!

But then I thought to myself: What if LLVMpipe could also handle me dumping all this info in?

Indeed, by spending way too much time rewriting and refactoring deep Mesa internals, I have gone beyond:

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5273, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 4804, 91.1%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 2941, 55.8%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5052, 95.8%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 5057, 95.9%

A 3.5x performance boost in drawoverhead seemed pretty good, and it also cut my GLES3 CTS runtime by about 20%, so I think I’m done here for now.

Pipe Fight!

After my recent Swiftshader misadventure, wherein I discovered that it’s missing (at least) transform feedback and conditional render extension support, meaning that it can’t be used to create a legitimate GL 3.0+ context, esoteric rendering expert Dave Airlie took benchmarking matters into his own hands today to battle it out against the other name-brand Vulkan CPU renderer.

The results were shocking.

March 23, 2021

Iago wrote recently a blog post about performance improvements on the v3d compiler, and introduced our plans to improve the pipeline caching (specifically the compiled shaders) (full blog here). We merged some improvements recently, so let’s talk about that work.

Pipeline cache improvements

While analysing the perfomance of RBDOOM-3-BFG, we noticed that some significant CPU time was spent every frame for linking shaders.

RBDOOM-3-BFG screenshot

After some investigation, we found that the game was calling ClearAttachment twice every frame. The implementation of those ClearAttachments was relying on a full job with a graphics pipeline. On v3dv by default any pipeline is created with a pipeline cache (provided by the user, or a default pipeline). On v3dv (and in general any Vulkan driver) the main cached data are the compiled shaders, so the main objective of the pipeline cache is avoiding full shader re-compilation on compatible pipelines that are used really often. Why was that time spent on linking shaders?

The issue was that for each pipeline lookup on the pipeline cache we were doing two cache lookups. The first one against a cache with the shaders in NIR, that is the main intermediate representation for shaders in Mesa (more info about intermediate representation here). And then we used those shaders to fill up the key for a second cache lookup, that if succesful, will return the compiled shader on Broadcom (QPU) assembly format.

The reason of this two-step lookup is that to compile a shader we call the common (for both OpenGL and Vulkan) Broadcom compiler, and we use some data structures that contain info that will affect the compilation (like if blending is enabled). When we implemented the pipeline cache support, for simplicity, we used the same data structures as part of the cache key. But as such keys were filled with info coming from the NIR shaders, those needed to be linked together on the case of the graphics pipelines.

When we analyzed how to improve it, we realized that in order to identify the compiled shader, we don’t really need the NIR shaders, as the info derived from them are implicit to the SPIR-V shaders provided to create the pipeline. Those NIR shaders are only really needed to compile the shader. The improvement here was using a different data structure as part of the cache key, and replace the two-cache-lookup with a one-cache-lookup. We needed to do some additional changes, as there were parts of the code that assumed that the NIR shaders would be available, but now if possible we are skipping getting them.

Numbers

Let’s start showing the improvement with a synthetic test. We took one CTS test using a complex shader, and forced the pipeline to be re-recreated 1000 times. So we get the following times:

  • Before this work, disabling pipeline cache: 125,79 seconds
  • Before this work, enabling pipeline cache (default): 11,41 seconds
  • With this work, enabling pipeline chache: 0,87 seconds

That’s a clear improvement. So how about real applications? As mentioned we started this work after analyze RBDOOM-3-BFG use of ClearAttachment. That game got an improvement of ~1fps. But when we tested other games we didn’t get any improvement on that case. This is because using a full job for ClearAttachment isn’t the preferred option (it is in fact the slowest path), and because the other games are GPU limited.

But we got improvements on other cases. As mentioned the advantage of creating the pipeline with a hot cache is that we avoid the shader recompilation, so it is far faster. And there are some apps that create pipelines at runtime. We found that this happens with the Unreal Engine 4 demos, specifically the Shooter Game. So, for example, we start the demo like this:

UE4 demo screenshot, before shooting gun

and then we decide to shot for the first time:

UE4 demo screenshot, while shooting gun

in order to render the shooting effect, new pipelines are created, and several shaders are compiled. Due all that extra work we experiment a noticiable FPS hiccup. We can visualize it with this graph:

On that graph we can see how we go from ~25fps to ~2fps. That is the shooting moment.

For this cases, the ideal would be to start the game with a pipeline cache loaded with the outcome of previous executions of the game. So adding a hack to simulate that situation, and focusing on the relevant stat in this case, that is the minimum FPS, we get the following:

  • Cold cache: ~2fps
  • Hot cache, before this work: ~6fps
  • Hot cache, with this work: ~8fps

on-disk-cache

As mentioned, the ideal would be that the applications used a pipeline cache when creating the pipelines, and that they stored the content on disk, and loaded it on following executions. Among other things, because the application would have more control about what to store and load at each moment (like for example: store/load the pipeline cache for a given level of a game).

The reality is that not all the applications do that. As mentioned, the numbers of the previous situation was simulating that ideal situation, but the application was not doing that. In order to mitigate that, we added support for on-disk-cache. Mesa provides a framework to store and load shaders on disk, so basically for any pipeline cacke lookup, now we have an extra fallback. In this case, for the UE4 Shooter demo, for a hot on-disk-cache we get a minimum of ~7fps, that as expected, is better that the situation before our work, but worse that the ideal case of the application handling the store/load of the pipeline cache on disk.

Some conclusions

So some conclusions from this work, that applies to any Vulkan driver, not only v3dv in particular:

  • If possible, pre-create all the pipelines that you would use during your application runtime. Creating pipelines during runtime, even if cached, could lead to performance hiccups.
  • If that is not possible (for example if the variable combination is too high), use pipeline cache, and store/load them on-disk. This would reduce the loading time, and make the runtime performance hiccups less noticiable. Note that having support for general on-disk-cache doesn’t mean that v3dv will do this for you.
March 18, 2021

Where We At

The blogging is happening once again, and it feels good.

I did a brief recap a couple posts ago, but I’ve gotten some sleep since then, and now it’s time for the real deal. Here’s what you missed while you were asleep and not following the Mesa RSS feed like all the coolest people I know:

  • zink-wip is shrinking. no, really. it’s down from close to 600 patches to around 200 after the past month
  • descriptor caching has landed, likely yielding around an 80-100% performance increase across the board for applications which were CPU-bound. huge, huge thanks to legendary reviewer, RADV developer, and all around great guy, Bas Nieuwenhuizen for tackling this way-too-fucking-big, 50+ patch MR in an amazingly short amount of time
  • no, zink still can’t render glxgears properly
  • zink now has lavapipe-based CI jobs going to prevent regressions!
  • Erik Faye-Lund has managed a one-liner MR that fixed over a hundred unit tests in CI
  • buffer invalidation is in, which lets zink avoid stalling the GPU in a lot of cases, also improving performance
  • with all this said, another giant thanks to l*v*pipe guru and part-time motivational graphics coach, Dave Airlie, who has reviewed nearly 200 of my patches in this release cycle
  • lavapipe is ramping up towards Vulkan 1.1 support, which will, other than being a great milestone and feat of software engineering, enable zink CI testing through GL 4.5
  • less than a hundred patches to go before zink’s threaded context support is up for merging

Good times.

March 17, 2021
I spend quite a bit of time on getting a Sierra Wireless EM7345-LTE modem to work under Linux. So here are some quick instructions to help other people who may hit the same problem.

These modems are somewhat notorious for shipping with broken firmware. They work fine after a firmware upgrade, but under Windows they will only upgrade to "carrier approved" firmware versions, which requires to be connected to the mobile-network first so that the tool can identify the carrier. And with some carriers connecting to the network does not work due to the broken firmware (ugh). There are a ton of forum-threads on how to work around this under Windows, but they all require that you are atleast able to register with the mobile-network.

Luckily someone has figured out how to update these under Linux and posted instructions for this. The procedure is actually much more straight forward under Linux. The hardest part is extracting the firmware from the Windows driver download.

One problem is that the necessary Intel FlashTool download is no longer available for download from Intel. I needed this tool a while ago for something else, and back then I used the PhoneFlashTool_5.8.4.0.rpm file from https://androiddatahost.com/nm466 . The rpm-file in the zip there has a sha256sum of: c5b964ed4fae470d1234c9bf24e0beb15a088a73c7e8e6b9369a68697020c17a

I now see that it seems that Intel is again offering this for download itself, you can find it here: https://github.com/projectceladon/tools and projectceladon seems to be an official Intel project. Note this is not the version which I used, I used the PhoneFlashTool_5.8.4.0.rpm version.

Once you have the Intel FlashTool installed just follow the posted instructions and after that your model should start working under Linux.

Last year, I have been working on Turnip driver development as my daily job at Igalia. One of those tasks was implementing the support for VK_KHR_depth_stencil_resolve extension.

VK_KHR_depth_stencil_resolve

This extension adds support for automatically resolving multisampled depth/stencil attachments in a subpass in a similar manner as for color attachments. This extension, however, does not add support for resolving msaa depth/stencil images with vkCmdResolveImage() command.

As you can imagine, this extension is easy to use by any application. Unless you are using a driver that supports Vulkan 1.2 or higher (VK_KHR_depth_stencil_resolve was promoted to core in Vulkan 1.2), you should check first if it is supported. Once done that, ask the driver which features are supported by extending VkPhysicalDeviceProperties2 with VkPhysicalDeviceDepthStencilResolveProperties structure when calling vkGetPhysicalDeviceProperties2().

struct VkPhysicalDeviceDepthStencilResolveProperties {
    VkStructureType       sType;
    void*                 pNext;
    VkResolveModeFlags    supportedDepthResolveModes;
    VkResolveModeFlags    supportedStencilResolveModes;
    VkBool32              independentResolveNone;
    VkBool32              independentResolve;
}

This structure will be filled by the driver to indicate the different depth/stencil resolve modes supported by the driver (more info about their meaning and possible values in the spec).

Next step: just fill a VkSubpassDescriptionDepthStencilResolve struct with the resolve mode wanted and the depth/stencil attachment used for resolve. Then, extend the VkSubpassDescription2 struct with it. And that’s all.

struct VkSubpassDescriptionDepthStencilResolve {
    VkStructureType                  sType;
    const void*                      pNext;
    VkResolveModeFlagBits            depthResolveMode;
    VkResolveModeFlagBits            stencilResolveMode;
    const VkAttachmentReference2*    pDepthStencilResolveAttachment;
}

Turnip implementation

Implementing this extension on Turnip was more or less direct, although there were some issues to fix.

For all depth/stencil formats, it was added its support in the both resolve paths used by the driver, one for sysmem (system memory) and other for gmem (tile buffer), including flushing the depth cache when needed. However, for VK_FORMAT_D32_SFLOAT_S8_UINT format, which has the stencil part in a separate plane, it was needed to add specific code to extend the 2D path used in the driver for sysmem resolves.

However the main issue when was testing the extension implementation on Adreno 630 GPU. It turns out that VK-GL-CTS tests exercising this extension for MSAA VK_FORMAT_D24_UNORM_S8_UINT formats were failing always, except when disabling UBWC via TU_DEBUG=noubwc environment variable. UBWC (Universal Bandwidth Compression) is a HW feature designed to improve throughput to system memory by minimizing the bandwidth of data (which also gives some power savings). The problem was that the UBWC support for MSAA VK_FORMAT_D24_UNORM_S8_UINT format is known to be failing on Adreno 630 and Adreno 618 (see merge request for freedreno driver). I just needed to disable it Turnip to fix these failures.

I also found other VK_KHR_depth_stencil_resolve CTS tests failing: the ones testing the format compatibility for VK_FORMAT_D32_SFLOAT_S8_UINT and VK_FORMAT_D24_UNORM_S8_UINT formats. For VK_FORMAT_D32_SFLOAT_S8_UINT failures, it was needed to take into account the particularity that it has a separate plane for the stencil part when resolving it to VK_FORMAT_S8_UINT. In the VK_FORMAT_D24_UNORM_S8_UINT failures, the problem was that we were setting wrongly the resolve mode used by the HW: it was wrongly doing a sample average, when we wanted to use the value of sample 0. This merge request fixed both issues.

And that’s all, this was a extension that allowed me to dive into the different resolve paths used by Turnip and learn one or two things about the HW ;-) Thanks a lot to Jonathan Marek for his reviews and suggestions to improve the implementation of this extension.

Happy hacking!

March 16, 2021

Lately, I have been looking at improving performance of the V3DV Vulkan driver for the Raspberry Pi 4. So far we had been toying a lot with some Vulkan ports of the Quake trilogy but we wanted to have a look at more modern games as well, and for that we started to look at some Unreal Engine 4 samples, particularly the Shooter demo.


Unreal Engine 4 Shooter Demo

In our initial tests with “High” settings, even at 480p we were running the sample at 8-15 fps, with 720p being mostly in the 5-10 fps range. Not great, obviously, but a good opportunity to guide and focus our performance efforts.

One aspect of the UE4 sample that was immediately obvious compared to the Quake games is that the shading is a lot more expensive in general, and more specifically, it involves more texture lookups and UBO loads, which require expensive accesses to memory from the shader cores, so this was the first area we targeted. The good thing about this is that because our Vulkan and OpenGL drivers share the compiler stack, any optimizations we do here benefit both drivers.

What follows is a brief discussion of some of the optimizations we did recently to our backend compiler and the results we have observed from this work.

Optimizing the backend compiler

So the first thing we tackled was better managing the latency of texture lookups. Interestingly, the NIR scheduler was setting this up so that we would try to put instructions that consumed the result of a texture lookup as far away as possible from the instruction that triggered the lookup, but then the backend compiler was not fully taking advantage of this and would still end up waiting on lookup results sooner than it should.

Fixing this helped performance by 1%-3%, although it could go a bit above that in some cases. It has a caveat though: doing this extends the liveness of our lookup sequences, and that makes spilling more difficult (we can’t emit spills/unspills in the middle of an outstanding memory lookup), so when register pressure is high enough that we need to inject register spills to compile the shader, we would typically be a lot more constrained and end up producing significantly worse code, or even worse, failing to register allocate for the shader completely. To avoid this, we recompile the shader without the optimization if we detect that we need to do any spilling. One can use V3D_DEBUG=perf to detect if this is happening for any shaders, looking for messages like this:

Falling back to strategy 'disable TMU pipelining' for MESA_SHADER_FRAGMENT.

While the above optimization was useful, I was expecting that it would make a larger impact for this particular demo so I kept looking for ways to do our memory lookups more efficient. One thing that is relevant to this analysis is that we were using the same hardware unit for both texture and UBO lookups, but for the latter, we could really use a different strategy by handling our UBOs as uniform streams. This has some caveats, but making a long story short, the fact that many UBO loads use uniform addresses, that we usually read a bunch of consecutive scalars from a UBO load (such as a full vec4) and that applications usually emit UBO loads for nearby addresses together, we can emit fairly optimal code for many of these, leading to more efficient memory access in general.

Eventually, this turned out to be a big win, yielding 20%-30% improvements for the Shooter demo even with the initial basic implementation, which we would then tune and optimize further.

Again related to memory accesses, I have also been improving how we schedule instructions involved with setting up memory lookups. Our scheduler was more restrictive here than it needed, and the extra flexibility can help reduce instruction counts, which will affect these more modern games most, as they are more likely to emit a larger number of texture/image operations in their shaders.

Another thing we did was to improve instruction packing. The V3D GPU is able to emit multiple instructions in the same cycle so long as the instructions meet some requirements. Turns out our assembly scheduler was being too restrictive with this and we could do better. Going by shader-db results, this led to ~5% less instructions on our programs and added another modest 1%-2% performance improvement for the Shooter demo.

Another issue we noticed is that a lot of the shaders were passing a lot of varyings from the vertex to fragment shaders, and our setup for fragment shader inputs was not optimal. The issue here was that there is a specific instruction involved in this process that writes two registers, one of then with an instruction of delay, and our dependency tracking was not able to handle this properly, effectively assuming that both registers are written in the same instruction which then had an impact in how we scheduled instructions. Fixing this required to handle this case specially in our scheduler so we could be more effective at scheduling these instructions in a way that would enable optimal pipelining of the instructions involved with varying setups for fragment shaders.

Another aspect we improved was related to our uniform handling. Generally, there many instances in which we need to emit duplicate uniforms. There are reasons for that related to how the GPU works, but in some cases, for example with consecutive UBO loads, we would emit the uniform with the base address of the UBO multiple times very close to each other. We can obviously do better by trying to track previous uses of a uniform/constant value and reusing them in nearby instructions. Of course, this comes at the expense of increasing register pressure (for reasons beyond the scope of this post our shaders typically require a lot of uniforms), so there is a balancing game to play here too. Reducing the size of our uniform streams this way also plays an important role in lowering some of the CPU overhead of the driver, since these streams need to be rebuilt often when certain pipeline states change.

Finally, an optimization that was more targeted at reducing register pressure rather than improving performance: we noticed that we would sometimes put some instructions far away from their consumers with no obvious benefit to it. This was bad enough some times that it would even cause us to be unable to compile some shaders. Mesa has a handy NIR pass for this called nir_opt_sink, which also proved to be helpful for this. This allowed us to get a few more shaders from shader-db to compile and reduce spilling for a bunch of shaders. For the Shooter demo, this changed a large compute shader involved with histogram post-processing, which had 48 spills and 50 fills to only have 8 spills and 15 fills. While the impact in performance of this is probably very small since the game only runs this pass once per frame, it made a notable difference in loading time, since compiling a shader with this much spilling is very slow at present. I have a mental note to improve this some day, I know Intel managed to fix this for their compiler, but for the time being, this alone managed to make the loading time much more reasonable.

Results

First, here are some shader-db stats, which describe how these optimizations change various statistics for a large collections of real world shaders:

total instructions in shared programs: 14992758 -> 13731927 (-8.41%)
instructions in affected programs: 14003658 -> 12742827 (-9.00%)
helped: 80448
HURT: 4297

total threads in shared programs: 407932 -> 412242 (1.06%)
threads in affected programs: 4514 -> 8824 (95.48%)
helped: 2189
HURT: 34

total uniforms in shared programs: 4069524 -> 3790401 (-6.86%)
uniforms in affected programs: 2267834 -> 1988711 (-12.31%)
helped: 40588
HURT: 1186

total max-temps in shared programs: 2388462 -> 2322009 (-2.78%)
max-temps in affected programs: 897803 -> 831350 (-7.40%)
helped: 30598
HURT: 2464

total spills in shared programs: 6241 -> 5940 (-4.82%)
spills in affected programs: 3963 -> 3662 (-7.60%)
helped: 75
HURT: 24

total fills in shared programs: 14587 -> 13372 (-8.33%)
fills in affected programs: 11192 -> 9977 (-10.86%)
helped: 90
HURT: 14

total sfu-stalls in shared programs: 28106 -> 31489 (12.04%)
sfu-stalls in affected programs: 16039 -> 19422 (21.09%)
helped: 4382
HURT: 6429

total inst-and-stalls in shared programs: 15020864 -> 13763416 (-8.37%)
inst-and-stalls in affected programs: 14028723 -> 12771275 (-8.96%)
helped: 80396
HURT: 4305

Less is better for all stats, except threads. We can see significant improvements across the board: we generally produce shaders with less instructions that have less maximum register pressure, we reduce spills and uniform counts and can run with more threads. We only are worse at stalls, but that is generally because we now produce more compact code with less instructions, so more stalls are expected.

Another good thing in these stats is the large number of helped shaders compared to hurt shaders, meaning that it is very likely that these optimizations will help most applications to some extent.

But enough of boring compiler statistics, most people won’t care about that and what they want to know is how this impacts performance on actual games and applications, which is what the following graphs shows (these were obtained by replaying specific traces with gfx-reconstruct). Keep in mind that while I am using a collection of Vulkan samples/games here it is expected that these optimizations apply to OpenGL too.


Before vs After

Framerate improvement after optimization (in %)

As it can be observed from the graph above, this optimization work made a significant impact in the observed framerate in all the cases. It is not surprising that the UE4 demo is the one that sees the most improvement, considering this the one we used to guide most of the optimization work.

Other optimizations and future work

In this post I have been focusing exclusively on compiler optimizations, but we have also been improving other parts of the Vulkan driver. While I won’t go into details to avoid making this post too long, we have also been improving aspects of the driver involved with buffer to image copies, depth buffer clears, dirty descriptor state management, usage of the TFU unit for transfer operations and more.

Finally, there is one other aspect of this UE4 demo that is pretty obvious as soon as you start a game: it can compile a lot of shaders in the middle of the gameplay loop which can lead to significant stutter. While there is not much we can do about this on the driver side, but adding support for a shader cache on disk should eliminate the problem on sessions after the first, so this is something that we may work on in the future.

We will certainly continue to look at improving the driver performance in the future so stay tuned for further updates on our progress or maybe join us at #videocore on Freenode.

Superposition

I got a weird bug report the other day. Apparently Unigine Superposition, the final boss of Unigine benchmarks, was broken in zink(-wip). And it was a regression, which meant that at some point it had worked, but because testing is hard, it then stopped working for a while.

The Problem (superposition wtf u doin?)

I had no idea what the problem was other than a vague impression it might be queue-related given the perf output in the bug report. The first step here was to figure out what I was dealing with. Off I went to the two horsemen of all queue-related bugs: zink_flush() and zink_fence_finish().

Here’s the latter of the two:

static bool
zink_fence_finish(struct zink_screen *screen, struct pipe_context *pctx, struct zink_tc_fence *mfence,
                  uint64_t timeout_ns)
{
   pctx = threaded_context_unwrap_sync(pctx);
   struct zink_context *ctx = zink_context(pctx);

   if (screen->device_lost)
      return true;

   if (pctx && mfence->deferred_ctx == pctx && mfence->deferred_id == ctx->curr_batch) {
      zink_context(pctx)->batch.has_work = true;
      /* this must be the current batch */
      pctx->flush(pctx, NULL, !timeout_ns ? PIPE_FLUSH_ASYNC : 0);
      if (!timeout_ns)
         return false;
   }

   /* need to ensure the tc mfence has been flushed before we wait */
   bool tc_finish = tc_fence_finish(ctx, mfence, &timeout_ns);
   struct zink_fence *fence = mfence->fence;
   if (!tc_finish || (fence && !fence->submitted))
      return fence ? p_atomic_read(&fence->completed) : false;

   /* this was an invalid flush, just return completed */
   if (!mfence->fence)
      return true;
   /* if the zink fence has a different batch id then it must have completed and been recycled already */
   if (mfence->fence->batch_id != mfence->batch_id)
      return true;

   return zink_vkfence_wait(screen, fence, timeout_ns);
}

In short, the control flow goes:

  • if the GPU rejected a previous cmdbuf, fail
  • detect and then submit deferred flushes (i.e., the app has requested a sync object and at some later point decided to wait on it)
  • verify that the cmdbuf represented by this fence has actually been submitted
  • detect cases where the internal fence (mfence->fence) no longer belongs to the gallium sync object (mfence)
  • finally, attempt to actually wait on the fence

So this is all fine and good, and it’s been working terrifically for many months. But I put a printf at the top, and it turns out this was being spammed nonstop during load with the app constantly checking the same sync object to see if it was finished.

It wasn’t finished.

Specifically, the return fence ? p_atomic_read(&fence->completed) : false; case was being hit, which led me to dig deeper into what was going on.

Another printf in zink_flush() (function contents omitted to spare the eyes of anyone under the legal drinking age) revealed that the sync object reaching zink_fence_finish was, in fact, the second one created by the app, which was then being waited upon twenty-something flushes later.

So in this case, mfence->deferred_id was something like 2 and the current batch id in ctx->curr_batch was around 20. The last-completed batch id was also around 20.

Revision

A slight modification here resolved the issue:

if (pctx && mfence->deferred_ctx == pctx) {
   if (mfence->deferred_id == ctx->curr_batch) {
      zink_context(pctx)->batch.has_work = true;
      /* this must be the current batch */
      pctx->flush(pctx, NULL, !timeout_ns ? PIPE_FLUSH_ASYNC : 0);
      if (!timeout_ns)
         return false;
   }
   /* this batch is known to have finished */
   if (mfence->deferred_id <= screen->last_finished)
      return true;
}

This way, cases where an app is randomly waiting on something that has finished so far in the past that there’s no longer any trace of it can just become a no-op success, and everything is happy.

Probably.

Testing is ongoing, but it seems like everything is happy now.

March 15, 2021

As we are heading towards April and the release of Fedora Workstation 34 I wanted to post an update on what we are working on for this release and what we are looking at going forward. 2020 was a year where we focused a lot on polishing what we had and getting things past the finish line and Fedora Workstation 34 is going to be the culmination of that effort in many ways.

Wayland:
The big ticket item we have wanted to close off on was Wayland, because while Wayland has been production ready for most of us for a while, there was still some cases it didn’t cover as well as X.org. The biggest of this was of course the lack of accelerated XWayland support with the binary NVidia driver. Fixing that issue of course wasn’t something we could do ourselves, but we have been working diligently with our friends at NVidia to help ensure everything was in place for them to enable that support in their driver, so I have been very happy to see the public reports confirming that NVidia will have accelerated 3D in the summer release of their driver. The other Wayland area we have put a lot of effort into has been the work undertaken by Jonas Ådahl to get headless display support working with Wayland. This is a critical feature for people who for instance want a desktop instance on their servers or in the cloud, who want a desktop they access through things like VNC or RDP to use for sysadmin related tasks. Jonas spent a lot of time laying the groundwork for this over the course of last year and we are now in the final stages of merging the patches to enable this feature in GNOME and Wayland in preparation for Fedora Workstation 34. Once those two items are out we consider our Wayland rampup/rollout to be complete, so while there of course will continue to be bugfixes and new features implemented, that will be part of a natural evolution of Wayland and not part of a ‘close gaps with X11’ effort like now.

PipeWire
Another big ticket item we are hoping to release fully in Fedora Workstation 34 is PipeWire. PipeWire as most of you know is the engine we use to deal with handling video streams in a secure and shareable away in Fedora Workstation, so when you interact with your web camera(s) or do screen casting or make screenshots it is all routed and handled by PipeWire. But in Fedora Workstation 34 we are aiming to also switch to using PipeWire for audio, to replace both PulseAudio and Jack. For those of you who had read any previous blog post from me you will know what an important step forward this will be as we would finally be making the pro-audio community first class citizens in Fedora Workstation and Linux in general. When we decided to try to switch to PipeWire in Fedora Workstation 34 I have to admit I was a little skeptical about if we would be able to get all things ready in time as there are so many things that needs to be tested and fixed when you switch out such a critical component. Up to that point we had a lot of people interested in PipeWire, but only limited community involvement, but I feel the announcement of bringing in PipeWire for Fedora Workstation 34 galvanized the community around the project and we now have a very active community around PipeWire in #pipewire on the freenode IRC network. Not only is Wim Taymans getting a ton of help with testing and verification, but we also see a stead stream of patches coming in, with for instance improved Bluetooth audio support being contributed, in fact I believe that PipeWire will be able to usher in better bluetooth audio support in Fedora than we ever had before, with great support for high quality Bluetooth audio codecs like LDAC.

I am especially happy to see so many of the key members of the pro-audio Linux community taking part in this effort and is of course also happy to see many pro-audio folks testing Fedora Workstation for the first time due to this effort. The community is working closely with Wim to test and verify as many important ProAudio applications as possible and work to update Fedora packaging as needed to ensure they can transition from Jack to PipeWire without dependency conflicts or issues. One last item to mention here is that you might have seen that Red Hat is getting into the automotive space, I can’t share a lot of details about that effort, but one thing I can say is that PipeWire will be a core part of it and thus we will soon be looking to hire more engineers to work on PipeWire, so if that is of interest to you be sure to track my twitter feed or blog as I will announce our job openings there as they become available. For the community at large this should be great too as it means that we can get a lot of synergy between automotive and the desktop around audio and video handling.

It is still somewhat of an open question if we end up actually switching to PipeWire in Fedora Workstation 34, but things are looking good at this point in time and worst case scenario it will be in place for Fedora Workstation 35.

Toolbox
Toolbox is another effort that is in a great spot now. Toolbox is our tool for making working with pet containers a breeze for developers. The initial version was prototyped quickly by writing it as a shell script, but we spent time last year getting it rewritten in Go in order to make it possible to keep expanding the project and allow us to implement all the things we envision for it. With that done feature work is now in focus again and Ondřej Michal has done some great work making it possible to set up RHEL containers in Toolbox. This means that you can run Fedora on your laptop and get the latest and greatest features that way, but you can do your development in a RHEL pet container, so you get an environment identical to what you applications will see once they are deployed into the cloud or onto company hardware. This gives you the best of both worlds in my opinion, the fast moving Fedora Workstation that brings the most out of our laptop and desktop hardware, but still easy access to the RHEL platform for development and testing. You can even test this today on Fedora Workstation 33, just open a terminal and type ‘toolbox create --distro rhel --release 8.3‘. The resulting toolbox will then be based on RHEL and not Fedora and thus perfect for doing RHEL targeted development. You will need to use the subscription-manager tool to register it (be sure to register on developer.redhat.com for your free RHEL developer subscription. Over time we hope to integrate this into GNOME Online accounts like we do for the RHEL virtual machines you can set up with GNOME Boxes, so that once you set up your RHEL account you can create RHEL virtual machines and RHEL containers easily on Fedora Workstation.

Toolbox with RHEL

Toolbox pet container with RHEL UBI

Flatpak
Owen Taylor has been doing some incredible work behind the scenes for the last year trying to ensure the infrastructure we have in RHEL and Fedora provides a great integrated Flatpak experience. As we move forward we expect Flatpaks to be the primary packaging format that Fedora users consume their applications in, but to make that a reality we needed to ensure the experience is good both for Fedora maintainers and for the end users. So one of the big ticket items Owen been working on is getting incremental updates working in Fedora. If you have used applications from Flathub you probably noticed that their updates are small and nimble despite being packaged as Flatpak containers, while the Fedora flatpaks causes big updates each time. The reason for this is that the Fedora flatpaks are shipping as OCI (Open Container Initiative) images, while the Flatpaks on Flathub are shipping as OStree repositories (if you don’t know OStree, think of it as git for binaries). So shipping the Flatpaks as OCI images has advantages in the form of being the same format we at Red Hat use for our kubernetes/docker/openshift containers and thus it allows us to reuse a lot of the work that Red Hat as put into ensuring we can provide and keep such containers up to date and secure, but the downside until now has been that these containers where shipped in a way which cause each update, no matter how small the change, to be a full re-download of the whole image. Well Owen Taylor and Alex Larsson worked together to resolve this and came up with a method to allow incremental updates of such containers and thus bring the update sizes in line with what you see on Flathub for Flatpaks. This should be deployed in time for Fedora Workstation 34 and we also hope to  eventually deploy it for  kubernetes/docker containers too. Finally to make even more applications available we are doing work to enable people to get access to Flathub.org out of the box in Fedora when you enable 3rd party repositories in initial setup, so that your our of the box application selection will be even bigger.

Flathub frontpage

Flathub webpage

GNOME 40
Another major change in Fedora Workstation 34 is GNOME40 which contains a revamp of the GNOME 3 user interface. This was a collaborative effort between a lot of GNOME 3 stakeholders with Allan Day representing Red Hat. This was also an effort by the GNOME design community to up their game and thus as part of the development process the GNOME Foundation paid a professional company to do user testing on the proposed changes and some of the alternatives. This means that the changes where verified to actually be experienced as an improvement for the experienced GNOME user participants and that it felt intuitive for new users of GNOME. One advantage we have in Fedora is that since we don’t do major tweaking of the GNOME user interface which means once Fedora Workstation 34 ships you are set to enjoy the new GNOME experience from day one. For long time GNOME users I hope and expect that the updates will be a welcome refresh and at the same time that the changes provide a more easy onramp for new GNOME and Fedora Workstation users. Some of the early versions did lead some long term fans of how multimonitor support in GNOME3 worked to be a bit concerned, but be assured that multi monitor is a critical usecase in our opinion and something we have been looking at and will be looking at keep improving. In fact Allan Day wrote a great blog post about GNOME40 multimonitor support recently to explain what we are doing and how we see it evolving going forward.

Input
Another area where we keep putting in a lot of effort is input. Thanks to Peter Hutterer and Benjamin Tissoires we keep making sure Fedora Workstation and the world of Linux keeps having access to the newest and best in input. The latest effort they are working on has been to enable haptic touchpads. Haptics touchpads should be familiar among people who tried Apple hardware, but they are expected to appear in force on laptops in general this year, so we have been putting in the effort to ensure that we can support this new type of device as they come out. So if you see laptops you want with haptic touchpads then Fedora Workstation should be ready for it, but of course until these devices are commonplace and we had a chance to test and verify I can make no guarantees.

Another major effort that we undertook in relation to input was move the GNOME input to a separate thread. Carlos Garnacho worked on this patch to make that happen. This should provide a smoother experience with Fedora Workstation 34 as it means the mouse should not stall due to the main thread running Wayland being busy. This was done as part of the overall performance work we been continuously doing over the last years to ensure to address performance issues and make Fedora and GNOME have the best performance and performance related behaviour possible.

Lenovo Laptops
So one of the major announcements of last year was Lenovo Laptops with Fedora Linux pre-installed. There are currently two models available with Linux, the X1 Carbon and the Lenovo P1. Between them they cover the two most commons requests we see, a ultralight weight laptop with the X1 and a more powerful ‘portable workstation’ model with the P1. We are working on a couple of more models and also to get them sold globally, which was key goal of the effort. Be aware that both models are on sale as I am writing this (hopefully still true when you read this), so it is a good time to grab a great laptop with a great OS.

Lenovo P1

Lenovo P1

Vision
So one thing I wanted to do to is tie the work we do in Fedora Workstation together by articulating what we are trying to achieve. Fedora has for the longest time been the place where Linux as an operating system is evolving and being developed. There are very few major innovations that has come to Linux that hasn’t been developed and incubated in Fedora by Fedora contributors, including of course Red Hat. This include things like the Linux Vendor Firmware Service, Wayland, Flatpak, SilverBlue, PipeWire, SystemD, flicker free boot, HiDPI support, gaming mouse support and so much more. We have always done this work in close cooperation with the upstreams we are collaborating with, which is why the patch delta in any given Fedora release is small. We work hard to get improvements into the upstream kernel and into GNOME and so on right away, to avoid needing to ship downstream patches in Fedora. That of course saves us from having to maintain temporary forks, but more importantly it is the right way to collaborate with an open source community.

So looking back to when we launched Fedora Workstation we realized that being at the front like that had come at the cost of not being stable and user friendly. So the big question we tried to ask ourselves when launching Fedora Workstation and the question that still drives a lot of our decision making and focus is : how can we preserve being the heart and center of Linux OS development, but at the same time provide end users with a stable and well functioning system? To achieve that we have done lot of changes over the last years, ranging from some policy changes in terms of how and when we brought changes into Fedora, but maybe even more importantly we focused on a lot on figuring out ways to reduce the challenges caused by a rapidly evolving OS, like the introduction of Flatpaks to allow applications to be developed and released without strong ties to the host system libraries and with the concepts we are maturing in Silverblue around image based operating systems or how we are looking at pet container development with Toolbox. All of these things combined remove a lot of the fragility we seen in Linux up to this point and instead let us treat the rapidly evolving linux landscape as a strength.

So where we are today is that I think we are very close to realizing the vision of being able to let Fedora be the place where exiting new stuff happens, yet at the same time provide the robustness and polish that end users need to be able to use it as their daily driver, it has been my daily driver for many years now and by the rapid growth of users we seen in Fedora over the last 5 years I think that is true for a lot of other people too. The goal is to allow the wider community around Linux, especially the developers, sysadmins and creators relying on Linux to do their work, to come to Fedora and interact and collaborate with the developers working on the OS itself to the benefit of all. You all are probably better judges than me to if we are succeeding with that, but I do take the increased chatter and adoption of Fedora by a lot of people doing Linux related podcasts, news sites and so on as a sign that we are succeeding. And PipeWire for me is a perfect example of how this can look, where we want to bring in the pro-audio creators to Fedora Workstation and let them interact and work closely with Wim Taymans and the PipeWire development team to make the experience even better for themselves and their fellow creators and at the same time give them a great stable platform to create their music on.

March 12, 2021

A New Post

Blogging is hard.

There’s a big switch that a brain needs to do in order to go from ramming code into a project at full speed to carefully crafting an internet post that potentially people might want to read, and some days the effort required to make that switch exceeds the available energy.

Such is (at least one of) the reason(s) why more graphics driver-y people don’t blog more, since it’s taxing enough for me to try and get posts out and I mostly just post memes.

But this time is definitely going to be a return to the normalcy that is posting more than once per week. The key is just to start posting and keep doing it.

So what’s been going on since the last post?

Well.

Lots.

Since Our Last Correspondence…

  • Some number of patches landed
    • zink is now conformant for the enhanced layouts tests in CTS (KHR-GL46.enhanced_layouts*)
  • Mesa 21.0 is going out (has already gone out? I’m bad at time) with GL 4.1 enabled in zink
    • enjoy crashes and GPU hangs in style with a driver that nobody else is using!
  • Definitely other stuff that’s good and useful but isn’t in the scope of this post

Easing Back In

Tomorrow’s post is going to be back to the usual. In fact, I’m going to be posting about some tc-related stuff, and now that I’ve said it, I can’t back out.

March 10, 2021

Over the last year and then so, we at Collabora have been working with Microsoft on their D3D12 mapping layer, which I announced in my previous blog post. In July, Louis-Francis Ratté-Boulianne wrote an update on the status on the Collabora blog, but a lot has happened since then, so it’s time for another update.

There’s two major things that has happened since then; we have passed the OpenGL 3.3 conformance tests, and we have upstreamed the code in Mesa 3D.

Photoshop support

It might not be a big surprise, but one of the motivation for this work was to be able to run applications like Photoshop on Windows devices without full OpenGL support.

I’m happy to report that Microsoft has released their compatibility pack that uses our work to provide OpenGL (and OpenCL) support, Photoshop can now run on Windows on ARM CPUs! This is pretty exciting to see high-profile applications like that benefit from our work!

OpenGL 3.3 Conformance Test Suite

First of all, I would like to point out that having passed the OpenGL CTS isn’t necessarily the same as being formally conformant. There’s some details about how to formally becoming conformant on layered implementations that are complicated, and I’ll leave the question about formal conformance up to Microsoft and Khronos.

Instead, I want to talk a bit about passing the OpenGL CTS.

Challenges for layered implementations

A problem we face with layered implementations is that we are subject to a few more sources of issues, some of which are entirely out of our control. A normal OpenGL, non-layered, implementation have two primary sources of issues:

  1. The OpenGL driver
  2. The hardware

Issues with the OpenGL driver itself that leads to tests failing is always required to be fixed before results are submitted. Issues with the hardware generally requires software workarounds, but this is not always feasible, so Khronos have a system where a vendor can file a waiver for a hardware issue, and if approved they can mark test-failures as waived, and the appropriate failures will be ignored.

So far so good.

But for our layered implementations, our world looks a bit different. We don’t really see the hardware, but instead we see D3D12 and the D3D12 driver. This means our sources of issues are:

  1. The OpenGL driver
  2. The D3D12 run-time
  3. The D3D12 vendor-driver
  4. The hardware

For the OpenGL driver, the story is the same as for a non-layered implementation, but from there on things start changing.

Problems in the D3D12 run-time must also be fixed before submitting results. We work together with Microsoft to get these issues fixed as appropriate. Such fixes can take a while to trickle all the way into a Windows build and to end-users, but they will eventually show up.

But for the D3D12 vendor-driver and below, things gets complicated. First of all, it’s not always possible for us to tell vendor-driver issues and hardware issues apart. And worse, as these are developed by third party companies, we have little insight there. We can’t affect their priorities, so it’s hard to know when or even if an issue gets resolved.

It’s also not really a good idea to work around such issues, because if they turn out to be fixable software problems, we don’t know when they will be fixed, so we can’t really tell when to disable the work-around. We also don’t know exactly what combination of hardware and software these issues apply to.

But there’s one case where we have full insight, and that’s when the D3D12 vendor-driver is WARP, a high-performance software rasterizer. Because that component is developed by Microsoft, and we have channels to report issues and even make sure they get resolved!

Bugs, bugs, bugs

When developing something new, there’s always going to be bugs. But usually also when using something existing in a new way. We encountered a lot of bugs on our way, and here’s a quick overview over some of them. This is in no way exhaustive, and most of our own bugs are not that interesting. So this is mostly about problems unique to layered implementations.

64-bit shifts

It turned out early on that the DXIL validator in D3D12 had a requirement when parsing the LLVM bitcode that required that the amounts were always 32-bit values. While this seems fine by itself, LLVM itself requires that all operands to binops have the same bit-size. This obviously meant that only 32-bit shifts could pass the validator.

Microsoft quickly removed this requirement once we figured out what was going on.

Aligned block-compressed textures

In D3D12, one requirement for block-compressed textures is that the base-level is aligned to ther block-size. This requirement does not apply to mip-levels, and OpenGL has no such requirement. This isn’t technically speaking a bug, but a documented limitation in DirectX.

It turns out that this limitation was an artificial historical left-over, and after a bunch of testing (and fixing of WARP), we got this limitation lifted. Great :)

D3D12 vendor-driver bugs

Something that has been much more frustrating is bugs in the vendor-drivers. The problem here is that even though we have channels to file bugs, we don’t have any influence or even insight into their prioritization and release schedule.

I think it suffice to say that there’s been several reported bugs to all vendors we’ve actively been running the OpenGL CTS on top of. We believe fixes are underway for at least some of this, but we can’t make any promises here.

Current status

Right now, the only configurations we’re cleanly passing the OpenGL 3.3 CTS on are WARP (which became conformant on November 24th, 2020), and NVIDIA (which became conformant on February 26th, 2021).

Having these multiple independent implementations of DirectX drivers passing in conjunction with the Mesa/D3D12 layer shows that we are able to implement GLon12 in a vendor-neutral way, which allowed us to bring the layer to conformance. Many thanks to Khronos for their assistance through this process.

We’ve also submitted results on top of an Intel GPU, but that submission has been halted due to failures, and will as far as I know be updated as soon as Intel publish updated drivers.

The conformance tests have been run against our downstream fork, which is no longer actively maintained, because:

Upstreaming

The D3D12 driver was upstreamed in Mesa in Merge-Request 7477, and the OpenCL compiler followed in Merge-Request 7565. There’s been a lot more merge-requests since then, and even more is expected in the future.

The process of upstreaming the driver into Mesa3D went relatively smoothly, but there were quite a lot of regressions that happened quickly after we upstreamed the code, so to avoid this from becoming a big problem we’ve added the D3D12 driver to Mesa’s set of GitLab CI tests. We now build and test the D3D12 driver on top of WARP on CI, as well as running some basic sanity-tests for the OpenCL compiler.

All in all, this seems to work very well right now, and we’re looking forward to the future. Next step, WSL support!

I’m no longer working full-time on this project, instead I’m trying to take some of the lessons learned and apply them to Zink. I’m sure there’s even more room for code-reuse than what we currently have, but it will probably take some time to figure out.

March 08, 2021

It was created by Konstantin Ryabitsev and has become a very frequently used tool for me.

It supports a lot of different ways for interacting with the Linux Kernel mailing lists. Of these the b4 am subcommand is what I primarily use. This subcommand downloads all of the patches belonging to a patch series and drops them into a .mbox file. But! It doesn't apply them to the repository we're currently in, and herein lies the itch that I would like to scratch.

The inspiration for this post is the script that @stellarhopper authored and @widawsky pointed out to me.

The Good, the Bad & the Ugly

After first publishing this post, people on the twittersphere suggested some alternative approaches, and it would seem that there …

March 03, 2021

One of the best decisions I did in my life was when I joined Igalia in 2012. Inside Igalia, I have been working in different open-source projects, most of the time related to graphics technologies, interacting with different communities, giving talks, organizing conferences and, more importantly, contributing to free software as my daily job.

Now I’m thrilled to announce that we are hiring for our Graphics team!

Igalia

Right now we have two open positions:

  • Graphics Developer.

    We are looking for candidates that would like to contribute to open-source OpenGL/Vulkan graphics drivers (Mesa), or other areas of the open-source graphics stack such as X11 or Wayland, among others. If you have experience with them, or you are very motivated to become an expert there, just send us your CV!

  • Kernel Developer.

    We are looking for candidates that either have experience with kernel development or they can ramp-up quickly to contribute to linux kernel drivers. Although no specific subsystem is mentioned in the job position, I encourage you to apply if you have DRM experience and/or ARM[64]/MIPS related knowledge.

Graphics technologies are not your cup of tea? We have positions in other areas like browsers, compilers, multimedia… Just check out our job offers on our website!

What we offer is to work in an open-source consultancy in which you can participate equally in the management and decision-making process of the company via our democratic, consensus-based assembly structure. As all of our positions are remote-friendly, we welcome submissions from any part of the world.

Are you still a student? We have launched the 2021 edition of our Coding Experience program. Check it out!

Igalia's office

February 27, 2021

It is extremely rare that a hobby software project of mine gets completed, but now it has happened. Behold! Fourbyfour!

Have you ever had to implement a mathematical algorithm, say, matrix inversion? You want it to be fast and measuring the speed is fairly simple, right. But what about correctness? Or precision? Behavior around inputs that are on the edge? You can hand-pick a few example inputs, put those into your test suite, and verify the result is what you expect. If you do not pick only trivial inputs, this is usually enough to guarantee your algorithm does not have fundamental mistakes. But what about those almost invalid inputs, can you trust your algorithm to not go haywire on them? How close to invalid can your inputs be before things break down? Does your algorithm know when it stops working and tell you?

Inverting a square matrix requires that the inverse matrix exists to begin with. Matrices that do not mathematically have an inverse matrix are called singular. Can your matrix inversion algorithm tell you when you are trying to invert a matrix that cannot be inverted, or does it just give you a bad result pretending it is ok?

Working with computers often means working with floating-point numbers. With floating-point, the usual mathematics is not enough, it can actually break down. You calculate something and the result a computer gives you is total nonsense, like 1+2=2 in spirit. In the case of matrix inversion, it's not enough that the input matrix is not singular mathematically, it needs to be "nice enough" numerically as well. How do you test your matrix inversion algorithm with this in mind?

These questions I tried to answer with Fourbyfour. The README has the links to the sub-pages discussing how I solved this, so I will not repeat it here. However, as the TL;DR, if there is one thing you should remember, it is this:

    Do not use the matrix determinant to test if a matrix is invertible!

Yes, the determinant is zero for a singular matrix. No, close to zero determinant does not tell you how close to singular the matrix is. There are better ways.

However, the conclusion I came to is that if you want a clear answer for a specific input matrix, is it invertible, the only way to know for sure is to actually invert it, multiply the input matrix with the inverse you computed, and measure how far off from the identity matrix it is. Of course, you also need to set a threshold, how close to identity matrix is close enough for your application, because with numerical algorithms, you will almost never get the exact answer. Also, pick an appropriate matrix norm for the matrix difference.

The reason for this conclusion is what one of the tools I wrote tells me about a matrix that would be typical for a display server with two full-HD monitors. The matrix is simply the pixel offset of the second monitor on the desktop. The analysis of the matrix is the example I used to demonstrate fourbyfour-analyse. If you read through it, you should be shocked. The mathematics, as far as I can understand, seems to tell us that if you use 32-bit floating-point, inverting this matrix gives us a result that leads to no correct digits at all. Obviously this is nonsense, the inverse is trivial and algorithms should give the exact correct result. However, the math does not lie (unless I did). If I did my research right, then what fourbyfour-analyse tells us is true, with an important detail: it is the upper error bound. It guarantees that we cannot get errors larger than that (heh, zero correct digits is pretty hard to make much worse). But I also read that there is no better error bound possible for a generic matrix inversion algorithm. (If you take the obvious-to-human constraints into account that those elements must be one and those must be zero, the analysis would likely be very different.) Therefore the only thing left to do is to actually go on with the matrix inversion and then verify the result.

Here is a list of the cool things the Fourbyfour project does or has:

  • Generates random matrices arbitrarily close to singular in a controlled way. If you simply generated random matrices for testing, they would almost never be close to singular. With this code, you can define how close to singular you want the matrices to be, to really torture your inversion algorithms.
  • Generates random matrices with a given determinant value. This is orthogonal to choosing how close to singular the generated matrices are. You can independently pick the determinant value and the condition number, and have the random matrices have both simultaneously.
  • Plot a graph about a matrix inversion algorithm's behavior when inputs get closer to singular, to see exactly when it breaks down.
  • A tutorial on how mathematical matrix notation works and how it relates to row- vs. column-major layouts (spoiler: it does not).
  • A comparison between Graphene and Weston matrix inversion algorithms.
In this project I also tried out several project quality assurance features:

  • Use Gitlab CI to run the test suite for the main branch, tags, and all merge requests, but not for other git branches.
  • Use Freedesktop ci-templates to easily generate the Docker image in CI under which CI testing will run.
  • Generate LCOV test code coverage report from CI.
  • Use reuse lint tool in CI to ensure every single file has a defined, machine-readable license. Using well-known licenses clearly is important if you want your code to be attractive. Fourbyfour also uses the Developer Certificate of Origin.
  • Use ci-fairy to ensure every commit has Singed-off-by and every merge request allows maintainer pushes.
  • Good CI test coverage. Test even the pseudo-random number generator in the test suite that it roughly follows the intended distribution.
  • CONTRIBUTING file. I believe that every open source project regardless of size needs this to set up the people's expectations when they see your project, whether you expect or accept contributions or not.
I'm really happy this project is now "done", well, version 1.0.0 so to say. One thing I have realized it is still missing is a determinant sweep mode. The precision testing mode sweeps over condition numbers and allows plotting the inversion behavior. It should have another mode where the sweep controls the determinant value, with some fixed condition number for the random test matrices. This determinant mode could point out inversion algorithms that use determinant value for matrix singularity testing and show how it leads to completely arbitrary results.

I you want to learn about numerical methods for matrices, I recommend the book Gene H. Golub, Charles F. van Loan, Matrix Computations. The Johns Hopkins University Press. I used the third edition, 1996, when implementing the Weston matrix inversion years ago.

February 26, 2021

Introduction

In a now pretty well established tradition on my part, I am posting on things I no longer work on!

I gave a talk on modifiers at XDC 2017 and Linux Plumbers 2017 audio only. It was always my goal to have a blog post accompany the work. Relatively shortly after the talks, I ended up leaving graphics and so it dropped on the priority list.

I'm splitting this up into two posts. This post will go over the problem, and solutions. The next post will go over the implementation details.

Modifiers

Each 3d computational unit in an Intel GPU is called an Execution Unit (EU). Aside from what you might expect them to do, like execute shaders, they may be used for copy operations (itself a shader), or compute operations (also, shaders). All of these things require memory bandwidth in order to complete their task in a timely manner.

Modifiers were the chosen solution in order to allow end to end renderbuffer [de]compression to work, which is itself designed to reduce memory bandwidth needs in the GPU and display pipeline. End to end renderbuffer compression simply means that through all parts of the GPU and display pipeline, assets are read and written to in a compression scheme that is capable of reducing bandwidth (more on this later).

Modifiers are relatively simple concept. They are modifications that are applied to a buffer's layout. Typically a buffer has a few properties, width, height, and pixel format to name a few. Modifiers can be thought of as ancillary information that is passed along with the pixel data. It will impact how the data is processed or displayed. One such example might be to support tiling, which is a mechanism to change how pixels are stored (not sequentially) in order for operations to make better use of locality for caching and other similar reasons. Modifiers were primarily designed to help negotiate modified buffers between the GPU rendering engine and the display engine (usually by way of the compositor). In addition, other uses can crop up such as the video decode/encode engines.

A Waste of Time and Gates

My understanding is that even now, 3 years later, full modifier support isn't readily available across all corners of the graphics ecosystem. Many hardware features are being entirely unrealized. Upstreaming sweeping graphics features like this one can be very time consuming and I seriously would advise hardware designers to take that into consideration (or better yet, ask your local driver maintainer) before they spend the gates. If you can make changes that don't require software, just do it. If you need software involvement, the longer you wait, the worse it will be.

They weren't new even when I made the presentation 3.5 years ago.

commit e3eb3250d84ef97b766312345774367b6a310db8
Author: Rob Clark <robdclark@gmail.com>
Date:   6 years ago

    drm: add support for tiled/compressed/etc modifier in addfb2

I managed to land some stuff:

commit db1689aa61bd1efb5ce9b896e7aa860a85b7f1b6
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   3 years, 7 months ago

    drm: Create a format/modifier blob

Admiring the Problem

Back of the envelope requirement for a midrange Skylake GPU from the time can be calculated relatively easily. 4 years ago, at the frequencies we run our GPUs and their ISA, we can expect roughly 1GBs for each of the 24 EUs.

A 4k display:

3840px × 2160rows × 4Bpp × 60HZ = 1.85GBs

24GBs + 1.85GBs = 25.85GBs

This by itself will oversaturate single channel DDR4 bandwidth (which was what was around at the time) at the fastest possible clock. As it turns out, it gets even worse with compositing. Most laptops sporting a SKL of this range wouldn't have a 4k display, but you get the idea.

The picture (click for larger SVG) is a typical "flow" for a composited desktop using direct rendering with X or a Wayland compositor using EGL. In this case, drawing a Rubik's cube looking thing into a black window.

Admiring the problem

Using this simple Rubik's cube example I'll explain each of the steps so that we can understand where our bandwidth is going and how we might mitigate that. This is just the overview, so feel free to move on to the next section. Since the example will be trivial, and the window is small (and only a singleton) it won't saturate the bandwidth, but it will demonstrate where the bandwidth is being consumed, and open up a discussion on how savings can be achieved.

Rendering and Texturing

For the example, no processing happens other than texturing. In a simple world, the processing of the shader instructions doesn't increase the memory bandwidth cost. As such, we'll omit that from the details.

The main steps on how you get this Rubik's cube displayed are

  • upload a static texture
  • read from static texture
  • write to offscreen buffer
  • copy to output frame
  • scanout from output

More details below...

Texture Upload

Getting the texture from the application, usually from disk, into main memory, is what I'm referring to as texture upload. In terms of memory bandwidth, you are using write bandwidth to write into the memory.

Assets are transfered from persistent storage to memory

Textures may either be generated by the 3d application, which would be trivial for this example, or they may be authored using a set of offline tools and baked into the application. For any consequential use, the latter is predominantly used. Certain surface types are often dynamically generated though, for example, the shadow mapping technique will generate depth maps. Those dynamically generated surfaces actually will benefit even more (more later).

This is pseudo code (but close to real) to upload the texture in OpenGL:

const unsigned height = 128;
const unsigned width = 64;
const void *data = ... // rubik's cube
GLuint tex;

glGenTextures(1, &tex);
glBindTexture(GL_TEXTURE_2D, texture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, width, height, 0, GL_RGB, GL_UNSIGNED_BYTE, data);
glGenerateMipmap(GL_TEXTURE_2D);

I'm going to punt on explaining mipmaps, which are themselves a mechanism to conserve memory bandwidth. If you have no understanding, I'd recommend reading up on mipmaps. This wikipedia article looks decent to me.

Texture Sampling

Once the texture is bound, the graphics runtime can execute shaders which can reference those textures. When the shader requests a color value (also known as sampling) from the texture, it's possiblelikely that the calculated coordinate within the texture will be in between pixels. The hardware will have to return a single color value for the sample point and the way it interpolates is chosen by the graphics runtime. This is referred to as a filter

Texture Fetch
Texture Fetch/Filtering
  • Nearest: Take the value of the closest two pixels and interpolate. If the texture coordinate hits a single pixel, don't interpolate.
  • Bilinear: Take the surround 4 pixels and interpolate based on distance for the texture coordinate
  • Trilinear: Bilinear, but also interpolate between the two closest mipmaps. (I skipped discussing mipmaps, but part of the texture fetch involves finding the nearest miplevel.
  • Anisotropic: It's complicated. Let's say 16x trilinear.

Here's the GLSL to fetch the texture:

#version 330

uniform sampler2D tex;
in vec2 texCoord;
out vec4 fragColor;

void main() {
    vec4 temp = texelFetch(tex, ivec2(texCoord));
    fragColor = temp;
}

The above actually does something that's perhaps not immediately obvious. fragColor = temp;. This actually instructs the fragment shader to write out that value to a surface which is bound for output (usually a framebuffer). In other words, there are two steps here, read and filter a value from the texture, write it back out.

The part of the overall diagram that represents this step:

Composition

In the old days of X, and even still when not using the composite extension, the graphics applications could be given a window to write the pixels directly into the resulting output. The X window manager would mediate resize and move events, letting the client update as needed. This has a lot of downsides which I'll say are out of scope here. There is one upside that is in scope though, there's no extra copy needed to create the screen composition. It just is what it is, tearing and all.

If you don't know if you're currently using a compositor, you almost certainly are using one. Wayland only composites, and the number of X window managers that don't composite is very few. So what exactly is compositing? Simply put it's a window manager that marshals frame updates from clients and is responsible for drawing them on the final output. Often the compositor may add its own effects such as the infamous wobbly windows. Those effects themselves may use up bandwidth!

Compositor
Simplified compositor block diagram

Applications will write their output into what's referred to as an offscreen buffer. 👋👋 The compositor will read the output and copy it into what will become the next frame. What this means from a bandwidth consumption perspective is that the compositor will need to use both read and write bandwidth just to build the final frame. 👋👋

Display

It's the mundane part of this whole thing. Pixels are fetched from memory and pushed out onto whatever display protocol.

Display Engine
Display Engine

Perhaps the interesting thing about the display engine is it has fairly isochronous timing requirements and can't tolerate latency very well. As such, it will likely have a dedicated port into memory that bypasses arbitration with other agents in the system that are generating memory traffic.

Out of scope here but I'll briefly mention, this also gets a bit into tiling. Display wants to read things row by row, whereas rendering works a bit different. In short this is the difference between X-tiling (good for display), and Y-tiling (good for rendering). Until Skylake, the display engine couldn't even understand Y-tiled buffers.

Summing the Bandwidth Cost

Running through our 64x64 example...

Operation Color Depth Description Bandwidth R/W
Texture Upload 1Bpc (RGBX8) File to DRAM 16KB (64 × 64 × 4) W
Texel Fetch (nearest) 1Bpc DRAM to Sampler 16KB (64 × 64 × 4) R
FB Write 1Bpc GPU to DRAM 16KB (64 × 64 × 4) W
Compositing 1Bpc DRAM to DRAM 32KB (64 × 64 × 4 × 2) R+W
Scanout 1Bpc DRAM to PHY 16KB (64 × 64 × 4) R

Total = (16 + 16 + 16 + 32 + 16) × 60Hz = 5.625MBs

But actually, Display Engine will always scanout the whole thing, so really with a 4k display:

Total = (16 + 16 + 16 + 32 + 32400) × 60Hz = 1.9GBs

Don't forget about those various filter modes though!

Filter Mode Multiplier (texel fetch) Total Bandwidth
Bilinear 4x 11.25MBs
Trilinear 8x 18.75MBs
Aniso 4x 32x 63.75MBs
Aniso 16x 128x 243.75MBs

Proposing some solutions

Without actually doing math, I think cache is probably the biggest win you can get. One spot where caching could help is that framebuffer write step followed composition step could avoid the trip to main memory. Another is the texture upload and fetch. Assuming you don't blow out your cache, you can avoid the main memory trip.

While caching can buy you some relief, ultimately you have to flush your caches to get the display engine to be able to read your buffer. At least as of 2017, I was unaware of an architecture that had a shared cache between display and 3d.

Also, cache sizes are limited...

Wait for DRAM to get faster

Instead of doing anything, why not just wait until memory gets higher bandwidth?

Here's a quick breakdown of the progression at the high end of the specs. For the DDR memory types, I took a swag at number of populated channels. For a fair comparison, the expectation with DDR is you'll have at least dual channel nowadays.

Bandwidth

Looking at the graph it seems like the memory vendors aren't hitting Moore's Law any time soon, and if they are, they're fooling me. A similar chart should be made for execution unit counts, but I'm too lazy. A Tigerlake GT2 has 96 Eus. If you go back to our back of the envelope calculation we had a mid range GPU at 24 EUs, so that has quadrupled. In other words, the system architects will use all the bandwidth they can get.

Improving memory technologies is vitally important, it just isn't enough.

TOTAL SAVINGS = 0%

Hardware Composition

One obvious place we want to try to reduce bandwidth is composition. It was after all the biggest individual consumer of available memory bandwidth.

With composition as we described earlier, there was presumed to be a single plane. Software would arrange the various windows onto the plane, which if you recall from the section on composition added quite a bit to the bandwidth consumption, then the display engine could display from that plane.

Hardware composition is the notion that each of those windows could have a separate display plane, directly write into that, and all the compositor would have to do is make sure those display planes occupied the right part of the overall screen. It's conceptually similar to the direct scanout we described earlier in the section on composition.

Operation Color Depth Description Bandwidth R/W
Compositing 1Bpc DRAM to DRAM 32KB (64 × 64 × 4 × 2) R+W

TOTAL SAVINGS = 1.875MBs (33% savings)

Hardware Compsition Verdict

33% savings is really really good, and certainly if you have hardware with this capability, the driver should enable it, but there are some problems that come along with this that make it not so appealing.

  1. Hardware has a limited number of planes.
  2. Formats. One thing I left out about the compositor earlier is that one of things it may opt to do is convert the application's window into a format that the display hardware understands. This means some amount of negotiation has to take place so the application knows about this. Prior to this work, that wasn't in place.
  3. Doesn't reduce any other parts of process, ie. a full screen application wouldn't benefit at all.

Texture Compression

So far in order to solve the, not enough bandwidth, problem, we've tried to add more bandwidth, and reduce usage with hardware composition. The next place to go is to try to tackle texture consumption from texturing.

If you recall, we split the texturing up into two stages. Texture upload, and texture fetch. This third proposed solution attempts to reduce bandwidth by storing a compressed texture in memory. Texture upload will compress it while uploading, and texture sampling can understand the compression scheme and avoid doing all the lookups. Compressing the texture usually comes with some unperceivable degradation. In terms of the sampling, it's a bit handwavy to say you reduce by the compression factor, but let's say for simplicity sake, that's what it does.

Some common formats at the time of the original materials were

Format Compression Ratio
DXT1 8:1
ETC2 4:1
ASTC Variable, 6:1

Using DXT1 as an example of the savings:

Operation Color Depth Bandwidth R/W
Texture Upload DXT1 2KB (64 × 64 × 4 / 8) W
Texel Fetch (nearest) DXT1 2KB (64 × 64 × 4 / 8) R
FB Write 1Bpc 16KB (64 × 64 × 4) W
Compositing 1Bpc 32KB (64 × 64 × 4 × 2) R+W
Scanout 1Bpc 16KB (64 × 64 × 4) R

Here's an example with the simple DXT1 format:

Texture Compression Verdict

Texture compression solves a couple of the limitations that hardware composition left. Namely it can work for full screen applications, and if your hardware supports it, there isn't a limit to how many applications can make use of it. Furthermore, it scales a bit better because an application might use many many textures but only have 1 visible window.

There are of course some downsides.

Click for SVG

For comparison, here is the same cube scaled down with an 8:1 ratio. As you can see DXT1 does a really good job.

Scaled cube

We can't ignore the degradation though as certain rendering may get very distorted as a result.

  • Lossy (perhaps).
  • Hardware compatibility. Application developers need to be ready and able to compress to different formats and select the right things at runtime based on what the hardware supports. This takes effort both in the authoring, as well as the programming.
  • Patents.
  • Display doesn't understand this, so you need to decompress before display engine will read from a surface that has this compression.
  • Doesn't work well for all filtering types, like anisotropic.

*TOTAL SAVINGS (DXT1) = 1.64MBs (30% savings)

*total savings here is kind of a theoretical max

End to end lossless compression

So what if I told you there was a way to reduce your memory bandwidth consumption without having to modify your application, without being subject to hardware limits on planes, and without having to wait for new memory technologies to arrive?

End to end loss compression attempts to provide both "end to end" and "lossless" compression transparently to software. Explanation coming up.

End to End

As mentioned in the previous section on texture compression, one of the pitfalls is that you'd have to decompress the texture in order for it to be used outside of your 3d engine. Typically this would mean for the display engine to scanout out from, but you could also envision a case where perhaps you'd like to share these surfaces with the hardware video encoder. The nice thing about this "end to end" attribute is every stage we mentioned in previous sections that required bandwidth can get the savings just by running on hardware and drivers that enable this.

Lossless

Now because this is all transparent to the application running, a lossless compression scheme has to be used so that there aren't any unexpected results. While lossless might sound great on the surface (why would you want to lose quality?) it reduces the potential savings because lossless compression algorithms are always more inefficient, but it's still a pretty big win.

What's with the box, bro?

I want to provide an example of how this can be possible. Going back to our original image of the full picture, everything looks sort of the same. The only difference is there is a little display engine decompression step, and all of the sampler and framebuffer write steps now have a little purple box accompanying them

One sort of surprising aspect of this compression is it reduces bandwidth, not overall memory usage (that's also true of the Intel implementation). In order to store the compression information, hardware carves off a little bit of extra memory which is referenced for each operation on a texture (yes, that might use bandwidth too if it's not cached).

Here's a made-up implementation which tracks state in a similar way to Skylake era hardware, but the rest is entirely made up by me. It shows that even a naive implementation can get up to a lossless 2:1 compression ratio. Remember though, this comes at the cost of adding gates to the design and so you'd probably want something better performing than this.

2:1 compression

Everything is tracked as cacheline pairs. In this example we have state called "CCS". For every pair of cachelines in the image, 2b are in this state to track the current compression. When the pair of cachelines uses 12 or fewer colors (which is surprisingly often in real life), we're able to compress the data into a single cacheline (state becomes '01'). When the data is compressed, we can reassemble the image losslessly from a single cacheline, this is 2:1 compression because 1 cacheline gets us back 2 cachelines worth of pixel data.

Walking through the example we've been using of the Rubik's cube.

  1. As the texture is being uploaded, the hardware observes all the runs of the same color and stores them in this compressed manner by building the lookup table. On doing this it modifies the state bits in the CCS to be 01 for those cachelines.
  2. On texture fetch, the texture sampler checks the CCS. If the encoding is 01, then the hardware knows to use the LUT mechanism instead for all the color values.
  3. Throughout the rest of rendering, steps 1 & 2 are repeated as needed.
  4. When display is ready to scanout the next frame, it too can look at the CCS determine if there is compression, and decompress as it's doing the scanout.

The memory consumed is minimal which also means that any bandwidth usage overhead is minimal. In the example we have a 64x128 image. In total that's 512 cachelines. At 2 bits per 2 cachelines the CCS size for the example would be not even be a full cacheline: 512 / 2 / 2 = 128b = 16B

* Unless you really want to understand how hardware might actually work, ignore the 00 encoding for clear color.

* There's a caveat here that we assume texture upload and fetch use the sampler. At the time of the original presentation, this was not usually the case and so until the FB write occurred, you didn't actually get compression.

Theoretical best savings would compress everything:

Operation Color Depth Description Bandwidth R/W
Texture Upload 1Bpc compressed File to DRAM 8KB (64 × 64 × 4) / 2 W
Texel Fetch (nearest) 1Bpc compressed DRAM to Sampler 8KB (64 × 64 × 4) / 2 R
FB Write 1Bpc compressed GPU to DRAM 8KB (64 × 64 × 4) / 2 W
Compositing 1Bpc compressed DRAM to DRAM 16KB (64 × 64 × 4 × 2) / 2 R+W
Scanout 1Bpc compressed DRAM to PHY 8KB (64 × 64 × 4) / 2 R

TOTAL SAVINGS = 2.8125MBs (50% savings)

And if you use HW compositing in addition to this...

TOTAL SAVINGS = 3.75MBs (66% savings)

Ending Notes

Hopefully it's somewhat clear how 3d applications are consuming memory bandwidth, and how quickly the consumption grows when adding more applications, textures, screen size, and refresh rate.

End to end lossless compression isn't always going to be a huge win, but in many cases it can really chip away at the problem enough to be measurable. The challenge as it turns out is actually getting it hooked up in the driver and rest of the graphics software stack. As I said earlier, just because a feature seems good doesn't necessarily mean it would be worth the software effort to implement it. End to end loss compression is one feature that you cannot just turn on by setting a bit, and the fact that it's still not enabled anywhere, to me, is an indication that effort and gates may have been better spent elsewhere.

However, next section will be all about how we got it hooked up through the graphics stack.

If you've made it this far, you probably could use a drink. I know I can.

February 25, 2021

A Losing Battle

For a long time, I’ve tried very, very, very, very hard to work around problems with NIR variables when it comes to UBOs and SSBOs.

Really, I have.

But the bottom line is that, at least for gallium-based drivers, they’re unusable. They’re so unreliable that it’s only by sheer luck (and a considerable amount of it) that zink has worked at all until this point.

Don’t believe me? Here’s a list of just some of the hacks that are currently in use by zink to handle support for these descriptor types, along with the reason(s) why they’re needed:

Hack Reason It’s Used Bad Because?
iterating the list of variables backwards this indexing vaguely matches the value used by shaders to access the descriptor only works coincidentally for as long as nothing changes this ordering and explodes entirely with GL-SPIRV
skipping non-array variables with data.location > 0 these are (usually) explicit references to components of a block BO object in a shader sometimes they’re the only reference to the BO, and skipping them means the whole BO interface gets skipped
using different indexing for SSBO variables depending on whether data.explicit_binding is set this (sometimes) works to fix indexing for SSBOs with random bindings and also atomic counters the value is set randomly by other optimization passes and so it isn’t actually reliable
atomic counters are identified by using !strcmp(glsl_get_type_name(var->interface_type), "counters") counters get converted to SSBOs, but they require different indexing in order to be accessed correctly c’mon.
runtime arrays (array[]) are randomly tacked onto SPIRV SSBO variables based on the variable type fixes atomic counter array access and SSBO length() method not actually needed most of the time

And then there’s this monstrosity that’s used for linking up SSBO variable indices with their instruction’s access value (comments included for posterity):

unsigned ssbo_idx = 0;
if (!is_ubo_array && var->data.explicit_binding &&
    (glsl_type_is_unsized_array(var->type) || glsl_get_length(var->interface_type) == 1)) {
    /* - block ssbos get their binding broken in gl_nir_lower_buffers,
     *   but also they're totally indistinguishable from lowered counter buffers which have valid bindings
     *
     * hopefully this is a counter or some other non-block variable, but if not then we're probably fucked
     */
    ssbo_idx = var->data.binding;
} else if (base >= 0)
   /* we're indexing into a ssbo array and already have the base index */
   ssbo_idx = base + i;
else {
   if (ctx->ssbo_mask & 1) {
      /* 0 index is used, iterate through the used blocks until we find the first unused one */
      for (unsigned j = 1; j < ctx->num_ssbos; j++)
         if (!(ctx->ssbo_mask & (1 << j))) {
            /* we're iterating forward through the blocks, so the first available one should be
             * what we're looking for
             */
            base = ssbo_idx = j;
            break;
         }
   } else
      /* we're iterating forward through the ssbos, so always assign 0 first */
      base = ssbo_idx = 0;
   assert(ssbo_idx < ctx->num_ssbos);
}
assert(!ctx->ssbos[ssbo_idx]);
ctx->ssbos[ssbo_idx] = var_id;
ctx->ssbo_mask |= 1 << ssbo_idx;
ctx->ssbo_vars[ssbo_idx] = var;

Does it work?

Amazingly, yes, it does work the majority of the time.

But is this really how we should live our lives?

A Methodology To Live By

As the great compiler-warrior Jasonus Ekstrandimus once said, “Just Delete All The Code”.

Truly this is a pivotal revelation, one that can induce many days of deep thinking, but how can it be applied to this scenario?

Today I present the latest in zink code deletion: a NIR pass that deletes all the broken variables and makes new ones.

bender.jpg

Let’s get into it.

uint32_t ssbo_used = 0;
uint32_t ubo_used = 0;
uint64_t max_ssbo_size = 0;
uint64_t max_ubo_size = 0;
bool ssbo_sizes[PIPE_MAX_SHADER_BUFFERS] = {false};

if (!shader->info.num_ssbos && !shader->info.num_ubos && !shader->num_uniforms)
   return false;
nir_function_impl *impl = nir_shader_get_entrypoint(shader);
nir_foreach_block(block, impl) {
   nir_foreach_instr(instr, block) {
      if (instr->type != nir_instr_type_intrinsic)
         continue;

      nir_intrinsic_instr *intrin = nir_instr_as_intrinsic(instr);
      switch (intrin->intrinsic) {
      case nir_intrinsic_store_ssbo:
         ssbo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[1]));
         break;

      case nir_intrinsic_get_ssbo_size: {
         uint32_t slot = nir_src_as_uint(intrin->src[0]);
         ssbo_used |= BITFIELD_BIT(slot);
         ssbo_sizes[slot] = true;
         break;
      }
      case nir_intrinsic_ssbo_atomic_add:
      case nir_intrinsic_ssbo_atomic_imin:
      case nir_intrinsic_ssbo_atomic_umin:
      case nir_intrinsic_ssbo_atomic_imax:
      case nir_intrinsic_ssbo_atomic_umax:
      case nir_intrinsic_ssbo_atomic_and:
      case nir_intrinsic_ssbo_atomic_or:
      case nir_intrinsic_ssbo_atomic_xor:
      case nir_intrinsic_ssbo_atomic_exchange:
      case nir_intrinsic_ssbo_atomic_comp_swap:
      case nir_intrinsic_ssbo_atomic_fmin:
      case nir_intrinsic_ssbo_atomic_fmax:
      case nir_intrinsic_ssbo_atomic_fcomp_swap:
      case nir_intrinsic_load_ssbo:
         ssbo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[0]));
         break;
      case nir_intrinsic_load_ubo:
      case nir_intrinsic_load_ubo_vec4:
         ubo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[0]));
         break;
      default:
         break;
      }
   }
}

The start of the pass iterates over the instructions in the shader. All UBOs and SSBOs that are used get tagged into a bitfield of their index, and any SSBOs which have the length() method called are similarly tagged.


nir_foreach_variable_with_modes(var, shader, nir_var_mem_ssbo | nir_var_mem_ubo) {
   const struct glsl_type *type = glsl_without_array(var->type);
   if (type_is_counter(type))
      continue;
   unsigned size = glsl_count_attribute_slots(type, false);
   if (var->data.mode == nir_var_mem_ubo)
      max_ubo_size = MAX2(max_ubo_size, size);
   else
      max_ssbo_size = MAX2(max_ssbo_size, size);
   var->data.mode = nir_var_shader_temp;
}
nir_fixup_deref_modes(shader);
NIR_PASS_V(shader, nir_remove_dead_variables, nir_var_shader_temp, NULL);
optimize_nir(shader);

Next, the existing SSBO and UBO variables get iterated over. A maximum size is stored for each type, and then the variable mode is set to temp so it can be deleted. These variables aren’t actually used by the shader anymore, so this is definitely okay.

Boom.

if (!ssbo_used && !ubo_used)
   return false;

Early return if it turns out that there’s not actually any UBO or SSBO use in the shader, and all the variables are gone to boot.

struct glsl_struct_field *fields = rzalloc_array(shader, struct glsl_struct_field, 2);
fields[0].name = ralloc_strdup(shader, "base");
fields[1].name = ralloc_strdup(shader, "unsized");

The new variables are all going to be the same type, one which matches what’s actually used during SPIRV translation: a simple struct containing an array of uints, aka base. SSBO variables which need the length() method will get a second struct member that’s a runtime array, aka unsized.

if (ubo_used) {
   const struct glsl_type *ubo_type = glsl_array_type(glsl_uint_type(), max_ubo_size * 4, 4);
   fields[0].type = ubo_type;
   u_foreach_bit(slot, ubo_used) {
      char buf[64];
      snprintf(buf, sizeof(buf), "ubo_slot_%u", slot);
      nir_variable *var = nir_variable_create(shader, nir_var_mem_ubo, glsl_struct_type(fields, 1, "struct", false), buf);
      var->interface_type = var->type;
      var->data.driver_location = slot;
   }
}

If there’s a valid bitmask of UBOs that are used by the shader, the index slots get iterated over, and a variable is created for that slot using the same type for each one. The size is determined by the size of the biggest UBO variable that previously existed, which ensures that there won’t be any errors or weirdness with access past the boundary of the variable. All the GLSL compiliation and NIR passes to this point have already handled bounds detection, so this is also fine.

if (ssbo_used) {
   const struct glsl_type *ssbo_type = glsl_array_type(glsl_uint_type(), max_ssbo_size * 4, 4);
   const struct glsl_type *unsized = glsl_array_type(glsl_uint_type(), 0, 4);
   fields[0].type = ssbo_type;
   u_foreach_bit(slot, ssbo_used) {
      char buf[64];
      snprintf(buf, sizeof(buf), "ssbo_slot_%u", slot);
      if (ssbo_sizes[slot])
         fields[1].type = unsized;
      else
         fields[1].type = NULL;
      nir_variable *var = nir_variable_create(shader, nir_var_mem_ssbo,
                                              glsl_struct_type(fields, 1 + !!ssbo_sizes[slot], "struct", false), buf);
      var->interface_type = var->type;
      var->data.driver_location = slot;
   }
}

SSBOs are almost the same, but as previously mentioned, they also get a bonus member if they need the length() method. The GLSL compiler has already pre-computed the adjustment for the value that will be returned by length(), so it doesn’t actually matter what the size of the variable is anymore.

And that’s it! The entire encyclopedia of hacks can now be removed, and I can avoid ever having to look at any of this again.

February 23, 2021

Linaro has been working together with Qualcomm to enable camera support on their platformssince 2017. The Open Source CAMSS driver was written to support the ISP IP-block with the same name that is present on Qualcomm SoCs coming from the smartphone space.

The first development board targeted by this work was the DragonBoard 410C, which was followed in 2018 by DragonBoard 820C support. Recently support for the Snapdragon 660 SoC was added to the driver, which will be part of the v5.11 Linux Kernel release. These SoCs all contain the CAMSS (Camera SubSystem) version of the ISP architecture.

Currently, support for the ISP found in the Snapdragon 845 SoC and the DragonBoard 845C is in the process of being upstreamed to the mailinglists. Having …

February 19, 2021

Quickly: ES 3.2

I’ve been getting a lot of pings over the past week or two about ES 3.2 support.

Here’s the deal.

It’s not happening soon. Probably.

Zink currently supports every 3.2 extension except for photoshop. There’s two ways to achieve support for that extension at present:

  • the nice, simple, VK_EXT_blend_operation_advanced which nobody* supports
  • the difficult, excruciating fbfetch method using shader rewrites, which also requires extensions that nobody supports

* Yes, I know that Nvidia supports advanced blend, but zink+nvidia is not currently capable of doing ES of any version, so that’s not testable.

So in short, it’s going to be a while.

But there’s not really a technical reason to rush towards full ES 3.2 anyway other than to fill out a box on mesamatrix. If you have an app that requires 3.2, chances are that it probably doesn’t really require it; few apps actually use the advanced blend extension, and so it should be possible for the app to require only 3.1 and then verify the presence of whatever 3.2-based extensions it may use in order to be more compatible.

Of course, this is unlikely to happen. It’s far easier for app developers to just say “give me 3.2” if maybe they just want geometry shaders, and I wouldn’t expect anyone is going to be special-casing things just to run on zink.

Nonetheless, it’s really not a priority for me given the current state of the Vulkan ecosystem. As time moves on and various extensions/features become more common that may change, but for now I’m focusing on things that are going to be the most useful.