planet.freedesktop.org
May 06, 2021
As you may know I've been doing a lot of hw-enablement work on Bay- and Cherry-Trail tablets as a side-project for the last couple of years.

Some of these tablets have one interesting feature intended to "flash" Android on them. When turned on with both the volume-up and the volume-down buttons pressed at the same time they enter something called DNX mode, which it will then also print to the LCD panel, this is really just a variant of the android fastboot protocol built into the BIOS. Quite a few models support this, although on Bay Trail it sometimes seems to be supported (it gets shown on the screen) but it does not work since many models which only shipped with Windows lack the external device/gadget-mode phy which the Bay Trail SoC needs to be able to work in device/gadget mode (on Cherry Trail the gadget phy has been integrated into the SoC).

So on to the topic of this blog-post, I recently used DNX mode to unbrick a tablet which was dead due to the BIOS settings get corrupted in a way where it would not boot and it was also impossible to enter the BIOS setup. After some duckduckgo-ing I found a thread about how in DNX mode you can upload a replacement for the efilinux.efi bootloader normally used for "fastboot boot <android-bootimg.img>" and how you can use this to upload a binary to flash the BIOS. I did not have a BIOS image of this tablet, so that approach did not work for me. But it did point me in the direction of a different, safer (no BIOS flashing involved) solution to unbrick the tablet.

If you run the following 2 commands on a PC with a Bay- or Cherry-Trail connected in DNX mode:

fastboot flash osloader some-efi-binary.efi
fastboot boot some-android-boot.img

Then the tablet will execute the some-efi-binary.efi at first I tried getting an EFI shell this way, but this failed because the EFI binary gets passed some arguments about where in RAM it can find the some-android-boot.img. Then I tried booting a grubx64.efi file and that result in a grub commandline. But I had not way to interact with it and replacing the USB connection to the PC with a OTG / USB-host cable with a keyboard attached to it did not result in working input.

So my next step was to build a new grubx64.efi with "-c grub.cfg" added to the commandline for the final grub2-mkimage step, embedding a grub.cfg with a single line in there: "fwsetup". This will cause the tablet to reboot into its BIOS setup menu. Note on some tablets you still will not have keyboard input if you just let the tablet sit there while it is rebooting. But during the reboot there is enough time to swap the USB cable for an OTG adapter with a keyboard attached before the reboot completes and then you will have working keyboard input. At this point you can select "load setup defaults" and then "save and exit" and voila the tablet works again.

For your convenience I've uploaded a grubia32.efi and a grubx64.efi with the necessary "fwsetup" grub.cfg here. This is build from this branch at this commit (this was just a random branch which I had checked out while working on this).

Note the image used for the "fastboot boot some-android-boot.img" command does not matter much, but it must be a valid android boot.img format file otherwise fastboot will refuse to try to boot it.
May 05, 2021

In Turnips in the wild (Part 1) we walked through two issues, one in TauCeti Benchmark and the other in Genshin Impact. Today, I have an update about the one I didn’t have plan to fix, and a showcase of two remaining issues I met in Genshin Impact.

Genshin Impact

Gameplay – Disco Water

In the previous post I said that I’m not planning to fix the broken water effect since it relied on undefined behavior.

Screenshot of the gameplay with body of water that has large colorful artifacts

However, I was notified that same issue was fixed in OpenGL driver for Adreno (Freedreno) and the fix is rather easy. Even though for Vulkan it is clearly an undefined behavior, with other APIs it might not be so clear. Thus, given that we want to support translation from other APIs, there are already apps which rely on this behavior, and it would be just a bit more performant - I made a fix for it.

Screenshot of the gameplay with body of water without artifacts

The issue was fixed by “tu: do not corrupt unwritten render targets (!10489)”

Login Screen

The login screen welcomes us with not-so-healthy colors:

Screenshot of a login screen in Genshin Impact which has wrong colors - columns and road are blue and white

And with a few failures to allocate registers in the logs. The failure to allocate registers isn’t good and may cause some important shader not to run, but let’s hope it’s not that. Thus, again, we should take a closer look at the frame.

Once the frame is loaded I’m staring at an empty image at the end of the frame… Not a great start.

Such things mostly happen due to a GPU hang. Since I’m inspecting frames on Linux I took a look at dmesg and confirmed the hang:

 [drm:a6xx_irq [msm]] *ERROR* gpu fault ring 0 fence ...

Fortunately, after walking through draw calls, I found that the mis-rendering happens before the call which hangs. Let’s look at it:

Screenshot of a correct draw call right before the wrong one being inspected in RenderDoc Draw call right before
Screenshot of a draw call, that draws the wrong colors, being inspected in RenderDoc Draw call with the issue

It looks like some fullscreen effect. As in the previous case - the inputs are fine, the only image input is a depth buffer. Also, there are always uniforms passed to the shaders, but when there is only a single problematic draw call - they are rarely an issue (also they are easily comparable with the proprietary driver if I spot some nonsensical values among them).

Now it’s time to look at the shader, ~150 assembly instructions, nothing fancy, nothing obvious, and a lonely kill near the top. Before going into the most “fun” part, it’s a good idea to make sure that the issue is 99% in the shader. RenderDoc has a cool feature which allows to debug shader (its SPIRV code) at a certain fragment (or vertex, or CS invocation), it does the evaluation on CPU, so I can use it as some kind of a reference implementation. In our case the output between RenderDoc and actual shader evaluation on GPU is different:

Screenshot of the color value calculated on CPU by RenderDoc Evaluation on CPU: color = vec4(0.17134, 0.40289, 0.69859, 0.00124)
Screenshot of the color value calculated on GPU On GPU: color = vec4(3.1875, 4.25, 5.625, 0.00061)

Knowing the above there is only one thing left to do - reduce the shader until we find the problematic instruction(s). Fortunately there is a proprietary driver which renders the scene correctly, therefor instead of relying on intuition, luck, and persistance - we could quickly bisect to the issue by editing and comparing the edited shader with a reference driver. Actually, it’s possible to do this with shader debugging in RenderDoc, but I had problems with it at that moment and it’s not that easy to do.

The process goes like this:

  1. Decompile SPIRV into GLSL and check that it compiles back (sometimes it requires some editing)
  2. Remove half of the code, write the most promising temporary variable as a color, and take a look at results
  3. Copy the edited code to RenderDoc instance which runs on proprietary driver
  4. Compare the results
  5. If there is a difference - return deleted code, now we know that the issue is probably in it. Thus, bisect it by returning to step 2.

This way I bisected to this fragment:

_243 = clamp(_243, 0.0, 1.0);
_279 = clamp(_279, 0.0, 1.0);
float _290;
if (_72.x) {
  _290 = _279;
} else {
  _290 = _243;
}

color0 = vec4(_290);
return;

Writing _279 or _243 to color0 produced reasonable results, but writing _290 produced nonsense. The difference was only the presence of condition. Now, having a minimal change which reproduces the issue, it’s possible to compare native assembly.

Bad:

mad.f32 r0.z, c0.y, r0.x, c6.w
sqrt r0.y, r0.y
mul.f r0.x, r1.y, c1.z
(ss)(nop2) mad.f32 r1.z, c6.x, r0.y, c6.y
(nop3) cmps.f.ge r0.y, r0.x, r1.w
(sat)(nop3) sel.b32 r0.w, r0.z, r0.y, r1.z

Good:

(sat)mad.f32 r0.z, c0.y, r0.x, c6.w
sqrt r0.y, r0.y
(ss)(sat)mad.f32 r1.z, c6.x, r0.y, c6.y
(nop2) mul.f r0.y, r1.y, c1.z
add.f r0.x, r0.z, r1.z
(nop3) cmps.f.ge r0.w, r0.y, r1.w
cov.u r1.w, r0.w
(rpt2)nop
(nop3) add.f r0.w, r0.x, r1.w

By running them in my head I reasoned that they should produce the same results. Something works not as expected. After a bit more changes in GLSL, it became apparent that something wrong with clamp(x, 0, 1) which is translated into (sat) modifier for instructions. A bit more digging and I found out that hardware doesn’t understand saturation modifier being placed on sel. instruction (sel is a selection between two values based on third).

Disallowing compiler to place saturation on sel instruction resolved the bug:

Login screen after the fix

The issue was fixed by “ir3: disallow .sat on SEL instructions (!9666)”

Gameplay – Where did the trees go?

Screenshot of the gameplay with trees and grass being almost black

The trees and grass are seem to be rendered incorrectly. After looking through the trace and not finding where they were actually rendered, I studied the trace on proprietary driver and found them. However, there weren’t any such draw calls on Turnip!

The answer was simple, shaders failed to compile due to the failure in a register allocation I mentioned earlier… The general solution would be an implementation of register spilling. However in this case there is a pending merge request that implements a new register allocator, which later would help us implement register spilling. With it shaders can now be compiled!

Screenshot of the gameplay with trees and grass being rendered correctly

More Turnip adventures to come!

wew

After a traumatic, tearful goodbye to weird glxgears which took more out of me than expected, I’m back and blogging again.

It’s been…

Well, I guess it’s been a while since my last post, huh.

So what’s happened since then?

Lots? Yeah, definitely lots. It’s May now. And there’s even been a new Mesa release since then.

Quick zink roundup in case you missed any of these:

  • shader clocks are in
  • sparse buffer support is in
  • 16bit ALU support is in
  • GPU memory queries are in

Cool.

But What’s Really Been Going On?

The truth is that I’ve been taking some time off from zink in a completely futile attempt to make progress on something else while zink-wip continues to land. Inspired by this ticket describing issues getting CS:GO working, I decided to tackle part of Mesa that I haven’t worked on much and that hasn’t seen much work in a long time:

PBOs.

PBO TIME

Pixel Buffer Objects are used when an application needs to perform a transfer of an image from host memory to the GPU or vice versa. A PBO is said to be downloaded when it is copied from the GPU into a host memory buffer, and it is uploaded when it is copied from a memory buffer into GPU memory. PBO uploads are a common way to load assets from disk (e.g., textures), and PBO downloads are common for…ideally nothing performance related, but RPCS3 uses them in such a way.

Uploading of textures in Mesa can take a number of different codepaths depending on various factors using this priority chain:

  • If the format and data type of the pixels is compatible with the format of the texture used for the upload, they can be directly copied by the driver. this is the fastest method
  • If the driver supports shader images and the texture format is supported, Gallium will generate a fragment shader which binds the data-to-be-uploaded as a bufferview, the texture as a framebuffer attachment, and then samples to the framebuffer. this is pretty fast
  • If the format of the host memory buffer (the data being uploaded) is supported by the driver, Gallium will create a staging texture, memcpy the pixel data into it, and then blit it to the target texture. now we’re slowing down a bit
  • Finally, if all else fails, Mesa will just demand the driver maps the target texture so it can dump the pixel data in on the CPU, usually resulting in what looks like a frozen screen because of all the staging textures, flushing, and stalling going on in order to get the pixels where they need to go. this is the bad place

The reason CS:GO takes forever to start with zink is because it performs texture uploads of formats which zink does not support, namely alpha and luminance formats that cannot be emulated in Vulkan, during startup in order to load assets onto the GPU. This hits the CPU copy path 100% of the time, which is actually going to be slower than using something like llvmpipe because GPU-based graphics drivers are not intended to be doing everything on the CPU.

Set PBOs To Ludicrous Speed

I spent a while considering the problem at hand, then decided to start in the place where I could understand what was going on: namely texture downloads. In contrast to uploads, downloads have fewer and simpler codepaths to choose from:

  • If the format and data type of the texture matches the requested pixel format, the data is copied directly with memcpy. this is the fastest method
  • If the format and data type of the requested pixel format are supported by the driver and the texture format is also compatible with the requested format and type, Gallium creates a staging texture to blit to and then memcpy. this is still pretty okay
  • Software fallback. oh no

So to effectively improve this with a new mechanism, all I had to do was meet two criteria:

  • Ensure that all formats are supported so that the software fallback is not needed
  • Ensure that all format conversions are supported so that the software fallback is not needed

There was also, of course, the additional checklist item of Don’t Be Slower Than The Existing Semi-Fastpath Implementation, but I figured I could get to that later.

The implementation I settled on, after much trial and error, was to set up a SSBO descriptor and then use the source texture as a sampler to texelFetch from in a compute shader. A small, vec4-sized constant buffer is passed to the shader, and then everything is done on the GPU to convert the image to the requested pixel format. The SSBO can then be copied to the output memory buffer, and boom, done.

It took some time to get right, but the results are already quite promising. Here’s some results from pbobench, a tool I’m working on to help measure PBO performance. At present, it populates a 1024x1024 pixel texture using GL_R32F data, then downloads (glGetTextureSubImage) it as fast as possible and measures the number of calls per second, iterating over a number of different formats and types for the download.

Here’s the latest results that I’ve run on IRIS using GL_PACK_ALIGNMENT==4 and GL_PACK_SWAP_BYTES==1 for variety.

32x32

# Format Type calls/s current calls/s compute
1 GL_R32F GL_FLOAT 458692 20171
2 GL_RGBA8 GL_UNSIGNED_BYTE 22263 20857
3 GL_RGB5_A1 GL_UNSIGNED_BYTE 22234 20211
4 GL_RGBA4 GL_UNSIGNED_BYTE 22301 20247
5 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 22088 20324
6 GL_RGBA8_SNORM GL_BYTE 22320 20315
7 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 22644 20094
8 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 23189 20229
9 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 23678 19763
10 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 23618 19914
11 GL_RGBA16F GL_HALF_FLOAT 208212 19306
12 GL_RGBA32F GL_FLOAT 231616 19411
13 GL_RGBA16F GL_FLOAT 227953 18887
14 GL_RGB8 GL_UNSIGNED_BYTE 22917 19691
15 GL_RGB565 GL_UNSIGNED_BYTE 23000 20002
16 GL_SRGB8 GL_UNSIGNED_BYTE 22852 20011
17 GL_RGB8_SNORM GL_BYTE 26064 20006
18 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 24226 19813
19 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 140785 19755
20 GL_R11F_G11F_B10F GL_HALF_FLOAT 221776 17852
21 GL_R11F_G11F_B10F GL_FLOAT 193169 19660
22 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 18616 18715
23 GL_RGB9_E5 GL_HALF_FLOAT 207255 18441
24 GL_RGB9_E5 GL_FLOAT 221998 19171
25 GL_RGB16F GL_HALF_FLOAT 228688 18704
26 GL_RGB32F GL_FLOAT 215211 19294
27 GL_RGB16F GL_FLOAT 233145 18858
28 GL_RG8 GL_UNSIGNED_BYTE 21850 18832
29 GL_RG8_SNORM GL_BYTE 19445 18902
30 GL_RG16F GL_HALF_FLOAT 248270 18413
31 GL_RG32F GL_FLOAT 270652 19426
32 GL_RG16F GL_FLOAT 286874 18964
33 GL_R8 GL_UNSIGNED_BYTE 22093 20270
34 GL_R8_SNORM GL_BYTE 21951 20154
35 GL_R16F GL_HALF_FLOAT 300217 19514
36 GL_R16F GL_FLOAT 454349 19784
37 GL_RGBA GL_UNSIGNED_BYTE 21023 19926
38 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 21408 19664
39 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 22183 19090
40 GL_RGB GL_UNSIGNED_BYTE 21791 20054
41 GL_RGB GL_UNSIGNED_SHORT_5_6_5 23290 19164
42 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 21032 20300
43 GL_LUMINANCE GL_UNSIGNED_BYTE 21897 20565
44 GL_ALPHA GL_UNSIGNED_BYTE 21941 20049

The 32x32 size is not very favorable for the new implementation. Drivers are already extremely optimized for small PBO operations, so at best, my work here manages to keep up with the existing codebase for some cases.

Let’s go up a power of two to 64x64.

# Format Type calls/s current calls/s compute
45 GL_R32F GL_FLOAT 187911 18895
46 GL_RGBA8 GL_UNSIGNED_BYTE 21094 18852
47 GL_RGB5_A1 GL_UNSIGNED_BYTE 20121 18515
48 GL_RGBA4 GL_UNSIGNED_BYTE 19995 18532
49 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 20829 19453
50 GL_RGBA8_SNORM GL_BYTE 21167 18914
51 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 5547 19544
52 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 5686 19762
53 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 5934 18229
54 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 5917 18035
55 GL_RGBA16F GL_HALF_FLOAT 69373 17646
56 GL_RGBA32F GL_FLOAT 66885 18507
57 GL_RGBA16F GL_FLOAT 69091 17522
58 GL_RGB8 GL_UNSIGNED_BYTE 5529 18198
59 GL_RGB565 GL_UNSIGNED_BYTE 5489 19197
60 GL_SRGB8 GL_UNSIGNED_BYTE 5735 19455
61 GL_RGB8_SNORM GL_BYTE 6496 19538
62 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 5971 19152
63 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 40364 19262
64 GL_R11F_G11F_B10F GL_HALF_FLOAT 76933 18650
65 GL_R11F_G11F_B10F GL_FLOAT 67688 18642
66 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 4957 18894
67 GL_RGB9_E5 GL_HALF_FLOAT 73775 17660
68 GL_RGB9_E5 GL_FLOAT 69293 18703
69 GL_RGB16F GL_HALF_FLOAT 74131 17808
70 GL_RGB32F GL_FLOAT 67877 18735
71 GL_RGB16F GL_FLOAT 68759 17787
72 GL_RG8 GL_UNSIGNED_BYTE 21194 19150
73 GL_RG8_SNORM GL_BYTE 20644 19174
74 GL_RG16F GL_HALF_FLOAT 90086 19010
75 GL_RG32F GL_FLOAT 88349 19285
76 GL_RG16F GL_FLOAT 89450 19041
77 GL_R8 GL_UNSIGNED_BYTE 21215 19813
78 GL_R8_SNORM GL_BYTE 21280 19457
79 GL_R16F GL_HALF_FLOAT 107419 19180
80 GL_R16F GL_FLOAT 189485 19045
81 GL_RGBA GL_UNSIGNED_BYTE 20784 19454
82 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 5645 19375
83 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 5625 19078
84 GL_RGB GL_UNSIGNED_BYTE 5753 19196
85 GL_RGB GL_UNSIGNED_SHORT_5_6_5 5917 17889
86 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 5553 19014
87 GL_LUMINANCE GL_UNSIGNED_BYTE 21420 18528
88 GL_ALPHA GL_UNSIGNED_BYTE 21506 19944

64x64 starts to show some interesting results, cases like GL_SRGB8 where the compute implementation has dominant performance.

# Format Type calls/s current calls/s compute
89 GL_R32F GL_FLOAT 55577 16096
90 GL_RGBA8 GL_UNSIGNED_BYTE 17315 15264
91 GL_RGB5_A1 GL_UNSIGNED_BYTE 17735 15541
92 GL_RGBA4 GL_UNSIGNED_BYTE 17191 15486
93 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 17343 14831
94 GL_RGBA8_SNORM GL_BYTE 17362 15710
95 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 1367 15981
96 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 1367 16372
97 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 1475 15181
98 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 1482 14789
99 GL_RGBA16F GL_HALF_FLOAT 19254 14916
100 GL_RGBA32F GL_FLOAT 18465 13109
101 GL_RGBA16F GL_FLOAT 18141 13075
102 GL_RGB8 GL_UNSIGNED_BYTE 1439 16143
103 GL_RGB565 GL_UNSIGNED_BYTE 1441 16252
104 GL_SRGB8 GL_UNSIGNED_BYTE 1407 16106
105 GL_RGB8_SNORM GL_BYTE 1583 16071
106 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 1520 16246
107 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 10888 16137
108 GL_R11F_G11F_B10F GL_HALF_FLOAT 22128 14779
109 GL_R11F_G11F_B10F GL_FLOAT 18807 13986
110 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 1251 15138
111 GL_RGB9_E5 GL_HALF_FLOAT 22464 14482
112 GL_RGB9_E5 GL_FLOAT 18562 14075
113 GL_RGB16F GL_HALF_FLOAT 22072 15038
114 GL_RGB32F GL_FLOAT 18801 14100
115 GL_RGB16F GL_FLOAT 18774 13864
116 GL_RG8 GL_UNSIGNED_BYTE 18446 16890
117 GL_RG8_SNORM GL_BYTE 18493 17353
118 GL_RG16F GL_HALF_FLOAT 26391 15989
119 GL_RG32F GL_FLOAT 25502 15230
120 GL_RG16F GL_FLOAT 25498 15027
121 GL_R8 GL_UNSIGNED_BYTE 18754 17213
122 GL_R8_SNORM GL_BYTE 16275 17254
123 GL_R16F GL_HALF_FLOAT 31097 16525
124 GL_R16F GL_FLOAT 54923 16005
125 GL_RGBA GL_UNSIGNED_BYTE 17905 15956
126 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 1434 16266
127 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 1462 16251
128 GL_RGB GL_UNSIGNED_BYTE 1465 16370
129 GL_RGB GL_UNSIGNED_SHORT_5_6_5 1525 16550
130 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 1416 15919
131 GL_LUMINANCE GL_UNSIGNED_BYTE 18876 16872
132 GL_ALPHA GL_UNSIGNED_BYTE 17523 16271

By the 128x128 texture size, the massive performance lead that the existing glGetTextureSubImage handling had has dwindled to 20-50% for some cases, while the compute shader is now outperforming by a factor of ten or more for other cases.

# Format Type calls/s current calls/s compute
133 GL_R32F GL_FLOAT 14768 8850
134 GL_RGBA8 GL_UNSIGNED_BYTE 10980 9142
135 GL_RGB5_A1 GL_UNSIGNED_BYTE 11034 9063
136 GL_RGBA4 GL_UNSIGNED_BYTE 11104 9160
137 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 11196 9085
138 GL_RGBA8_SNORM GL_BYTE 10843 9139
139 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 500 10001
140 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 496 9775
141 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 484 8868
142 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 508 8994
143 GL_RGBA16F GL_HALF_FLOAT 5025 7952
144 GL_RGBA32F GL_FLOAT 4731 6635
145 GL_RGBA16F GL_FLOAT 4722 6519
146 GL_RGB8 GL_UNSIGNED_BYTE 497 9356
147 GL_RGB565 GL_UNSIGNED_BYTE 499 9181
148 GL_SRGB8 GL_UNSIGNED_BYTE 479 9067
149 GL_RGB8_SNORM GL_BYTE 784 9704
150 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 527 9569
151 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 1396 8938
152 GL_R11F_G11F_B10F GL_HALF_FLOAT 5697 8283
153 GL_R11F_G11F_B10F GL_FLOAT 4760 6599
154 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 444 8123
155 GL_RGB9_E5 GL_HALF_FLOAT 5836 8305
156 GL_RGB9_E5 GL_FLOAT 4782 6944
157 GL_RGB16F GL_HALF_FLOAT 5669 8313
158 GL_RGB32F GL_FLOAT 4759 6819
159 GL_RGB16F GL_FLOAT 4779 6912
160 GL_RG8 GL_UNSIGNED_BYTE 11772 10298
161 GL_RG8_SNORM GL_BYTE 11771 10555
162 GL_RG16F GL_HALF_FLOAT 6900 9324
163 GL_RG32F GL_FLOAT 6601 7928
164 GL_RG16F GL_FLOAT 6461 7965
165 GL_R8 GL_UNSIGNED_BYTE 12249 10936
166 GL_R8_SNORM GL_BYTE 12423 11080
167 GL_R16F GL_HALF_FLOAT 8790 10254
168 GL_R16F GL_FLOAT 15005 7751
169 GL_RGBA GL_UNSIGNED_BYTE 11094 8086
170 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 506 9767
171 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 494 9790
172 GL_RGB GL_UNSIGNED_BYTE 497 9689
173 GL_RGB GL_UNSIGNED_SHORT_5_6_5 532 9917
174 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 487 10353
175 GL_LUMINANCE GL_UNSIGNED_BYTE 12297 10944
176 GL_ALPHA GL_UNSIGNED_BYTE 12310 10940

256x256 is the point at which the difference starts to become even more pronounced. The cases where the compute implementation is definitively worse are few and far between, and many of the cases in which it was previously worse are now neck and neck.

# Format Type calls/s current calls/s compute
177 GL_R32F GL_FLOAT 3739 3348
178 GL_RGBA8 GL_UNSIGNED_BYTE 4533 3409
179 GL_RGB5_A1 GL_UNSIGNED_BYTE 4327 3354
180 GL_RGBA4 GL_UNSIGNED_BYTE 4609 3274
181 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 4520 3333
182 GL_RGBA8_SNORM GL_BYTE 4296 3437
183 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 236 3614
184 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 237 3512
185 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 235 2715
186 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 230 3331
187 GL_RGBA16F GL_HALF_FLOAT 1291 2694
188 GL_RGBA32F GL_FLOAT 1127 1020
189 GL_RGBA16F GL_FLOAT 1127 1000
190 GL_RGB8 GL_UNSIGNED_BYTE 223 3648
191 GL_RGB565 GL_UNSIGNED_BYTE 225 3550
192 GL_SRGB8 GL_UNSIGNED_BYTE 227 3487
193 GL_RGB8_SNORM GL_BYTE 358 3586
194 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 252 3567
195 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 661 3084
196 GL_R11F_G11F_B10F GL_HALF_FLOAT 1509 3055
197 GL_R11F_G11F_B10F GL_FLOAT 1164 1222
198 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 209 2636
199 GL_RGB9_E5 GL_HALF_FLOAT 1499 2571
200 GL_RGB9_E5 GL_FLOAT 1182 1234
201 GL_RGB16F GL_HALF_FLOAT 1510 2955
202 GL_RGB32F GL_FLOAT 1179 1112
203 GL_RGB16F GL_FLOAT 1172 1247
204 GL_RG8 GL_UNSIGNED_BYTE 5019 3572
205 GL_RG8_SNORM GL_BYTE 5043 4201
206 GL_RG16F GL_HALF_FLOAT 1796 3471
207 GL_RG32F GL_FLOAT 1677 2701
208 GL_RG16F GL_FLOAT 1668 2638
209 GL_R8 GL_UNSIGNED_BYTE 5374 4084
210 GL_R8_SNORM GL_BYTE 5409 2985
211 GL_R16F GL_HALF_FLOAT 2222 880
212 GL_R16F GL_FLOAT 3689 2904
213 GL_RGBA GL_UNSIGNED_BYTE 4490 3179
214 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 220 3366
215 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 237 3405
216 GL_RGB GL_UNSIGNED_BYTE 228 3392
217 GL_RGB GL_UNSIGNED_SHORT_5_6_5 253 3180
218 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 229 3763
219 GL_LUMINANCE GL_UNSIGNED_BYTE 5400 4099
220 GL_ALPHA GL_UNSIGNED_BYTE 5473 3543

512x512 is even better for the compute-based implementation, which remains extremely consistent across nearly all cases. It now exhibits a commanding lead of almost 1500% for some cases, while at worst, it maintains a consistent rate and achieves only 60% of baseline performance for a few cases.

# Format Type calls/s current calls/s compute
221 GL_R32F GL_FLOAT 831 711
222 GL_RGBA8 GL_UNSIGNED_BYTE 839 724
223 GL_RGB5_A1 GL_UNSIGNED_BYTE 856 707
224 GL_RGBA4 GL_UNSIGNED_BYTE 831 644
225 GL_SRGB8_ALPHA8 GL_UNSIGNED_BYTE 869 679
226 GL_RGBA8_SNORM GL_BYTE 827 730
227 GL_RGBA4 GL_UNSIGNED_SHORT_4_4_4_4 77 709
228 GL_RGB5_A1 GL_UNSIGNED_SHORT_5_5_5_1 77 718
229 GL_RGB10_A2 GL_UNSIGNED_INT_2_10_10_10_REV 75 664
230 GL_RGB5_A1 GL_UNSIGNED_INT_2_10_10_10_REV 75 662
231 GL_RGBA16F GL_HALF_FLOAT 304 550
232 GL_RGBA32F GL_FLOAT 253 403
233 GL_RGBA16F GL_FLOAT 253 358
234 GL_RGB8 GL_UNSIGNED_BYTE 74 522
235 GL_RGB565 GL_UNSIGNED_BYTE 74 594
236 GL_SRGB8 GL_UNSIGNED_BYTE 74 618
237 GL_RGB8_SNORM GL_BYTE 156 575
238 GL_RGB565 GL_UNSIGNED_SHORT_5_6_5 80 593
239 GL_R11F_G11F_B10F GL_UNSIGNED_INT_10F_11F_11F_REV 160 585
240 GL_R11F_G11F_B10F GL_HALF_FLOAT 353 489
241 GL_R11F_G11F_B10F GL_FLOAT 282 366
242 GL_RGB9_E5 GL_UNSIGNED_INT_5_9_9_9_REV 67 518
243 GL_RGB9_E5 GL_HALF_FLOAT 352 488
244 GL_RGB9_E5 GL_FLOAT 279 365
245 GL_RGB16F GL_HALF_FLOAT 356 472
246 GL_RGB32F GL_FLOAT 282 352
247 GL_RGB16F GL_FLOAT 269 322
248 GL_RG8 GL_UNSIGNED_BYTE 936 273
249 GL_RG8_SNORM GL_BYTE 954 354
250 GL_RG16F GL_HALF_FLOAT 434 494
251 GL_RG32F GL_FLOAT 393 384
252 GL_RG16F GL_FLOAT 392 284
253 GL_R8 GL_UNSIGNED_BYTE 1398 624
254 GL_R8_SNORM GL_BYTE 1415 630
255 GL_R16F GL_HALF_FLOAT 535 530
256 GL_R16F GL_FLOAT 819 490
257 GL_RGBA GL_UNSIGNED_BYTE 861 504
258 GL_RGBA GL_UNSIGNED_SHORT_4_4_4_4 76 540
259 GL_RGBA GL_UNSIGNED_SHORT_5_5_5_1 78 685
260 GL_RGB GL_UNSIGNED_BYTE 74 676
261 GL_RGB GL_UNSIGNED_SHORT_5_6_5 82 706
262 GL_LUMINANCE_ALPHA GL_UNSIGNED_BYTE 75 825
263 GL_LUMINANCE GL_UNSIGNED_BYTE 1431 997
264 GL_ALPHA GL_UNSIGNED_BYTE 1399 962

1024x1024 is the true gauntlet for PBO downloads, and it’s really not something that is likely to be seen in many places. But pbobench covers it because why not, so here we are. The results here are much less pronounced just because the amount of time spent in memcpy on the CPU ends up being so large for both implmentations that the GPU doesn’t get much time to do work.

Improvements are ongoing for compute-accelerated PBOs, and performance continues to rise. I’m looking forward to tackling uploads next, as that should directly improve load times for GL-based games across the board.

Also, Graphics.

As part of the ongoing saga of working for a company that has an interest in gaming, games have been played recently.

One of those games is Tomb Raider (2013), and until very recently it had some issues when run on zink:

tombraider.png

Here we see the wild triangle in its native habitat, foraging for sustenance among its siblings.

All of that changed today, however, when I rebased zink-wip from a couple weeks ago and then hucked it back into the repo with a new snapshot name and zero testing.

I was subsequently informed that I had broken our beautiful triangle princess, and Senior Misrendering Connoisseur Witold Baryluk has provided us with actual gameplay footage with all settings set to EXTREME:

May 02, 2021
glxgears rendered on an Apple M1

After beginning a compiler for the Apple M1 GPU, the next step is to develop a graphics driver exercising the compiler. Since the last post two weeks ago, I’ve begun a Gallium driver for the M1, implementing much of the OpenGL 2.1 and ES 2.0 specifications. With the compiler and driver together, we’re now able to run OpenGL workloads like glxgears and scenes from glmark2 on the M1 with an open source stack. We are passing about 75% of the OpenGL ES 2.0 tests in the drawElements Quality Program used to establish Khronos conformance. To top it off, the compiler and driver are now upstreamed in Mesa!

Gallium is a driver framework inside Mesa. It splits drivers into frontends, like OpenGL and OpenCL, and backends, like Intel and AMD. In between, Gallium has a common caching system for graphics and compute state, reducing the CPU overhead of every Gallium driver. The code sharing, central to Gallium’s design, allows high-performance drivers to be written at a low cost. For us, that means we can focus on writing a Gallium backend for Apple’s GPU and pick up OpenGL and OpenCL support “for free”.

More sharing is possible. A key responsibility of the Gallium backend is to translate Gallium’s state objects into hardware packets, so we need a good representation of hardware packets. While packed bitfields can work, C’s bitfields have performance and safety (overflow) concerns. Hand-coded C structures lack pretty-printing needed for efficient debugging. Finally, while reverse-engineering, hand-coded C structures tend to accumulate random magic numbers in driver code, which is undesirable. These issues are not new; systems like Intel’s GenXML and Nouveau’s envytools solve them by allowing the hardware packets to be described as XML while all necessary C code is auto-generated. For Asahi, I’ve opted to use GenXML, providing a concise description of my reverse-engineering results and an ergonomic API for the driver.

The XML contains recently reverse-engineered packets, like those describing textures and samplers. Fortunately, these packets map closely to both Metal and Gallium, allowing a natural translation. Coupled with Dougall Johnson’s latest notes on texture instructions, adding texture support to the Gallium driver and NIR compiler was a cinch.

The resulting XML is somewhat smaller than that of other reverse-engineered drivers of similar maturity. As discussed in the previous post, Apple relies on shader code in lieu of fixed-function graphics hardware for tasks like vertex attribute fetch and blending. Apple’s design reduces the hardware surface area, in turn reducing the number of packets in the XML at the expense of extra driver code to produce the needed shader variants. Here is yet another win for code sharing: the most complex code needed is for blending and logic operations, for which Mesa already has lowering code. Mali (Panfrost) needs some blending lowered to shader code, and all Mesa drivers need advanced blending equations lowered (modes like overlay, colour dodge, and screen). As a result, it will be a simple task to wire in the “load tilebuffer to register” instruction and Mesa’s blending code and see a thousand more tests pass.

Although Apple culled a great deal of legacy functionality from their GPU, some features are retained to support older APIs like compatibility contexts of desktop OpenGL, even when the features are inaccessible from Metal. Index buffers and primitive types exhibit this phenomenon. In Metal, an application can draw using a 16-bit or 32-bit index buffer, selecting primitives like triangles and triangle strips, with primitive restart always enabled. Most graphics developers want to use this fast path; by only supporting the fast path, Metal prevents a game developer from accidentally introducing slow code. However, this limits our ability to implement Khronos APIs like OpenGL and Vulkan on top of the Apple hardware… or does it?

In addition to the subset supported by Metal, the Khronos APIs also support 8-bit index buffers, additional primitive types like triangle fans, and an option to disable primitive restart. True, some of this functionality is unnecessary. Real geometry usually requires more than 256 vertices, so 8-bit index buffers are a theoretical curiosity. Primitives like triangle fans can bring a hardware penalty relative to triangle strips due to poor cache locality while offering no generality over indexed triangles. Well-written apps generally require primitive restart, and it’s almost free for apps that don’t need it. Even so, real applications (and the Khronos conformance tests) do use these features, so we need to support them in our OpenGL and Vulkan drivers. The issue does not only affect us – drivers layered on top of Metal like MoltenVK struggle with these exact gaps.

If Apple left these features in the hardware but never wired them into Metal, we’d like to know. But clean room reverse-engineering the hardware requires observing the output of the proprietary driver and looking for patterns, so if the proprietary (Metal) driver doesn’t use the functionality, how can we ever know it exists?

This is a case for an underappreciated reverse-engineering technique: guesswork. Hardware designers are logical engineers. If we can understand how Apple approached the hardware design, we can figure out where these hidden features should be in the command stream. We’re looking for conspicuous gaps in our understanding of the hardware, like looking for black holes by observing the absence of light. Consider index buffers, which are configured by an indexed draw packet with a field describing the size. After trying many indexed draws in Metal, I was left with the following reverse-engineered fragment:

<enum name="Index size">
  <value name="U16" value="1"/>
  <value name="U32" value="2"/>
</enum>

<struct name="Indexed draw" size="32">
...
  <field name="Index size" size="2" start="2:17" type="Index size"/>
...
</struct>

If the hardware only supported what Metal supports, this fragment would be unusual. Note 16-bit and 32-bit index buffers are encoded as 1 and 2. In binary, that’s 01 and 10, occupying two different bits, so we know the index size is at least 2 bits wide. That leaves 00 (0) and 11 (3) as possible unidentified index sizes. Now observe index sizes are multiples of 8 bits and powers of two, so there is a natural encoding as the base-2 logarithm of the index size in bytes. This matches our XML fragment, since log2(16 / 8) = 1 and log2(32 / 8) = 2. From here, we make our leap of faith: if the hardware supports 8-bit index buffers, it should be encoded with value log2(8 / 8) = 0. Sure enough, if we try out this guess, we’ll find it passes the relevant OpenGL tests. Success.

Finding the missing primitive types works the same way. The primitive type field is 4-bits in the hardware, allowing for 16 primitive types to be encoded, while Metal only uses 5, leaving only 11 to brute force with the tests in hand. Likewise, few bits vary between an indexed draw of triangles (no primitive restart) and an indexed draw of triangle strips (with primitive restart), giving us a natural candidate for a primitive restart enable bit. Our understanding of the hardware is coming together.

One outstanding difficulty is ironically specific to macOS: the IOGPU interface with the kernel. Traditionally, open source drivers on Linux use simple kernel space drivers together with complex userspace drivers. The userspace (Mesa) handles all 3D state, and the kernel simply handles memory management and scheduling. However, macOS does not follow this model; the IOGPU kernel extension, common to all GPUs on macOS, is made aware of graphics state like surface dimensions and even details about mipmapping. Many of these mechanisms can be ignored in Mesa, but there is still an uncomfortably large volume of “magic” to interface with the kernel, like the memory mapping descriptors. The good news is many of these elements can be simplified when we write a Linux kernel driver. The bad news is that they do need to be reverse-engineered and implemented in Mesa if we would like native Vulkan support on Macs. Still, we know enough to drive the GPU from macOS… and hey, soon enough, we’ll all be running Linux on our M1 machines anyway :-)

April 26, 2021

You Will Be Missed

glxgears.png

RIP weird glxgears 2018-2021.

April 21, 2021

The Meson Build System provides support for running on Microsoft Windows, including support for Microsoft Visual Studio C++. GitHub Actions provides public access to CI machines running Microsoft Windows. But trying to tie both together is not as straightforward as it sounds.

Sometimes you stumble over a task you never thought you have to deal with. This story is about one of those times. In particular, I was faced with running CI tests for a simple C library on Microsoft Visual Studio C++ (MSVC). Gladly, GitHub already provides simple access to machines running Microsoft Windows Server 2016 and 2019, so this sounded like a straightforward task. Unfortunately, my infinite ignorance of anything Windows made this harder than it should have been.

The root of this problem is that the Meson Build System needs to run in the MSVC Developer Shell. This shell has all the necessary environment variables prepared for a particular install of MSVC. Since you can have multiple versions installed in parallel, Meson cannot know which install to use if run outside of such a shell. Unfortunately, GitHub Actions has no simple way to enter this shell. Therefore, running Meson on GitHub Actions will end up using GCC rather than MSVC, since this is what it detects by default in the GitHub Actions Environment. This is not what we wanted, so adjustments are needed.

Luckily, Microsoft provides a tool called vswhere which finds MSVC installs on a Windows system. We can use this to find the required setup scripts and then import the environment variables into our GitHub Actions setup. This tool is pre-deployed on GitHub Actions, so we can simply invoke it to find a suitable MSVC install. From there on, we look for DevShell.dll, which provides the required integration. We load it into PowerShell and invoke the provided Enter-VsDevShell function. By comparing our own environment variables before and after that call, we can extract the changes and export them into the GitHub Actions environment. Thus, the following workflow-steps will have access to those variables as well.

I plugged this into a re-usable GitHub Action using the new composite type. To use it in a GitHub Actions workflow, simply use:

- name: Prepare MSVC
  uses: bus1/cabuild/action/msdevshell@v1
  with:
    architecture: x64

This queries the MSVC environment and exports it to your GitHub Actions job. Following steps will thus run as if in an MSVC Developer Shell. A full example is appended at the bottom, which shows how to get Meson to compile and test a project on MSVC for both Windows Server 2016 and 2019.

If you rather import the code into your own project, you can find it on GitHub. Note that this uses PowerShell syntax, so it might look alien to linux developers.

While this is only roughly 50 lines of PowerShell scripting, it still feels a bit too hacky. The Meson developers are aware of this, but so far no patches have found their way upstream. Lets hope that this workaround will one day be obsolete and Meson invokes vswhere itself.


Following a full example workflow:

name: Continuous Integration

on: [push, pull_request]

jobs:
  ci-msvc:
    name: CI with MSVC
    runs-on: $
    strategy:
      matrix:
        os: [windows-2016, windows-latest]

    steps:
    - name: Fetch Sources
      uses: actions/checkout@v2
    - name: Setup Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.x'
    - name: Install Python Dependencies
      run: pip install meson ninja
    - name: Prepare MSVC
      uses: bus1/cabuild/action/msdevshell@v1
      with:
        architecture: x64
    - name: Prepare Build
      run: meson setup build
    - name: Run Build
      run: meson compile -v -C build
    - name: Run Test Suite
      run: meson test -v -C build
April 20, 2021

Running games and benchmarks is much more exciting than trying to fix a handful of remaining synthetic tests. Turnip, which is an open-source Vulkan driver for recent Adreno GPUs, should already be capable of running real world applications, and they always have a way to break the driver in a new, unexpected ways.

TauCeti Vulkan Technology Benchmark

The benchmark greeted me with this wonderful field which looked a little blocky:

Screenshot of the benchmark with grass/dirt field which looks more like a patchwork

It’s not a crash, it’s not a hang, there is something wrong either with textures or with a shader. So, let’s take a closer look at a frame. Here is the first major draw call which looks wrong:

Screenshot of a bad draw call being inspected in RenderDoc

Now, let’s take a look at textures used by the draw:

More than twelve textures used by the draw call

That’s a lot of textures! But all of them look fine; yes, there are some textures that look blocky, but these textures are just small, so nothing wrong here.

The next stop is the fragment shader, I recently implemented an extension which helps with that. VK_KHR_pipeline_executable_properties allows the reporting of additional information about shaders in a pipeline. In case of Turnip it is statistics about registers and assembly of the shader:

Part of the problematic shader with a few instructions having a warning printed near them Excerpt from the problematic shader
sam.base0 (f32)(xy)r0.x, r0.z, a1.x	; no field 'HAS_SAMP', WARNING: unexpected bits[0:7] in #cat5-samp-s2en-bindless-a1: 0x6 vs 0x0

Bingo! Usually when the issue is in a shader I have to either make educated guesses and/or reduce the shader until the issue is apparent. However, here we can immediately spot instructions with a warning. It may not be the cause of earth mis-rendering, but these instructions do sampling from textures, soooo.

After fixing their encoding, which was caused by a mistake in the XML definition, the ground is now rendered correctly:

Screenshot of the benchmark with a field which now has proper grass and dirt

After looking a bit more at the frame I saw that the same issue plagued all rock formations, and now it is gone too. The final fix could be seen at

ir3/isa,parser: fix encoding and parsing of bindless s2en SAM (!9628)

Genshin Impact

Now it’s time for a more heavyweight opponent - Genshin Impact, one of the most popular mobile games at the moment. Genshin Impact supports both GLES and Vulkan, and defaults to GLES on all but several devices. In any case, Vulkan could be enabled by editing a config file.

There are several mis-renderings in the game, however here I would describe the one which shows why it is important to use all available validation tooling for such complex API as Vulkan. Other mis-renderings would be a matter for the second post.

Gameplay – Disco Water

Proceeding to the gameplay and running around for a bit revealed a major artifacts on the water surface:

Screenshot of the gameplay with body of water that has large colorful artifacts

This one was a multiframe effect so I had to capture a trace of all frames and then find the one where issue began. The draw call I found:

GIF showing before and after problematic draw call

It adds water to the first framebuffer and a lot of artifacts to the second one. Looking at the framebuffer configuration - I could see that the fragment shader doesn’t write to the second framebuffer. Is it allowed? Yes. Is the behavior defined? No! VK_LAYER_KHRONOS_validation warns us about it, if warnings are enabled:

UNASSIGNED-CoreValidation-Shader-InputNotProduced(WARN / SPEC): Validation Warning:
Attachment 1 not written by fragment shader; undefined values will be written to attachment

“undefined values will be written to attachment” may mean that nothing is written into it, like expected by the game, or that random values are written to the second attachment, like Turnip does. Such behavior should not be relied upon, so Turnip does not contradict the specification here and there is nothing to fix. Case closed.

More mis-renderings and investigations to come in the next post(s).

Turnips in the wild (Part 2)

April 19, 2021

Last year I worked on implementing in Turnip the support for a HW feature present in Qualcomm Adreno GPUs: the low-resolution Z buffer (aka LRZ). This is a HW feature already supported in Freedreno, which is the open-source OpenGL driver for these GPUs.

What is low-resolution Z buffer

Low-resolution Z buffer is very similar to a depth prepass that helps the HW avoid executing the fragment shader on those fragments that will be subsequently discarded by the depth test afterwards (Hidden surface removal). This feature comes with some limitations though, such as the fragment shader not being allowed to have side effects (writing to SSBOs, atomic operations, etc) among others.

The interesting part of this feature is that it allows the applications to submit the vertices in any order (saving CPU time that was otherwise used on ordering them) and the HW will process them in the binning pass as explained below, detecting which ones are occluded and giving an increase in performance in some specific use cases due to this.

Tiled-rendering

To understand better how LRZ works, we need to talk a bit about tiled-based rendering. This is a way of rendering based on subdividing the framebuffer in tiles and rendering each tile separately. The advantage of this design is that the amount of memory and bandwidth is reduced compared to immediate mode rendering systems that draw the entire frame at once. Tile-based rendering is very popular on embedded GPUs, including Qualcomm Adreno.

Entering into more details, the graphics pipeline is divided into three different passes executed per tile of the frame.

Tiled-rendering architecture diagram
Tiled-rendering architecture diagram.
  • The binning pass. This pass processes the geometry of the scene and records in a table on which tiles a primitive will be rendered. By doing this, the HW only needs to render the primitives that affect a specific tile when is processed.

  • The rendering pass. This pass gets the rasterized primitives and executes all the fragment related processes of the pipeline (fragment shader execution, depth pass, stencil pass, blending, etc). Once it finishes, the resolve pass starts.

  • The resolve pass. It first resolves the tile buffer (GMEM) if it is multisample, and copy the final color and depth values for all tile pixels back to system memory. If it is the last tile of the framebuffer, it swap buffers and start the binning pass for next frame.

Where is LRZ used then? Well, in both binning and rendering passes. In the binning pass, it is possible to store the depth value of each vertex of the geometries of the scene in a buffer as the HW has that data available. That is the depth buffer used internally for LRZ. It has lower resolution as too much detail is not needed, which helps to save bandwidth while transferring its contents to system memory.

Thanks to LRZ, the rendering pass is only executed on the fragments that are going to be visible at the end. However, there are some limitations as mentioned before: if a fragment shader has collateral effects, such as writing SSBO, atomics, etc; or if blending is enabled, or if the fragment shader could modify the fragment’s depth… then LRZ cannot be used as the results may be wrong.

However, LRZ brings a couple of things on the table that makes it interesting. One is that applications don’t need to reorder their primitives before submission to be more efficient, that is done by the HW with LRZ automatically. Another one is performance improvements in some use cases. For example, imagine a fragment shader discards parts of fragments but it doesn’t have any other collateral effect otherwise. In that case, although we cannot do early depth testing, we can do early LRZ as we know that some fragments won’t pass a depth test even if they are not discarded by the fragment shader.

Turnip implementation

Talking about the LRZ implementation, I took Freedreno’s code as a starting point to implement LRZ on Turnip. After some months of work, it finally landed in Mesa master.

Last week, more patches related to LRZ landed in Mesa master: the ones fixing LRZ interactions with VK_EXT_extended_dynamic_state, as with this extension the application can change some states in command buffer time that could affect LRZ state and, therefore, we need to track them accordingly.

I also implemented some LRZ improvements that are currently under review also landed (thanks Eric Anholt!), such as the support to do early-LRZ-late-depth test that I mentioned before, which could bring a performance improvement in some applications.

LRZ improvements
Left: original vulkan tutorial demo implementation. Right: same demo modified to discard fragments with red component lower than 0.5f.

For instance, I did some measurements in a vulkan-tutorial.com implementation of my own that I modified to discard a significant amount of fragments (see previous figure). This is one of the cases that early-LRZ-late-depth test helps to improve performance.

When running the modified demo with these patches, I found a performance improvement between 13-16%.

Acknowledgments

All this LRZ work was my first big contribution to this open-source reverse engineered driver! I don’t want to finish this post without thanking publicly Rob Clark for the original Freedreno implementation and his reviews of my work, as well as Jonathan Marek and Connor Abbott for their insightful reviews, advice and tips to make it working. Edited: Many thanks to Eric Anholt for his reviews in the last two patch series!

Happy hacking!

For the fun of it I decided to run some real apps on lavapipe.

Talos Principle is still rando crashing on startup, occasionally whatever magic value ends up being right in uninit memory and it suddenly runs fine.

I started Rise of the Tomb Raider, and it renders really slowly up to the menu.

Then I gave DOOM 2016 with the Vulkan renderer a go, and with a few lavapipe hacks to enable some feature bits, I managed to get it to load a game image. It's taking 5-6s per frame to render. However most of the slowness in the frame is the BPTC texture loading which is a path that I've done no tuning for so it definitely running very slowly. I think RoTR is also hitting that slow path so I guess I've some incentive to look at cleaning it up.

 


April 18, 2021
Shaded cube rendered on an Apple M1

After a few weeks of investigating the Apple M1 GPU in January, I was able to draw a triangle with my own open source code. Although I began dissecting the instruction set, the shaders there were specified as machine code. A real graphics driver needs a compiler from high-level shading languages (GLSL or Metal) to a native binary. Our understanding of the M1 GPU’s instruction set has advanced over the past few months. Last week, I began writing a free and open source shader compiler targeting the Apple GPU. Progress has been rapid: at the end of its first week, it can compile both basic vertex and fragment shaders, sufficient to render 3D scenes. The spinning cube pictured above has its shaders written in idiomatic GLSL, compiled with the nascent free software compiler, and rendered with native code like the first triangle in January. No proprietary blobs here!

Over the past few months, Dougall Johnson has investigated the instruction set in-depth, building on my initial work. His findings on the architecture are outstanding, focusing on compute kernels to complement my focus on graphics. Armed with his notes and my command stream tooling, I could chip away at a compiler.

The compiler’s design must fit into the development context. Asahi Linux aims to run a Linux desktop on Apple Silicon, so our driver should follow Linux’s best practices like upstream development. That includes using the New Intermediate Representation (NIR) in Mesa, the home for open source graphics drivers. NIR is a lightweight library for shader compilers, with a GLSL frontend and backend targets including Intel and AMD. NIR is an alternative to LLVM, the compiler framework used by Apple. Just because Apple prefers LLVM doesn’t mean we have to. A team at Valve famously rewrote AMD’s LLVM backend as a NIR compiler, improving performance. If it’s good enough for Valve, it’s good enough for me.

Supporting NIR as input does not dictate our compiler’s own intermediate representation, which reflects the hardware’s design. The instruction set of AGX2 (Apple’s GPU) has:

  • Scalar arithmetic
  • Vectorized input/output
  • 16-bit types
  • Free conversions between 16-bit and 32-bit
  • Free floating-point absolute value, negate, saturate
  • 256 registers (16-bits each)
  • Register usage / thread occupancy trade-off
  • Some form of multi-issue or out-of-order (superscalar) execution

Each hardware property induces a compiler property:

  • Scalar sources. Don’t complicate the compiler by allowing unrestricted vectors.
  • Vector values at the periphery separated with vector combine and extract pseudo-instructions, optimized out during register allocation.
  • 16-bit units.
  • Sources and destinations are sized. The optimizer folds size conversion instructions into uses and definitions.
  • Sources have absolute value and negate bits; instructions have a saturate bit. Again, the optimizer folds these away.
  • A large register file means running out of registers is rare, so don’t optimize for register spilling performance.
  • Minimizing register pressure is crucial. Use static single assignment (SSA) form to facilitate pressure estimates, informing optimizations.
  • The scheduler simply reorders instructions without leaking details to the rest of the backend. Scheduling is feasible both before and after register allocation.

Putting it together, a design for an AGX compiler emerges: a code generator translating NIR to an SSA-based intermediate representation, optimized by instruction combining passes, scheduled to minimize register pressure, register allocated while going out of SSA, scheduled again to maximize instruction-level parallelism, and finally packed to binary instructions.

These decisions reflect the hardware traits visible to software, which are themselves “shadows” cast by the hardware design. Investigating these traits offers insight into the hardware itself. Consider the register file. While every thread can access up to 256 half-word registers, there is a performance penalty: the more registers used, the fewer concurrent threads possible, since threads share a register file. The number of threads allowed in a given shader is reported in Metal as the maxTotalThreadsPerThreadgroup property. So, we can study the register pressure versus occupancy trade-off by varying the register pressure of Metal shaders (confirmed via our disassembler) and correlating with the value of maxTotalThreadsPerThreadgroup:

Registers Threads
<= 104 1024
112 896
120, 128 832
136 768
144 704
152, 160 640
168-184 576
192-208 512
216-232 448
240-256 384

From the table, it’s clear that up until a threshold, it doesn’t matter how many registers the program uses; occupancy is unaffected. Most well-written shaders fall in this bracket and need not worry. After hitting the threshold, other GPUs might spill registers to memory, but Apple doesn’t need to spill until more than 256 registers are required. Between 112 and 256 registers, the number of threads decreases in an almost linear fashion, in increments of 64 threads. Carefully considering rounding, it’s easy to recover the formula Metal uses to map register usage to thread count.

What’s less obvious is that we can infer the size of the machine’s register file. On one hand, if 256 registers are used, the machine can still support 384 threads, so the register file must be at least 256 half-words * 2 bytes per half-word * 384 threads = 192 KiB large. Likewise, to support 1024 threads at 104 registers requires at least 104 * 2 * 1024 = 208 KiB. If the file were any bigger, we would expect more threads to be possible at higher pressure, so we guess each threadgroup has exactly 208 KiB in its register file.

The story does not end there. From Apple’s public specifications, the M1 GPU supports 24576 = 1024 * 24 simultaneous threads. Since the table shows a maximum of 1024 threads per threadgroup, we infer 24 threadgroups may execute in parallel across the chip, each with its own register file. Putting it together, the GPU has 208 KiB * 24 = 4.875 MiB of register file! This size puts it in league with desktop GPUs.

For all the visible hardware features, it’s equally important to consider what hardware features are absent. Intriguingly, the GPU lacks some fixed-function graphics hardware ubiquitous among competitors. For example, I have not encountered hardware for reading vertex attributes or uniform buffer objects. The OpenGL and Vulkan specifications assume dedicated hardware for each, so what’s the catch?

Simply put – Apple doesn’t need to care about Vulkan or OpenGL performance. Their only properly supported API is their own Metal, which they may shape to fit the hardware rather than contorting the hardware to match the API. Indeed, Metal de-emphasizes vertex attributes and uniform buffers, favouring general constant buffers, a compute-focused design. The compiler is responsible for translating the fixed-function attribute and uniform state to shader code. In theory, this has a slight runtime cost; conventional wisdom says dedicated hardware is faster and lower power than software emulation. In practice, the code is so simple it may make no difference, although application developers should be mindful of the vertex formats used in case conversion code is inserted. As always, there is a trade-off: omitting features allows Apple to squeeze more arithmetic logic units (or register file!) onto the chip, speeding up everything else.

The more significant concern is the increased time on the CPU spent compiling shaders. If changing fixed-function attribute state can affect the shader, the compiler could be invoked at inopportune times during random OpenGL calls. Here, Apple has another trick: Metal requires the layout of vertex attributes to be specified when the pipeline is created, allowing the compiler to specialize formats at no additional cost. The OpenGL driver pays the price of the design decision; Metal is exempt from shader recompile tax.

The silver lining is that there is nothing to reverse-engineer for “missing” features like attributes and uniform buffers. As long as we know how to access memory in compute kernels, we can write the lowering code ourselves with no hardware mysteries. So far, I’ve implemented enough to spin a cube.

At present, the in-progress compiler supports most arithmetic and input/output instructions found in OpenGL ES 2.0, with a simple optimizer and native instruction packing. Support for control flow, textures, scheduling, and register allocation will come further down the line as we work towards a real driver.

Get the code while it’s hot!

April 16, 2021
A while ago I worked on improving Logitech G15 LCD-screen support under Linux. I recently got an email from someone who wanted to add support for the LCD panel in the Logitech Z-10 speakers to lcdproc, asking me to describe the process I went through to improve G15 support in lcdproc and how I made it work without requiring the unmaintained g15daemon code.

So I wrote a long email describing the steps I went through and we both thought this could be interesting for others too, so here it is:

1. For some reason I decided that I did not have enough projects going on at the same time already and I started looking into improving support for the G15 family of keyboards.

2. I started studying the g15daemon code and did a build of it to check that it actually works. I believe making sure that you have a known-to-work codebase as a starting point, even if it is somewhat crufty and ugly, is important. This way you know you have code which speaks the right protocol and you can try to slowly morph it into what you want it to become (making small changes testing every step). Even if you decide to mostly re-implement from scratch, then you will likely use the old code as a protocol documentation and it is important to know it actually works.

3. There were number of things which I did not like about g15daemon:

3.1 It uses libusb, taking control of the entire USB-interface used for the gaming functionality away from the kernel. 

3.2 As part of this it was not just dealing with the LCD, it also was acting as a dispatches for G-key key-presses. IMHO the key-press handling clearly belonged in the kernel. These keys are just extra keys, all macro functionality is handled inside software on the host/PC side. So as a kernel dev I found that these really should be handled as normal keys and emit normal evdev event with KEY_FOO codes from a /dev/input/event# kernel node.

3.3 It is a daemon, rather then a library; and most other code which would deal with the LCD such as lcdproc was a daemon itself too, so now we would have lcdproc's LCDd talking to g15daemon to get to the LCD which felt rather roundabout.
So I set about tackling all of these

4. Kernel changes: I wrote a new drivers/hid/hid-lg-g15.c kernel driver:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hid/hid-lg-g15.c
Which sends KEY_MACRO1 .. KEY_MACRO18, KEY_MACRO_PRESET1 .. KEY_MACRO_PRESET3, KEY_MACRO_RECORD_START, KEY_KBD_LCD_MENU1 .. KEY_KBD_LCD_MENU4 keypresses for all the special keys. Note this requires that the kernel HID driver is left attached to the USB interface, which required changes on the g15dameon/lcdproc side.

This kernel driver also offers a /sys/class/leds/g15::kbd_backlight/ interface which allows controller the backlight intensity through e.g. GNOME kbd-backlight slider in the power pane of the control-center. It also offers a bunch of other LED interfaces under /sys/class/leds/ for controlling things like the LEDs under the M1 - M3 preset selection buttons. The idea here being that the kernel abstract the USB protocol for these gaming-kbd away and that a single userspace daemon for doing gaming kbd macro functionality can be written which will rely only on the kernel interfaces and will thus work with any kbd which has kernel support.

5. lcdproc changes

5.1 lcdproc already did have a g15 driver, which talked to g15daemon. So at first I started testing with this (at this point all my kernel work would not work, since g15daemon would take full control of the USB interface unbinding my kernel driver). I did a couple of bug-fixes / cleanups in this setup to get the code to a starting point where I could run it and it would not show any visible rendering bugs in any of the standard lcdproc screens

5.2 I wrote a little lcdproc helper library, lib_hidraw, which can be used by lcdproc drivers to open a /dev/hidraw device for them. The main feature is, you give the lib_hidraw_open() helper a list of USB-ids you are interested in; and it will then find the right /dev/hidraw# device node (which may be different every boot) and open it for you.

5.3 The actual sending of the bitmap to the LCD screen is quite simple, but it does need to be a very specific format. The format rendering is done by libg15render. So now it was time to replace the code talking to g15daemon with code to directly talk to the LCD through a /dev/hidraw# interface. I kept the libg15render dependency since that was fine. After a bit of refactoring, the actual change over to directly sending the data to the LCD was not that big.

5.4 The change to stop using the g15daemon meant that the g15 driver also lost support for detecting presses on the 4 buttons directly under the LCD which are meant for controlling the menu on the LCD. But now that the code is using /dev/hidraw# the kernel driver I wrote would actually work and report KEY_KBD_LCD_MENU1 .. KEY_KBD_LCD_MENU4 keypresses. So I did a bunch of cleanup / refactoring of lcdproc's linux_input driver and made it take over the reporting of those button presses.

5.5 I wanted everything to just work out of the box, so I wrote some udev rules which automatically generate a lcdproc.conf file configuring the g15 + linux_input drivers (including the key-mapping for the linux_input driver) when a G15 keyboard gets plugged in and the lcdproc.conf file does not exist yet.

All this together means that users under Fedora, where I also packaged all this can now do "dnf install lcdproc", pluging their G15 keyboard and everything will just work.



p.s.

After the email exchange I got curious and found a pair of these speakers 2nd hand for a nice price. The author of the initial email was happy with me doing the work on this. So I added support for the Z-10 speakers to lcdproc (easy) and wrote a set of kernel-patches to support the display and 1-4 keys on the speaker as LCD menu keys.

I've also prepared an update to the Fedora lcdproc packages so that they will now support the Z-10 speakers OOTB, if you have these and are running Fedora, then (once the update has reached the repos) a "sudo dnf install lcdproc" followed by unplugging + replugging the Z-10 speakers should make the display come alive and show the standard lcdproc CPU/mem usage screens.
April 15, 2021

Given that the new RDNA2 GPUs provide some support for hardware accelerated raytracing and there is even a new shiny Vulkan extension for it, it may not be a surprise that we’re working on implementing raytracing support in RADV.

Already some time ago I wrote documentation for the hardware raytracing support. As these GPUs contain quite minimal hardware to implement things there is a large software and shader side to implementing this.

And that is what I’ve been up to for the last couple of weeks. And I now have achieved my first personal milestones for the implementation:

  1. A fully recursive Fibonacci shader
  2. And a raytraced cube:

Raytraced cube

This involves writing initial versions for a lot of the software infrastructure needed, so really shows that the basis is getting there.

At the same time we’re quite a ways off from really testing using CTS or running our first real demos. In particular we are missing things like

  • GPU-side BVH building
  • any-hit and intersection shaders
  • Supporting BVH instances, geometry transforms etc.
  • pipeline libraries

and much more, in addition to some of these initial implementations likely not really being performant.

TFW Long Game Loads

I’m typing this up between loads and runs of various games I’m testing, since bug reports for games are somehow already a thing, and there’s a lot of them.

The worst part about testing games is the unbelievably long load times (and startup videos) most of them have, not to mention those long, panning camera shots at the start of the game before gameplay begins and I can start crashing.

But this isn’t a post about games.

No, no, there’s plenty of time for such things.

This is a combo post: part roundup because blogging has been sporadic the past couple weeks, and part feature.

The Roundup

The big Mesa 21.1 branchpoint happened yesterday, and I’m pretty pleased with the state of zink in this upcoming release.

Things you should expect to see:

  • GL 4.6
  • ES 3.1
  • Reasonable performance in many cases

Things you should not expect to see:

  • Most (any?) AAA games working; I’ve kept GL compat contexts clamped to 3.0 in a certainly-futile attempt to cut down on the absolute deluge of bug tickets I’m expecting once everyone tries to run their favorite games/emulators/whathaveyou with the shipped version
    • This functionality remains enabled in zink-wip and will be dumped into mainline soon
  • ???
  • Tough to say, honestly, since this is effectively a version of zink that is, to me, 5-6 months old with a few other patches sprinkled in here and there

And here’s the zink-wip roundup from the past however long since I did the last one:

  • I doubled blending performance in many cases by fixing an incredibly old TODO item regarding using linear tiled images for scanout; this got rushed into the 21.1 release solely to avoid super embarrassing numbers on Phoronix benchmarks.

Yeah, I’m talking to you.

  • I fixed a ton of bugs. Like, actually a ton. Tomb Raider went from an all-you-can-crash buffet to…well, I’ll leave it as a fun surprise for anyone feeling especially intrepid, but it’s definitely playable. So are things that use queries. Don’t ask.

  • There’s totally some other stuff, but I’m too fried to remember it.

The Real Post

Everything up there was just fluff, but this is where the post gets real.

Zink supports formats with alpha-to-one, e.g., RGBX, BGRX, and even (on zink-wip, very illegal) XRGB and XBGR. This is handy for 24bit visuals, such as (probably) all your windows in Xorg. But the method by which these formats are supported comes with its own issues, part of which I’d fixed some time ago in my quest to reduce GPU overhead, and the other part I discovered more recently.

In zink, an alpha-to-one format is just the equivalent format with alpha. So RGBX is just RGBA. This is due to Vulkan not having format equivalents for these types; when sampling from them, a swizzle is applied to force the alpha channel to the maximum value, which yields the correct result.

But what happens when an RGBX framebuffer attachment is used in a blending operation?

Let’s look at VK_BLEND_OP_ADD as a very simple example. The spec defines this operation as:

As0 × Sa + Ad × Da

That’s alpha-of-src times src-alpha-blend-factor plus alpha-of-dest times dest-alpha-blend-factor yielding the resulting pixel color that gets written to the attachment.

But what if the dest value is expected to always one, and the actual buffer is always zero because its alpha channel is never written?

Such is the case with RGBX and the like, and so more steps are required here for full emulation.

The Real Roundup

Here’s how I went about solving the issue.

First, framebuffer attachments have to be monitored when they’re updated, and any time an alpha-to-one attachment is bound, I set a bitflag for it in the pipeline state. This then triggers a pipeline update for the blend state at the time of draw, where I apply clamping to the appropriate blend factors like so:

if (state->zero_alpha_attachments) {
   for (unsigned i = 0; i < state->num_attachments; i++) {
      blend_att[i] = state->blend_state->attachments[i];
      if (state->zero_alpha_attachments & BITFIELD_BIT(i)) {
         blend_att[i].dstAlphaBlendFactor = VK_BLEND_FACTOR_ZERO;
         blend_att[i].srcColorBlendFactor = clamp_zero_blend_factor(blend_att[i].srcColorBlendFactor);
         blend_att[i].dstColorBlendFactor = clamp_zero_blend_factor(blend_att[i].dstColorBlendFactor);
      }
   }
   blend_state.pAttachments = blend_att;
} else
   blend_state.pAttachments = state->blend_state->attachments;

For any of the attachments in the bitfield, three clamps are performed:

  • dstAlphaBlendFactor is clamped to zero, because there will never be any contribution from the dest component of alpha blending
  • srcColorBlendFactor and dstColorBlendFactor are both clamped using the following check:
if (f == VK_BLEND_FACTOR_ONE_MINUS_DST_ALPHA)
   return VK_BLEND_FACTOR_ZERO;
if (f == VK_BLEND_FACTOR_DST_ALPHA)
   return VK_BLEND_FACTOR_ONE;
return f;

Thus, alpha blending is a passthrough operation from the src component, and for color blending, the dest component is always one or zero. This yields correct results in piglit’s spec@arb_texture_float@fbo-blending-formats test, and also potentially enables the hardware to employ some optimizations to reduce the burden of blending.

Exciting.

April 14, 2021

With dbus-broker we have introduced the resource-accounting of bus1 into the D-Bus world. We believe it greatly improves and strengthens the resource distribution of the D-Bus messages bus, and we have already found a handful of resource leaks that way. However, it can be a daunting task to solve resource exhaustion bugs, so I decided to describe the steps we took to resolve a recent resource-leak in the openQA package.

A few days ago, Adam Williamson approached me 1 with a bug in the openQA package, where he saw the log stream filled with messages like:

dbus-broker[<pid>]: Peer :1.<id> is being disconnected as it does not have the resources to receive a reply or unicast signal it expects.
dbus-broker[<pid>]: UID <uid> exceeded its 'bytes' quota on UID <uid>.

This is the typical sign of a resource exhaustion in dbus-broker. When the message broker generates or forwards messages to an individual client, it will queue them as outgoing-messages and push them into the unix-socket of the client. If this client does not dequeue messages, this queue might fill up. If a limit is reached, something needs to be done. Since D-Bus is not a lossy protocol, dropping messages is not an option. Instead, the message broker will either refuse new incoming operations or disconnect a client. All resources are accounted on UIDs, this means multiple clients of the same user will share the same resource limits.

Depending on what message is sent, it is accounted either on the receiver or sender. Furthermore, some messages can be refused by the broker, others cannot. The exact rules are described in the wiki 2.

In the case of openQA, the first step was to query the accounting information of the running message broker:

sudo dbus-send --system --dest=org.freedesktop.DBus --type=method_call --print-reply /org/freedesktop/DBus org.freedesktop.DBus.Debug.Stats.GetStats

(Replace --system with --session to query the session or user bus.)

While preferably this query is performed when the resource exhaustion happens, it will often yield useful information under normal operation as well. Resources are often consumed slowly, so the accumulation will still show up.

The output 3 of this query shows a list of all D-Bus clients with their accounting information. Furthermore, it lists all UIDs that have clients connected to this message bus, again with all accounting information. The challenge is to find suspicious entries in this huge data dump. The most promising solution so far was to search for "OutgoingBytes" and check for big numbers. This shows the number of bytes queued in the message broker for a particular client. It is usually 0, since the kernel queues are big enough to hold most normal messages. Even if it is not 0, it is usually just a couple of KiB.

In this case, we checked for "OutgoingBytes", and found:

dict entry(
    string "OutgoingBytes"
    uint32 62173024
)

62 MiB of messages are waiting to be delivered to that client. Expanding the logs to show the surrounding block, we see:

struct {
    string ":1.211366"
    array [
        dict entry(
            string "UnixUserID"
            variant                            uint32 991
        )
        dict entry(
            string "ProcessID"
            variant                            uint32 674968
        )
        [...]
    ]

    array [
        [...]
        dict entry(
            string "Matches"
            uint32 1
        )
        [...]
        dict entry(
            string "OutgoingBytes"
            uint32 62173024
        )
        [...]
    ]
}

This tells us the PID 674968 of user 991 has roughly 62 MiB of data queued, and it is likely not dequeuing the data. Furthermore, we see it has 1 message filter (D-Bus match rule) installed. D-Bus message filters will cause matching D-Bus signals to be delivered to a client. So a likely problem is that this client keeps receiving signals, but does not dispatch its client socket.

We digged further, and the data dump includes more such clients. Matching back the PIDs to processes via ps auxf, we found that each and every of those suspicious entries was /usr/bin/isotovideo: backend. The code of this process is part of the os-autoinst repository, in this case qemu.pm. A quick look showed only a single use of D-Bus 4. At a first glance, this looks alright. It creates a system-bus connection via the Net::DBus perl module, dispatches a method-call, and returns the result. However, we know this process has a match-rule installed (assuming the dbus-broker logs are correct), so we checked further and found that the Net::DBus module always installs a match-rule on NameOwnerChanged. Furthermore, it caches the system-bus connection in a global variable, sharing it across users in the same code-base.

Long story short, the os-autoinst qemu module created a D-Bus connection which was idle in the background and never dispatched by any code. However, the connection has a match-rule installed, and the message broker kept sending matching signals to that connection. This data accumulated and eventually exceeded the resource quota of that client. A workaround was quickly provided, and it will hopefully resolve this problem 5.

Hopefully, this short recap will be helpful to debug other similar situations. You are always welcome to message us on bus1-devel@googlegroups or on the dbus-broker GitHub issue tracker if you need help.

April 08, 2021

In RADV we just added an option to speed up rendering by rendering less pixels.

These kinds of techniques have become more common over the past decade with techniques such as checkerboarding, TAA based upscaling and recently DLSS. Fundamentally all they do is trading off rendering quality for rendering cost and many of them include some amount of postprocessing to try to change the curve of that tradeoff. Most notably DLSS has been wildly successful at that to the point many people claim it is barely a quality regression.

Of course increasing GPU performance by up to 50% or so with barely any quality regression seems like must have and I think it would be pretty cool if we could have the same improvements on Linux. I think it has the potential to be a game changer, making games playable on APUs or playing with really high resolution or framerates on desktops.

And today we took our first baby steps in RADV by allowing users to force Variable Rate Shading (VRS) with an experimental environment variable:

RADV_FORCE_VRS=2x2

VRS is a hardware capability that allows us to reduce the number of fragment shader invocations per pixel rendered. So you could say configure the hardware to use one fragment shader invocation per 2x2 pixels. The hardware still renders the edges of geometry exactly, but the inner area of each triangle is rendered with a reduced number of fragment shader invocations.

There are a couple of ways this capability can be configured:

  1. On a per-draw level
  2. On a per-primitive level (e.g. per triangle)
  3. Using an image to configure on a per-region level

This is a new feature for AMD on RDNA2 hardware.

With RADV_FORCE_VRS we use this to improve performance at the cost of visual quality. Since we did not implement any postprocessing the quality loss can be pretty bad, so we restricted the reduce shading rate when we detect one of the following

  1. Something is rendered in 2D, as that is likely some UI where you’d really want some crispness
  2. When the shader can discard pixels, as this implicitly introduces geometry edges that the hardware doesn’t see but that significantly impact the visual quality.

As a result there are some games where this has barely any effect but you also don’t notice the quality regression and there are games where it really improves performance by 30%+ but you really notice the quality regression.

VRS is by far the easiest thing to make work in almost all games. Most alternatives like checkerboarding, TAA and DLSS need modified render target size, significant shader fixups, or even a proprietary integration with games. Making changes that deeply is getting more complicated the more advanced the game is.

If we want to reduce render resolution (which would be a key thing in e.g. checkerboarding or DLSS) it is very hard to confidently tie all resolution dependent things together. For example a big cost for some modern games is raytracing, but the information flow to the main render targets can be very hard to track automatically and hence such a thing would require a lot of investigation or a bunch of per game customizations.

And hence we decided to introduce this first baby step. Enjoy!

April 07, 2021

The lavapipe vulkan software rasterizer in Mesa is now reporting Vulkan 1.1 support.

It passes all CTS tests for those new features in 1.1 but it stills fails all the same 1.0 tests so isn't that close to conformant. (lines/point rendering are the main areas of issue).

There are also a bunch of the 1.2 features implemented so that might not be too far away though 16-bit shader ops and depth resolve are looking a bit tricky.

If there are any specific features anyone wants to see or any crazy places/ideas for using lavapipe out there, please either file a gitlab issue or hit me up on twitter @DaveAirlie


Buffering

The great thing about tomorrow is that it never comes.

Let’s talk about sparse buffers.

What is a sparse buffer? A sparse buffer is a buffer that is not required to be contiguously or fully backed. This means that a buffer larger than the GPU’s available memory can be created, and only some parts of it are utilized at any given time. Because of the non-resident nature of the backing memory, they can never be mapped, instead needing to go through a staging buffer for any host read/write.

In a gallium-based driver, provided that an effective implementation for staging buffers exists, sparse buffer implementation goes almost exclusively through the pipe_context::resource_commit hook, which manages residency of a sparse resource’s backing memory, passing a range to change residency for and an on/off switch.

In zink(-wip), the hook looks like this:

static bool
zink_resource_commit(struct pipe_context *pctx, struct pipe_resource *pres, unsigned level, struct pipe_box *box, bool commit)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_resource *res = zink_resource(pres);
   struct zink_screen *screen = zink_screen(pctx->screen);

   /* if any current usage exists, flush the queue */
   if (zink_batch_usage_matches(&res->obj->reads, ctx->curr_batch) ||
       zink_batch_usage_matches(&res->obj->writes, ctx->curr_batch))
      zink_flush_queue(ctx);

   VkBindSparseInfo sparse;
   sparse.sType = VK_STRUCTURE_TYPE_BIND_SPARSE_INFO;
   sparse.pNext = NULL;
   sparse.waitSemaphoreCount = 0;
   sparse.bufferBindCount = 1;
   sparse.imageOpaqueBindCount = 0;
   sparse.imageBindCount = 0;
   sparse.signalSemaphoreCount = 0;

   VkSparseBufferMemoryBindInfo sparse_bind;
   sparse_bind.buffer = res->obj->buffer;
   sparse_bind.bindCount = 1;
   sparse.pBufferBinds = &sparse_bind;

   VkSparseMemoryBind mem_bind;
   mem_bind.resourceOffset = box->x;
   mem_bind.size = box->width;
   mem_bind.memory = commit ? res->obj->mem : VK_NULL_HANDLE;
   mem_bind.memoryOffset = box->x;
   mem_bind.flags = 0;
   sparse_bind.pBinds = &mem_bind;
   VkQueue queue = util_queue_is_initialized(&ctx->batch.flush_queue) ? ctx->batch.thread_queue : ctx->batch.queue;

   VkResult ret = vkQueueBindSparse(queue, 1, &sparse, VK_NULL_HANDLE);
   if (!zink_screen_handle_vkresult(screen, ret)) {
      check_device_lost(ctx);
      return false;
   }
   return true;
}

Naturally there’s a need to enjoy the verbosity of Vulkan structs here, but there’s two key takeaways.

The first is that this implementation is likely suboptimal; it should be making better use of semaphores to avoid having to flush the queue if the resource has current-batch usage. That’s complex to implement, however, so I took the same shortcut that RadeonSI does here.

The second is that this is just copying the pipe_box struct to the VkSparseMemoryBind struct. The reason this works with a 1:1 mapping is because the backing resource is allocated with a 1:1 range mapping, so the values can be directly used.

Other than that, the only changes required for this implementation were to add a bunch of checks for the sparse flag on resources during map/unmap to force staging buffers and to use device-local memory instead of host-visible.

Sometimes zink can be simple!

April 06, 2021
For a long time Logitech produced wireless keyboards using 27 MHz as communications band. Although these have not been produced for a while now these are still pretty common and a lot of them are still perfectly serviceable.

But when using them under Linux, there is one downside, since the communication is one way by default the wireless link is unencrypted by default, which is kinda bad from a security pov. These keyboards do support using an encrypted link, but this requires a one-time setup where the user manually enters a key on the keyboard.

I've written a small Linux utility to do this under Linux, which should help give these keyboards an extra lease on life and stop them unnecessarily becoming e-waste. Sometimes these keyboards appear to be broken, while the only problem is that the key in the keyboard and receiver are not in sync, the README also contains instructions on howto reset the keyboard, without needing the utility, restoring (unencrypted) functionality.

The 'lg-27MHz-keyboard-encryption-setup' utility is available on Fedora in the 'logitech-27mhz-keyboard-encryption-setup package.
April 05, 2021

The crocus project was recently mentioned in a phoronix article. The article covered most of the background for the project.

Crocus is a gallium driver to cover the gen4-gen7 families of Intel GPUs. The basic GPU list is 965, GM45, Ironlake, Sandybridge, Ivybridge and Haswell, with some variants thrown in. This hardware currently uses the Intel classic 965 driver. This is hardware is all gallium capable and since we'd like to put the classic drivers out to pasture, and remove support for the old infrastructure, it would be nice to have these generations supported by a modern gallium driver.

The project was initiated by Ilia Mirkin last year, and I've expended some time in small bursts to moving it forward. There have been some other small contributions from the community. The basis of the project is a fork of the iris driver with the old relocation based batchbuffer and state management added back in. I started my focus mostly on the older gen4/5 hardware since it was simpler and only supported GL 2.1 in the current drivers. I've tried to cleanup support for Ivybridge along the way.

The current status of the driver is in my crocus branch.

Ironlake is the best supported, it runs openarena and supertuxkart, and piglit has only around 100 tests delta vs i965 (mostly edgeflag related) and there is only one missing feature (vertex shader push constants). 

Ivybridge just stop hanging on second batch submission now, and glxgears runs on it. Openarena starts to the menu but is misrendering and a piglit run completes with some gpu hangs and a quite large delta. I expect IVB to move faster now that I've solved the worst hang.

Haswell runs glxgears as well.

I think once I take a closer look at Ivybridge/Haswell and can get Ilia (or anyone else) to do some rudimentary testing on Sandybridge, I will start taking a closer look at upstreaming it into Mesa proper.


Woosh

After last week’s post touting the “final” features being added to the upcoming Mesa release, naturally now that this is a new week, I have to outdo myself.

I’ve heard some speculation about zink’s future regarding features. Specifically regarding all the mesamatrix features that aren’t green-ified for zink yet.

So you want features is what you’re saying.

Let’s see where things stand in today’s zink-wip snapshot:

  • GL_OES_tessellation_shader, GL_OES_gpu_shader5 - this is a mesamatrix bug; zink can’t reach GL 4.0 without supporting them, so obviously they are supported
  • GL_ARB_bindless_texture - the final boss
  • GL_ARB_cl_event - not (yet) supported by mesa
  • GL_ARB_compute_variable_group_size - done
  • GL_ARB_ES3_2_compatibility - missing advanced blend from ES3.2
  • GL_ARB_fragment_shader_interlock - done
  • GL_ARB_gpu_shader_int64 - done
  • GL_ARB_parallel_shader_compile - done
  • GL_ARB_post_depth_coverage - done (thanks ajax)
  • GL_ARB_robustness_isolation - not supported by mesa
  • GL_ARB_sample_locations - done
  • GL_ARB_seamless_cubemap_per_texture - needs a new Vulkan extension
  • GL_ARB_shader_ballot - done
  • GL_ARB_shader_clock - done
  • GL_ARB_shader_stencil_export - done
  • GL_ARB_shader_viewport_layer_array - done
  • GL_ARB_shading_language_include - done
  • GL_ARB_sparse_buffer - done
  • GL_ARB_sparse_texture - not supported by mesa
  • GL_ARB_sparse_texture2 - not supported by mesa
  • GL_ARB_sparse_texture_clamp - not supported by mesa
  • GL_ARB_texture_filter_minmax - done
  • GL_EXT_memory_object - TODO
  • GL_EXT_memory_object_fd - TODO
  • GL_EXT_memory_object_win32 - not supported by mesa
  • GL_EXT_render_snorm - done
  • GL_EXT_semaphore - TODO
  • GL_EXT_semaphore_fd - TODO
  • GL_EXT_semaphore_win32 - not supported by mesa
  • GL_EXT_sRGB_write_control - TODO
  • GL_EXT_texture_norm16 - done
  • GL_EXT_texture_sRGB_R8 - TODO
  • GL_KHR_blend_equation_advanced_coherent - same as regular advanced blend
  • GL_KHR_texture_compression_astc_hdr - TODO
  • GL_KHR_texture_compression_astc_sliced_3d - TODO
  • GL_OES_depth_texture_cube_map - done
  • GL_OES_EGL_image - done
  • GL_OES_EGL_image_external - done
  • GL_OES_EGL_image_external_essl3 - done
  • GL_OES_required_internalformat - done
  • GL_OES_surfaceless_context - done
  • GL_OES_texture_compression_astc - TODO
  • GL_OES_texture_float - done
  • GL_OES_texture_float_linear - done
  • GL_OES_texture_half_float - done
  • GL_OES_texture_half_float_linear - done
  • GL_OES_texture_view - same mesamatrix bug since this is a GL 4.3 extension
  • GL_OES_viewport_array - done
  • GLX_ARB_context_flush_control - not supported by mesa
  • GLX_ARB_robustness_application_isolation - not supported by mesa
  • GLX_ARB_robustness_share_group_isolation - not supported by mesa
  • GL_EXT_shader_group_vote - done
  • GL_EXT_multisampled_render_to_texture - TODO
  • GL_EXT_color_buffer_half_float - TODO
  • GL_EXT_depth_bounds_test - done

By my calculations, that’s 11 TODO, 10 not supported, 2 advanced blend, and 1 final boss, a total of 24 out-of-version-extensions not yet implemented out of 54, meaning that 30 are done, tying with i965 and second only to RadeonSI at 33.

New in today’s snapshot: GL_ARB_fragment_shader_interlock, GL_ARB_sparse_buffer, GL_ARB_sample_locations, GL_ARB_shader_ballot, GL_ARB_shader_clock, GL_ARB_texture_filter_minmax

Cross-referencing

Writing blog posts like this is easy, but you know what’s not easy?

Writing good blog posts.

And new to the blogging game is the one, the only, Bas Nieuwenhuizen of RADV founding fame! If you’re at all curious about how drivers actually work, his is definitely a site to follow, as he’s already gone much deeper into explaining my RPCS3 memcpy fail than I ever did.

Tomorrow

Is sparse buffer implementation 101. I’ve said it, so now the blog post has to happen.

April 03, 2021

In this article I show how reading from VRAM can be a catastrophe for game performance and why.

To illustrate I will go back to fall 2015. AMDGPU was just released, it didn’t even have re-clocking yet and I was just a young student trying to play Skyrim on my new AMD R9 285.

Except it ran slowly. 10-15 FPS slowly. Now one might think that is no surprise as due to lack of re-clocking the GPU ran with a shader clock of 300 MHz. However the real surprise was that the game was not at all GPU bound.

As usual with games of that era there was a single thread doing a lot of the work and that thread was very busy doing something inside the game binary. After a bunch of digging with profilers and gdb, it turned out that the majority of time was spent in a single function that accessed less than 1 MiB from a GPU buffer each frame.

At the time DXVK was not a thing yet and I ran the game with wined3d on top of OpenGL. In OpenGL an application does not specify the location of GPU buffers directly, but specifies some properties about how it is going to be used and the driver decides. Poorly in this case.

There was a clear tweak to the driver heuristics that choose the memory location and the frame rate of the game more than doubled and was now properly GPU bound.

Some Data

After the anecdote above you might be wondering how slow reading from VRAM can really be? 1 MiB is not a lot of data so even if it is slow it cannot be that bad right?

To show you how bad it can be I ran some benchmarks on my system (Threadripper 2990wx, 4 channel DDR4-3200 and a RX 6800 XT). I checked read/write performance using a 16 MiB buffer (512 MiB for system memory to avoid the test being contained in L3 cache)

We look into three allocation types that are exposed by the amdgpu Linux kernel driver:

  • VRAM. This lives on the GPU and is mapped with Uncacheable Speculative Write Combining (USWC) on the CPU. This means that accesses from the CPU are not cached, but writes can be write-combined.

  • Cacheable system memory. This is system memory that has caching enabled on the CPU and there is cache snooping to ensure the memory is coherent between the CPU and GPU (up till the top level caches. The GPU caches do not participate in the coherence).

  • USWC system memory. This is system memory that is mapped with Uncacheable Speculative Write Combining on the CPU. This can lead to slight performance benefits compared to cacheable system memory due to lack of cache snooping.

For context, in Vulkan this would roughly correspond to the following memory types:

Hardware Vulkan
VRAM VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
Cacheable system memory VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT
USWC system memory VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT

The benchmark resulted in the following throughput numbers:

method (MiB/s) VRAM Cacheable System Memory USWC System Memory
read via memcpy 15 11488 137
write via memcpy 10028 18249 11480

I furthermore tested handwritten for-loops accessing 8,16,32 and 64-bit elements at a time and those got similar performance.

This clearly shows that reads from VRAM using memcpy are ~766x slower than memcpy reads from cacheable system memory and even non-cacheable system memory is ~91x slower than cacheable system memory. Reading even small amounts from these can cause severe performance degradations.

Writes show a difference as well, but the difference is not nearly as significant. So if an application does not select the best memory location for their data for CPU access it is still likely to result in a reasonable experience.

APUs Are Affected Too

Even though APUs do not have VRAM they still are affected by the same issue. Typically the GPU gets a certain amount of memory pre-allocated at boot time as a carveout. There are some differences in how this is accessed from the GPU so from the perspective of the GPU this memory can be faster.

At the same time the Linux kernel only gives uncached access to that region from the CPU, so one could expect similar performance issues to crop up.

I did the same test as above on a laptop with a Ryzen 5 2500U (Raven Ridge) APU, and got results that are are not dissimilar from my workstation.

method (MiB/s) Carveout Snooped System Memory USWC System Memory
read via memcpy 108 10426 108
write via memcpy 11797 20743 11821

The carveout performance is virtually identical to the uncached system memory now, which is still ~97x slower than cacheable system memory. So even though it is all system memory on an APU care still has to be taken on how the memory is allocated.

What To Do Instead

Since the performance cliff is so large it is recommended to avoid this issue if at all possible. The following three methods are good ways to avoid the issue:

  1. If the data is only written from the CPU, it is advisable to use a shadow buffer in cacheable system memory (can even be outside of the graphics API, e.g. malloc) and read from that instead.

  2. If this is written by the GPU but not frequently, one could consider putting the buffer in snooped system memory. This makes the GPU traffic go over the PCIE bus though, so it has a trade-off.

  3. Let the GPU copy the data to a buffer in snooped system memory. This is basically an extension of the previous item by making sure that the GPU accesses the data exactly once in system memory. The GPU roundtrip can take a non-trivial wall-time though (up to ~0.5 ms measured on some low end APUs), some of which is size-independent, such as command submission. Additionally this may need to wait till the hardware unit used for the copy is available, which may depend on other GPU work. The SDMA unit (Vulkan transfer queue) is a good option to avoid that.

Other Limitations

Another problem with CPU access from VRAM is the BAR size. Typically only the first 256 MiB of VRAM is configured to be accessible from the CPU and for anything else one needs to use DMA.

If the working set of what is allocated in VRAM and accessed from the CPU is large enough the kernel driver may end up moving buffers frequently in the page fault handler. System memory would be an obvious target, but due to the GPU performance trade-off that is not always the decision that gets made.

Luckily, due to the recent push from AMD for Smart Access Memory, large BARs that encompass the entire VRAM are now much more common on consumer platforms.

April 02, 2021

This is the first post of this blog and with it being past midnight I couldn’t be bothered making one about a technical topic. So instead here is an explanation of my plans with the blog.

I got inspired by the prolific blogging of Mike Blumenkrantz and some discussion on the VKx discord that some actually written updates can be very useful, and that I don’t need to make a paper out of each one.

At the same time I have been involved in some longer running things on the driver side which I think could really use some updates as progress is made. Consider for example raytracing, DRM format modifiers, RGP support and more.

I have no plans at all to be as prolific as Mike by a long shot, but I think the style of articles is probably a good template of what to expect from this blog.

April 01, 2021

I’m Trying

Blogging is tough, but I’m getting the posts out there one way or another.

Today marks what is likely to be the last of the “big” changes to zink in Mesa 21.1 before the merge window closes in less than two weeks, and what changes they were.

Threaded context support is now implemented, which, on its own, makes zink vaguely competitive against native GL drivers. I’d expect that for many scenarios, people should start seeing upwards of 60-70% native perf when previously the numbers were much lower, excepting things like furmark, where a weird problem with alpha blending is still causing a massive perf hit.

If that wasn’t enough, my timeline semaphore handling also snuck in, providing a reduction in CPU overhead for asynchronous queue-related operations where supported. Special thanks to Vulkan crash test dummy and uninitialized variable enthusiast Lionel Landwerlin for tripping and falling over basically every line of code in this implementation to help get it to the finish line for your consumption.

And if that still wasn’t enough, my RADV draw dispatch refactor also landed yesterday, both paving the way for some totally secret future work of mine and also bringing a 3-4% reduction in CPU overhead for draws that will make your gaming feel faster now that you’re aware of it but realistically won’t have any discernible effect. Basically racing stripes.

March 31, 2021

Do As I Say, Not As Zink Does

Today, a brief post and lamentation.

I’m sure everyone is well aware of Vulkan semantics regarding array vs non-array image types: when using an array type, e.g., VK_IMAGE_TYPE_2D with VkImageCreateInfo::arrayLayers > 1, always use array members for accessing/copying/blitting, and when using a 3D type, always use depth members.

This means array types should use baseArrayLayer and layerCount for copying/blitting and arrayPitch for accessing subresource regions. Non-array types, specifically 3D types, should use VkExtent3D::depth and VkSubresourceLayout::depthPitch.

This is really important, as I’ve found out over the past week, given that this has not been handled as it should’ve in many places throughout the zink stack. Some drivers were cool and didn’t make a big deal about it. Other drivers were more accurate and have just been failing all along.

And Before I Forget

I was recently interviewed by Boiling Steam, a small Linux gaming-oriented news site focused on creating original content and interviewing only the most important figures within the community (like me). If you’ve ever wanted to know more about the rich open source pedigree of Super Good Code, the interview goes deep into the back catalogue of how things got to this point.

March 30, 2021

Yeah, Again

It’s been a while since I blogged about descriptors, so let’s fix that since this site may as well be called Super Good Descriptors.

Last time, I talked a bit about descriptors 3.0: lazy descriptors. The idea I settled on here was to do templated updates, and to do the absolute minimal amount of work possible for each draw while still never reusing any written-to descriptor sets once they’d been cycled out.

Lazy descriptors worked out great, and I’m sure many of you have grown to enjoy the ZINK: USING LAZY DESCRIPTORS log message that got spammed on startup over the past couple months of zink-wip.

Well, times have changed, and this message will no longer fill your terminal by default after today’s (20210330) zink-wip snapshot.

Modes

New today is the ZINK_DESCRIPTORS environment variable, supporting three modes of operation:

  • auto - the default, which attempts to detect system capabilities and use caching with templated updates
  • lazy - the mode we all know and love from the past few months
  • notemplates - this is the old-style caching mechanism

auto mode moves zink one step closer to my eventual goal, which is to be able to use driconf to do application-based mode changes to disable caching for apps which never reuse resources. Also potentially useful would be the ability to dynamically disable caching on a pipeline-by-pipeline basis while an application is running if too many cache misses are detected.

Necessary?

With that said, I’ve come to the conclusion that any form of caching may actually be, at best, equivalent to uncached mode for the general desktop user, and it may only be worthwhile for special cases, like Vulkan drivers which can’t do descriptor templates or embedded devices. In my latest testing (on desktop systems), I have yet to see any scenarios where lazy mode fails to provide the best performance.

ARM in particular seems to gain a lot from it, as the post shows a ~40% perf improvement. It’s unclear to me, however, whether any benchmarking was done against a highly optimized uncached implementation like I’ve done. The overhead from doing basic descriptor updating without templates is definitely significant, so it might just be that things are different now that more functionality is available. On Linux systems, at least, every Vulkan driver that matters supports descriptor templates, so this is functionality that can be relied upon.

Is zink-wip slow now?

No.

The goal of zink-wip is to provide an optimal testing environment with the absolute bleeding edge in terms of performance and features. The auto mode should provide that, and the cases I’ve seen where its performance is noticeably worse number exactly one, and it’s a subtest for drawoverhead. If anyone finds any other cases where auto is worse than lazy, I’m interested, but it shouldn’t be a concern.

With that said, it might be worth doing some benchmarking between the two for some extremely high CPU usage scenarios, as that’s the only case where it may be possible to detect a difference. Gone are the days of zink(-wip) hogging the whole CPU, so probably this is just useless pontificating to fill more of a blog page.

But also, if you’re doing any kind of benchmarking on a high-end CPU, I’d probably recommend going with the lazy mode for now.

Game-changers

I’m pleased with the current state of descriptor caching, but it does bother me that it isn’t dramatically better than the uncached mode on my desktop. I think this ultimately just comes down to the current cache implementation being split into two steps:

  • compute the descriptor cache key
  • lookup the set

This effectively sits on top of the lazy mode, moving work out of the Vulkan driver and into the cache lookup any time there’s a cache hit. As such, I’ve been considering working to shift some of this work into threads, though this is somewhat challenging given the current gallium API. Specifically, only SSBOs and shader images can be per-stage updated immediately after bind, as both UBOs and samplers bind only a single descriptor slot at a time, meaning there’s no way to know when the “final” one is bound.

But then again, I’ve certainly reached the point of diminishing returns now. Most applications that I test have minimal CPU usage for the zink driver thread (e.g., Unigine Superposition is only at about 10% utilization on a i7-6700K), and are instead bottlenecking hard in the GPU, so I think it’s time to call things “good enough” here unless things change in a significant way.

March 28, 2021

A restrictive end-user license agreement is one way a company can exert power over the user. When the free software movement was founded thirty years ago, these restrictive licenses were the primary user-hostile power dynamic, so permissive and copyleft licenses emerged as synonyms to software freedom. Licensing does matter; user autonomy is lost with subscription models, revocable licenses, binary-only software, and onerous legal clauses. Yet these issues pertinent to desktop software do not scratch the surface of today’s digital power dynamics.

Today, companies exert power over their users by: tracking, selling data, psychological manipulation, intrusive advertising, planned obsolescence, and hostile Digital “Rights” Management (DRM) software. These issues affect every digital user, technically inclined or otherwise, on desktops and smartphones alike.

The free software movement promised to right these wrongs via free licenses on the source code, with adherents arguing free licenses provide immunity to these forms of malware since users could modify the code. Unfortunately most users lack the resources to do so. While the most egregious violations of user freedom come from companies publishing proprietary software, these ills can remain unchecked even in open source programs, and not all proprietary software exhibits these issues. The modern browser is nominally free software containing the trifecta of telemetry, advertisement, and DRM; a retro video game is proprietary software but relatively harmless.

As such, it’s not enough to look at the license. It’s not even enough to consider the license and a fixed set of issues endemic to proprietary software; the context matters. Software does not exist in a vacuum. Just as proprietary software tends to integrate with other proprietary software, free software tends to integrate with other free software. Software freedom in context demands a gentle nudge towards software in user interests, rather than corporate interests.

How then should we conceptualize software freedom?

Consider the three adherents to free software and open source: hobbyists, corporations, and activists. Individual hobbyists care about tinkering with the software of their choice, emphasizing freely licensed source code. These concerns do not affect those who do not make a sport out of modifying code. There is nothing wrong with this, but it will never be a household issue.

For their part, large corporations claim to love “open source”. No, they do not care about the social movement, only the cost reduction achieved by taking advantage of permissively licensed software. This corporate emphasis on licensing is often to the detriment of software freedom in the broader context. In fact, it is this irony that motivates software freedom beyond the license.

It is the activist whose ethos must apply to everyone regardless of technical ability or financial status. There is no shortage of open source software, often of corporate origin, but this is insufficient – it is the power dynamic we must fight.

We are not alone. Software freedom is intertwined with contemporary social issues, including copyright reform, privacy, sustainability, and Internet addiction. Each issue arises as a hostile power dynamic between a corporate software author and the user, with complicated interactions with software licensing. Disentangling each issue from licensing provides a framework to address nuanced questions of political reform in the digital era.

Copyright reform generalizes the licensing approaches of the free software and free culture movements. Indeed, free licenses empower us to freely use, adapt, remix, and share media and software alike. However, proprietary licenses micromanaging the core of human community and creativity are doomed to fail. Proprietary licenses have had little success preventing the proliferation of the creative works they seek to “protect”, and the rights to adapt and remix media have long been exercised by dedicated fans of proprietary media, producing volumes of fanfiction and fan art. The same observation applies to software: proprietary end-user license agreements have stopped neither file sharing nor reverse-engineering. In fact, a unique creative fandom around proprietary software has emerged in video game modding communities. Regardless of legal concerns, the human imagination and spirit of sharing persists. As such, we need not judge anyone for proprietary software and media in their life; rather, we must work towards copyright reform and free licensing to protect them from copyright overreach.

Privacy concerns are also traditional in software freedom discourse. True, secure communications software can never be proprietary, given the possibility of backdoors and impossibility of transparent audits. Unfortunately, the converse fails: there are freely licensed programs that inherently compromise user privacy. Consider third-party clients to centralized unencrypted chat systems. Although two users of such a client privately messaging one another are using only free software, if their messages are being data mined, there is still harm. The need for context is once more underscored.

Sustainability is an emergent concern, tying to software freedom via the electronic waste crisis. In the mobile space, where deprecating smartphones after a few short years is the norm and lithium batteries are hanging around in landfills indefinitely, we see the paradox of a freely licensed operating system with an abysmal social track record. A curious implication is the need for free device drivers. Where proprietary drivers force devices into obsolescence shortly after the vendor abandons them in favour of a new product, free drivers enable long-term maintenance. As before, licensing is not enough; the code must also be upstreamed and mainlined. Simply throwing source code over a wall is insufficient to resolve electronic waste, but it is a prerequisite. At risk is the right of a device owner to continue use of a device they have already purchased, even after the manufacturer no longer wishes to support it. Desired by climate activists and the dollar conscious alike, we cannot allow software to override this right.

Beyond copyright, privacy, and sustainability concerns, no software can be truly “free” if the technology itself shackles us, dumbing us down and driving us to outrage for clicks. Thanks to television culture spilling onto the Internet, the typical citizen has less to fear from government wiretaps than from themselves. For every encrypted message broken by an intelligence agency, thousands of messages are willingly broadcast to the public, seeking instant gratification. Why should a corporation or a government bother snooping into our private lives, if we present them on a silver platter? Indeed, popular open source implementations of corrupt technology do not constitute success, an issue epitomized by free software responses to social media. No, even without proprietary software, centralization, or cruel psychological manipulation, the proliferation of social media still endangers society.

Overall, focusing on concrete software freedom issues provides room for nuance, rather than the traditional binary view. End-users may make more informed decisions, with awareness of technologies’ trade-offs beyond the license. Software developers gain a framework to understand how their software fits into the bigger picture, as a free license is necessary but not sufficient for guaranteeing software freedom today. Activists can divide-and-conquer.

Many outside of our immediate sphere understand and care about these issues; long-term success requires these allies. Claims of moral superiority by licenses are unfounded and foolish; there is no success backstabbing our friends. Instead, a nuanced approach broadens our reach. While abstract moral philosophies may be intellectually valid, they are inaccessible to all but academics and the most dedicated supporters. Abstractions are perpetually on the political fringe, but these concrete issues are already understood by the general public. Furthermore, we cannot limit ourselves to technical audiences; understanding network topology cannot be a prerequisite to private conversations. Overemphasizing the role of source code and under-emphasizing the power dynamics at play is a doomed strategy; for decades we have tried and failed. In a post-Snowden world, there is too much at stake for more failures. Reforming the specific issues paves the way to software freedom. After all, social change is harder than writing code, but with incremental social reform, licenses become the easy part.

The nuanced analysis even helps individual software freedom activists. Purist attempts to refuse non-free technology categorically are laudable, but outside a closed community, going against the grain leads to activist burnout. During the day, employers and schools invariably mandate proprietary software, sometimes used to facilitate surveillance. At night, popular hobbies and social connections today are mediated by questionable software, from the DRM in a video game to the surveillance of a chat with a group of friends. Cutting ties with friends and abandoning self-care as a prerequisite to fighting powerful organizations seems noble, but is futile. Even without politics, there remain technical challenges to using only free software. Layering in other concerns, or perhaps foregoing a mobile smartphone, only amplifies the risk of software freedom burnout.

As an application, this approach to software freedom brings to light disparate issues with the modern web raising alarm in the free software community. The traditional issue is proprietary JavaScript, a licensing question, yet considering only JavaScript licensing prompts both imprecise and inaccurate conclusions about web “applications”. Deeper issues include rampant advertising and tracking; the Internet is the largest surveillance network in human history, largely for commercial aims. To some degree, these issues are mitigated by script, advertisement, and tracker blockers; these may be pre-installed in a web browser for harm reduction in pursuit of a gentler web. However, the web’s fatal flaw is yet more fundamental. By design, when a user navigates to a URL, their browser executes whatever code is piped on the wire. Effectively, the web implies an automatic auto-update, regardless of the license of the code. Even if the code is benign, it is still every year more expensive to run, forcing a hardware upgrade cycle deprecating old hardware which would work if only the web weren’t bloated by corporate interests. A subtler point is the “attention economy” tied into the web. While it’s hard to become addicted to reading in a text-only browser, binge-watching DRM-encumbered television is a different story. Half-hearted advances like “Reading Mode” are limited by the ironic distribution of documents over an app store. On the web, disparate issues of DRM, forced auto-update, privacy, sustainability, and psychological dark patterns converge to a single worst case scenario for software freedom. The licenses were only the beginning.

Nevertheless, there is cause for optimism. Framed appropriately, the fight for software freedom is winnable. To fight for software freedom, fight for privacy. Fight for copyright reform. Fight for sustainability. Resist psychological dark patterns. At the heart of each is a software freedom battle – keep fighting and we can win.

See also

Declaration of Digital Autonomy

Local-first software: You own your data, in spite of the cloud

The WWWorst App Store

March 25, 2021

 Mike, the zink dev, mentioned that swiftshader seemed slow at some stuff and I realised I've never expended much effort in checking swiftshader vs llvmpipe in benchmarks.

The thing is CPU rendering is pretty much going to top out on memory bandwidth pretty quickly but I decided to do some rough napkin benchmarks using the vulkan samples from Sascha Willems.

I'd also thought that due to having a few devs and the fact that it was used instead of mesa by google for lots of things that llvmpipe would be slower since it hasn't really gotten dedicated development resources.

I picked a random smattering of Vulkan samples and ran them on my Ryzen 

workstation without doing anything else, in their default window size.

The first number is lavapipe fps the second swiftshader.

  • gears: 336 309
  • instancing: 3 3
  • ssao: 19 9
  • deferredmultisampling:  11 4
  • computeparticles: 9 8
  • computeshader: 73 57
  • computeshader sharpen: 54 34

I guess the swift is just good marketing name, now I'm not sure why llvmpipe/lavapipe isn't more of a development target for those devs, imagine how much better it could be if it has fulltime dedicate devs on it.

Enhance Your Pipe!

It’s no secret that CPU renderers are slower than GPU renderers. But at the same time, CPU renderers are crucial for things like CI and also not hanging your current session by testing on a live GPU when you’re deep into Critical Rewrites.

So I’ve spent some time today doing some rewrites in the *pipe section of mesa, and let’s just say that the pipe was good with zink before, but it’s much, much better now.

Here’s where I started on piglit’s drawoverhead test under Lavapipe:

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 1517, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 1519, 100.2%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 1112, 73.3%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  524, 34.5%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  583, 38.4%

Nothing too incredible. Modern CPUs with a discrete GPU should be pulling upwards of 15,000k draws/s, and hitting 10% of that is sort of okay.jpg.

Pipe Faster!

But what if I implemented my Mesa multidraw extensions in Lavapipe?

goku.jpg

The gist of this extension is that when no other draw parameters have changed except for the starting vertex/index and the number of vertices/indices, an array of those values can just be dumped into the Vulkan driver to optimize out a lot of draw dispatch/validation code, mirroring Marek Olšák’s multidraw work in Gallium.

The results, just from doing basic support in Lavapipe without adding any handling in LLVMpipe underneath, speak for themselves:

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 2945, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 2626, 89.2%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 1702, 57.8%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 2881, 97.8%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 2888, 98.1%

Yup, that’s a 100% perf increase.

Pipe Beyond Space And Time!

But then I thought to myself: What if LLVMpipe could also handle me dumping all this info in?

Indeed, by spending way too much time rewriting and refactoring deep Mesa internals, I have gone beyond:

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5273, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 4804, 91.1%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 2941, 55.8%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5052, 95.8%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 5057, 95.9%

A 3.5x performance boost in drawoverhead seemed pretty good, and it also cut my GLES3 CTS runtime by about 20%, so I think I’m done here for now.

Pipe Fight!

After my recent Swiftshader misadventure, wherein I discovered that it’s missing (at least) transform feedback and conditional render extension support, meaning that it can’t be used to create a legitimate GL 3.0+ context, esoteric rendering expert Dave Airlie took benchmarking matters into his own hands today to battle it out against the other name-brand Vulkan CPU renderer.

The results were shocking.

March 23, 2021

Iago wrote recently a blog post about performance improvements on the v3d compiler, and introduced our plans to improve the pipeline caching (specifically the compiled shaders) (full blog here). We merged some improvements recently, so let’s talk about that work.

Pipeline cache improvements

While analysing the perfomance of RBDOOM-3-BFG, we noticed that some significant CPU time was spent every frame for linking shaders.

RBDOOM-3-BFG screenshot

After some investigation, we found that the game was calling ClearAttachment twice every frame. The implementation of those ClearAttachments was relying on a full job with a graphics pipeline. On v3dv by default any pipeline is created with a pipeline cache (provided by the user, or a default pipeline). On v3dv (and in general any Vulkan driver) the main cached data are the compiled shaders, so the main objective of the pipeline cache is avoiding full shader re-compilation on compatible pipelines that are used really often. Why was that time spent on linking shaders?

The issue was that for each pipeline lookup on the pipeline cache we were doing two cache lookups. The first one against a cache with the shaders in NIR, that is the main intermediate representation for shaders in Mesa (more info about intermediate representation here). And then we used those shaders to fill up the key for a second cache lookup, that if succesful, will return the compiled shader on Broadcom (QPU) assembly format.

The reason of this two-step lookup is that to compile a shader we call the common (for both OpenGL and Vulkan) Broadcom compiler, and we use some data structures that contain info that will affect the compilation (like if blending is enabled). When we implemented the pipeline cache support, for simplicity, we used the same data structures as part of the cache key. But as such keys were filled with info coming from the NIR shaders, those needed to be linked together on the case of the graphics pipelines.

When we analyzed how to improve it, we realized that in order to identify the compiled shader, we don’t really need the NIR shaders, as the info derived from them are implicit to the SPIR-V shaders provided to create the pipeline. Those NIR shaders are only really needed to compile the shader. The improvement here was using a different data structure as part of the cache key, and replace the two-cache-lookup with a one-cache-lookup. We needed to do some additional changes, as there were parts of the code that assumed that the NIR shaders would be available, but now if possible we are skipping getting them.

Numbers

Let’s start showing the improvement with a synthetic test. We took one CTS test using a complex shader, and forced the pipeline to be re-recreated 1000 times. So we get the following times:

  • Before this work, disabling pipeline cache: 125,79 seconds
  • Before this work, enabling pipeline cache (default): 11,41 seconds
  • With this work, enabling pipeline chache: 0,87 seconds

That’s a clear improvement. So how about real applications? As mentioned we started this work after analyze RBDOOM-3-BFG use of ClearAttachment. That game got an improvement of ~1fps. But when we tested other games we didn’t get any improvement on that case. This is because using a full job for ClearAttachment isn’t the preferred option (it is in fact the slowest path), and because the other games are GPU limited.

But we got improvements on other cases. As mentioned the advantage of creating the pipeline with a hot cache is that we avoid the shader recompilation, so it is far faster. And there are some apps that create pipelines at runtime. We found that this happens with the Unreal Engine 4 demos, specifically the Shooter Game. So, for example, we start the demo like this:

UE4 demo screenshot, before shooting gun

and then we decide to shot for the first time:

UE4 demo screenshot, while shooting gun

in order to render the shooting effect, new pipelines are created, and several shaders are compiled. Due all that extra work we experiment a noticiable FPS hiccup. We can visualize it with this graph:

On that graph we can see how we go from ~25fps to ~2fps. That is the shooting moment.

For this cases, the ideal would be to start the game with a pipeline cache loaded with the outcome of previous executions of the game. So adding a hack to simulate that situation, and focusing on the relevant stat in this case, that is the minimum FPS, we get the following:

  • Cold cache: ~2fps
  • Hot cache, before this work: ~6fps
  • Hot cache, with this work: ~8fps

on-disk-cache

As mentioned, the ideal would be that the applications used a pipeline cache when creating the pipelines, and that they stored the content on disk, and loaded it on following executions. Among other things, because the application would have more control about what to store and load at each moment (like for example: store/load the pipeline cache for a given level of a game).

The reality is that not all the applications do that. As mentioned, the numbers of the previous situation was simulating that ideal situation, but the application was not doing that. In order to mitigate that, we added support for on-disk-cache. Mesa provides a framework to store and load shaders on disk, so basically for any pipeline cacke lookup, now we have an extra fallback. In this case, for the UE4 Shooter demo, for a hot on-disk-cache we get a minimum of ~7fps, that as expected, is better that the situation before our work, but worse that the ideal case of the application handling the store/load of the pipeline cache on disk.

Some conclusions

So some conclusions from this work, that applies to any Vulkan driver, not only v3dv in particular:

  • If possible, pre-create all the pipelines that you would use during your application runtime. Creating pipelines during runtime, even if cached, could lead to performance hiccups.
  • If that is not possible (for example if the variable combination is too high), use pipeline cache, and store/load them on-disk. This would reduce the loading time, and make the runtime performance hiccups less noticiable. Note that having support for general on-disk-cache doesn’t mean that v3dv will do this for you.
March 18, 2021

Where We At

The blogging is happening once again, and it feels good.

I did a brief recap a couple posts ago, but I’ve gotten some sleep since then, and now it’s time for the real deal. Here’s what you missed while you were asleep and not following the Mesa RSS feed like all the coolest people I know:

  • zink-wip is shrinking. no, really. it’s down from close to 600 patches to around 200 after the past month
  • descriptor caching has landed, likely yielding around an 80-100% performance increase across the board for applications which were CPU-bound. huge, huge thanks to legendary reviewer, RADV developer, and all around great guy, Bas Nieuwenhuizen for tackling this way-too-fucking-big, 50+ patch MR in an amazingly short amount of time
  • no, zink still can’t render glxgears properly
  • zink now has lavapipe-based CI jobs going to prevent regressions!
  • Erik Faye-Lund has managed a one-liner MR that fixed over a hundred unit tests in CI
  • buffer invalidation is in, which lets zink avoid stalling the GPU in a lot of cases, also improving performance
  • with all this said, another giant thanks to l*v*pipe guru and part-time motivational graphics coach, Dave Airlie, who has reviewed nearly 200 of my patches in this release cycle
  • lavapipe is ramping up towards Vulkan 1.1 support, which will, other than being a great milestone and feat of software engineering, enable zink CI testing through GL 4.5
  • less than a hundred patches to go before zink’s threaded context support is up for merging

Good times.

March 17, 2021
I spend quite a bit of time on getting a Sierra Wireless EM7345-LTE modem to work under Linux. So here are some quick instructions to help other people who may hit the same problem.

These modems are somewhat notorious for shipping with broken firmware. They work fine after a firmware upgrade, but under Windows they will only upgrade to "carrier approved" firmware versions, which requires to be connected to the mobile-network first so that the tool can identify the carrier. And with some carriers connecting to the network does not work due to the broken firmware (ugh). There are a ton of forum-threads on how to work around this under Windows, but they all require that you are atleast able to register with the mobile-network.

Luckily someone has figured out how to update these under Linux and posted instructions for this. The procedure is actually much more straight forward under Linux. The hardest part is extracting the firmware from the Windows driver download.

One problem is that the necessary Intel FlashTool download is no longer available for download from Intel. I needed this tool a while ago for something else, and back then I used the PhoneFlashTool_5.8.4.0.rpm file from https://androiddatahost.com/nm466 . The rpm-file in the zip there has a sha256sum of: c5b964ed4fae470d1234c9bf24e0beb15a088a73c7e8e6b9369a68697020c17a

I now see that it seems that Intel is again offering this for download itself, you can find it here: https://github.com/projectceladon/tools and projectceladon seems to be an official Intel project. Note this is not the version which I used, I used the PhoneFlashTool_5.8.4.0.rpm version.

Once you have the Intel FlashTool installed just follow the posted instructions and after that your model should start working under Linux.

Last year, I have been working on Turnip driver development as my daily job at Igalia. One of those tasks was implementing the support for VK_KHR_depth_stencil_resolve extension.

VK_KHR_depth_stencil_resolve

This extension adds support for automatically resolving multisampled depth/stencil attachments in a subpass in a similar manner as for color attachments. This extension, however, does not add support for resolving msaa depth/stencil images with vkCmdResolveImage() command.

As you can imagine, this extension is easy to use by any application. Unless you are using a driver that supports Vulkan 1.2 or higher (VK_KHR_depth_stencil_resolve was promoted to core in Vulkan 1.2), you should check first if it is supported. Once done that, ask the driver which features are supported by extending VkPhysicalDeviceProperties2 with VkPhysicalDeviceDepthStencilResolveProperties structure when calling vkGetPhysicalDeviceProperties2().

struct VkPhysicalDeviceDepthStencilResolveProperties {
    VkStructureType       sType;
    void*                 pNext;
    VkResolveModeFlags    supportedDepthResolveModes;
    VkResolveModeFlags    supportedStencilResolveModes;
    VkBool32              independentResolveNone;
    VkBool32              independentResolve;
}

This structure will be filled by the driver to indicate the different depth/stencil resolve modes supported by the driver (more info about their meaning and possible values in the spec).

Next step: just fill a VkSubpassDescriptionDepthStencilResolve struct with the resolve mode wanted and the depth/stencil attachment used for resolve. Then, extend the VkSubpassDescription2 struct with it. And that’s all.

struct VkSubpassDescriptionDepthStencilResolve {
    VkStructureType                  sType;
    const void*                      pNext;
    VkResolveModeFlagBits            depthResolveMode;
    VkResolveModeFlagBits            stencilResolveMode;
    const VkAttachmentReference2*    pDepthStencilResolveAttachment;
}

Turnip implementation

Implementing this extension on Turnip was more or less direct, although there were some issues to fix.

For all depth/stencil formats, it was added its support in the both resolve paths used by the driver, one for sysmem (system memory) and other for gmem (tile buffer), including flushing the depth cache when needed. However, for VK_FORMAT_D32_SFLOAT_S8_UINT format, which has the stencil part in a separate plane, it was needed to add specific code to extend the 2D path used in the driver for sysmem resolves.

However the main issue when was testing the extension implementation on Adreno 630 GPU. It turns out that VK-GL-CTS tests exercising this extension for MSAA VK_FORMAT_D24_UNORM_S8_UINT formats were failing always, except when disabling UBWC via TU_DEBUG=noubwc environment variable. UBWC (Universal Bandwidth Compression) is a HW feature designed to improve throughput to system memory by minimizing the bandwidth of data (which also gives some power savings). The problem was that the UBWC support for MSAA VK_FORMAT_D24_UNORM_S8_UINT format is known to be failing on Adreno 630 and Adreno 618 (see merge request for freedreno driver). I just needed to disable it Turnip to fix these failures.

I also found other VK_KHR_depth_stencil_resolve CTS tests failing: the ones testing the format compatibility for VK_FORMAT_D32_SFLOAT_S8_UINT and VK_FORMAT_D24_UNORM_S8_UINT formats. For VK_FORMAT_D32_SFLOAT_S8_UINT failures, it was needed to take into account the particularity that it has a separate plane for the stencil part when resolving it to VK_FORMAT_S8_UINT. In the VK_FORMAT_D24_UNORM_S8_UINT failures, the problem was that we were setting wrongly the resolve mode used by the HW: it was wrongly doing a sample average, when we wanted to use the value of sample 0. This merge request fixed both issues.

And that’s all, this was a extension that allowed me to dive into the different resolve paths used by Turnip and learn one or two things about the HW ;-) Thanks a lot to Jonathan Marek for his reviews and suggestions to improve the implementation of this extension.

Happy hacking!

March 16, 2021

Lately, I have been looking at improving performance of the V3DV Vulkan driver for the Raspberry Pi 4. So far we had been toying a lot with some Vulkan ports of the Quake trilogy but we wanted to have a look at more modern games as well, and for that we started to look at some Unreal Engine 4 samples, particularly the Shooter demo.


Unreal Engine 4 Shooter Demo

In our initial tests with “High” settings, even at 480p we were running the sample at 8-15 fps, with 720p being mostly in the 5-10 fps range. Not great, obviously, but a good opportunity to guide and focus our performance efforts.

One aspect of the UE4 sample that was immediately obvious compared to the Quake games is that the shading is a lot more expensive in general, and more specifically, it involves more texture lookups and UBO loads, which require expensive accesses to memory from the shader cores, so this was the first area we targeted. The good thing about this is that because our Vulkan and OpenGL drivers share the compiler stack, any optimizations we do here benefit both drivers.

What follows is a brief discussion of some of the optimizations we did recently to our backend compiler and the results we have observed from this work.

Optimizing the backend compiler

So the first thing we tackled was better managing the latency of texture lookups. Interestingly, the NIR scheduler was setting this up so that we would try to put instructions that consumed the result of a texture lookup as far away as possible from the instruction that triggered the lookup, but then the backend compiler was not fully taking advantage of this and would still end up waiting on lookup results sooner than it should.

Fixing this helped performance by 1%-3%, although it could go a bit above that in some cases. It has a caveat though: doing this extends the liveness of our lookup sequences, and that makes spilling more difficult (we can’t emit spills/unspills in the middle of an outstanding memory lookup), so when register pressure is high enough that we need to inject register spills to compile the shader, we would typically be a lot more constrained and end up producing significantly worse code, or even worse, failing to register allocate for the shader completely. To avoid this, we recompile the shader without the optimization if we detect that we need to do any spilling. One can use V3D_DEBUG=perf to detect if this is happening for any shaders, looking for messages like this:

Falling back to strategy 'disable TMU pipelining' for MESA_SHADER_FRAGMENT.

While the above optimization was useful, I was expecting that it would make a larger impact for this particular demo so I kept looking for ways to do our memory lookups more efficient. One thing that is relevant to this analysis is that we were using the same hardware unit for both texture and UBO lookups, but for the latter, we could really use a different strategy by handling our UBOs as uniform streams. This has some caveats, but making a long story short, the fact that many UBO loads use uniform addresses, that we usually read a bunch of consecutive scalars from a UBO load (such as a full vec4) and that applications usually emit UBO loads for nearby addresses together, we can emit fairly optimal code for many of these, leading to more efficient memory access in general.

Eventually, this turned out to be a big win, yielding 20%-30% improvements for the Shooter demo even with the initial basic implementation, which we would then tune and optimize further.

Again related to memory accesses, I have also been improving how we schedule instructions involved with setting up memory lookups. Our scheduler was more restrictive here than it needed, and the extra flexibility can help reduce instruction counts, which will affect these more modern games most, as they are more likely to emit a larger number of texture/image operations in their shaders.

Another thing we did was to improve instruction packing. The V3D GPU is able to emit multiple instructions in the same cycle so long as the instructions meet some requirements. Turns out our assembly scheduler was being too restrictive with this and we could do better. Going by shader-db results, this led to ~5% less instructions on our programs and added another modest 1%-2% performance improvement for the Shooter demo.

Another issue we noticed is that a lot of the shaders were passing a lot of varyings from the vertex to fragment shaders, and our setup for fragment shader inputs was not optimal. The issue here was that there is a specific instruction involved in this process that writes two registers, one of then with an instruction of delay, and our dependency tracking was not able to handle this properly, effectively assuming that both registers are written in the same instruction which then had an impact in how we scheduled instructions. Fixing this required to handle this case specially in our scheduler so we could be more effective at scheduling these instructions in a way that would enable optimal pipelining of the instructions involved with varying setups for fragment shaders.

Another aspect we improved was related to our uniform handling. Generally, there many instances in which we need to emit duplicate uniforms. There are reasons for that related to how the GPU works, but in some cases, for example with consecutive UBO loads, we would emit the uniform with the base address of the UBO multiple times very close to each other. We can obviously do better by trying to track previous uses of a uniform/constant value and reusing them in nearby instructions. Of course, this comes at the expense of increasing register pressure (for reasons beyond the scope of this post our shaders typically require a lot of uniforms), so there is a balancing game to play here too. Reducing the size of our uniform streams this way also plays an important role in lowering some of the CPU overhead of the driver, since these streams need to be rebuilt often when certain pipeline states change.

Finally, an optimization that was more targeted at reducing register pressure rather than improving performance: we noticed that we would sometimes put some instructions far away from their consumers with no obvious benefit to it. This was bad enough some times that it would even cause us to be unable to compile some shaders. Mesa has a handy NIR pass for this called nir_opt_sink, which also proved to be helpful for this. This allowed us to get a few more shaders from shader-db to compile and reduce spilling for a bunch of shaders. For the Shooter demo, this changed a large compute shader involved with histogram post-processing, which had 48 spills and 50 fills to only have 8 spills and 15 fills. While the impact in performance of this is probably very small since the game only runs this pass once per frame, it made a notable difference in loading time, since compiling a shader with this much spilling is very slow at present. I have a mental note to improve this some day, I know Intel managed to fix this for their compiler, but for the time being, this alone managed to make the loading time much more reasonable.

Results

First, here are some shader-db stats, which describe how these optimizations change various statistics for a large collections of real world shaders:

total instructions in shared programs: 14992758 -> 13731927 (-8.41%)
instructions in affected programs: 14003658 -> 12742827 (-9.00%)
helped: 80448
HURT: 4297

total threads in shared programs: 407932 -> 412242 (1.06%)
threads in affected programs: 4514 -> 8824 (95.48%)
helped: 2189
HURT: 34

total uniforms in shared programs: 4069524 -> 3790401 (-6.86%)
uniforms in affected programs: 2267834 -> 1988711 (-12.31%)
helped: 40588
HURT: 1186

total max-temps in shared programs: 2388462 -> 2322009 (-2.78%)
max-temps in affected programs: 897803 -> 831350 (-7.40%)
helped: 30598
HURT: 2464

total spills in shared programs: 6241 -> 5940 (-4.82%)
spills in affected programs: 3963 -> 3662 (-7.60%)
helped: 75
HURT: 24

total fills in shared programs: 14587 -> 13372 (-8.33%)
fills in affected programs: 11192 -> 9977 (-10.86%)
helped: 90
HURT: 14

total sfu-stalls in shared programs: 28106 -> 31489 (12.04%)
sfu-stalls in affected programs: 16039 -> 19422 (21.09%)
helped: 4382
HURT: 6429

total inst-and-stalls in shared programs: 15020864 -> 13763416 (-8.37%)
inst-and-stalls in affected programs: 14028723 -> 12771275 (-8.96%)
helped: 80396
HURT: 4305

Less is better for all stats, except threads. We can see significant improvements across the board: we generally produce shaders with less instructions that have less maximum register pressure, we reduce spills and uniform counts and can run with more threads. We only are worse at stalls, but that is generally because we now produce more compact code with less instructions, so more stalls are expected.

Another good thing in these stats is the large number of helped shaders compared to hurt shaders, meaning that it is very likely that these optimizations will help most applications to some extent.

But enough of boring compiler statistics, most people won’t care about that and what they want to know is how this impacts performance on actual games and applications, which is what the following graphs shows (these were obtained by replaying specific traces with gfx-reconstruct). Keep in mind that while I am using a collection of Vulkan samples/games here it is expected that these optimizations apply to OpenGL too.


Before vs After

Framerate improvement after optimization (in %)

As it can be observed from the graph above, this optimization work made a significant impact in the observed framerate in all the cases. It is not surprising that the UE4 demo is the one that sees the most improvement, considering this the one we used to guide most of the optimization work.

Other optimizations and future work

In this post I have been focusing exclusively on compiler optimizations, but we have also been improving other parts of the Vulkan driver. While I won’t go into details to avoid making this post too long, we have also been improving aspects of the driver involved with buffer to image copies, depth buffer clears, dirty descriptor state management, usage of the TFU unit for transfer operations and more.

Finally, there is one other aspect of this UE4 demo that is pretty obvious as soon as you start a game: it can compile a lot of shaders in the middle of the gameplay loop which can lead to significant stutter. While there is not much we can do about this on the driver side, but adding support for a shader cache on disk should eliminate the problem on sessions after the first, so this is something that we may work on in the future.

We will certainly continue to look at improving the driver performance in the future so stay tuned for further updates on our progress or maybe join us at #videocore on Freenode.

Superposition

I got a weird bug report the other day. Apparently Unigine Superposition, the final boss of Unigine benchmarks, was broken in zink(-wip). And it was a regression, which meant that at some point it had worked, but because testing is hard, it then stopped working for a while.

The Problem (superposition wtf u doin?)

I had no idea what the problem was other than a vague impression it might be queue-related given the perf output in the bug report. The first step here was to figure out what I was dealing with. Off I went to the two horsemen of all queue-related bugs: zink_flush() and zink_fence_finish().

Here’s the latter of the two:

static bool
zink_fence_finish(struct zink_screen *screen, struct pipe_context *pctx, struct zink_tc_fence *mfence,
                  uint64_t timeout_ns)
{
   pctx = threaded_context_unwrap_sync(pctx);
   struct zink_context *ctx = zink_context(pctx);

   if (screen->device_lost)
      return true;

   if (pctx && mfence->deferred_ctx == pctx && mfence->deferred_id == ctx->curr_batch) {
      zink_context(pctx)->batch.has_work = true;
      /* this must be the current batch */
      pctx->flush(pctx, NULL, !timeout_ns ? PIPE_FLUSH_ASYNC : 0);
      if (!timeout_ns)
         return false;
   }

   /* need to ensure the tc mfence has been flushed before we wait */
   bool tc_finish = tc_fence_finish(ctx, mfence, &timeout_ns);
   struct zink_fence *fence = mfence->fence;
   if (!tc_finish || (fence && !fence->submitted))
      return fence ? p_atomic_read(&fence->completed) : false;

   /* this was an invalid flush, just return completed */
   if (!mfence->fence)
      return true;
   /* if the zink fence has a different batch id then it must have completed and been recycled already */
   if (mfence->fence->batch_id != mfence->batch_id)
      return true;

   return zink_vkfence_wait(screen, fence, timeout_ns);
}

In short, the control flow goes:

  • if the GPU rejected a previous cmdbuf, fail
  • detect and then submit deferred flushes (i.e., the app has requested a sync object and at some later point decided to wait on it)
  • verify that the cmdbuf represented by this fence has actually been submitted
  • detect cases where the internal fence (mfence->fence) no longer belongs to the gallium sync object (mfence)
  • finally, attempt to actually wait on the fence

So this is all fine and good, and it’s been working terrifically for many months. But I put a printf at the top, and it turns out this was being spammed nonstop during load with the app constantly checking the same sync object to see if it was finished.

It wasn’t finished.

Specifically, the return fence ? p_atomic_read(&fence->completed) : false; case was being hit, which led me to dig deeper into what was going on.

Another printf in zink_flush() (function contents omitted to spare the eyes of anyone under the legal drinking age) revealed that the sync object reaching zink_fence_finish was, in fact, the second one created by the app, which was then being waited upon twenty-something flushes later.

So in this case, mfence->deferred_id was something like 2 and the current batch id in ctx->curr_batch was around 20. The last-completed batch id was also around 20.

Revision

A slight modification here resolved the issue:

if (pctx && mfence->deferred_ctx == pctx) {
   if (mfence->deferred_id == ctx->curr_batch) {
      zink_context(pctx)->batch.has_work = true;
      /* this must be the current batch */
      pctx->flush(pctx, NULL, !timeout_ns ? PIPE_FLUSH_ASYNC : 0);
      if (!timeout_ns)
         return false;
   }
   /* this batch is known to have finished */
   if (mfence->deferred_id <= screen->last_finished)
      return true;
}

This way, cases where an app is randomly waiting on something that has finished so far in the past that there’s no longer any trace of it can just become a no-op success, and everything is happy.

Probably.

Testing is ongoing, but it seems like everything is happy now.

March 15, 2021

As we are heading towards April and the release of Fedora Workstation 34 I wanted to post an update on what we are working on for this release and what we are looking at going forward. 2020 was a year where we focused a lot on polishing what we had and getting things past the finish line and Fedora Workstation 34 is going to be the culmination of that effort in many ways.

Wayland:
The big ticket item we have wanted to close off on was Wayland, because while Wayland has been production ready for most of us for a while, there was still some cases it didn’t cover as well as X.org. The biggest of this was of course the lack of accelerated XWayland support with the binary NVidia driver. Fixing that issue of course wasn’t something we could do ourselves, but we have been working diligently with our friends at NVidia to help ensure everything was in place for them to enable that support in their driver, so I have been very happy to see the public reports confirming that NVidia will have accelerated 3D in the summer release of their driver. The other Wayland area we have put a lot of effort into has been the work undertaken by Jonas Ådahl to get headless display support working with Wayland. This is a critical feature for people who for instance want a desktop instance on their servers or in the cloud, who want a desktop they access through things like VNC or RDP to use for sysadmin related tasks. Jonas spent a lot of time laying the groundwork for this over the course of last year and we are now in the final stages of merging the patches to enable this feature in GNOME and Wayland in preparation for Fedora Workstation 34. Once those two items are out we consider our Wayland rampup/rollout to be complete, so while there of course will continue to be bugfixes and new features implemented, that will be part of a natural evolution of Wayland and not part of a ‘close gaps with X11’ effort like now.

PipeWire
Another big ticket item we are hoping to release fully in Fedora Workstation 34 is PipeWire. PipeWire as most of you know is the engine we use to deal with handling video streams in a secure and shareable away in Fedora Workstation, so when you interact with your web camera(s) or do screen casting or make screenshots it is all routed and handled by PipeWire. But in Fedora Workstation 34 we are aiming to also switch to using PipeWire for audio, to replace both PulseAudio and Jack. For those of you who had read any previous blog post from me you will know what an important step forward this will be as we would finally be making the pro-audio community first class citizens in Fedora Workstation and Linux in general. When we decided to try to switch to PipeWire in Fedora Workstation 34 I have to admit I was a little skeptical about if we would be able to get all things ready in time as there are so many things that needs to be tested and fixed when you switch out such a critical component. Up to that point we had a lot of people interested in PipeWire, but only limited community involvement, but I feel the announcement of bringing in PipeWire for Fedora Workstation 34 galvanized the community around the project and we now have a very active community around PipeWire in #pipewire on the freenode IRC network. Not only is Wim Taymans getting a ton of help with testing and verification, but we also see a stead stream of patches coming in, with for instance improved Bluetooth audio support being contributed, in fact I believe that PipeWire will be able to usher in better bluetooth audio support in Fedora than we ever had before, with great support for high quality Bluetooth audio codecs like LDAC.

I am especially happy to see so many of the key members of the pro-audio Linux community taking part in this effort and is of course also happy to see many pro-audio folks testing Fedora Workstation for the first time due to this effort. The community is working closely with Wim to test and verify as many important ProAudio applications as possible and work to update Fedora packaging as needed to ensure they can transition from Jack to PipeWire without dependency conflicts or issues. One last item to mention here is that you might have seen that Red Hat is getting into the automotive space, I can’t share a lot of details about that effort, but one thing I can say is that PipeWire will be a core part of it and thus we will soon be looking to hire more engineers to work on PipeWire, so if that is of interest to you be sure to track my twitter feed or blog as I will announce our job openings there as they become available. For the community at large this should be great too as it means that we can get a lot of synergy between automotive and the desktop around audio and video handling.

It is still somewhat of an open question if we end up actually switching to PipeWire in Fedora Workstation 34, but things are looking good at this point in time and worst case scenario it will be in place for Fedora Workstation 35.

Toolbox
Toolbox is another effort that is in a great spot now. Toolbox is our tool for making working with pet containers a breeze for developers. The initial version was prototyped quickly by writing it as a shell script, but we spent time last year getting it rewritten in Go in order to make it possible to keep expanding the project and allow us to implement all the things we envision for it. With that done feature work is now in focus again and Ondřej Michal has done some great work making it possible to set up RHEL containers in Toolbox. This means that you can run Fedora on your laptop and get the latest and greatest features that way, but you can do your development in a RHEL pet container, so you get an environment identical to what you applications will see once they are deployed into the cloud or onto company hardware. This gives you the best of both worlds in my opinion, the fast moving Fedora Workstation that brings the most out of our laptop and desktop hardware, but still easy access to the RHEL platform for development and testing. You can even test this today on Fedora Workstation 33, just open a terminal and type ‘toolbox create --distro rhel --release 8.3‘. The resulting toolbox will then be based on RHEL and not Fedora and thus perfect for doing RHEL targeted development. You will need to use the subscription-manager tool to register it (be sure to register on developer.redhat.com for your free RHEL developer subscription. Over time we hope to integrate this into GNOME Online accounts like we do for the RHEL virtual machines you can set up with GNOME Boxes, so that once you set up your RHEL account you can create RHEL virtual machines and RHEL containers easily on Fedora Workstation.

Toolbox with RHEL

Toolbox pet container with RHEL UBI

Flatpak
Owen Taylor has been doing some incredible work behind the scenes for the last year trying to ensure the infrastructure we have in RHEL and Fedora provides a great integrated Flatpak experience. As we move forward we expect Flatpaks to be the primary packaging format that Fedora users consume their applications in, but to make that a reality we needed to ensure the experience is good both for Fedora maintainers and for the end users. So one of the big ticket items Owen been working on is getting incremental updates working in Fedora. If you have used applications from Flathub you probably noticed that their updates are small and nimble despite being packaged as Flatpak containers, while the Fedora flatpaks causes big updates each time. The reason for this is that the Fedora flatpaks are shipping as OCI (Open Container Initiative) images, while the Flatpaks on Flathub are shipping as OStree repositories (if you don’t know OStree, think of it as git for binaries). So shipping the Flatpaks as OCI images has advantages in the form of being the same format we at Red Hat use for our kubernetes/docker/openshift containers and thus it allows us to reuse a lot of the work that Red Hat as put into ensuring we can provide and keep such containers up to date and secure, but the downside until now has been that these containers where shipped in a way which cause each update, no matter how small the change, to be a full re-download of the whole image. Well Owen Taylor and Alex Larsson worked together to resolve this and came up with a method to allow incremental updates of such containers and thus bring the update sizes in line with what you see on Flathub for Flatpaks. This should be deployed in time for Fedora Workstation 34 and we also hope to  eventually deploy it for  kubernetes/docker containers too. Finally to make even more applications available we are doing work to enable people to get access to Flathub.org out of the box in Fedora when you enable 3rd party repositories in initial setup, so that your our of the box application selection will be even bigger.

Flathub frontpage

Flathub webpage

GNOME 40
Another major change in Fedora Workstation 34 is GNOME40 which contains a revamp of the GNOME 3 user interface. This was a collaborative effort between a lot of GNOME 3 stakeholders with Allan Day representing Red Hat. This was also an effort by the GNOME design community to up their game and thus as part of the development process the GNOME Foundation paid a professional company to do user testing on the proposed changes and some of the alternatives. This means that the changes where verified to actually be experienced as an improvement for the experienced GNOME user participants and that it felt intuitive for new users of GNOME. One advantage we have in Fedora is that since we don’t do major tweaking of the GNOME user interface which means once Fedora Workstation 34 ships you are set to enjoy the new GNOME experience from day one. For long time GNOME users I hope and expect that the updates will be a welcome refresh and at the same time that the changes provide a more easy onramp for new GNOME and Fedora Workstation users. Some of the early versions did lead some long term fans of how multimonitor support in GNOME3 worked to be a bit concerned, but be assured that multi monitor is a critical usecase in our opinion and something we have been looking at and will be looking at keep improving. In fact Allan Day wrote a great blog post about GNOME40 multimonitor support recently to explain what we are doing and how we see it evolving going forward.

Input
Another area where we keep putting in a lot of effort is input. Thanks to Peter Hutterer and Benjamin Tissoires we keep making sure Fedora Workstation and the world of Linux keeps having access to the newest and best in input. The latest effort they are working on has been to enable haptic touchpads. Haptics touchpads should be familiar among people who tried Apple hardware, but they are expected to appear in force on laptops in general this year, so we have been putting in the effort to ensure that we can support this new type of device as they come out. So if you see laptops you want with haptic touchpads then Fedora Workstation should be ready for it, but of course until these devices are commonplace and we had a chance to test and verify I can make no guarantees.

Another major effort that we undertook in relation to input was move the GNOME input to a separate thread. Carlos Garnacho worked on this patch to make that happen. This should provide a smoother experience with Fedora Workstation 34 as it means the mouse should not stall due to the main thread running Wayland being busy. This was done as part of the overall performance work we been continuously doing over the last years to ensure to address performance issues and make Fedora and GNOME have the best performance and performance related behaviour possible.

Lenovo Laptops
So one of the major announcements of last year was Lenovo Laptops with Fedora Linux pre-installed. There are currently two models available with Linux, the X1 Carbon and the Lenovo P1. Between them they cover the two most commons requests we see, a ultralight weight laptop with the X1 and a more powerful ‘portable workstation’ model with the P1. We are working on a couple of more models and also to get them sold globally, which was key goal of the effort. Be aware that both models are on sale as I am writing this (hopefully still true when you read this), so it is a good time to grab a great laptop with a great OS.

Lenovo P1

Lenovo P1

Vision
So one thing I wanted to do to is tie the work we do in Fedora Workstation together by articulating what we are trying to achieve. Fedora has for the longest time been the place where Linux as an operating system is evolving and being developed. There are very few major innovations that has come to Linux that hasn’t been developed and incubated in Fedora by Fedora contributors, including of course Red Hat. This include things like the Linux Vendor Firmware Service, Wayland, Flatpak, SilverBlue, PipeWire, SystemD, flicker free boot, HiDPI support, gaming mouse support and so much more. We have always done this work in close cooperation with the upstreams we are collaborating with, which is why the patch delta in any given Fedora release is small. We work hard to get improvements into the upstream kernel and into GNOME and so on right away, to avoid needing to ship downstream patches in Fedora. That of course saves us from having to maintain temporary forks, but more importantly it is the right way to collaborate with an open source community.

So looking back to when we launched Fedora Workstation we realized that being at the front like that had come at the cost of not being stable and user friendly. So the big question we tried to ask ourselves when launching Fedora Workstation and the question that still drives a lot of our decision making and focus is : how can we preserve being the heart and center of Linux OS development, but at the same time provide end users with a stable and well functioning system? To achieve that we have done lot of changes over the last years, ranging from some policy changes in terms of how and when we brought changes into Fedora, but maybe even more importantly we focused on a lot on figuring out ways to reduce the challenges caused by a rapidly evolving OS, like the introduction of Flatpaks to allow applications to be developed and released without strong ties to the host system libraries and with the concepts we are maturing in Silverblue around image based operating systems or how we are looking at pet container development with Toolbox. All of these things combined remove a lot of the fragility we seen in Linux up to this point and instead let us treat the rapidly evolving linux landscape as a strength.

So where we are today is that I think we are very close to realizing the vision of being able to let Fedora be the place where exiting new stuff happens, yet at the same time provide the robustness and polish that end users need to be able to use it as their daily driver, it has been my daily driver for many years now and by the rapid growth of users we seen in Fedora over the last 5 years I think that is true for a lot of other people too. The goal is to allow the wider community around Linux, especially the developers, sysadmins and creators relying on Linux to do their work, to come to Fedora and interact and collaborate with the developers working on the OS itself to the benefit of all. You all are probably better judges than me to if we are succeeding with that, but I do take the increased chatter and adoption of Fedora by a lot of people doing Linux related podcasts, news sites and so on as a sign that we are succeeding. And PipeWire for me is a perfect example of how this can look, where we want to bring in the pro-audio creators to Fedora Workstation and let them interact and work closely with Wim Taymans and the PipeWire development team to make the experience even better for themselves and their fellow creators and at the same time give them a great stable platform to create their music on.

March 12, 2021

A New Post

Blogging is hard.

There’s a big switch that a brain needs to do in order to go from ramming code into a project at full speed to carefully crafting an internet post that potentially people might want to read, and some days the effort required to make that switch exceeds the available energy.

Such is (at least one of) the reason(s) why more graphics driver-y people don’t blog more, since it’s taxing enough for me to try and get posts out and I mostly just post memes.

But this time is definitely going to be a return to the normalcy that is posting more than once per week. The key is just to start posting and keep doing it.

So what’s been going on since the last post?

Well.

Lots.

Since Our Last Correspondence…

  • Some number of patches landed
    • zink is now conformant for the enhanced layouts tests in CTS (KHR-GL46.enhanced_layouts*)
  • Mesa 21.0 is going out (has already gone out? I’m bad at time) with GL 4.1 enabled in zink
    • enjoy crashes and GPU hangs in style with a driver that nobody else is using!
  • Definitely other stuff that’s good and useful but isn’t in the scope of this post

Easing Back In

Tomorrow’s post is going to be back to the usual. In fact, I’m going to be posting about some tc-related stuff, and now that I’ve said it, I can’t back out.

March 10, 2021

Over the last year and then so, we at Collabora have been working with Microsoft on their D3D12 mapping layer, which I announced in my previous blog post. In July, Louis-Francis Ratté-Boulianne wrote an update on the status on the Collabora blog, but a lot has happened since then, so it’s time for another update.

There’s two major things that has happened since then; we have passed the OpenGL 3.3 conformance tests, and we have upstreamed the code in Mesa 3D.

Photoshop support

It might not be a big surprise, but one of the motivation for this work was to be able to run applications like Photoshop on Windows devices without full OpenGL support.

I’m happy to report that Microsoft has released their compatibility pack that uses our work to provide OpenGL (and OpenCL) support, Photoshop can now run on Windows on ARM CPUs! This is pretty exciting to see high-profile applications like that benefit from our work!

OpenGL 3.3 Conformance Test Suite

First of all, I would like to point out that having passed the OpenGL CTS isn’t necessarily the same as being formally conformant. There’s some details about how to formally becoming conformant on layered implementations that are complicated, and I’ll leave the question about formal conformance up to Microsoft and Khronos.

Instead, I want to talk a bit about passing the OpenGL CTS.

Challenges for layered implementations

A problem we face with layered implementations is that we are subject to a few more sources of issues, some of which are entirely out of our control. A normal OpenGL, non-layered, implementation have two primary sources of issues:

  1. The OpenGL driver
  2. The hardware

Issues with the OpenGL driver itself that leads to tests failing is always required to be fixed before results are submitted. Issues with the hardware generally requires software workarounds, but this is not always feasible, so Khronos have a system where a vendor can file a waiver for a hardware issue, and if approved they can mark test-failures as waived, and the appropriate failures will be ignored.

So far so good.

But for our layered implementations, our world looks a bit different. We don’t really see the hardware, but instead we see D3D12 and the D3D12 driver. This means our sources of issues are:

  1. The OpenGL driver
  2. The D3D12 run-time
  3. The D3D12 vendor-driver
  4. The hardware

For the OpenGL driver, the story is the same as for a non-layered implementation, but from there on things start changing.

Problems in the D3D12 run-time must also be fixed before submitting results. We work together with Microsoft to get these issues fixed as appropriate. Such fixes can take a while to trickle all the way into a Windows build and to end-users, but they will eventually show up.

But for the D3D12 vendor-driver and below, things gets complicated. First of all, it’s not always possible for us to tell vendor-driver issues and hardware issues apart. And worse, as these are developed by third party companies, we have little insight there. We can’t affect their priorities, so it’s hard to know when or even if an issue gets resolved.

It’s also not really a good idea to work around such issues, because if they turn out to be fixable software problems, we don’t know when they will be fixed, so we can’t really tell when to disable the work-around. We also don’t know exactly what combination of hardware and software these issues apply to.

But there’s one case where we have full insight, and that’s when the D3D12 vendor-driver is WARP, a high-performance software rasterizer. Because that component is developed by Microsoft, and we have channels to report issues and even make sure they get resolved!

Bugs, bugs, bugs

When developing something new, there’s always going to be bugs. But usually also when using something existing in a new way. We encountered a lot of bugs on our way, and here’s a quick overview over some of them. This is in no way exhaustive, and most of our own bugs are not that interesting. So this is mostly about problems unique to layered implementations.

64-bit shifts

It turned out early on that the DXIL validator in D3D12 had a requirement when parsing the LLVM bitcode that required that the amounts were always 32-bit values. While this seems fine by itself, LLVM itself requires that all operands to binops have the same bit-size. This obviously meant that only 32-bit shifts could pass the validator.

Microsoft quickly removed this requirement once we figured out what was going on.

Aligned block-compressed textures

In D3D12, one requirement for block-compressed textures is that the base-level is aligned to ther block-size. This requirement does not apply to mip-levels, and OpenGL has no such requirement. This isn’t technically speaking a bug, but a documented limitation in DirectX.

It turns out that this limitation was an artificial historical left-over, and after a bunch of testing (and fixing of WARP), we got this limitation lifted. Great :)

D3D12 vendor-driver bugs

Something that has been much more frustrating is bugs in the vendor-drivers. The problem here is that even though we have channels to file bugs, we don’t have any influence or even insight into their prioritization and release schedule.

I think it suffice to say that there’s been several reported bugs to all vendors we’ve actively been running the OpenGL CTS on top of. We believe fixes are underway for at least some of this, but we can’t make any promises here.

Current status

Right now, the only configurations we’re cleanly passing the OpenGL 3.3 CTS on are WARP (which became conformant on November 24th, 2020), and NVIDIA (which became conformant on February 26th, 2021).

Having these multiple independent implementations of DirectX drivers passing in conjunction with the Mesa/D3D12 layer shows that we are able to implement GLon12 in a vendor-neutral way, which allowed us to bring the layer to conformance. Many thanks to Khronos for their assistance through this process.

We’ve also submitted results on top of an Intel GPU, but that submission has been halted due to failures, and will as far as I know be updated as soon as Intel publish updated drivers.

The conformance tests have been run against our downstream fork, which is no longer actively maintained, because:

Upstreaming

The D3D12 driver was upstreamed in Mesa in Merge-Request 7477, and the OpenCL compiler followed in Merge-Request 7565. There’s been a lot more merge-requests since then, and even more is expected in the future.

The process of upstreaming the driver into Mesa3D went relatively smoothly, but there were quite a lot of regressions that happened quickly after we upstreamed the code, so to avoid this from becoming a big problem we’ve added the D3D12 driver to Mesa’s set of GitLab CI tests. We now build and test the D3D12 driver on top of WARP on CI, as well as running some basic sanity-tests for the OpenCL compiler.

All in all, this seems to work very well right now, and we’re looking forward to the future. Next step, WSL support!

I’m no longer working full-time on this project, instead I’m trying to take some of the lessons learned and apply them to Zink. I’m sure there’s even more room for code-reuse than what we currently have, but it will probably take some time to figure out.

March 08, 2021

It was created by Konstantin Ryabitsev and has become a very frequently used tool for me.

It supports a lot of different ways for interacting with the Linux Kernel mailing lists. Of these the b4 am subcommand is what I primarily use. This subcommand downloads all of the patches belonging to a patch series and drops them into a .mbox file. But! It doesn't apply them to the repository we're currently in, and herein lies the itch that I would like to scratch.

The inspiration for this post is the script that @stellarhopper authored and @widawsky pointed out to me.

The Good, the Bad & the Ugly

After first publishing this post, people on the twittersphere suggested some alternative approaches, and it would seem that there …

March 03, 2021

One of the best decisions I did in my life was when I joined Igalia in 2012. Inside Igalia, I have been working in different open-source projects, most of the time related to graphics technologies, interacting with different communities, giving talks, organizing conferences and, more importantly, contributing to free software as my daily job.

Now I’m thrilled to announce that we are hiring for our Graphics team!

Igalia

Right now we have two open positions:

  • Graphics Developer.

    We are looking for candidates that would like to contribute to open-source OpenGL/Vulkan graphics drivers (Mesa), or other areas of the open-source graphics stack such as X11 or Wayland, among others. If you have experience with them, or you are very motivated to become an expert there, just send us your CV!

  • Kernel Developer.

    We are looking for candidates that either have experience with kernel development or they can ramp-up quickly to contribute to linux kernel drivers. Although no specific subsystem is mentioned in the job position, I encourage you to apply if you have DRM experience and/or ARM[64]/MIPS related knowledge.

Graphics technologies are not your cup of tea? We have positions in other areas like browsers, compilers, multimedia… Just check out our job offers on our website!

What we offer is to work in an open-source consultancy in which you can participate equally in the management and decision-making process of the company via our democratic, consensus-based assembly structure. As all of our positions are remote-friendly, we welcome submissions from any part of the world.

Are you still a student? We have launched the 2021 edition of our Coding Experience program. Check it out!

Igalia's office

February 27, 2021

It is extremely rare that a hobby software project of mine gets completed, but now it has happened. Behold! Fourbyfour!

Have you ever had to implement a mathematical algorithm, say, matrix inversion? You want it to be fast and measuring the speed is fairly simple, right. But what about correctness? Or precision? Behavior around inputs that are on the edge? You can hand-pick a few example inputs, put those into your test suite, and verify the result is what you expect. If you do not pick only trivial inputs, this is usually enough to guarantee your algorithm does not have fundamental mistakes. But what about those almost invalid inputs, can you trust your algorithm to not go haywire on them? How close to invalid can your inputs be before things break down? Does your algorithm know when it stops working and tell you?

Inverting a square matrix requires that the inverse matrix exists to begin with. Matrices that do not mathematically have an inverse matrix are called singular. Can your matrix inversion algorithm tell you when you are trying to invert a matrix that cannot be inverted, or does it just give you a bad result pretending it is ok?

Working with computers often means working with floating-point numbers. With floating-point, the usual mathematics is not enough, it can actually break down. You calculate something and the result a computer gives you is total nonsense, like 1+2=2 in spirit. In the case of matrix inversion, it's not enough that the input matrix is not singular mathematically, it needs to be "nice enough" numerically as well. How do you test your matrix inversion algorithm with this in mind?

These questions I tried to answer with Fourbyfour. The README has the links to the sub-pages discussing how I solved this, so I will not repeat it here. However, as the TL;DR, if there is one thing you should remember, it is this:

    Do not use the matrix determinant to test if a matrix is invertible!

Yes, the determinant is zero for a singular matrix. No, close to zero determinant does not tell you how close to singular the matrix is. There are better ways.

However, the conclusion I came to is that if you want a clear answer for a specific input matrix, is it invertible, the only way to know for sure is to actually invert it, multiply the input matrix with the inverse you computed, and measure how far off from the identity matrix it is. Of course, you also need to set a threshold, how close to identity matrix is close enough for your application, because with numerical algorithms, you will almost never get the exact answer. Also, pick an appropriate matrix norm for the matrix difference.

The reason for this conclusion is what one of the tools I wrote tells me about a matrix that would be typical for a display server with two full-HD monitors. The matrix is simply the pixel offset of the second monitor on the desktop. The analysis of the matrix is the example I used to demonstrate fourbyfour-analyse. If you read through it, you should be shocked. The mathematics, as far as I can understand, seems to tell us that if you use 32-bit floating-point, inverting this matrix gives us a result that leads to no correct digits at all. Obviously this is nonsense, the inverse is trivial and algorithms should give the exact correct result. However, the math does not lie (unless I did). If I did my research right, then what fourbyfour-analyse tells us is true, with an important detail: it is the upper error bound. It guarantees that we cannot get errors larger than that (heh, zero correct digits is pretty hard to make much worse). But I also read that there is no better error bound possible for a generic matrix inversion algorithm. (If you take the obvious-to-human constraints into account that those elements must be one and those must be zero, the analysis would likely be very different.) Therefore the only thing left to do is to actually go on with the matrix inversion and then verify the result.

Here is a list of the cool things the Fourbyfour project does or has:

  • Generates random matrices arbitrarily close to singular in a controlled way. If you simply generated random matrices for testing, they would almost never be close to singular. With this code, you can define how close to singular you want the matrices to be, to really torture your inversion algorithms.
  • Generates random matrices with a given determinant value. This is orthogonal to choosing how close to singular the generated matrices are. You can independently pick the determinant value and the condition number, and have the random matrices have both simultaneously.
  • Plot a graph about a matrix inversion algorithm's behavior when inputs get closer to singular, to see exactly when it breaks down.
  • A tutorial on how mathematical matrix notation works and how it relates to row- vs. column-major layouts (spoiler: it does not).
  • A comparison between Graphene and Weston matrix inversion algorithms.
In this project I also tried out several project quality assurance features:

  • Use Gitlab CI to run the test suite for the main branch, tags, and all merge requests, but not for other git branches.
  • Use Freedesktop ci-templates to easily generate the Docker image in CI under which CI testing will run.
  • Generate LCOV test code coverage report from CI.
  • Use reuse lint tool in CI to ensure every single file has a defined, machine-readable license. Using well-known licenses clearly is important if you want your code to be attractive. Fourbyfour also uses the Developer Certificate of Origin.
  • Use ci-fairy to ensure every commit has Singed-off-by and every merge request allows maintainer pushes.
  • Good CI test coverage. Test even the pseudo-random number generator in the test suite that it roughly follows the intended distribution.
  • CONTRIBUTING file. I believe that every open source project regardless of size needs this to set up the people's expectations when they see your project, whether you expect or accept contributions or not.
I'm really happy this project is now "done", well, version 1.0.0 so to say. One thing I have realized it is still missing is a determinant sweep mode. The precision testing mode sweeps over condition numbers and allows plotting the inversion behavior. It should have another mode where the sweep controls the determinant value, with some fixed condition number for the random test matrices. This determinant mode could point out inversion algorithms that use determinant value for matrix singularity testing and show how it leads to completely arbitrary results.

I you want to learn about numerical methods for matrices, I recommend the book Gene H. Golub, Charles F. van Loan, Matrix Computations. The Johns Hopkins University Press. I used the third edition, 1996, when implementing the Weston matrix inversion years ago.

February 26, 2021

Introduction

In a now pretty well established tradition on my part, I am posting on things I no longer work on!

I gave a talk on modifiers at XDC 2017 and Linux Plumbers 2017 audio only. It was always my goal to have a blog post accompany the work. Relatively shortly after the talks, I ended up leaving graphics and so it dropped on the priority list.

I'm splitting this up into two posts. This post will go over the problem, and solutions. The next post will go over the implementation details.

Modifiers

Each 3d computational unit in an Intel GPU is called an Execution Unit (EU). Aside from what you might expect them to do, like execute shaders, they may be used for copy operations (itself a shader), or compute operations (also, shaders). All of these things require memory bandwidth in order to complete their task in a timely manner.

Modifiers were the chosen solution in order to allow end to end renderbuffer [de]compression to work, which is itself designed to reduce memory bandwidth needs in the GPU and display pipeline. End to end renderbuffer compression simply means that through all parts of the GPU and display pipeline, assets are read and written to in a compression scheme that is capable of reducing bandwidth (more on this later).

Modifiers are relatively simple concept. They are modifications that are applied to a buffer's layout. Typically a buffer has a few properties, width, height, and pixel format to name a few. Modifiers can be thought of as ancillary information that is passed along with the pixel data. It will impact how the data is processed or displayed. One such example might be to support tiling, which is a mechanism to change how pixels are stored (not sequentially) in order for operations to make better use of locality for caching and other similar reasons. Modifiers were primarily designed to help negotiate modified buffers between the GPU rendering engine and the display engine (usually by way of the compositor). In addition, other uses can crop up such as the video decode/encode engines.

A Waste of Time and Gates

My understanding is that even now, 3 years later, full modifier support isn't readily available across all corners of the graphics ecosystem. Many hardware features are being entirely unrealized. Upstreaming sweeping graphics features like this one can be very time consuming and I seriously would advise hardware designers to take that into consideration (or better yet, ask your local driver maintainer) before they spend the gates. If you can make changes that don't require software, just do it. If you need software involvement, the longer you wait, the worse it will be.

They weren't new even when I made the presentation 3.5 years ago.

commit e3eb3250d84ef97b766312345774367b6a310db8
Author: Rob Clark <robdclark@gmail.com>
Date:   6 years ago

    drm: add support for tiled/compressed/etc modifier in addfb2

I managed to land some stuff:

commit db1689aa61bd1efb5ce9b896e7aa860a85b7f1b6
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   3 years, 7 months ago

    drm: Create a format/modifier blob

Admiring the Problem

Back of the envelope requirement for a midrange Skylake GPU from the time can be calculated relatively easily. 4 years ago, at the frequencies we run our GPUs and their ISA, we can expect roughly 1GBs for each of the 24 EUs.

A 4k display:

3840px × 2160rows × 4Bpp × 60HZ = 1.85GBs

24GBs + 1.85GBs = 25.85GBs

This by itself will oversaturate single channel DDR4 bandwidth (which was what was around at the time) at the fastest possible clock. As it turns out, it gets even worse with compositing. Most laptops sporting a SKL of this range wouldn't have a 4k display, but you get the idea.

The picture (click for larger SVG) is a typical "flow" for a composited desktop using direct rendering with X or a Wayland compositor using EGL. In this case, drawing a Rubik's cube looking thing into a black window.

Admiring the problem

Using this simple Rubik's cube example I'll explain each of the steps so that we can understand where our bandwidth is going and how we might mitigate that. This is just the overview, so feel free to move on to the next section. Since the example will be trivial, and the window is small (and only a singleton) it won't saturate the bandwidth, but it will demonstrate where the bandwidth is being consumed, and open up a discussion on how savings can be achieved.

Rendering and Texturing

For the example, no processing happens other than texturing. In a simple world, the processing of the shader instructions doesn't increase the memory bandwidth cost. As such, we'll omit that from the details.

The main steps on how you get this Rubik's cube displayed are

  • upload a static texture
  • read from static texture
  • write to offscreen buffer
  • copy to output frame
  • scanout from output

More details below...

Texture Upload

Getting the texture from the application, usually from disk, into main memory, is what I'm referring to as texture upload. In terms of memory bandwidth, you are using write bandwidth to write into the memory.

Assets are transfered from persistent storage to memory

Textures may either be generated by the 3d application, which would be trivial for this example, or they may be authored using a set of offline tools and baked into the application. For any consequential use, the latter is predominantly used. Certain surface types are often dynamically generated though, for example, the shadow mapping technique will generate depth maps. Those dynamically generated surfaces actually will benefit even more (more later).

This is pseudo code (but close to real) to upload the texture in OpenGL:

const unsigned height = 128;
const unsigned width = 64;
const void *data = ... // rubik's cube
GLuint tex;

glGenTextures(1, &tex);
glBindTexture(GL_TEXTURE_2D, texture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, width, height, 0, GL_RGB, GL_UNSIGNED_BYTE, data);
glGenerateMipmap(GL_TEXTURE_2D);

I'm going to punt on explaining mipmaps, which are themselves a mechanism to conserve memory bandwidth. If you have no understanding, I'd recommend reading up on mipmaps. This wikipedia article looks decent to me.

Texture Sampling

Once the texture is bound, the graphics runtime can execute shaders which can reference those textures. When the shader requests a color value (also known as sampling) from the texture, it's possiblelikely that the calculated coordinate within the texture will be in between pixels. The hardware will have to return a single color value for the sample point and the way it interpolates is chosen by the graphics runtime. This is referred to as a filter

Texture Fetch
Texture Fetch/Filtering
  • Nearest: Take the value of the closest two pixels and interpolate. If the texture coordinate hits a single pixel, don't interpolate.
  • Bilinear: Take the surround 4 pixels and interpolate based on distance for the texture coordinate
  • Trilinear: Bilinear, but also interpolate between the two closest mipmaps. (I skipped discussing mipmaps, but part of the texture fetch involves finding the nearest miplevel.
  • Anisotropic: It's complicated. Let's say 16x trilinear.

Here's the GLSL to fetch the texture:

#version 330

uniform sampler2D tex;
in vec2 texCoord;
out vec4 fragColor;

void main() {
    vec4 temp = texelFetch(tex, ivec2(texCoord));
    fragColor = temp;
}

The above actually does something that's perhaps not immediately obvious. fragColor = temp;. This actually instructs the fragment shader to write out that value to a surface which is bound for output (usually a framebuffer). In other words, there are two steps here, read and filter a value from the texture, write it back out.

The part of the overall diagram that represents this step:

Composition

In the old days of X, and even still when not using the composite extension, the graphics applications could be given a window to write the pixels directly into the resulting output. The X window manager would mediate resize and move events, letting the client update as needed. This has a lot of downsides which I'll say are out of scope here. There is one upside that is in scope though, there's no extra copy needed to create the screen composition. It just is what it is, tearing and all.

If you don't know if you're currently using a compositor, you almost certainly are using one. Wayland only composites, and the number of X window managers that don't composite is very few. So what exactly is compositing? Simply put it's a window manager that marshals frame updates from clients and is responsible for drawing them on the final output. Often the compositor may add its own effects such as the infamous wobbly windows. Those effects themselves may use up bandwidth!

Compositor
Simplified compositor block diagram

Applications will write their output into what's referred to as an offscreen buffer. 👋👋 The compositor will read the output and copy it into what will become the next frame. What this means from a bandwidth consumption perspective is that the compositor will need to use both read and write bandwidth just to build the final frame. 👋👋

Display

It's the mundane part of this whole thing. Pixels are fetched from memory and pushed out onto whatever display protocol.

Display Engine
Display Engine

Perhaps the interesting thing about the display engine is it has fairly isochronous timing requirements and can't tolerate latency very well. As such, it will likely have a dedicated port into memory that bypasses arbitration with other agents in the system that are generating memory traffic.

Out of scope here but I'll briefly mention, this also gets a bit into tiling. Display wants to read things row by row, whereas rendering works a bit different. In short this is the difference between X-tiling (good for display), and Y-tiling (good for rendering). Until Skylake, the display engine couldn't even understand Y-tiled buffers.

Summing the Bandwidth Cost

Running through our 64x64 example...

Operation Color Depth Description Bandwidth R/W
Texture Upload 1Bpc (RGBX8) File to DRAM 16KB (64 × 64 × 4) W
Texel Fetch (nearest) 1Bpc DRAM to Sampler 16KB (64 × 64 × 4) R
FB Write 1Bpc GPU to DRAM 16KB (64 × 64 × 4) W
Compositing 1Bpc DRAM to DRAM 32KB (64 × 64 × 4 × 2) R+W
Scanout 1Bpc DRAM to PHY 16KB (64 × 64 × 4) R

Total = (16 + 16 + 16 + 32 + 16) × 60Hz = 5.625MBs

But actually, Display Engine will always scanout the whole thing, so really with a 4k display:

Total = (16 + 16 + 16 + 32 + 32400) × 60Hz = 1.9GBs

Don't forget about those various filter modes though!

Filter Mode Multiplier (texel fetch) Total Bandwidth
Bilinear 4x 11.25MBs
Trilinear 8x 18.75MBs
Aniso 4x 32x 63.75MBs
Aniso 16x 128x 243.75MBs

Proposing some solutions

Without actually doing math, I think cache is probably the biggest win you can get. One spot where caching could help is that framebuffer write step followed composition step could avoid the trip to main memory. Another is the texture upload and fetch. Assuming you don't blow out your cache, you can avoid the main memory trip.

While caching can buy you some relief, ultimately you have to flush your caches to get the display engine to be able to read your buffer. At least as of 2017, I was unaware of an architecture that had a shared cache between display and 3d.

Also, cache sizes are limited...

Wait for DRAM to get faster

Instead of doing anything, why not just wait until memory gets higher bandwidth?

Here's a quick breakdown of the progression at the high end of the specs. For the DDR memory types, I took a swag at number of populated channels. For a fair comparison, the expectation with DDR is you'll have at least dual channel nowadays.

Bandwidth

Looking at the graph it seems like the memory vendors aren't hitting Moore's Law any time soon, and if they are, they're fooling me. A similar chart should be made for execution unit counts, but I'm too lazy. A Tigerlake GT2 has 96 Eus. If you go back to our back of the envelope calculation we had a mid range GPU at 24 EUs, so that has quadrupled. In other words, the system architects will use all the bandwidth they can get.

Improving memory technologies is vitally important, it just isn't enough.

TOTAL SAVINGS = 0%

Hardware Composition

One obvious place we want to try to reduce bandwidth is composition. It was after all the biggest individual consumer of available memory bandwidth.

With composition as we described earlier, there was presumed to be a single plane. Software would arrange the various windows onto the plane, which if you recall from the section on composition added quite a bit to the bandwidth consumption, then the display engine could display from that plane.

Hardware composition is the notion that each of those windows could have a separate display plane, directly write into that, and all the compositor would have to do is make sure those display planes occupied the right part of the overall screen. It's conceptually similar to the direct scanout we described earlier in the section on composition.

Operation Color Depth Description Bandwidth R/W
Compositing 1Bpc DRAM to DRAM 32KB (64 × 64 × 4 × 2) R+W

TOTAL SAVINGS = 1.875MBs (33% savings)

Hardware Compsition Verdict

33% savings is really really good, and certainly if you have hardware with this capability, the driver should enable it, but there are some problems that come along with this that make it not so appealing.

  1. Hardware has a limited number of planes.
  2. Formats. One thing I left out about the compositor earlier is that one of things it may opt to do is convert the application's window into a format that the display hardware understands. This means some amount of negotiation has to take place so the application knows about this. Prior to this work, that wasn't in place.
  3. Doesn't reduce any other parts of process, ie. a full screen application wouldn't benefit at all.

Texture Compression

So far in order to solve the, not enough bandwidth, problem, we've tried to add more bandwidth, and reduce usage with hardware composition. The next place to go is to try to tackle texture consumption from texturing.

If you recall, we split the texturing up into two stages. Texture upload, and texture fetch. This third proposed solution attempts to reduce bandwidth by storing a compressed texture in memory. Texture upload will compress it while uploading, and texture sampling can understand the compression scheme and avoid doing all the lookups. Compressing the texture usually comes with some unperceivable degradation. In terms of the sampling, it's a bit handwavy to say you reduce by the compression factor, but let's say for simplicity sake, that's what it does.

Some common formats at the time of the original materials were

Format Compression Ratio
DXT1 8:1
ETC2 4:1
ASTC Variable, 6:1

Using DXT1 as an example of the savings:

Operation Color Depth Bandwidth R/W
Texture Upload DXT1 2KB (64 × 64 × 4 / 8) W
Texel Fetch (nearest) DXT1 2KB (64 × 64 × 4 / 8) R
FB Write 1Bpc 16KB (64 × 64 × 4) W
Compositing 1Bpc 32KB (64 × 64 × 4 × 2) R+W
Scanout 1Bpc 16KB (64 × 64 × 4) R

Here's an example with the simple DXT1 format:

Texture Compression Verdict

Texture compression solves a couple of the limitations that hardware composition left. Namely it can work for full screen applications, and if your hardware supports it, there isn't a limit to how many applications can make use of it. Furthermore, it scales a bit better because an application might use many many textures but only have 1 visible window.

There are of course some downsides.

Click for SVG

For comparison, here is the same cube scaled down with an 8:1 ratio. As you can see DXT1 does a really good job.

Scaled cube

We can't ignore the degradation though as certain rendering may get very distorted as a result.

  • Lossy (perhaps).
  • Hardware compatibility. Application developers need to be ready and able to compress to different formats and select the right things at runtime based on what the hardware supports. This takes effort both in the authoring, as well as the programming.
  • Patents.
  • Display doesn't understand this, so you need to decompress before display engine will read from a surface that has this compression.
  • Doesn't work well for all filtering types, like anisotropic.

*TOTAL SAVINGS (DXT1) = 1.64MBs (30% savings)

*total savings here is kind of a theoretical max

End to end lossless compression

So what if I told you there was a way to reduce your memory bandwidth consumption without having to modify your application, without being subject to hardware limits on planes, and without having to wait for new memory technologies to arrive?

End to end loss compression attempts to provide both "end to end" and "lossless" compression transparently to software. Explanation coming up.

End to End

As mentioned in the previous section on texture compression, one of the pitfalls is that you'd have to decompress the texture in order for it to be used outside of your 3d engine. Typically this would mean for the display engine to scanout out from, but you could also envision a case where perhaps you'd like to share these surfaces with the hardware video encoder. The nice thing about this "end to end" attribute is every stage we mentioned in previous sections that required bandwidth can get the savings just by running on hardware and drivers that enable this.

Lossless

Now because this is all transparent to the application running, a lossless compression scheme has to be used so that there aren't any unexpected results. While lossless might sound great on the surface (why would you want to lose quality?) it reduces the potential savings because lossless compression algorithms are always more inefficient, but it's still a pretty big win.

What's with the box, bro?

I want to provide an example of how this can be possible. Going back to our original image of the full picture, everything looks sort of the same. The only difference is there is a little display engine decompression step, and all of the sampler and framebuffer write steps now have a little purple box accompanying them

One sort of surprising aspect of this compression is it reduces bandwidth, not overall memory usage (that's also true of the Intel implementation). In order to store the compression information, hardware carves off a little bit of extra memory which is referenced for each operation on a texture (yes, that might use bandwidth too if it's not cached).

Here's a made-up implementation which tracks state in a similar way to Skylake era hardware, but the rest is entirely made up by me. It shows that even a naive implementation can get up to a lossless 2:1 compression ratio. Remember though, this comes at the cost of adding gates to the design and so you'd probably want something better performing than this.

2:1 compression

Everything is tracked as cacheline pairs. In this example we have state called "CCS". For every pair of cachelines in the image, 2b are in this state to track the current compression. When the pair of cachelines uses 12 or fewer colors (which is surprisingly often in real life), we're able to compress the data into a single cacheline (state becomes '01'). When the data is compressed, we can reassemble the image losslessly from a single cacheline, this is 2:1 compression because 1 cacheline gets us back 2 cachelines worth of pixel data.

Walking through the example we've been using of the Rubik's cube.

  1. As the texture is being uploaded, the hardware observes all the runs of the same color and stores them in this compressed manner by building the lookup table. On doing this it modifies the state bits in the CCS to be 01 for those cachelines.
  2. On texture fetch, the texture sampler checks the CCS. If the encoding is 01, then the hardware knows to use the LUT mechanism instead for all the color values.
  3. Throughout the rest of rendering, steps 1 & 2 are repeated as needed.
  4. When display is ready to scanout the next frame, it too can look at the CCS determine if there is compression, and decompress as it's doing the scanout.

The memory consumed is minimal which also means that any bandwidth usage overhead is minimal. In the example we have a 64x128 image. In total that's 512 cachelines. At 2 bits per 2 cachelines the CCS size for the example would be not even be a full cacheline: 512 / 2 / 2 = 128b = 16B

* Unless you really want to understand how hardware might actually work, ignore the 00 encoding for clear color.

* There's a caveat here that we assume texture upload and fetch use the sampler. At the time of the original presentation, this was not usually the case and so until the FB write occurred, you didn't actually get compression.

Theoretical best savings would compress everything:

Operation Color Depth Description Bandwidth R/W
Texture Upload 1Bpc compressed File to DRAM 8KB (64 × 64 × 4) / 2 W
Texel Fetch (nearest) 1Bpc compressed DRAM to Sampler 8KB (64 × 64 × 4) / 2 R
FB Write 1Bpc compressed GPU to DRAM 8KB (64 × 64 × 4) / 2 W
Compositing 1Bpc compressed DRAM to DRAM 16KB (64 × 64 × 4 × 2) / 2 R+W
Scanout 1Bpc compressed DRAM to PHY 8KB (64 × 64 × 4) / 2 R

TOTAL SAVINGS = 2.8125MBs (50% savings)

And if you use HW compositing in addition to this...

TOTAL SAVINGS = 3.75MBs (66% savings)

Ending Notes

Hopefully it's somewhat clear how 3d applications are consuming memory bandwidth, and how quickly the consumption grows when adding more applications, textures, screen size, and refresh rate.

End to end lossless compression isn't always going to be a huge win, but in many cases it can really chip away at the problem enough to be measurable. The challenge as it turns out is actually getting it hooked up in the driver and rest of the graphics software stack. As I said earlier, just because a feature seems good doesn't necessarily mean it would be worth the software effort to implement it. End to end loss compression is one feature that you cannot just turn on by setting a bit, and the fact that it's still not enabled anywhere, to me, is an indication that effort and gates may have been better spent elsewhere.

However, next section will be all about how we got it hooked up through the graphics stack.

If you've made it this far, you probably could use a drink. I know I can.

February 25, 2021

A Losing Battle

For a long time, I’ve tried very, very, very, very hard to work around problems with NIR variables when it comes to UBOs and SSBOs.

Really, I have.

But the bottom line is that, at least for gallium-based drivers, they’re unusable. They’re so unreliable that it’s only by sheer luck (and a considerable amount of it) that zink has worked at all until this point.

Don’t believe me? Here’s a list of just some of the hacks that are currently in use by zink to handle support for these descriptor types, along with the reason(s) why they’re needed:

Hack Reason It’s Used Bad Because?
iterating the list of variables backwards this indexing vaguely matches the value used by shaders to access the descriptor only works coincidentally for as long as nothing changes this ordering and explodes entirely with GL-SPIRV
skipping non-array variables with data.location > 0 these are (usually) explicit references to components of a block BO object in a shader sometimes they’re the only reference to the BO, and skipping them means the whole BO interface gets skipped
using different indexing for SSBO variables depending on whether data.explicit_binding is set this (sometimes) works to fix indexing for SSBOs with random bindings and also atomic counters the value is set randomly by other optimization passes and so it isn’t actually reliable
atomic counters are identified by using !strcmp(glsl_get_type_name(var->interface_type), "counters") counters get converted to SSBOs, but they require different indexing in order to be accessed correctly c’mon.
runtime arrays (array[]) are randomly tacked onto SPIRV SSBO variables based on the variable type fixes atomic counter array access and SSBO length() method not actually needed most of the time

And then there’s this monstrosity that’s used for linking up SSBO variable indices with their instruction’s access value (comments included for posterity):

unsigned ssbo_idx = 0;
if (!is_ubo_array && var->data.explicit_binding &&
    (glsl_type_is_unsized_array(var->type) || glsl_get_length(var->interface_type) == 1)) {
    /* - block ssbos get their binding broken in gl_nir_lower_buffers,
     *   but also they're totally indistinguishable from lowered counter buffers which have valid bindings
     *
     * hopefully this is a counter or some other non-block variable, but if not then we're probably fucked
     */
    ssbo_idx = var->data.binding;
} else if (base >= 0)
   /* we're indexing into a ssbo array and already have the base index */
   ssbo_idx = base + i;
else {
   if (ctx->ssbo_mask & 1) {
      /* 0 index is used, iterate through the used blocks until we find the first unused one */
      for (unsigned j = 1; j < ctx->num_ssbos; j++)
         if (!(ctx->ssbo_mask & (1 << j))) {
            /* we're iterating forward through the blocks, so the first available one should be
             * what we're looking for
             */
            base = ssbo_idx = j;
            break;
         }
   } else
      /* we're iterating forward through the ssbos, so always assign 0 first */
      base = ssbo_idx = 0;
   assert(ssbo_idx < ctx->num_ssbos);
}
assert(!ctx->ssbos[ssbo_idx]);
ctx->ssbos[ssbo_idx] = var_id;
ctx->ssbo_mask |= 1 << ssbo_idx;
ctx->ssbo_vars[ssbo_idx] = var;

Does it work?

Amazingly, yes, it does work the majority of the time.

But is this really how we should live our lives?

A Methodology To Live By

As the great compiler-warrior Jasonus Ekstrandimus once said, “Just Delete All The Code”.

Truly this is a pivotal revelation, one that can induce many days of deep thinking, but how can it be applied to this scenario?

Today I present the latest in zink code deletion: a NIR pass that deletes all the broken variables and makes new ones.

bender.jpg

Let’s get into it.

uint32_t ssbo_used = 0;
uint32_t ubo_used = 0;
uint64_t max_ssbo_size = 0;
uint64_t max_ubo_size = 0;
bool ssbo_sizes[PIPE_MAX_SHADER_BUFFERS] = {false};

if (!shader->info.num_ssbos && !shader->info.num_ubos && !shader->num_uniforms)
   return false;
nir_function_impl *impl = nir_shader_get_entrypoint(shader);
nir_foreach_block(block, impl) {
   nir_foreach_instr(instr, block) {
      if (instr->type != nir_instr_type_intrinsic)
         continue;

      nir_intrinsic_instr *intrin = nir_instr_as_intrinsic(instr);
      switch (intrin->intrinsic) {
      case nir_intrinsic_store_ssbo:
         ssbo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[1]));
         break;

      case nir_intrinsic_get_ssbo_size: {
         uint32_t slot = nir_src_as_uint(intrin->src[0]);
         ssbo_used |= BITFIELD_BIT(slot);
         ssbo_sizes[slot] = true;
         break;
      }
      case nir_intrinsic_ssbo_atomic_add:
      case nir_intrinsic_ssbo_atomic_imin:
      case nir_intrinsic_ssbo_atomic_umin:
      case nir_intrinsic_ssbo_atomic_imax:
      case nir_intrinsic_ssbo_atomic_umax:
      case nir_intrinsic_ssbo_atomic_and:
      case nir_intrinsic_ssbo_atomic_or:
      case nir_intrinsic_ssbo_atomic_xor:
      case nir_intrinsic_ssbo_atomic_exchange:
      case nir_intrinsic_ssbo_atomic_comp_swap:
      case nir_intrinsic_ssbo_atomic_fmin:
      case nir_intrinsic_ssbo_atomic_fmax:
      case nir_intrinsic_ssbo_atomic_fcomp_swap:
      case nir_intrinsic_load_ssbo:
         ssbo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[0]));
         break;
      case nir_intrinsic_load_ubo:
      case nir_intrinsic_load_ubo_vec4:
         ubo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[0]));
         break;
      default:
         break;
      }
   }
}

The start of the pass iterates over the instructions in the shader. All UBOs and SSBOs that are used get tagged into a bitfield of their index, and any SSBOs which have the length() method called are similarly tagged.


nir_foreach_variable_with_modes(var, shader, nir_var_mem_ssbo | nir_var_mem_ubo) {
   const struct glsl_type *type = glsl_without_array(var->type);
   if (type_is_counter(type))
      continue;
   unsigned size = glsl_count_attribute_slots(type, false);
   if (var->data.mode == nir_var_mem_ubo)
      max_ubo_size = MAX2(max_ubo_size, size);
   else
      max_ssbo_size = MAX2(max_ssbo_size, size);
   var->data.mode = nir_var_shader_temp;
}
nir_fixup_deref_modes(shader);
NIR_PASS_V(shader, nir_remove_dead_variables, nir_var_shader_temp, NULL);
optimize_nir(shader);

Next, the existing SSBO and UBO variables get iterated over. A maximum size is stored for each type, and then the variable mode is set to temp so it can be deleted. These variables aren’t actually used by the shader anymore, so this is definitely okay.

Boom.

if (!ssbo_used && !ubo_used)
   return false;

Early return if it turns out that there’s not actually any UBO or SSBO use in the shader, and all the variables are gone to boot.

struct glsl_struct_field *fields = rzalloc_array(shader, struct glsl_struct_field, 2);
fields[0].name = ralloc_strdup(shader, "base");
fields[1].name = ralloc_strdup(shader, "unsized");

The new variables are all going to be the same type, one which matches what’s actually used during SPIRV translation: a simple struct containing an array of uints, aka base. SSBO variables which need the length() method will get a second struct member that’s a runtime array, aka unsized.

if (ubo_used) {
   const struct glsl_type *ubo_type = glsl_array_type(glsl_uint_type(), max_ubo_size * 4, 4);
   fields[0].type = ubo_type;
   u_foreach_bit(slot, ubo_used) {
      char buf[64];
      snprintf(buf, sizeof(buf), "ubo_slot_%u", slot);
      nir_variable *var = nir_variable_create(shader, nir_var_mem_ubo, glsl_struct_type(fields, 1, "struct", false), buf);
      var->interface_type = var->type;
      var->data.driver_location = slot;
   }
}

If there’s a valid bitmask of UBOs that are used by the shader, the index slots get iterated over, and a variable is created for that slot using the same type for each one. The size is determined by the size of the biggest UBO variable that previously existed, which ensures that there won’t be any errors or weirdness with access past the boundary of the variable. All the GLSL compiliation and NIR passes to this point have already handled bounds detection, so this is also fine.

if (ssbo_used) {
   const struct glsl_type *ssbo_type = glsl_array_type(glsl_uint_type(), max_ssbo_size * 4, 4);
   const struct glsl_type *unsized = glsl_array_type(glsl_uint_type(), 0, 4);
   fields[0].type = ssbo_type;
   u_foreach_bit(slot, ssbo_used) {
      char buf[64];
      snprintf(buf, sizeof(buf), "ssbo_slot_%u", slot);
      if (ssbo_sizes[slot])
         fields[1].type = unsized;
      else
         fields[1].type = NULL;
      nir_variable *var = nir_variable_create(shader, nir_var_mem_ssbo,
                                              glsl_struct_type(fields, 1 + !!ssbo_sizes[slot], "struct", false), buf);
      var->interface_type = var->type;
      var->data.driver_location = slot;
   }
}

SSBOs are almost the same, but as previously mentioned, they also get a bonus member if they need the length() method. The GLSL compiler has already pre-computed the adjustment for the value that will be returned by length(), so it doesn’t actually matter what the size of the variable is anymore.

And that’s it! The entire encyclopedia of hacks can now be removed, and I can avoid ever having to look at any of this again.

February 23, 2021

Linaro has been working together with Qualcomm to enable camera support on their platformssince 2017. The Open Source CAMSS driver was written to support the ISP IP-block with the same name that is present on Qualcomm SoCs coming from the smartphone space.

The first development board targeted by this work was the DragonBoard 410C, which was followed in 2018 by DragonBoard 820C support. Recently support for the Snapdragon 660 SoC was added to the driver, which will be part of the v5.11 Linux Kernel release. These SoCs all contain the CAMSS (Camera SubSystem) version of the ISP architecture.

Currently, support for the ISP found in the Snapdragon 845 SoC and the DragonBoard 845C is in the process of being upstreamed to the mailinglists. Having …

February 19, 2021

Quickly: ES 3.2

I’ve been getting a lot of pings over the past week or two about ES 3.2 support.

Here’s the deal.

It’s not happening soon. Probably.

Zink currently supports every 3.2 extension except for photoshop. There’s two ways to achieve support for that extension at present:

  • the nice, simple, VK_EXT_blend_operation_advanced which nobody* supports
  • the difficult, excruciating fbfetch method using shader rewrites, which also requires extensions that nobody supports

* Yes, I know that Nvidia supports advanced blend, but zink+nvidia is not currently capable of doing ES of any version, so that’s not testable.

So in short, it’s going to be a while.

But there’s not really a technical reason to rush towards full ES 3.2 anyway other than to fill out a box on mesamatrix. If you have an app that requires 3.2, chances are that it probably doesn’t really require it; few apps actually use the advanced blend extension, and so it should be possible for the app to require only 3.1 and then verify the presence of whatever 3.2-based extensions it may use in order to be more compatible.

Of course, this is unlikely to happen. It’s far easier for app developers to just say “give me 3.2” if maybe they just want geometry shaders, and I wouldn’t expect anyone is going to be special-casing things just to run on zink.

Nonetheless, it’s really not a priority for me given the current state of the Vulkan ecosystem. As time moves on and various extensions/features become more common that may change, but for now I’m focusing on things that are going to be the most useful.

Background

After a lot of effort over short stints in the last several months, I have completed my blog migration to Lektor in the hopes that when I migrate again in the future, it won't be as painful.

Despite my efforts, many old posts might not be perfect. This is a job for the wayback machine

In case you're curious I did this primarily for one reason (and lots of smaller ones). I wanted my data back. Wordpress is an open source blogging platform with huge adoption. It has a very large plugin ecosystem and is very actively updated and maintained. While security issues have come up here and there, at some point automatic updates became an option and that helped a bit. In 2010 it was the obvious choice.

If you've gained anything from my blog posts, you should thank Wordpress. Wordpress' ease of setup and relative ease of use is a big reason I was able to author things as well as I did.

So what happened - plugins

I wanted my data back. It was a self hosted instance and I had all my information stored in a SQL database. Obviously I never lost my data, but...

Plugins.

I used plugins for my tables (multiple plugins). I used plugins for code highlighting. Plugins for LaTeX. Plugins for table of contents, social media integration, post tagging, image captioning and formatting, spelling. You get the idea. The result of all this was I ended up with a blog post that was entirely useless in its text only form. Plugins storing the data in non-standard places so it can be processed and look fancy.

The WYSIWYG editor interface was a huge plus for me. I spent all day in front of a terminal breaking graphics and display (meaning I really was in front of an 80x24 terminal at times). I didn't want to have to deal with fanciful layout engines or styles. Those plugins ended up destroying the WYSIWYG editor experience and I ended up doing everything in quasi markdown anyway.

Plugins themselves introduced security issues when they weren't intentionally malicious anyway.

What was next?

These static site generators seemed really appealing as a solution to this problem. Everything in markdown. Assets stored together in the filesystem. Jekyll is obviously hugely popular. Hugo, Pelican, Gatsby, and Sphinx are all generators I considered. The number of static site generators is staggering. I wish I could remember what made me choose Lektor, but I can't - python based was my only requirement.

Python because I wanted a platform that did most of what I wanted but was extendible by me if absolutely necessary.

Migrating was definitely a lot of work. I was tempted several times to abort the effort and just rely on wayback machine. Ultimately I decided that migrating the post would be a good way to learn how well the platform would meet my needs (that being an annual blog post or so)

There are definitely some features I miss that I may or may not get to.

  1. Comments. There is disqus integration. I'm not convinced this is what I want.
  2. Post grouping. There is categories. It was too complicated for me to figure out in a short time, so I'm punting on it for now.
  3. I'd really like to not have to learn CSS and jinja2. I can scrape by a bit, but changing anything drastic takes a lot of effort for me.

Migration

I followed this. I did have to make some minor changes specific to my needs and posts did still require some touchups, in large part due to plugins and my obsessive use of SVG.

See you soon

Now that I'm back, I hope to post more often. Next up will be a recap of some of the pathfinding projects I worked on after FreeBSD enabling

February 18, 2021

Last year I wrote about how to create a user-specific XKB layout, followed by a post explaining that this won't work in X. But there's a pandemic going on, which is presumably the only reason people haven't all switched to Wayland yet. So it was time to figure out a workaround for those still running X.

This Merge Request (scheduled for xkeyboard-config 2.33) adds a "custom" layout to the evdev.xml and base.xml files. These XML files are parsed by the various GUI tools to display the selection of available layouts. An entry in there will thus show up in the GUI tool.

Our rulesets, i.e. the files that convert a layout/variant configuration into the components to actually load already have wildcard matching [1]. So the custom layout will resolve to the symbols/custom file in your XKB data dir - usually /usr/share/X11/xkb/symbols/custom.

This file is not provided by xkeyboard-config. It can be created by the user though and whatever configuration is in there will be the "custom" keyboard layout. Because xkeyboard-config does not supply this file, it will not get overwritten on update.

From XKB's POV it is just another layout and it thus uses the same syntax. For example, to override the +/* key on the German keyboard layout with a key that produces a/b/c/d on the various Shift/Alt combinations, use this:


default
xkb_symbols "basic" {
include "de(basic)"
key <AD12> { [ a, b, c, d ] };
};
This example includes the "basic" section from the symbols/de file (i.e. the default German layout), then overrides the 12th alphanumeric key from left in the 4th row from bottom (D) with the given symbols. I'll leave it up to the reader to come up with a less useful example.

There are a few drawbacks:

  • If the file is missing and the user selects the custom layout, the results are... undefined. For run-time configuration like GNOME it doesn't really matter - the layout compilation fails and you end up with the one the device already had (i.e. the default one built into X, usually the US layout).
  • If the file is missing and the custom layout is selected in the xorg.conf, the results are... undefined. I tested it and ended up with the US layout but that seems more by accident than design. My recommendation is to not do that.
  • No variants are available in the XML files, so the only accessible section is the one marked default.
  • If a commandline tool uses a variant of custom, the GUI will not reflect this. If the GUI goes boom, that's a bug in the GUI.

So overall, it's a hack[2]. But it's a hack that fixes real user issues and given we're talking about X, I doubt anyone notices another hack anyway.

[1] If you don't care about GUIs, setxkbmap -layout custom -variant foobar has been possible for years.
[2] Sticking with the UNIX principle, it's a hack that fixes the issue at hand, is badly integrated, and weird to configure.

What’s Next

It’s been a busy week. The CTS fixes and patch drops are coming faster and faster, and progress is swift. Here’s a quick note on some things that are on the horizon.

Features Landing Soon

Zink’s in a tough spot right now in master. GL 4.6 is available, but there’s still plenty of things that won’t work, e.g., running anything at 60fps. These are things I expect (hope) to see land in the repo in the next month or so:

  • improved barrier support, which frees up some opportunities with queue refactoring
  • removing explicit pre-fencing on every frame (and sometimes multiple times per frame)
  • descriptor caching
  • various bugfixes which weren’t feasible due to architectural issues

All told, just as an example, Unigine Heaven (which can now run in color!) should see roughly a 100% performance improvement (possibly more) once this is in, and I’d expect substantial performance gains across the board.

Will you be able to suddenly play all your favorite GL-based Steam games?

No.

I can’t even play all your favorite GL-based Steam games yet, so it’s a long ways off for everyone else.

But you’ll probably be able to get surprisingly good speed on what things you can run considering the amount of time that will pass between hitting 4.6 and these patchsets merging.

Features I’m Working On

I spent some time working on Wolfenstein over the past week, but there’s some non-zink issues in the way, so that’s on the backburner for a while. Instead, I’ve turned my attention to CTS and begun unloading a dumptruck of resulting fixes into the codebase.

There comes a time when performance is “good enough” for a while, and, after some intense optimizing since the start of the year, that time has come. So now it’s back to stabilization mode, and I’m now aiming to have a vaguely decent pass rate in the near term.

Hopefully I’ll find some time to post some of the crazy bugs I’ve been hunting, but maybe not. Time will tell.

February 11, 2021

By Now

…or in the very near future, the ol’ bumperino will have landed, putting zink at GL 4.5.

But that’s boring, so let’s check out something very slightly more interesting.

Steam Games

What are they and how do they work?

I’m not going to answer these questions, but I am going to be looking into getting them working on zink.

To that end, as I hinted at yesterday, I began with Wolfenstein: The New Order, as chosen by Daniel Schuermann, the lucky winner of the What Steam Game Should Zink Use As Its Primary Test Case And Benchmark contest that was recently held.

Early tests of this game were unimpressive. That is to say I got an immediate crash. It turns out that having the GL compatibility context restricted to 3.0 is bad for getting AAA games running, so zink-wip now enables 4.6 compat contexts.

But then I was still getting a crash without any clear error message. Suddenly, I was back in 2004 trying to figure out how to debug wine apps.

Things are much simpler now, however. PROTON_DUMP_DEBUG_COMMANDS enables dumping scripts for debugging from steam, including one which attaches a debugger to the game. This solved the problem of getting a debugger in before the almost-immediate crash, but it didn’t get me closer to a resolution.

The problem now is that I’d attached a debugger to the in-wine process, which is just a sandbox for the Windows API. What I actually wanted was to attach to the wine process itself so I could see what was going on in the driver.

gdb --pid=$(pidof WolfNewOrder_x64.exe) ended up what I needed, but this was complicated by the fact that I had to attach before the game crashed and without triggering the steam error reporter. So in the end, I had to attach using the proton script, then while it was paused, attach to the outer process for driver debugging. But then also I had to attach to the outer process after zink was loaded, so it was a real struggle.

Then, as per usual, another problem: I had no symbols loaded because proton runs a static binary. After cluelessly asking around in the DXVK discord, @Herbert helpfully provided a gdb python script for proton in-process debugging that I was able to repurpose for my needs. The gist (haha) of the script is that it scans /proc/$pid/maps and then manually loads the required library files.

At last, I had attached to the game, I had symbols, and I could see that I was hitting a zink assert I’d added to catch int overflows. A quick one-liner to change the order of a calculation fixed that, and now I’m on to an entirely new class of bugs.

February 10, 2021

I recently joined Igalia and as a way to familiarize myself with Adreno GPUs it was decided to get Freedreno up to OpenGL 3.3.

Just recently, Freedreno exposed only OpenGL 3.0. The big jump in version required only two small extensions and a few fixes to get rid of most crashes in Piglit since almost all features were already supported.

ARB_blend_func_extended

I decided to start with dual-source-blending. Fittingly, it was the cause of the first issue I investigated in Mesa two and a half years ago =)

Turnip already had it implemented so it was a good first task. Most of the time was spent getting myself familiarized with Freedreno and squashing the hangs that happened due to the lack of experience with this driver. And shortly I have a working MR

freedreno/a6xx: add support for dual-source blending (!7708)

However, Freedreno didn’t have Piglit tests running on CI and I missed the failure of arb_blend_func_extended-fbo-extended-blend-pattern_gles2 test. I did test the _gles3 version though. Thus the issue with gles2 was not found immediately.

What’s different in the gles2 test? It uses gl_SecondaryFragColorEXT for the second color of dual-source blending. gl_FragColor and gl_SecondaryFragColorEXT are different from the other ways to specify color in that they broadcast the color to all enabled draw buffers. In Mesa they both translated to slot FRAG_RESULT_COLOR, which in Freedreno didn’t expect the second source for blending. The solution was simple, if a shader has a dual-source blending - remap FRAG_RESULT_COLOR to FRAG_RESULT_DATA0 + dual_src_index.

freedreno/ir3: remap FRAG_RESULT_COLOR to _DATA* for dual-src blending (!8245)

While looking at how other drivers handle FRAG_RESULT_COLOR I saw that Zink also has a similar issue but in a NIR lowering path, so I fixed it too since it was simple enough.

nir/lower_fragcolor: handle dual source blending (!8247)

ARB_shader_stencil_export

While implementing ARB_blend_func_extended I saw in Turnip that RB_FS_OUTPUT_CNTL0 also could enable stencil export:

tu_cs_emit_pkt4(cs, REG_A6XX_RB_FS_OUTPUT_CNTL0, 2);
tu_cs_emit(cs, COND(fs->writes_pos, A6XX_RB_FS_OUTPUT_CNTL0_FRAG_WRITES_Z) |
               COND(fs->writes_smask, A6XX_RB_FS_OUTPUT_CNTL0_FRAG_WRITES_SAMPMASK) |
               COND(fs->writes_stencilref, A6XX_RB_FS_OUTPUT_CNTL0_FRAG_WRITES_STENCILREF) |
               COND(dual_src_blend, A6XX_RB_FS_OUTPUT_CNTL0_DUAL_COLOR_IN_ENABLE));

Meanwhile, in Freedreno there was just a cryptic 0xfc000000:

OUT_PKT4(ring, REG_A6XX_SP_FS_OUTPUT_CNTL0, 1);
OUT_RING(ring, A6XX_SP_FS_OUTPUT_CNTL0_DEPTH_REGID(posz_regid) |
        A6XX_SP_FS_OUTPUT_CNTL0_SAMPMASK_REGID(smask_regid) |
        COND(fs_has_dual_src_color,
                A6XX_SP_FS_OUTPUT_CNTL0_DUAL_COLOR_IN_ENABLE) |
        0xfc000000);

Which I was quickly told to be a shifted register r63.x, which has a value of 0xfc and is used to signify an unused value.

Again, Turnip already supported stencil export and compiler had it wired up; it’s hard to have a simpler extension to implement.

freedreno/a6xx: add support for ARB_shader_stencil_export (!7810)

Running tests for GL 3.1 - 3.3

After enabling ARB_blend_func_extended and ARB_shader_stencil_export Freedreno had everything for a jump to GL 3.3. It was a time to run tests.

KHR-GL31.texture_size_promotion.functional

First observation was that the test passes with FD_MESA_DEBUG=nogmem, next I compared traces with and without nogmem and saw the difference in a texture layout (I should have just printed the layouts with FD_MESA_DEBUG=layout). It didn’t give me an answer, so I checked whether I could bisect the failure. To my delight it was bisectable

freedreno: reduce extra height alignment in a6xx layout (e4974852)

The change is only one line of code and without the context first thing which comes to mind is that maybe the alignment is wrong:

         if (level == mip_levels - 1)
-            nblocksy = align(nblocksy, 32);
+            height = align(nblocksy, 4);

         uint32_t nblocksx =

But it’s even more trivial, nblocksy was erroneously changed to height, and height just wasn’t used afterwards which hid the issue. Thus it was fixed in

freedreno/a6xx: Fix typo in height alignment calculation in a6xx layout (!7792)

gl-3.2-layered-rendering-clear-color-all-types 2d_multisample_array single_level

It was quickly found that clearing a framebuffer in Freedreno didn’t take into account that framebuffer could have several layers, so only the first one was cleared.

Initially, I thought that we would need a number_of_layers draw calls to clear all layers, but looking at a generic blitter in gallium suggested a better answer - instanced draw where each instance draw a fullscreen quad into a distinct layer. Which could be accomplished just by:

gl_Layer = gl_InstanceID;

in the vertex shader. However, this required the support of GL_AMD_vertex_shader_layer which Freedreno didn’t have…

Implementing GL_AMD_vertex_shader_layer

Fortunately, Turnip again had it implemented. As with stencil export it was just a matter of finding a register of VARYING_SLOT_LAYER and plugging it into respective control structure.

freedreno/a6xx: add support for gl_Layer in vertex shader

Simple enough? Yes! However, it failed the only tests available for the extension:

amd_vertex_shader_layer-layered-2d-texture-render    // Fails
amd_vertex_shader_layer-layered-depth-texture-render // Crashes

The crash when clearing a depth texture is a known one, but why 2d-texture-render fails? Let’s look at the output:

layered-2d-texture-render without noubwc amd_vertex_shader_layer-layered-2d-texture-render

First layer and its mipmaps are cleared, but on other layers most mipmaps are not! Mipmaps are nowhere near gl_Layer, so it seams that the implementation is indeed correct, however the single test we expected to pass - fails. Which is bad because we don’t have other tests to test gl_Layer.

Fixing amd_vertex_shader_layer-layered-2d-texture-render

A good starting point is always to check whether one of FD_MESA_DEBUG options helps. In this case noubwc changed the result of the test to:

layered-2d-texture-render with noubwc amd_vertex_shader_layer-layered-2d-texture-render with noubwc

Not by a lot but still a lead. Something wrong with image layouts…

I searched for similar tests for Vulkan and found them in vktDrawShaderLayerTests.cpp, the tests passed on Turnip which told me that most likely the issue is not in the common code of layout calculations. After some hacking of the tests I made one close enough to amd_vertex_shader_layer-layered-2d-texture-render. However, directly comparing traces between GL and Vulkan isn’t productive. So I decided to see what changes with different mip levels in Freedreno and Turnip separately.

So I did make 4 tests:

  • GL with clearing level 1
  • GL with clearing level 2
  • VK drawing to level 1
  • VK drawing to level 2

With that comparing “GL level 1”” with “GL level 2” and “VK level 1”” with “VK level 2” yielded just a few differences. For GL the only difference which could had been relevant was:

Level 1:

RB_MRT[0].ARRAY_PITCH: 8192

Level 2:

RB_MRT[0].ARRAY_PITCH: 4096

However in VK this value didn’t change, there was no difference in traces barring window/scissor/blit sizes!

RB_MRT[0].ARRAY_PITCH: 45056

In VK it was just a constant 45056 regardless of mip level. Hardcoding the value in Freedreno fixed the test. After that it was a matter of comparing how the field is calculated for Turnip and Freedreno resulting in:

freedreno/a6xx: fix array pitch for layer-first layouts

The layered clear

With the above changes the layered clear was simple enough:

freedreno/a6xx: support layered framebuffers in blitter_clear

gl-3.2-minmax or bumping the maximum count of varyings

Before, Freedreno supported only 16 vec4 varyings or 64 components between shader stages. GL3.2 on the other hand mandates the minimum of 128 components. After a couple of trivial fixes - I thought it would be easy, but Marek Olšák pointed out that I didn’t cover all cases with my tests. And of course one of them, the passing of varyings between VS and TCS, hanged the GPU.

I ran a few tests only changing the varyings count and observed that GPU hanged starting from 17 vec4. By comparing the traces I didn’t find anything suspicious in shaders, so I decided to check a few states that depended on varyings count. After poking a few of them I found that setting SP_HS_UNKNOWN_A831 above 64 resulted in a guaranteed hang.

In Turnip SP_HS_UNKNOWN_A831 had a special case for A650 which made me a bit suspicious. However, using A650’s formula for A630 didn’t change anything. It was time to see what blob driver does, so I wrote a script to iterate from 1 to 32 vec4, push test on device, run it, copy trace back, extract A831 value and append to a csv. Here is the result:

vec4 count A831 (Hex) A831 (Dec) PC_HS_INPUT_SIZE
1 0x10 16 0xc
2 0x14 20 0xf
3 0x18 24 0x12
4 0x1c 28 0x15
5 0x20 32 0x18
6 0x24 36 0x1b
7 0x28 40 0x1e
8 0x2c 44 0x21
9 0x30 48 0x24
10 0x34 52 0x27
11 0x38 56 0x2a
12 0x3c 60 0x2d
13 0x3f 63 0x30
14 0x40 64 0x33
15 0x3d 61 0x36
16 0x3d 61 0x39
17 0x40 64 0x3c
18 0x3f 63 0x3f
19 0x3e 62 0x42
20 0x3d 61 0x45
21 0x3f 63 0x48
22 0x3d 61 0x4b
23 0x40 64 0x4e
24 0x3d 61 0x51
25 0x3f 63 0x54
26 0x3c 60 0x57
27 0x3e 62 0x5a
28 0x40 64 0x5d
29 0x3c 60 0x60
30 0x3e 62 0x63
31 0x40 64 0x66

So A831 is indeed capped at 64 and after the first time A831 reaches 64, the relation ceases to be linear. Ugh…

Without having any good ideas I posted the results on Gitlab asking Jonathan Marek, who added a special case for A650 some time ago, for help. The previous A650 formula was:

prims_per_wave = wavesize // tcs_vertices_out
total_size = output_size * patch_control_points * prims_per_wave
unknown_a831 = DIV_ROUND_UP(total_size, wavesize)

After looking at the table above Jonathan Marek suggested the following formula:

prims_per_wave = wavesize // tcs_vertices_out
while True:
    total_size = output_size * patch_control_points * prims_per_wave
    unknown_a831 = DIV_ROUND_UP(total_size, wavesize)
    if unknown_a831 <= 0x40:
        break
    prims_per_wave -= 1

And it worked! What does the new formula mean is that there is A831 * wavesize available space per wave for varyings; the more varyings we use in a shader the less threads the wave will have. For TCS we use wave size of 64 so the total size available would be

64 (wave size) * 64 (max A831) * 4 = 16k

freedreno/a6xx: bump varyings limit (!7917)

Bumping OpenGL support to 3.3

There still was a considerable amount of Piglit failures but they were either already known or not so important to prevent the the version increase. Which was done in

freedreno: Enable GLSL 3.30, updating us to GL 3.3 contexts (!8270)

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Check out part 1 where we expose the context/high-level principles of the whole CI system, and make the machine fully controllable remotely (power on, OS to boot, keyboard/screen emulation using a serial console).

In this article, we will start demystifying the boot process, and discuss about different ways to generate and boot an OS image along with a kernel for your machine. Finally, we will introduce boot2container, a project that makes running containers on bare metal a breeze!

This work is sponsored by the Valve Corporation.

Generating a kernel & rootfs for your Linux-based testing

To boot your test environment, you will need to generate the following items:

  • A kernel, providing all the necessary drivers for your test;
  • A userspace, containing all the dependencies of your test (rootfs);
  • An initramfs (optional), containing the drivers/firmwares needed to access the userspace image, along with an init script performing the early boot sequence of the machine;

The initramfs is optional because the drivers and their firmwares can be built in the kernel directly.

Let's not generate these items just yet, but instead let's look at the different ways one could generate them, depending on their experience.

The embedded way

Buildroot's logo

If you are used to dealing with embedded devices, you are already familiar with projects such as Yocto or Buildroot. They are well-suited to generate a tiny rootfs which can be be useful for netbooted systems such as the one we set up in part 1 of this series. They usually allow you to describe everything you want on your rootfs, then will configure, compile, and install all the wanted programs in the rootfs.

If you are wondering which one to use, I suggest you check out the presentation from Alexandre Belloni / Thomas Pettazoni which will give you an overview of both projects, and help you decide on what you need.

Pros:

  • Minimal size: Only what is needed is included
  • Complete: Configures and compiles the kernel for you

Cons:

  • Slow to generate: Everything is compiled from source
  • Small selection of software/libraries: Adding build recipes is however relatively easy

The Linux distribution way

Debian Logo, www.debian.org

If you are used to installing Linux distributions, your first instinct might be to install your distribution of choice in a chroot or a Virtual Machine, install the packages you want, and package the folder/virtual disk into a tarball.

Some tools such as debos, or virt-builder make this process relatively painless, although they will not be compiling an initramfs, nor a kernel for you.

Fortunately, building the kernel is relatively simple, and there are plenty of tutorials on the topic (see ArchLinux's wiki). Just make sure to compile modules and firmware in the kernel, to avoid the complication of using an initramfs. Don't forget to also compress your kernel if you decide to netboot it!

Pros:

  • Relatively fast: No compilation necessary (except for the kernel)
  • Familiar environment: Closer to what users/developers use in the wild

Cons:

  • Larger: Packages tend to bring a lot of unwanted dependencies, drastically increasing the size of the image
  • Limited choice of distros: Not all distributions are easy to install in a chroot
  • Insecure: Requires root rights to generate the image, which may accidentally trash your distro
  • Poor reproducibility: Distributions get updates continuously, leading to different outcomes when running the same command
  • No caching: all the steps to generate the rootfs are re-done every time
  • Incomplete: does not generate a kernel or initramfs for you

The refined distribution way: containers

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc.

Containers are an evolution of the old chroot trick, but instead made secure thanks the addition of multiple namespaces to Linux. Containers and their runtimes have been addressing pretty much all the cons of the "Linux distribution way", and became a standard way to share applications.

On top of generating a rootfs, containers also allow setting environment variables, control the command line of the program, and have a standardized transport mechanism which simplifies sharing images.

Finally, container images are constituted of cacheable layers, which can be used to share base images between containers, and also speed up the generation of the container image by only re-computing the layer that changed and all the layers applied on top of it.

The biggest draw-back of containers is that they usually are meant to be run on pre-configured hosts. This means that if you want to run the container directly, you will need to make sure to include an initscript or install systemd in your container, and set it as the entrypoint of the container. It is however possible to perform these tasks before running the container, as we'll explain in the following sections.

Pros:

  • Fastest: No compilation necessary (except for the kernel), and layers cached
  • Familiar: Shared environment between developers and the test system
  • Flexible: Full choice of distro
  • Secure: No root rights needed, everything is done in a user namespace
  • Shareable: Containers come with a transport/storage mechanism (registries)
  • Reproducible: Easily run the exact same userspace on your dev and test machines

Cons:

  • Larger: Packages tend to bring a lot of dependencies, drastically increasing the size of the image
  • Incomplete: does not generate a kernel or initramfs for you

Deploying and booting a rootfs

Now we know how we could generate a rootfs, so the next step is to be able to deploy and boot it!

Challenge #1: Deploying the Kernel / Initramfs

There are multiple ways to deploy an operating system:

  • Flash and reboot: Typical on ARM boards / Android phones;
  • Netboot: Typical in big organizations that manage thousands of machines.

The former solution is great at preventing the bricking of a device that depends on an Operating System to be flashed again, as it enables checking the deployment on the device itself before rebooting.

The latter solution enables diskless test machines, which is an effective way to reduce state (the enemy #1 of reproducible results). It also enables a faster deployment/boot time as the CI system would not have to boot the machine, flash it, then reboot. Instead, the machine simply starts up, requests an IP address through BOOTP/DHCP, downloads the kernel/initramfs, and executes the kernel. This was the solution we opted for in part 1 of this blog series.

Whatever solution you end up picking, you will now be presented with your next challenge: making sure the rootfs remains the same across reboots.

Challenge #2: Deploying the rootfs efficiently

If you have chosen the Flash and reboot deployment method, you may be prepared to re-flash the entire Operating System image every time you boot. This would make sure that the state of a previous boot won't leak into following boots.

This method can however become a big burden on your network when scaled to tens of machines, so you may be tempted to use a Network File System such as NFS to spread the load over a longer period of time. Unfortunately, using NFS brings its own set of challenges (how deep is this rabbit hole?):

  • The same rootfs directory cannot be shared across machines without duplication unless mounted read-only, as machines should not be able to influence each-other's execution;
  • The NFS server needs to remain available as long as at least one test machine is running;
  • Network congestion might influence the testing happening on the machine, which can affect functional testing, but will definitely affect performance testing.

So, instead of trying to spread the load, we could try to reduce the size of the rootfs by only sending the content that changed. For example, the rootfs could be split into the following layers:

  • The base Operating System needed for the testing;
  • The driver you want to test (if it wasn't in the kernel);
  • The test suite(s) you want to run.

Layers can be downloaded by the test machine, through a short-lived-state network protocol such as HTTP, as individual SquashFS images. Additionally, SquashFS provides compression which further reduces the storage/network bandwidth needs.

The layers can then be directly combined by first mounting the layers to separate folders in read-only mode (only mode supported by SquashFS), then merging them using OverlayFS. OverlayFS will store all the writes done to this file system into the workdir directory. If this work directory is backed up by a ramdisk (tmpfs) or a never-reused temporary directory, then this would guarantee that no information from previous boots would impact the new boots!

If you are familiar with containers, you may have recognized this approach as what is used by containers: layers + overlay2 storage driver. The only difference is that container runtimes depend on tarballs rather than SquashFS images, probably because this is a Linux-only filesystem.

If you are anything like me, you should now be pretty tempted to simply use containers for the rootfs generation, transport, and boot! That would be a wise move, given that thousands of engineers have been working on them over the last decade or so, and whatever solution you may come up with will inevitably have even more quirks than these industry standards.

I would thus recommend using containers to generate your rootfs, as there are plenty of tools that will generate them for you, with varying degree of complexity. Check out buildah, if Docker, or Podman are not too high/level for your needs!

Let's now brace for the next challenge, deploying a container runtime!

Challenge #3: Deploying a container runtime to run the test image

In the previous challenge, we realized that a great way to deploy a rootfs efficiently was to simply use a container runtime to do everything for us, rather than re-inventing the wheel.

This would enable us to create an initramfs which would be downloaded along with the kernel through the usual netboot process, and would be responsible for initializing the machine, connecting to the network, mounting the layer cache partition, setting the time, downloading a container, then executing it. The last two steps would be performed by the container runtime of our choice.

Generating an initramfs is way easier than one can expect. Projects like dracut are meant to simplify their creation, but my favourite has been u-root, coming from the LinuxBoot project. I generated my first initramfs in less than 5 minutes, so I was incredibly hopeful to achieve the outlined goals in no time!

Unfortunately, the first setback came quickly: container runtimes (Docker, or Podman) are huge (~150 to 300 MB), if we are to believe Alpine Linux's size of their respective packages and dependencies! While this may not be a problem for the Flash and reboot method, it is definitely a significant issue for the Netboot method which would need to download it for every boot.

Challenge #3.5: Minifying the container runtime

After spending a significant amount of time studying container runtimes, I identified the following functions:

  • Transport / distribution: Downloading a container image from a container registry to the local storage (spec );
  • De-layer the rootfs: Unpack the layers' tarball, and use OverlayFS to merge them (default storage driver, but there are many other ways);
  • Generate the container manifest: A JSON-based config file specifying how the container should be run;
  • Executing the container

Thus started my quest to find lightweight solutions that could do all of these steps... and wonder just how deep is this rabbit hole??

The usual executor found in the likes of Podman and Docker is runc. It is written in Golang, which compiles everything statically and leads to giant binaries. In this case, runc clocks at ~12MB. Fortunately, a knight in shining armour came to the rescue, re-implemented runc in C, and named it crun. The final binary size is ~400 KB, and it is fully compatible with runc. That's good-enough for me!

To download and unpack the rootfs from the container image, I found genuinetools/img which supports that out of the box! Its size was however much bigger than expected, at ~28.5MB. Fortunately, compiling it ourselves, stripping the symbols, then compressing it using UPX led to a much more manageable ~9MB!

What was left was to generate the container manifest according to the runtime spec. I started by hardcoding it to verify that I could indeed run the container. I was relieved to see it would work on my development machine, even thought it fails on my initramfs. After spending a couple of hours diffing straces, poking a couple of files sysfs/config files, and realizing that pivot_root does not work in an initramfs , I finally managed to run the container with crun run --no-pivot!

I was over the moon, as the only thing left was to generate the container manifest by patching genuinetools/img to generate it according to the container image manifest (like docker or podman does). This is where I started losing grip: lured by the prospect of a simple initramfs solving all my problems, being so close to the goal, I started free-falling down what felt like the deepest rabbit hole of my engineering career... Fortunately, after a couple of weeks, I emerged, covered in mud but victorious! Queue the gory battle log :)

When trying to access the container image's manifest in img, I realized that it was re-creating the layers and manifest, and thus was losing the information such as entrypoint, environment variables, and other important parameters. After scouring through its source code and its 500 kLOC of dependencies, I came to the conclusion that it would be easier to start a project from scratch that would use Red Hat's image and storage libraries to download and store the container on the cache partition. I then needed to unpack the layers, generate the container manifest, and start runc. After a couple of days, ~250 lines of code, and tons of staring at straces to get it working, it finally did! Out was img, and the new runtime's size was under 10 MB \o/!

The last missing piece in the puzzle was performance-related: use OverlayFS to merge the layers, rather than unpacking them ourselves.

This is when I decided to have another look at Podman, saw that they have their own internal library for all the major functions, and decided to compile podman to try it out. The binary size was ~50 MB, but after removing some features, setting the -w -s LDFLAGS, and compressing it using upx --best, I got the final size to be ~14 MB! Of course, Podman is more than just one binary, so trying to run a container with it failed. However, after a bit of experimentation and stracing, I realized that running the container with --privileged --network=host would work using crun... provided we force-added the --no-pivot parameter to crun. My happiness was however short-lived, replaced by a MAJOR FACEPALM MOMENT:

After a couple of minutes of constant facepalming, I realized I was also relieved, as Podman is a battle-tested container runtime, and I would not need to maintain a single line of Go! Also, I now knew how deep the rabbit was, and we just needed to package everything nicely in an initramfs and we would be good. Success, at last!

Boot2container: Run your containers from an initramfs!

If you have managed to read through the article up to this point, congratulations! For the others who just gave up and jumped straight to this section, I forgive you for teleporting yourself to the bottom of the rabbit hole directly! In both cases, you are likely wondering where is this breeze you were promised in the introduction?

     Boot2container enters the chat.

Boot2container is a lightweight (sub-20 MB) and fast initramfs I developed that will allow you to ignore the subtleties of operating a container runtime and focus on what matters, your test environment!

Here is an example of how to run boot2container, using SYSLINUX:

LABEL root
    MENU LABEL Run docker's hello world container, with caching disabled
    LINUX /vmlinuz-linux
    APPEND b2c.container=docker://hello-world b2c.cache_device=none b2c.ntp_peer=auto
    INITRD /initramfs.linux_amd64.cpio.xz

The hello-world container image will be run in privileged mode, without the host network, which is what you want when running the container for bare metal testing!

Make sure to check out the list of features and options before either generating the initramfs yourself or downloading it from the releases page. Try it out with your kernel, or the example one bundled in in the release!

With this project mostly done, we pretty much conclude the work needed to set up the test machines, and the next articles in this series will be focusing on the infrastructure needed to support a fleet of test machines, and expose it to Gitlab/Github/...

That's all for now, thanks for reading that far!

wolfenstein.png

wolfenstein2.png

February 09, 2021

If you don’t know what is traces based rendering regression testing, read the appendix before continuing.


The Mesa community has witnessed an explosion of the Continuous Integration interest in the last two years.

In addition to checking the proper building of the project, integrating the testing of its functional correctness has become a priority. The user space graphics drivers exhibit a wide variety of types of tests and test suites. One kind of those tests are the traces based rendering regression testing.

The public effort to add this kind of tests into Mesa’s CI started with this mail from Alexandros Frantzis.

At some point, we had support for replaying OpenGL, Vulkan and D3D11 traces using apitrace, RenderDoc and GFXReconstruct with the in-tree tool tracie. However, it was a very custom solution made to the needs of Mesa so I proposed to move this codebase and integrate it into the piglit test suite. It was a natural step forward.

This is how replayer was born into piglit.

replayer

The first step to test a trace is, actually, obtaining a trace. I won’t go into the details about how to create one from scratch. The process is well documented on each of the tools listed above. However, the Mesa community has been collecting publicly distributable traces for a while and placing them in traces-db whose CI is copying them to Freedesktop.org’s MinIO instance.

To make things simple, once we have built and installed piglit, if we would like to test an apitrace created OpenGL trace, we can download from there with:

$ replayer.py download \
 	 --download-url https://minio-packet.freedesktop.org/mesa-tracie-public/ \
 	 --db-path ./traces-db \
 	 --force-download \
 	 glxgears/glxgears-2.trace

The parameters are self explanatory. The downloaded trace will now exist at ./traces-db/glxgears/glxgears-2.trace.

The next step will be to dump an image from the trace. Since it is a .trace file we will need to have apitrace installed in the system. If we do not specify the call(s) from which to dump the image(s), we will just get the last frame of the trace:

$ replayer.py dump ./traces-db/glxgears/glxgears-2.trace

The dumped PNG image will be at ./results/glxgears-2.trace-0000001413.png. Notice, the number suffix is the snapshot id from the trace.

Dumping from a trace may result in a range of different possible images. One example is when the trace makes use of uninitialized values, leading to undefined behaviors.

However, since the original aim was performing pre-merge rendering regression testing in Mesa’s CI, the idea is that replaying any of the provided traces would be quick and the dumped image will be consistent. In other words, if we would dump several times the same frame of a trace with the same GFX stack, the image will always be the same.

With this precondition, we can test whether 2 different images are the same just by doing a hash of its content. replayer can obtain the hash for the generated dumped image:

$ replayer.py checksum ./results/glxgears-2.trace-0000001413.png 
f8eba0fec6e3e0af9cb09844bc73bdc8

Now, if we would build a different commit of Mesa, we could check the generated image at this new point against the previously generated reference image. If everything goes well, we will see something like:

$ replayer.py compare trace \
 	 --download-url https://minio-packet.freedesktop.org/mesa-tracie-public/ \
 	 --device-name gl-vmware-llvmpipe \
 	 --db-path ./traces-db \
 	 --keep-image \
 	 glxgears/glxgears-2.trace f8eba0fec6e3e0af9cb09844bc73bdc8
[dump_trace_images] Info: Dumping trace ./traces-db/glxgears/glxgears-2.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./traces-db/glxgears/glxgears-2.trace
// process.name = "/usr/bin/glxgears"
1384 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

1413 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

error: drawable failed to resize: expected 1515x843, got 300x300
[dump_trace_images] Running: eglretrace --headless --snapshot=1413 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace- ./blog-traces-db/glxgears/glxgears-2.trace
Wrote ./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413.png

OK
[check_image]
    actual: f8eba0fec6e3e0af9cb09844bc73bdc8
  expected: f8eba0fec6e3e0af9cb09844bc73bdc8
[check_image] Images match for:
  glxgears/glxgears-2.trace

PIGLIT: {"images": [{"image_desc": "glxgears/glxgears-2.trace", "image_ref": "f8eba0fec6e3e0af9cb09844bc73bdc8.png", "image_render": "./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413-f8eba0fec6e3e0af9cb09844bc73bdc8.png"}], "result": "pass"}

replayer‘s compare subcommand is the one spitting a piglit formatted test expectations output.

Putting everything together

We can make the whole process way simpler by passing the replayer a YAML tests list file. For example:

$ cat testing-traces.yml
traces-db:
  download-url: https://minio-packet.freedesktop.org/mesa-tracie-public/

traces:
  - path: gputest/triangle.trace
    expectations:
      - device: gl-vmware-llvmpipe
        checksum: c8848dec77ee0c55292417f54c0a1a49
  - path: glxgears/glxgears-2.trace
    expectations:
      - device: gl-vmware-llvmpipe
        checksum: f53ac20e17da91c0359c31f2fa3f401e
$ replayer.py compare yaml \
 	 --device-name gl-vmware-llvmpipe \
 	 --yaml-file testing-traces.yml 
[check_image] Downloading file gputest/triangle.trace took 5s.
[dump_trace_images] Info: Dumping trace ./replayer-db/gputest/triangle.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./replayer-db/gputest/triangle.trace
// process.name = "/home/anholt/GpuTest_Linux_x64_0.7.0/GpuTest"
397 glXSwapBuffers(dpy = 0x7f0ad0005a90, drawable = 56623106)

510 glXSwapBuffers(dpy = 0x7f0ad0005a90, drawable = 56623106)


/home/anholt/GpuTest_Linux_x64_0.7.0/GpuTest
[dump_trace_images] Running: eglretrace --headless --snapshot=510 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/gputest/triangle.trace- ./replayer-db/gputest/triangle.trace
Wrote ./results/trace/gl-vmware-llvmpipe/gputest/triangle.trace-0000000510.png

OK
[check_image]
    actual: c8848dec77ee0c55292417f54c0a1a49
  expected: c8848dec77ee0c55292417f54c0a1a49
[check_image] Images match for:
  gputest/triangle.trace

[check_image] Downloading file glxgears/glxgears-2.trace took 5s.
[dump_trace_images] Info: Dumping trace ./replayer-db/glxgears/glxgears-2.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./replayer-db/glxgears/glxgears-2.trace
// process.name = "/usr/bin/glxgears"
1384 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

1413 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)


/usr/bin/glxgears
error: drawable failed to resize: expected 1515x843, got 300x300
[dump_trace_images] Running: eglretrace --headless --snapshot=1413 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace- ./replayer-db/glxgears/glxgears-2.trace
Wrote ./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413.png

OK
[check_image]
    actual: f8eba0fec6e3e0af9cb09844bc73bdc8
  expected: f8eba0fec6e3e0af9cb09844bc73bdc8
[check_image] Images match for:
  glxgears/glxgears-2.trace

replayer features also the query subcommand, which is just a helper to read the YAML files with the tests configuration.

Testing the other kind of supported 3D traces doesn’t change much from what’s shown here. Just make sure to have the needed tools installed: RenderDoc, GFXReconstruct, the VK_LAYER_LUNARG_screenshot layer, Wine and DXVK. A good reference for building, installing and configuring these tools are Mesa’s GL and VK test containers building scripts.

replayer also accepts several configurations to tweak how to behave and where to find the actual tracing tools needed for replaying the different types of traces. Make sure to check the replay section in piglit’s configuration example file.

replayer‘s README.md file is also a good read for further information.

piglit

replayer is a test runner in a similar fashion to shader_runner or glslparsertest. We are now missing how does it integrate so we can do piglit runs which will produce piglit formatted results.

This is done through the replay test profile.

This profile needs a couple configuration values. Easiest is just to set the PIGLIT_REPLAY_DESCRIPTION_FILE and PIGLIT_REPLAY_DEVICE_NAME env variables. They are self explanatory, but make sure to check the documentation for this and other configuration options for this profile.

The following example features a similar run to the one done above invoking directly replayer but with piglit integration, providing formatted results:

$ PIGLIT_REPLAY_DESCRIPTION_FILE=testing-traces.yml PIGLIT_REPLAY_DEVICE_NAME=gl-vmware-llvmpipe piglit run replay -n replay-example replay-results
[2/2] pass: 2   
Thank you for running Piglit!
Results have been written to replay-results

We can create some summary based on the results:

# piglit summary console replay-results/
trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace: pass
trace/gl-vmware-llvmpipe/gputest/triangle.trace: pass
summary:
       name: replay-example
       ----  --------------
       pass:              2
       fail:              0
      crash:              0
       skip:              0
    timeout:              0
       warn:              0
 incomplete:              0
 dmesg-warn:              0
 dmesg-fail:              0
    changes:              0
      fixes:              0
regressions:              0
      total:              2
       time:       00:00:00

Creating an HTML summary may be also interesting, specially when finding failures!

Wishlist

  • Through different backends, replayer supports running apitrace, RenderDoc and GFXReconstruct traces. We may want to support other tracing tools in the future. The dummy backend used for functional testing is a good starting point when writing a new backend.
  • The solution chosen for checking whether we detect a rendering regression is dependent on having consistent results, as said before. It’d be great if we could add a secondary testing method whenever the expected rendered image is variable. From the top of my head, using exclusion masks could be a quick single-run solution when we know which specific areas in a rendered scenario are the ones fluctuating. For more complex variations, a multi-run based solution seems to be the best option. EzBench has a great statistical approach for this!
  • The current syntax of the YAML test list files implies running the compare subcommand with the default behavior of checking against the last frame of the tested trace. This means figuring out which call number is the one of the last frame first. It would be great to support providing the call numbers directly in the YAML files to be able to test more than just the last frame and, additionally, cut down the time taken to run the test.
  • The HTML generated summary allows us to see the reference and generated image during a test run side to side when it fails. It’d be great to have also some easy way of checking its differences. Using Rembrandt.js could be a possible solution.

Thanks a lot to the whole Mesa community for helping with the creation of this tool. Alexandros Frantzis, Rohan Garg and Tomeu Vizoso did a lot of the initial development for the in-tree tracie tool while Dylan Baker was very patient while reviewing my patches for the piglit integration.

Finally, thanks to Igalia for allowing me to work in this.


Appendix

In 3D computer graphics we say “traces”, for short, to name the files generated by 3D APIs capturing tools which store not only the calls to the specific 3D API but also the internal state of the 3D program during the capturing process: shaders, textures, buffers, etc.

Being able to “record” the execution of a 3D program is very useful. Usually, it will allow us to replay the execution without the need of the original program from which we generated the trace, it will also allow in-depth analysis for debugging and performance optimization, it’s a very good solution for sharing with other developers, and, in some cases, will allow us to check how the replay will happen with different GPUs.

In this post, however, I focus in a specific usage: rendering regression testing.

When doing a regression test what we would do is compare a specific metric obtained by replaying the trace with a specific version of the GFX software stack against the same metric obtained from a different version of the GFX stack. If the value of the metric changes we have found a regression (or an improvement!).

To make things simpler, we would like to check changes happening just in one of the many elements of the software stack. The most relevant component is the user space driver. In particular, I care about the Mesa drivers and the GNU/Linux stack.

Mainly, there are two kinds of regression testing we can do with a trace: performance or rendering regression testing. When doing a performance one, the checked metric(s) usually are in terms of speed or memory usage. In the case of the rendering ones what we would do is comparing the rendered output at one (or many) point during the trace replay. This output, a bitmap image, is the metric that we will compare in between two different points of the Mesa driver. If the images differ, we may have found a regression; artifacts, improper colors, etc, or an enhancement, if the reference image is the one featuring any of these problems.

If you’re on Intel…

Your zink built from git master now has GL 4.3.

Turns out having actual hardware available when doing feature support is important, so I need to do some fixups there for stencil texturing before you can enjoy things.

February 08, 2021

Under contracting work for Valve Corporation, I have been working with Charlie Turner and Andres Gomez from Igalia to develop a CI test farm for driver testing (most graphics).

This is now the fifth CI system I have worked with / on, and I am growing tired of not being able to re-use components from the previous systems due to how deeply-integrated its components are, and how implementation details permeate from one component to another. Additionally, such designs limit the ability of the system to grow, as updating a component would impact a lot of components, making it difficult or even impossible to do without a rewrite of the system, or taking the system down for multiple hours.

With this new system, I am putting emphasis on designing good interfaces between components in order to create an open source toolbox that CI systems can re-use freely and tailor to their needs, while not painting themselves in a corner.

I aim to blog about all the different components/interfaces we will be making for this test system, but in this article, I would like to start with the basics: proposing design goals, and setting up a machine to be controllable remotely by a test system.

Overall design principles

When designing a test system, it is important to keep in mind that test results need to be:

  • Stable: Re-executing the same test should yield the same result;
  • Reproducible: The test should be runnable on other machines with the same hardware, and yield the same result;

What this means is that we should use the default configuration as much as possible (no weird setup in CI). Additionally, we need to reduce the amount of state in the system to the absolute minimum. This can be achieved in the following way:

  • Power cycle the machine between each test cycle: this helps reset the hardware;
  • Go diskless if at all possible, or treat the disk as a cache that can be flushed when testing fails;
  • Pre-compute as much as possible outside of the test machine, to reduce the impact of the environment of the machine running the test.

Finally, the machine should not restrict which kernel / Operating System can be loaded for testing. An easy way to achieve this is to use netboot (PXE), which is a common BIOS feature allowing diskless machines to boot from the network.

Converting a machine for testing

Now that we have a pretty good idea about the design principles behind preparing a machine for CI, let's try to apply them to an actual machine.

Step 1: Powering up the machine remotely

In order to power up, a machine needs both power and a signal to start. The latter is usually provided by a power button, but additional ways exist (non-exhaustive):

  • Wake on LAN: An Ethernet frame sent to the network adapter triggers the boot;
  • Power on by Mouse/Keyboard: Any activity on the mouse or the keyboard will boot the computer;
  • Power on AC: Providing power to the machine will automatically turn it on;
  • Timer: Boot at a specified time.

An Intel motherboard's list of wakeup events

Unfortunately, none of these triggers can be used to also turn off the machine. The only way to guarantee that a machine will power down and reset its internal state completely is to cut its power supply for a significant amount of time. A safe way to provide/cut power is to use a remotely-switchable Power Distribution Unit (example), or simply using some smart plug such as Ikea's TRÅDFRI. In any case, make sure you rely on as few services as possible (no cloud!), that you won't exceed the ratings (voltage, power, and cycles), and can read back the state to make sure the command was well received. If you opt out for the industrial PDUs, make sure to check out PDU Gateway, our REST service to control the machines.

An example of a PDU

Now that we can reliably cut/provide power, we still need to control the boot signal. The difficulty here is that the signal needs to be received after the machine received power and initialized enough to receive this event. To make things as easy as possible, the easiest is to configure the BIOS to boot as soon as the power is brought to the computer. This is usually called "Boot on AC". If your computer does not support this feature, you may want to try the other ones, or use a microcontroller to press the power button for you when powering up (see the HELP! My machine can't ... Boot on AC section at the end of this article).

Step 2: Net booting

Net booting is quite commonly supported on x86 and ARM bootloaders. On x86 platforms, you can generally find this option in the boot option priorities under the name PXE boot or network boot. You may also need to enable the LAN option ROM, LAN controller, or the UEFI network stack. Reboot, and check that your machine is trying to get an IP!

The next step will be to set up a machine, called Testing Gateway, that will provide a PXE service. This machine should have two network interfaces, one connected to a public network, and one connected to the test machines (through a switch). Setting up this machine will be the subject of an upcoming blog post, but if your are impatient, you may use our valve-infra container.

Step 3: Emulating your screen and keyboard using a serial console

Thanks to the previous steps, we can now boot in any Operating System we want, but we cannot interact with it...

One solution could be to run an SSH server on the Operating System, but until we could connect to it, there would be no way to know what is going on. Instead, we could use an ancient technology, a serial port, to drive a console. This solution is often called "Serial console" and is supported by most Operating Systems. Serial ports come in two types:

  • UART: voltage changing between 0 and VCC (TTL signalling), more common in the System-on-Chip (SoC) and microcontrollers world;
  • RS-232: voltage changing between a positive and negative voltage, more common in the desktop and datacenter world.

In any case, I suggest you find a serial-to-USB adapter adapted to the computer you are trying to connect:

On Linux, using a serial console is relatively simple, just add the following in the command line to get a console on your screen AND over the /dev/ttyS0 serial port running at 9600 bauds:

console=tty0 console=ttyS0,9600 earlyprintk=vga,keep

If your machine does not have a serial port but has USB ports, which is more the norm than the exception in the desktop/laptop world, you may want to connect two RS-232-to-USB adapters together, using a Null modem cable:

Test Machine <-> USB <-> RS-232 <-> NULL modem cable <-> RS-232 <-> USB Hub <-> Gateway

And the kernel command line should use ttyACM0 / ttyUSB0 instead of ttyS0.

Putting it all together

Start by removing the internal battery if it has one (laptops), and any built-in wireless antenna. Then set the BIOS to boot on AC, and use netboot.

Steps for an AMD motherboard:

Steps for an Intel motherboard:

Finally, connect the test machine to the wider infrastructure in this way:

      Internet /   ------------------------------+
    Public network                               |
                                       +---------+--------+                USB
                                       |                  +-----------------------------------+
                                       |      Testing     | Private network                   |
Main power (240 V) ---------+          |      Gateway     +-----------------+                 |
                            |          +---------+--------+                 |                 |
                            |                    | Serial /                 |                 |
                            |                    | Ethernet                 |                 |
                            |                    |                          |                 |
                +-----------+--------------------+--------------+   +-------+--------+   +----+----+
                |                 Switchable PDU                |   |   RJ45 switch  |   | USB Hub |
                |  Port 0    Port 1        ...          Port N  |   |                |   |         |
                +----+------------------------------------------+   +---+------------+   +-+-------+
                     |                                                  |                  |
                Main |                                                  |                  |
                Power|                                                  |                  |
            +--------|--------+               Ethernet                  |                  |
            |                 +-----------------------------------------+   +----+----+    |
            |  Test Machine 1 |            Serial (RS-232 / TTL)            |  Serial |    |
            |                 +---------------------------------------------+  2 USB  +----+ USB
            +-----------------+                                             +---------+

If you managed to do all this, then congratulations, you are set! If you got some issues finding the BIOS parameters, brace yourself, and check out the following section!

HELP! My machine can't ...

Net boot

It's annoying, but it is super simple to work around that. What you need is to install a bootloader on a drive or USB stick which supports PXE. I would recommend you look into SYSLINUX, and Arch Linux's wiki page about it.

Boot on AC

Well, that's a bummer, but that's not the end of the line either if you have some experience dealing with microcontrollers, such as Arduino. Provided you can find the following 4 wires, you should be fine:

  • Ground: The easiest to find;
  • Power rail: 3.3 or 5V depending on what your controller expects;
  • Power LED: A signal that will change when the computer turns on/off;
  • Power Switch: A signal to pull-up/down to start the computer.

On desktop PCs, all these wires can be easily found in the motherboard's manual. For laptops, you'll need to scour the motherboard for these signals using a multimeter. Pay extra attention when looking for the power rail, as it needs to be able to source enough current for your microcontroller. If you are struggling to find one, look for the VCC pins of some of the chips and you'll be set.

Next, you'll just need to figure out what voltage the power LED is at when the machine is ON or OFF. Make sure to check that this voltage is compatible with your microcontroller's input rating and plug it directly into a GPIO of your microcontroller.

Let's then do the same work for the power switch, except this time we also need to check how much current will flow through it when it is activated. To do that, just use a multimeter to check how much current is flowing when you connect the two wires of the power switch. Check that this amount of current can be sourced/sinked by the microcontroller, and then connect it to a GPIO.

Finally, we need to find power for the microcontroller that will be present as soon as we plug the machine to the power. For desktop PCs, you would find this in Pin 9 of the ATX connector. For laptops, you will need to probe the motherboard until you find a pin that has one with a voltage suitable for your microcontroller (5 or 3.3V). However, make sure it is able to source enough current without the voltage dropping bellow the minimum acceptable VCC of your microcontroller. The best way to make sure of that is to connect this rail to the ground through a ~100 Ohm and check that the voltage at the leads of the resistor, and keep on trying until you find a suitable place (took me 3 attempts). Connect your microcontroller's VCC and ground to the these pads.

The last step will be to edit this Arduino code for your needs, flash it to your microcontroller, and iterate until it works!

Here is a photo summary of all the above steps:

Thanks to Arkadiusz Hiler for giving me a couple of these BluePills, as I did not have any microcontroller that would be small-enough to fit in place of a laptop speaker. If you are a novice, I would suggest you pick an Arduino nano instead.

Oh, and if you want to create a board that would be generic-enough for most motherboards, check out the schematics from my 8 year-old blog post about doing just that!

Boot without a battery

So far, I have never heard of any laptop that would completely refuse to boot when disconnecting the battery. The worst I have heard of was that the laptop would take 30s before starting to boot.

Let's be real though, your time is valuable, and I would suggest you buy/get another laptop. However, if this is the only model you can get, and you really want to test it, then it will be .... juuuuuust fine!

Your state of mind, right now!

There are multiple options here, depending how far down the stack you want to go to:

  1. Rework the Embedded Controller (EC) to drop this delay: Applicable when you have access to the EC's source code, like for chromebooks;
  2. Impersonate the battery: Replacing the battery with a microcontroller that will respond to the EC's commands just like the real battery;
  3. Reuse the battery controller, but replace the battery cells with ... capacitors: The fastest way forward, but can be real-tricky without some knowledge about dealing with Li-ion cells.

I will not explain what needs to be done in option 1, as this is highly-dependent on your platform of choice, but it is by far the safest and the least hacky.

Option 2 is the next best option if the EC is not open or easy to flash. What you will want is to figure out what are the I2C lines in the battery's connector, and then attach a protocol analyser to it. Boot the machine, then inspect the logs and try to figure out the pattern. Re-implement as much as of it as needed in a microcontroller, until the system boots reliably. Should be a good weekend project!

Option 3 is by far the hackiest and requiring the most skills even if it is the fastest to implement IF you have an oscilloscope, and some super capacitors with a low discharge rate lying around (who doesn't?). What you'll need to do is open the battery, rip off the battery cells, and replace them with the super capacitors. They will simulate the battery cell well-enough for most controllers, but beware that the controller might not like starting with the capacitors being discharged, so you may need to force-charge them to [2, 3.6]V (the range between a fully discharged and a fully-charged battery), so consider using a 3.3V power rail. Beware that the battery cells might be wired in series, so you should not connect their negative pole to the ground, as it would short one or more cells! In my case, the controller was happy with seeing an empty battery, and it was fun to see the battery go from 50% to 100% in a second when booting :D

That's all, folks!

February 04, 2021

All In A Day’s Work

I posted yesterday in brief about zink on nvidia blob, But what actually was necessary to get this working?

In short, there were three problem areas that needed to be worked on:

  • GLX
  • image creation
  • memory allocation

Let’s go over some of these in a bit more depth.

GLX

GLX is the layer which connects GL to X. In the context of this post and zink, this can be thought of as the method by which zink negotiates how it’s going to draw to the screen.

The way this works, at a very high level and with only the barest concern for accuracy and depth of subject matter, is that Mesa grabs a bunch of properties and features from the driver (zink) and then generates a series of configurations that can be used for X window outputs. GLX then compares these configurations against the one requested by the driver. If a match is found, the driver gets to draw. Otherwise, an error like this one I found on StackOverflow is printed:

X Error of failed request:  BadMatch (invalid parameter attributes)
Major opcode of failed request:  128 (GLX)
Minor opcode of failed request:  34 ()
Serial number of failed request:  39
Current serial number in output stream:  40

I didn’t do much digging into this. Here’s a visual representation of the process by which, with the assistance of GLX Hall of Famer Adam Jackson, I resolved the issue:

giphy.gif

In short, he found this problem some time ago and disabled some of the less relevant checks for configuration matching. All I had to do was apply the patch.

#teamwork

Image Creation

A core philosophy of Vulkan is that it’s an explicit API. This means that in the case of images, the exact format, tiling mode, target, usage, etc are specified at the time of creation. This also means that drivers have things they support and things they don’t support.

Historically, zink has just assumed that everything is supported and then either crashed horribly or succeeded because ANV is a pretty cool driver that makes a lot of stuff work when it shouldn’t.

To improve this situation, I’ve added two tiers of validation for image resource creation:

  • first, check all VkImageUsageFlags that are needed for an image are supported by the format being used
  • second, run the exact image creation params by the Vulkan driver before creation to see if it’ll work (and abort() if it doesn’t)

Roughly, these correspond to vkGetPhysicalDeviceFormatProperties (which previously had some, but not comprehensive use for this type of validation) and vkGetPhysicalDeviceImageFormatProperties (which has never been used previously in zink).

A major issue that I found along the way is that there’s many cases where zink needs a linear image tiling, as this is the only type of layout which enables reading it back into a buffer. Zink has been assuming that if the format has any (e.g., non-linear) support for the image’s usage, linear is also fine. This is not the case, however, so now there’s more checks which enforce a series of hoops to be jumped through when it’s necessary to do readback from images which have no support at all for linear tiling.

Memory Allocation

This is more or less the same as the issue that existed with image creation: zink tried to allocate memory from the wrong bucket (usually HOST_VISIBLE), and the Vulkan driver couldn’t handle that, so everything blew up.

Now this is handled by using device memory for almost all allocations, and everything works well enough.

Closing Thoughts

Nvidia GPUs should work pretty well from today on in my branch; I’m at a ~95% pass rate in piglit tests as of my last run, putting it solidly in second place on the Zink Preferred GPU List behind ANV, where I’m getting upwards of 97% of tests passing.

This didn’t make it into yesterday’s post, but everyone’s favorite benchmark also runs on zink+nvidia now:

nvidia-heaven.png

Here’s the caveat for all of the above: at present, zink on NV is unusably slow. The primary reason for this is that every frame that gets displayed has to be copied multiple times:

  • first, a GPU copy to a staging image that has HOST_VISIBLE (CPU-readable) memory
  • second, a CPU copy to another image which will then be used for displaying the frame

The first step in the process not only introduces more GPU work, it also forces an explicit fence, meaning that this is effectively right back where zink from master/release versions is at in forcing a complete stop of all work prior to each frame being finished.

The second step is also pretty bad, as it delays queuing any further work by the driver until the entire frame has once again been copied.

There’s not really any way to improve the current situation, but that’s only temporary. In the “soon” future, we’ll be landing WSI support, which will greatly improve things for all the drivers that zink supports, but probably mostly this one.

February 03, 2021

So some months have passed since our last update, when we announced that v3dv became Vulkan 1.0 conformant. The main reason for not publishing so many posts is that we saw the 1.0 checkpoint as a good moment to hold on adding new big features, and focus on improving the codebase (refactor, clean-ups, etc.) and the already existing features. For the latter we did a lot of work on performance. That alone would deserve a specific blog post, so in this one I will summarize the other stuff we did.

New features

Even if we didn’t focus on adding new features, we were still able to add some:

  • The following optional 1.0 features were enabled: logicOp, althaToOne, independentBlend, drawIndirectFirstInstance, and shaderStorageImageExtendedFormats.
  • Added support for timestamp queries.
  • Added implementation for VK_KHR_maintenance1, VK_EXT_private_data, and VK_KHR_display extensions
  • Added support for Wayland WSI.

Here I would like to highlight that we started to get feature contributions out of the initial core of developers that created the driver. VK_KHR_display was submitted by Steven Houston, and Wayland WSI support was submitted by Ella-0. Thanks a lot for it, really appreciated! We hope that this would begin a trend of having more things implemented by the rpi/mesa community as a whole.

Bugfixing and vulkan tools

Even if the driver got conformant, we were still testing the driver with several demos and applications, and provided fixes. As a example, we got Sascha Willem’s oit (Order Independent Transparency) working:

Sascha Willem’s oit demo on the rpi4

Among those applications that we were testing, we can highlight renderdoc and gfxreconstruct. The former is a frame-capture based graphics debugger and the latter is a tool that allows to capture and replay several frames. Both tools are heavily used when debugging and testing vulkan applications. We tested that they work on the rpi4 (fixing some bugs while doing it), and also started to use them to help/guide the performance work we are doing.

Fosdem 2021

If you are interested on an overview of the development of the driver during the last year, we are going to present “Overview of the Open Source Vulkan Driver for Raspberry Pi 4” on FOSDEM this weekend (presentation details here).

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working
v3dv status update 2020-07-31
v3dv status update 2020-09-07
Vulkan update: we’re conformant!

What Is Even Happening Anymore

Thanks once again to my generous sponsors at Valve, as well as a patch from a GLX guru with the very accurate commit log “I hate everything.”, I put in a little time today and came up with this:

nvidia.png

So now zink+nvidia is a thing.

Also: snow. Why is that a thing and can it stop being a thing for the rest of the week so I can stop shoveling?

January 28, 2021

(This post was first published with Collabora on Nov 19, 2020.) (Fixed a broken link on Jan 28, 2021.)

Wayland (the protocol and architecture) is still lacking proper consideration for color management. Wayland also lacks support for high dynamic range (HDR) imagery which has been around in movie and broadcasting industry for a while now (e.g. Netflix HDR UI).

While there are well established tools and workflows for how to do color management on X11, even X11 has not gained support for HDR. There were plans for it (Alex GoinsDeepColor Visuals), but as far as I know nothing really materialized from them.  Right now, the only way to watch HDR content on a HDR monitor in Linux is to use the DRM KMS API directly, in other words, not use any window system, which means not using any desktop environment. Kodi is one of the very few applications that can do this at all.

This is a story about starting the efforts to fix the situation on Wayland.

History and People

Color management for Wayland has been talked about on and off for many years by dozens of people. To me it was obvious from the start that color management architecture on Wayland must be fundamentally different from X11. I thought the display server must be part of the color management stack instead of an untrusted, unknown entity that must be bypassed and overridden by applications that fight each other for who gets to configure the display. This opinion was wildly controversial and it took a long time to get my point across, but over the years some color management experts started to open up to new ideas and other people joined in the opinion as well.  Whether these new ideas are actually better than the ways of old remains to be seen, though. I think the promise of getting everything and more to work better is far too great to not try it out.

The discussions started several times over the years, but they always dried out mostly without any tangible progress. Color management is a wide, deep and difficult topic, and the required skills, knowledge, interest, and available time did not come together until fairly recently. People did write draft protocol extensions, but I would claim that it was not really until Sebastian Wick started building on top of them that things started moving forward. But one person cannot push such a huge effort alone even for the simple reason that there must be at least one reviewer before anything can be merged upstream. I was very lucky that since summer 2020 I have been able to work on Wayland color management and HDR for improving ChromeOS, letting me support Sebastian's efforts on a daily basis. Vitaly Prosyak joined the effort this year as well, researching how to combine the two seemingly different worlds of ICC and HDR, and how tone-mapping could be implemented.

I must also note the past efforts of Harish Krupo, who submitted a major Weston merge request, but unfortunately at the time reviewers in Weston upstream were not much available. Even before that, there were experiments by Ville Syrjälä. All these are now mostly superseded by the on-going work.

Currently the active people around the topic are me (Collabora), Vitaly Prosyak (AMD), and Naveen Kumar (Intel). Sebastian Wick (unaffilated) is still around as well. None of us is a color management or HDR expert by trade, so we are all learning things as we go.

Design

The foundation for the color management protocol are ICC profile files for describing both output and content color spaces. The aim is for ICCv4, also allowing ICCv2, as these are known and supported well in general. Adding iccMAX support or anything else will be possible any time in the future.

As color management is all about color spaces and gamuts, and high dynamic range (HDR) is also very much about color spaces and gamuts plus extended luminance range, Sebastian and I decided that Wayland color management extension should cater for both from the beginning. Combining traditional color management and HDR is a fairly new thing as far as I know, and I'm not sure we have much prior art to base upon, so this is an interesting research journey as well. There is a lot of prior art on HDR and color management separately, but they tend to have fundamental differences that makes the combination not obvious.

To help us keep focused and explain to the community about what we actually intend with Wayland color management and HDR support, I wrote the section "Wayland Color Management and HDR Design Goals" in color.rst (draft). I very much recommend you to read it so that you get a picture what we (or I, at least) want to aim for.

Elle Stone explains in their article how color management should work on X11. As I wanted to avoid repeating the massive email threads that were had on the wayland-devel mailing list, I wrote the section "Color Pipeline Overview" in color.rst (draft) more or less as a response to her article, trying to explain in what ways Wayland will be different from X11. I think that understanding that section is paramount before anyone makes any comment on our efforts with the Wayland protocol extension.

HDR brings even more reasons to put color space conversions in the display server than just the idea that all applications should be color managed if not explicitly then implicitly.  Most of the desktop applications (well, literally all right now) are using Standard Dynamic Range (SDR).  SDR is a fuzzy concept referring to all traditional, non-HDR image content.  Therefore, your desktop is usually 100% SDR. You run your fancy new HDR monitor in SDR mode, which means it looks just like any old monitor with nothing fancy.  What if you want to watch a HDR video? The monitor won't display HDR in SDR mode.  If you simply switch the monitor to HDR mode, you will be blinded by all the over-bright SDR applications.  Switching monitor modes may also cause flicker and take a bit of time. That would be a pretty bad user experience, right?

A solution is to run your monitor in HDR mode all the time, and have the window system compositor convert all SDR application windows appropriately to the HDR luminance, so that they look normal in spite of the HDR mode. There will always be applications that will never support HDR at all, so the compositor doing the conversion is practically the only way.

For the protocol, we are currently exploring the use of relative luminance.  The reason is that people look at monitors in wildly varying viewing environments, under standard office lighting for example. The environment and personal preferences affect what monitor brightness you want. Also monitors themselves can be wildly different in their capabilities. Most prior art on HDR uses absolute luminance, but absolute luminance has the problem that it assumes a specific viewing environment, usually a dark room, similar to a movie theatre.  If a display server would show a movie with the absolute luminance it was mastered for, in most cases it would be far too dark to see. Whether using relative luminance at the protocol level turns out to be a good idea or not, we shall see.

Development

The Wayland color management and HDR protocol extension proposal is known as wayland/wayland-protocols!14 (MR14). Because it is a very long running merge request (the bar for landing a new protocol into wayland-protocols is high) and there are several people working on it, we started using sub-merge-requests to modify the proposal. You can find the sub-MRs in Sebastian's fork. If you have a change to propose, that is how to do it.

Obviously using sub-MRs also splits the review discussions into multiple places, but in this case I think it is a good thing, because the discussion threads in Gitlab are already massive.

There are several big and small open questions we haven't had the time to tackle yet even among the active group; questions that I feel we should have some tentative answers before asking for wider community comments. There is also no set schedule, so don't hold your breath. This work is likely to take months still before there is a complete tentative protocol, and probably years until these features are available in your favourite Wayland desktop environments.

If you are an expert on the topics of color management or HDR displays and content, you are warmly welcome to join the development.

If you are an interested developer or an end user looking to try out things, sorry, there is nothing really for you yet.

January 26, 2021

Overhead Migration

The goal in this post is to migrate a truckload block of code I wrote to handle sampler updating out of zink and into Gallium, thereby creating several days worth of rebase work for myself but also removing a costly codepath from the driver thread.

The first step in getting sampler creation to work right in zink is getting Gallium to create samplers with the correct filters in accordance with Chapter 42 of the Vulkan Spec:

VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT specifies that if VK_FORMAT_FEATURE_SAMPLED_IMAGE_BIT is also set, an image view can be used with a sampler that has either of magFilter or minFilter set to VK_FILTER_LINEAR, or mipmapMode set to VK_SAMPLER_MIPMAP_MODE_LINEAR. If VK_FORMAT_FEATURE_BLIT_SRC_BIT is also set, an image can be used as the srcImage to vkCmdBlitImage2KHR and vkCmdBlitImage with a filter of VK_FILTER_LINEAR. This bit must only be exposed for formats that also support the VK_FORMAT_FEATURE_SAMPLED_IMAGE_BIT or VK_FORMAT_FEATURE_BLIT_SRC_BIT.

If the format being queried is a depth/stencil format, this bit only specifies that the depth aspect (not the stencil aspect) of an image of this format supports linear filtering, and that linear filtering of the depth aspect is supported whether depth compare is enabled in the sampler or not. If this bit is not present, linear filtering with depth compare disabled is unsupported and linear filtering with depth compare enabled is supported, but may compute the filtered value in an implementation-dependent manner which differs from the normal rules of linear filtering. The resulting value must be in the range [0,1] and should be proportional to, or a weighted average of, the number of comparison passes or failures.

Here’s the (primary) function that I’ll be modifying to get everything working:

void
st_convert_sampler(const struct st_context *st,
                   const struct gl_texture_object *texobj,
                   const struct gl_sampler_object *msamp,
                   float tex_unit_lod_bias,
                   struct pipe_sampler_state *sampler)
{
   memset(sampler, 0, sizeof(*sampler));
   sampler->wrap_s = gl_wrap_xlate(msamp->Attrib.WrapS);
   sampler->wrap_t = gl_wrap_xlate(msamp->Attrib.WrapT);
   sampler->wrap_r = gl_wrap_xlate(msamp->Attrib.WrapR);

   if (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = gl_filter_to_img_filter(msamp->Attrib.MinFilter);
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }
   sampler->min_mip_filter = gl_filter_to_mip_filter(msamp->Attrib.MinFilter);

   if (texobj->Target != GL_TEXTURE_RECTANGLE_ARB)
      sampler->normalized_coords = 1;

   sampler->lod_bias = msamp->Attrib.LodBias + tex_unit_lod_bias;
   /* Reduce the number of states by allowing only the values that AMD GCN
    * can represent. Apps use lod_bias for smooth transitions to bigger mipmap
    * levels.
    */
   sampler->lod_bias = CLAMP(sampler->lod_bias, -16, 16);
   sampler->lod_bias = roundf(sampler->lod_bias * 256) / 256;

   sampler->min_lod = MAX2(msamp->Attrib.MinLod, 0.0f);
   sampler->max_lod = msamp->Attrib.MaxLod;
   if (sampler->max_lod < sampler->min_lod) {
      /* The GL spec doesn't seem to specify what to do in this case.
       * Swap the values.
       */
      float tmp = sampler->max_lod;
      sampler->max_lod = sampler->min_lod;
      sampler->min_lod = tmp;
      assert(sampler->min_lod <= sampler->max_lod);
   }

   /* Check that only wrap modes using the border color have the first bit
    * set.
    */
   STATIC_ASSERT(PIPE_TEX_WRAP_CLAMP & 0x1);
   STATIC_ASSERT(PIPE_TEX_WRAP_CLAMP_TO_BORDER & 0x1);
   STATIC_ASSERT(PIPE_TEX_WRAP_MIRROR_CLAMP & 0x1);
   STATIC_ASSERT(PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER & 0x1);
   STATIC_ASSERT(((PIPE_TEX_WRAP_REPEAT |
                   PIPE_TEX_WRAP_CLAMP_TO_EDGE |
                   PIPE_TEX_WRAP_MIRROR_REPEAT |
                   PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE) & 0x1) == 0);

   /* For non-black borders... */
   if (/* This is true if wrap modes are using the border color: */
       (sampler->wrap_s | sampler->wrap_t | sampler->wrap_r) & 0x1 &&
       (msamp->Attrib.BorderColor.ui[0] ||
        msamp->Attrib.BorderColor.ui[1] ||
        msamp->Attrib.BorderColor.ui[2] ||
        msamp->Attrib.BorderColor.ui[3])) {
      const GLboolean is_integer = texobj->_IsIntegerFormat;
      GLenum texBaseFormat = _mesa_base_tex_image(texobj)->_BaseFormat;

      if (texobj->Attrib.StencilSampling)
         texBaseFormat = GL_STENCIL_INDEX;

      if (st->apply_texture_swizzle_to_border_color) {
         const struct st_texture_object *stobj = st_texture_object_const(texobj);
         /* XXX: clean that up to not use the sampler view at all */
         const struct st_sampler_view *sv = st_texture_get_current_sampler_view(st, stobj);

         if (sv) {
            struct pipe_sampler_view *view = sv->view;
            union pipe_color_union tmp;
            const unsigned char swz[4] =
            {
               view->swizzle_r,
               view->swizzle_g,
               view->swizzle_b,
               view->swizzle_a,
            };

            st_translate_color(&msamp->Attrib.BorderColor, &tmp,
                               texBaseFormat, is_integer);

            util_format_apply_color_swizzle(&sampler->border_color,
                                            &tmp, swz, is_integer);
         } else {
            st_translate_color(&msamp->Attrib.BorderColor,
                               &sampler->border_color,
                               texBaseFormat, is_integer);
         }
      } else {
         st_translate_color(&msamp->Attrib.BorderColor,
                            &sampler->border_color,
                            texBaseFormat, is_integer);
      }
   }

   sampler->max_anisotropy = (msamp->Attrib.MaxAnisotropy == 1.0 ?
                              0 : (GLuint) msamp->Attrib.MaxAnisotropy);

   /* If sampling a depth texture and using shadow comparison */
   if (msamp->Attrib.CompareMode == GL_COMPARE_R_TO_TEXTURE) {
      GLenum texBaseFormat = _mesa_base_tex_image(texobj)->_BaseFormat;

      if (texBaseFormat == GL_DEPTH_COMPONENT ||
          (texBaseFormat == GL_DEPTH_STENCIL && !texobj->Attrib.StencilSampling)) {
         sampler->compare_mode = PIPE_TEX_COMPARE_R_TO_TEXTURE;
         sampler->compare_func = st_compare_func_to_pipe(msamp->Attrib.CompareFunc);
      }
   }

   /* Only set the seamless cube map texture parameter because the per-context
    * enable should be ignored and treated as disabled when using texture
    * handles, as specified by ARB_bindless_texture.
    */
   sampler->seamless_cube_map = msamp->Attrib.CubeMapSeamless;
}

texobj here is the texture being sampled, msamp is the GL sampler object, and sampler is the template for the driver-backed sampler object that will be created with the pipe_context::create_sampler_state hook. The first half of the function deals with setting up filtering and wrap modes. The second half is mostly for border color pre-swizzling (i.e., what the Vulkan spec claims is the way that drivers should be handling border colors).

First: LINEAR Availability

First, if a driver doesn’t provide the format feature for linear filtering, linear filtering can’t be used.

I added a struct pipe_screen hook for this:

/**
 * Check if the given pipe_format and resource is supported for linear filtering
 * as a sampler view.
 * \param format The format to check.
 * \param pres The resource to check.
 */
bool (*is_linear_filtering_supported)( struct pipe_screen *,
                                       enum pipe_format format,
                                       struct pipe_resource *pres );

This gets called in st_convert_sampler(), which is the path that all user-managed samplers go through:

void
st_convert_sampler(const struct st_context *st,
                   const struct gl_texture_object *texobj,
                   const struct gl_sampler_object *msamp,
                   float tex_unit_lod_bias,
                   struct pipe_sampler_state *sampler)
{
   const struct st_texture_object *stobj = NULL;
   const struct st_sampler_view *sv = NULL;

   memset(sampler, 0, sizeof(*sampler));
   sampler->wrap_s = gl_wrap_xlate(msamp->Attrib.WrapS);
   sampler->wrap_t = gl_wrap_xlate(msamp->Attrib.WrapT);
   sampler->wrap_r = gl_wrap_xlate(msamp->Attrib.WrapR);

   bool is_linear_filtering_supported = true;

   if (st->pipe->screen->is_linear_filtering_supported) {
      enum pipe_format fmt = PIPE_FORMAT_NONE;
      stobj = st_texture_object_const(texobj);
      if (stobj->surface_based)
         fmt = stobj->surface_format;
      else {
         sv = st_texture_get_current_sampler_view(st, stobj);
         if (sv)
            fmt = sv->view->format;
         else
            fmt = stobj->pt->format;
      }
      assert(fmt != PIPE_FORMAT_NONE);
      is_linear_filtering_supported =
         st->pipe->screen->is_linear_filtering_supported(st->pipe->screen, fmt, stobj->pt);
   }

   if (!is_linear_filtering_supported ||
       (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest)) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = gl_filter_to_img_filter(msamp->Attrib.MinFilter);
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }

   if (is_linear_filtering_supported)
      sampler->min_mip_filter = gl_filter_to_mip_filter(msamp->Attrib.MinFilter);
   else
      sampler->min_mip_filter = gl_filter_to_img_filter(GL_NEAREST);

The code automatically assumes that linear filtering is available for all formats, utilizing the new is_linear_filtering_supported method if it’s available in order to override that value. The filtering modes are then updated based on the (Vulkan) driver’s capabilities.

Easy.

Second: LINEAR Depth Filtering

The spec allows linear filtering unconditionally for formats containing a depth aspect so long as depth compare is enabled:

If this bit is not present, linear filtering with depth compare disabled is unsupported and linear filtering with depth compare enabled is supported

For this, I added PIPE_CAP_LINEAR_DEPTH_FILTERING and some interaction with the pipe_screen::is_linear_filtering_supported hook:

...
   bool is_linear_filtering_supported = true;
   bool has_depth = false;

   if (st->pipe->screen->is_linear_filtering_supported) {
      enum pipe_format fmt = PIPE_FORMAT_NONE;
      stobj = st_texture_object_const(texobj);
      if (stobj->surface_based)
         fmt = stobj->surface_format;
      else {
         sv = st_texture_get_current_sampler_view(st, stobj);
         if (sv)
            fmt = sv->view->format;
         else
            fmt = stobj->pt->format;
      }
      assert(fmt != PIPE_FORMAT_NONE);
      is_linear_filtering_supported =
         st->pipe->screen->is_linear_filtering_supported(st->pipe->screen, fmt, stobj->pt);
      if (st->linear_depth_filtering_semantics)
         has_depth = util_format_has_depth(util_format_description(fmt));
   }

   /* PIPE_CAP_LINEAR_DEPTH_FILTERING */
   if (has_depth &&
       !is_linear_filtering_supported) {
      /* this conditional has the same result as the one after it,
       * but its complexity makes splitting it more readable
       */
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else if ((!is_linear_filtering_supported && !has_depth) ||
       (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest)) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = gl_filter_to_img_filter(msamp->Attrib.MinFilter);
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }

   if (is_linear_filtering_supported || has_depth)
      sampler->min_mip_filter = gl_filter_to_mip_filter(msamp->Attrib.MinFilter);
   else
      sampler->min_mip_filter = gl_filter_to_img_filter(GL_NEAREST);
...
   /* PIPE_CAP_LINEAR_DEPTH_FILTERING */
   if (sampler->compare_mode == PIPE_TEX_COMPARE_NONE &&
       has_depth && !is_linear_filtering_supported &&
       (sampler->mag_img_filter == PIPE_TEX_FILTER_LINEAR ||
        sampler->min_img_filter == PIPE_TEX_FILTER_LINEAR ||
        sampler->min_mip_filter == PIPE_TEX_FILTER_LINEAR)) {
      sampler->compare_mode = PIPE_TEX_COMPARE_R_TO_TEXTURE;
      sampler->compare_func = PIPE_FUNC_ALWAYS;
   }

If the format has a depth component, allow linear filtering and, if necessary, enable depth compare.

Nothing too complex here either.

Third: Handle Int-based Border Colors

Now it’s time to start getting grimy. Vulkan’s custom border color extension is, at best, functional. One component of the spec for this is that border colors can be specified as either integer or float values, and this is actually significant, so ideally this should just be passed through using the same value the user specified.

I made PIPE_CAP_NEED_BORDER_COLOR_TYPE for this. When specified, the border_color_is_integer member that I added to struct pipe_sampler_state will be treated as a disambiguating value for sampler states, and thus zink can use it to set the right type of border color values. It affects this glorious micro-optimization in the CSO state cache:

void
cso_single_sampler(struct cso_context *ctx, enum pipe_shader_type shader_stage,
                   unsigned idx, const struct pipe_sampler_state *templ)
{
   if (templ) {
      unsigned key_size = ctx-> needs_sampler_border_color_type ?
                             sizeof(struct pipe_sampler_state) :
                             offsetof(struct pipe_sampler_state, border_color_is_integer);
      unsigned hash_key = cso_construct_key((void*)templ, key_size);

Finally: GL_CLAMP

Get ready to roll around in some mud.

I added PIPE_CAP_EMULATE_GL_CLAMP_PLZ (named by Kayden, not up for discussion) to handle this awfulness. Functionally, it’s comprised of 3 components:

  • handling in Mesa core to flag shader and sampler state updates any time a sampler sets or unsets GL_CLAMP (or GL_MIRROR_CLAMP)
  • handling in Gallium to force the sampler to either CLAMP_TO_BORDER or CLAMP_TO_EDGE depending on linear filtering availability for a given resource
  • shader rewrites in Gallium to run nir_lower_tex with bitfield info set for the GL_CLAMP samplers (henceforth gl_clamplers) that need to be rewritten

Since I’m already going deep into this function, let’s go once more into st_convert_sampler():

...
   /* PIPE_CAP_LINEAR_DEPTH_FILTERING */
   if (has_depth &&
       !is_linear_filtering_supported &&
       (!st->emulate_gl_clamp || (
       sampler->wrap_s != PIPE_TEX_WRAP_CLAMP &&
       sampler->wrap_s != PIPE_TEX_WRAP_MIRROR_CLAMP &&
       sampler->wrap_t != PIPE_TEX_WRAP_CLAMP &&
       sampler->wrap_t != PIPE_TEX_WRAP_MIRROR_CLAMP &&
       sampler->wrap_r != PIPE_TEX_WRAP_CLAMP &&
       sampler->wrap_r != PIPE_TEX_WRAP_MIRROR_CLAMP))) {
      /* this conditional has the same result as the one after it,
       * but its complexity makes splitting it more readable
       */
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else if ((!is_linear_filtering_supported && !has_depth) ||
       (texobj->_IsIntegerFormat && st->ctx->Const.ForceIntegerTexNearest)) {
      sampler->min_img_filter = gl_filter_to_img_filter(GL_NEAREST);
      sampler->mag_img_filter = gl_filter_to_img_filter(GL_NEAREST);
   } else {
      sampler->min_img_filter = min_img_filter;
      sampler->mag_img_filter = gl_filter_to_img_filter(msamp->Attrib.MagFilter);
   }
...
   if (st->emulate_gl_clamp) {
      bool clamp_to_border = (is_linear_filtering_supported || has_depth) &&
                             min_img_filter != PIPE_TEX_FILTER_NEAREST;
      if (sampler->wrap_s == PIPE_TEX_WRAP_CLAMP)
         sampler->wrap_s = clamp_to_border ? PIPE_TEX_WRAP_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_CLAMP_TO_EDGE;
      else if (sampler->wrap_s == PIPE_TEX_WRAP_MIRROR_CLAMP)
         sampler->wrap_s = clamp_to_border ? PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE;

      if (sampler->wrap_t == PIPE_TEX_WRAP_CLAMP)
         sampler->wrap_t = clamp_to_border ? PIPE_TEX_WRAP_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_CLAMP_TO_EDGE;
      else if (sampler->wrap_t == PIPE_TEX_WRAP_MIRROR_CLAMP)
         sampler->wrap_t = clamp_to_border ? PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE;

      if (sampler->wrap_r == PIPE_TEX_WRAP_CLAMP)
         sampler->wrap_r = clamp_to_border ? PIPE_TEX_WRAP_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_CLAMP_TO_EDGE;
      else if (sampler->wrap_r == PIPE_TEX_WRAP_MIRROR_CLAMP)
         sampler->wrap_r = clamp_to_border ? PIPE_TEX_WRAP_MIRROR_CLAMP_TO_BORDER :
                                             PIPE_TEX_WRAP_MIRROR_CLAMP_TO_EDGE;
   }
...

The depth component codepath is a little trickier, so I’ve split it off for readability even though it’s able to be collapsed into the conditional after. In short, this is only allowing linear filtering for unsupported depth formats when GL_CLAMP is also used.

With the filtering and wrap modes set, the next step here is to adjust the wrap modes based on whether linear filtering is available and the min filter mode. If linear is available and the min filter is linear, GL_CLAMP becomes CLAMP_TO_BORDER, otherwise it’s CLAMP_TO_EDGE. In conjunction with the NIR pass, this ends up replicating the expected behavior.

And to get that info to the NIR pass, more awfulness is required:

static inline GLboolean
is_wrap_gl_clamp(GLint param)
{
   return param == GL_CLAMP || param == GL_MIRROR_CLAMP_EXT;
}

static void
update_gl_clamplers(struct st_context *st, struct gl_program *prog, uint32_t *gl_clamplers)
{
   if (!st->emulate_gl_clamp)
      return;

   gl_clamplers[0] = gl_clamplers[1] = gl_clamplers[2] = 0;
   GLbitfield samplers_used = prog->SamplersUsed;
   unsigned unit;
   /* same as st_atom_sampler.c */
   for (unit = 0; samplers_used; unit++, samplers_used >>= 1) {
      unsigned tex_unit = prog->SamplerUnits[unit];
      if (samplers_used & 1 &&
          (st->ctx->Texture.Unit[tex_unit]._Current->Target != GL_TEXTURE_BUFFER ||
           st->texture_buffer_sampler)) {
         const struct gl_texture_object *texobj;
         struct gl_context *ctx = st->ctx;
         const struct gl_sampler_object *msamp;

         texobj = ctx->Texture.Unit[tex_unit]._Current;
         assert(texobj);

         msamp = _mesa_get_samplerobj(ctx, tex_unit);
         if (is_wrap_gl_clamp(msamp->Attrib.WrapS))
            gl_clamplers[0] |= BITFIELD64_BIT(unit);
         if (is_wrap_gl_clamp(msamp->Attrib.WrapT))
            gl_clamplers[1] |= BITFIELD64_BIT(unit);
         if (is_wrap_gl_clamp(msamp->Attrib.WrapR))
            gl_clamplers[2] |= BITFIELD64_BIT(unit);
      }
   }
}

This function iterates over all the samplers used by a given shader (struct gl_program), checking the wrap modes for GL_CLAMP and then updating the bitfields which correspond to struct nir_lower_tex_options::saturate_{s,t,r} when one is found. Each Gallium shader key is updated to use the values for comparisons, though I’ve helpfully reduced the key size used for comparisons for drivers which don’t set the pipe cap as well as those which do but have yet to see a gl_clampler.

The Result

By setting all these pipe caps and adding a trivial function, zink no longer needs to internally create and track sampler variants based on the above factors in order to support various sampler modes. Additionally, all Gallium-based drivers which emulate GL_CLAMP (there’s several) can switch over to this and delete a bunch of code.

Hooray.

January 22, 2021
Passion Led Us Here

How our for-profit company became a nonprofit, to better tackle the digital divide.

Originally posted on the Endless OS Foundation blog.

An 8-year journey to a nonprofit

On the 1st of April 2020, our for-profit Endless Mobile officially became a nonprofit as the Endless OS Foundation. Our launch as a nonprofit just as the global pandemic took hold was, predictably, hardly noticed, but for us the timing was incredible: as the world collectively asked “What can we do to help others in need?”, we framed our mission statement and launched our .org with the same very important question in mind. Endless always had a social impact mission at its heart, and the challenges related to students, families, and communities falling further into the digital divide during COVID-19 brought new urgency and purpose to our team’s decision to officially step in the social welfare space.

On April 1st 2020, our for-profit Endless Mobile officially became a nonprofit as the Endless OS Foundation, focused on the #DigitalDivide.

Our updated status was a long time coming: we began our transformation to a nonprofit organization in late 2019 with the realization that the true charter and passions of our team would be greatly accelerated without the constraints of for-profit goals, investors and sales strategies standing in the way of our mission of digital access and equity for all. 

But for 8 years we made a go of it commercially, headquartered in Silicon Valley and framing ourselves as a tech startup with access to the venture capital and partnerships on our doorstep. We believed that a successful commercial channel would be the most efficient way to scale the impact of bringing computer devices and access to communities in need. We still believe this – we’ve just learned through our experience that we don’t have the funding to enter the computer and OS marketplace head-on. With the social impact goal first, and the hope of any revenue a secondary goal, we have had many successes in those 8 years bridging the digital divide throughout the world, from Brazil, to Kenya, and the USA. We’ve learned a huge amount which will go on to inform our strategy as a nonprofit.

Endless always had a social impact mission at its heart. COVID-19 brought new urgency and purpose to our team’s decision to officially step in the social welfare space.

Our unique perspective

One thing we learned as a for-profit is that the OS and technology we’ve built has some unique properties which are hugely impactful as a working solution to digital equity barriers. And our experience deploying in the field around the world for 8 years has left us uniquely informed via many iterations and incremental improvements.

Endless OS designer in discussion with prospective user

With this knowledge in-hand, we’ve been refining our strategy throughout 2020 and now starting to focus on what it really means to become an effective nonprofit and make that impact. In many ways it is liberating to abandon the goals and constraints of being a for-profit entity, and in other ways it’s been a challenging journey for me and the team to adjust our way of thinking and let these for-profit notions and models go. Previously we exclusively built and sold a product that defined our success; and any impact we achieved was a secondary consequence of that success and seen through that lens. Now our success is defined purely in terms of social impact, and through our actions, those positive impacts can be made with or without our “product”. That means that we may develop and introduce technology to solve a problem, but it is equally as valid to find another organization’s existing offering and design a way to increase that positive impact and scale.

We develop technology to solve access equity issues, but it’s equally as valid to find another organization’s offering and partner in a way that increases their positive impact.

The analogy to Free and Open Source Software is very strong – while Endless has always used and contributed to a wide variety of FOSS projects, we’ve also had a tension where we’ve been trying to hold some pieces back and capture value – such as our own application or content ecosystem, our own hardware platform – necessarily making us competitors to other organisations even though they were hoping to achieve the same things as us. As a nonprofit we can let these ideas go and just pick the best partners and technologies to help the people we’re trying to reach.

School kids writing on paper

Digital equity … 4 barriers we need to overcome

In future, our decisions around which projects to build or engage with will revolve around 4 barriers to digital equity, and how our Endless OS, Endless projects, or our partners’ offerings can help to solve them. We define these 4 equity barriers as: barriers to devices, barriers to connectivity, barriers to literacy in terms of your ability to use the technology, and barriers to engagement in terms of whether using the system is rewarding and worthwhile.

We define the 4 digital equity barriers we exist to impact as:
1. barriers to devices
2. barriers to connectivity
3. barriers to literacy
4. barriers to engagement

It doesn’t matter who makes the solutions that break these barriers; what matters is how we assist in enabling people to use technology to gain access to the education and opportunities these barriers block. Our goal therefore is to simply ensure that solutions exist – building them ourselves and with partners such as the FOSS community and other nonprofits – proving them with real-world deployments, and sharing our results as widely as possible to allow for better adoption globally.

If we define our goal purely in terms of whether people are using Endless OS, we are effectively restricting the reach and scale of our solutions to the audience we can reach directly with Endless OS downloads, installs and propagation. Conversely, partnerships that scale impact are a win-win-win for us, our partners, and the communities we all serve. 

Engineering impact

Our Endless engineering roots and capabilities feed our unique ability to build and deploy all of our solutions, and the practical experience of deploying them gives us evidence and credibility as we advocate for their use. Either activity would be weaker without the other.

Our engineering roots and capabilities feed our unique ability to build and deploy digital divide solutions.

Our partners in various engineering communities will have already seen our change in approach. Particularly, with GNOME we are working hard to invest in upstream and reconcile the long-standing differences between our experience and GNOME. If successful, many more people can benefit from our work than just users of Endless OS. We’re working with Learning Equality on Kolibri to build a better app experience for Linux desktop users and bring content publishers into our ecosystem for the first time, and we’ve also taken our very own Hack, the immersive and fun destination for kids learning to code, released it for non-Endless systems on Flathub, and made it fully open-source.

Planning tasks with sticky notes on a whiteboard

What’s next for our OS?

What then is in store for the future of Endless OS, the place where we have invested so much time and planning through years of iterations? For the immediate future, we need the capacity to deploy everything we’ve built – all at once, to our partners. We built an OS that we feel is very unique and valuable, containing a number of world-firsts: first production OS shipped with OSTree, first Flatpak-only desktop, built-in support for updating OS and apps from USBs, while still providing a great deal of reliability and convenience for deployments in offline and educational-safe environments with great apps and content loaded on every system.

However, we need to find a way to deliver this Linux-based experience in a more efficient way, and we’d love to talk if you have ideas about how we can do this, perhaps as partners. Can the idea of “Endless OS” evolve to become a spec that is provided by different platforms in the future, maybe remixes of Debian, Fedora, openSUSE or Ubuntu? 

Build, Validate, Advocate

Beyond the OS, the Endless OS Foundation has identified multiple programs to help underserved communities, and in each case we are adopting our “build, validate, advocate” strategy. This approach underpins all of our projects: can we build the technology (or assist in the making), will a community in-need validate it by adoption, and can we inspire others by telling the story and advocating for its wider use?

We are adopting a “build, validate, advocate” strategy.
1. build the technology (or assist in the making)
2. validate by community adoption
3. advocate for its wider use

As examples, we have just launched the Endless Key (link) as an offline solution for students during the COVID-19 at-home distance learning challenges. This project is also establishing a first-ever partnership of well-known online educational brands to reach an underserved offline audience with valuable learning resources. We are developing a pay-as-you-go platform and new partnerships that will allow families to own laptops via micro-payments that are built directly into the operating system, even if they cannot qualify for standard retail financing. And during the pandemic, we’ve partnered with Teach For America to focus on very practical digital equity needs in the USA’s urban and rural communities.

One part of the world-wide digital divide solution

We are one solution provider for the complex matrix of issues known collectively as the #DigitalDivide, and these issues will not disappear after the pandemic. Digital equity was an issue long before COVID-19, and we are not so naive to think it can be solved by any single institution, or by the time the pandemic recedes. It will take time and a coalition of partnerships to win. We are in for the long-haul and we are always looking for partners, especially now as we are finding our feet in the nonprofit world. We’d love to hear from you, so please feel free to reach out to me – I’m ramcq on IRC, RocketChat, Twitter, LinkedIn or rob@endlessos.org.