For my first encoder I started in C++ and wrote a straightforward but feature complete encoder. It supports exhaustive partition scanning or it can estimate a single partition to evaluate which is around 12x faster. It's also pretty generic, so you can call the endpoint optimizer with any number of pixels (not just 16). About the only thing good you can say about this encoder is that it's high quality and easy to follow. Its performance is dreadful (even worse then DirectXTex, believe it or not) unless you force it to evaluate only a single partition per mode in which case it's ~2.6x faster than DirectXTex.
I then wrote a new, simpler C++ encoder, then quickly ported this to C. For this simpler encoder, I focused on just the modes that mattered most to performance vs. quality. The endpoint optimizer was less generic and more focused, and I removed some less useful features in perceptual mode. I then added in additional modes to this encoder to bring its quality up vs. Intel's ispc_texcomp. After this, I took the C encoder and got it building with iscp, then profiled and optimized this over a period of a week or so.
1. Ridiculously slow C++ encoder, evaluate all partitions:
46286.7 secs, 51.56 Luma PSNR
2. Ridiculously slow C++ encoder, evaluate the single best partition only:
4615.5 secs, 51.35 Luma PSNR
3. Faster C encoder:
458.7 secs, 50.77 Luma PSNR
4. Vectorized encoder - C encoder ported to ispc then further optimized and enhanced:
76.4 secs, 51.16 Luma PSNR
Notice each time I got roughly a 6-10x performance increase at similar quality. The iscp version is currently 6x faster than the C version. To get it there I had to examine the code generated for all the inner loops and tweak the iscp code for decent code generation. Without doing this it was only around 2x faster.
I used the previous encoder as a reference design during each step of this optimization process, which was invaluable for checking for regressions.
Some things about BC7 that are commonly believed but aren't really true, and other things I've learned along the way (along with some light ranting but I mean well):
- "BC7 has this massive search space - it takes forever to encode!" Actually, with a properly written encoder it's not that big of a deal. You can very quickly estimate which partition(s) to examine (which is an extremely vectorizable operation), you can drop half the modes for alpha textures, mode 7 is useless with opaque textures, the 3 subset modes can be dropped unless you want absolute max quality, and many fast endpoint optimization methods from BC1 are usable with BC7. BC1-style cluster fit isn't a good match for BC7 (because it has 3 and 4 bit indices), but you can easily approximate it.
For a fast opaque-only encoder, my statistics show that supporting just modes 1 and 6, or 0/1/6, results in reasonably high quality without noticeable block artifacts.
- CPU encoders are too slow, so let's use the GPU!
Actually, a properly vectorized CPU encoder can be extremely fast. There is nothing special about BC7 that requires compute shaders if you do it correctly. Using gradient descent or calling rand() in your encoder's inner loop are red flags that you are going off the rails. The search space is massive but most of it is useless.
You need the right algorithms, and strong heuristics matter in GPU texture compression. Write a BC1 encoder that can compete against squish first, then tackle BC7.
- Graphics dev BC7 folklore: "3 subset modes are useless!"
No, they aren't. iscp_texcomp and my encoder make heavy use of these modes. The perf reduction for checking the 3 subset modes is minimal but the PSNR gain from using them is high:
Across my 31 texture corpus, all modes (but 7) gets 46.72 dB, disabling the two 3 subset modes I get 45.96 dB. This is a significant difference. Some timings: 487 secs (3 subsets enabled) vs. 476 secs (3 subsets disabled). In another test at lower quality: 215.9 secs (3 subsets enabled) vs. 199.5 secs (disabled).
Before you write a BC1 or BC7 encoder, you should probably write a simple JPEG compressor first to learn all about chroma subsampling, etc.