zstd: import upstream v1.5.5

[PATCH 2/2] zstd: Backport Huffman speed improvement from upstream

Posted by Nick Terrell 2 years, 2 months ago

From: Nick Terrell <terrelln@meta.com>

Backport upstream commit c7269ad [0] to improve zstd decoding speed.

Updating the kernel to zstd v1.5.5 earlier in this patch series
regressed zstd decoding speed. This turned out to be because gcc was not
unrolling the inner loops of the Huffman decoder which are executed a
constant number of times [1]. This really hurts performance, as we expect
this loop to be completely branch-free. This commit fixes the issue by
unrolling the loop manually [2].

The commit fixes one more minor issue, which is to mask a variable shift
by 0x3F. The shift was guaranteed to be less than 64, but gcc couldn't
prove that, and emitted suboptimal code.

Finally, the upstream commit added a build macro
`HUF_DISABLE_FAST_DECODE` which is not used in the kernel, but is
maintained to keep a clean import from upstream.

This commit was generated from upstream signed tag v1.5.5-kernel [3] by:

  export ZSTD=/path/to/repo/zstd/
  export LINUX=/path/to/repo/linux/
  cd "$ZSTD/contrib/linux-kernel"
  git checkout v1.5.5-kernel
  make import LINUX="$LINUX"

I ran my benchmark & test suite before and after this commit to measure
the overall decompression speed benefit. It benchmarks zstd at several
compression levels. These benchmarks measure the total time it takes to
read data from the compressed filesystem.

Component,	Level,	Read time delta
Btrfs    ,	    1,	-7.0%
Btrfs    ,	    3,	-3.9%
Btrfs    ,	    5,	-4.7%
Btrfs    ,	    7,	-5.5%
Btrfs    ,	    9,	-2.4%
Squashfs ,	    1,	-9.1%

Link: https://github.com/facebook/zstd/commit/c7269add7eaf028ed828d9af41e732cf01993aad
Link: https://gist.github.com/terrelln/2e14ff1fb197102a08d7823d8044978d
Link: https://gist.github.com/terrelln/a70bde22a2abc800691fb65c21eabc2a
Link: https://github.com/facebook/zstd/tree/v1.5.5-kernel
Signed-off-by: Nick Terrell <terrelln@fb.com>
---
 lib/zstd/decompress/huf_decompress.c | 171 ++++++++++++++++-----------
 1 file changed, 105 insertions(+), 66 deletions(-)

diff --git a/lib/zstd/decompress/huf_decompress.c b/lib/zstd/decompress/huf_decompress.c
index d172e35fbd9a..db670d71fdab 100644
--- a/lib/zstd/decompress/huf_decompress.c
+++ b/lib/zstd/decompress/huf_decompress.c
@@ -35,6 +35,12 @@
 *  Macros
 ****************************************************************/
 
+#ifdef HUF_DISABLE_FAST_DECODE
+# define HUF_ENABLE_FAST_DECODE 0
+#else
+# define HUF_ENABLE_FAST_DECODE 1
+#endif
+
 /* These two optional macros force the use one way or another of the two
  * Huffman decompression implementations. You can't force in both directions
  * at the same time.
@@ -289,6 +295,24 @@ static size_t HUF_initRemainingDStream(BIT_DStream_t* bit, HUF_DecompressFastArg
     return 0;
 }
 
+/* Calls X(N) for each stream 0, 1, 2, 3. */
+#define HUF_4X_FOR_EACH_STREAM(X) \
+    {                             \
+        X(0)                      \
+        X(1)                      \
+        X(2)                      \
+        X(3)                      \
+    }
+
+/* Calls X(N, var) for each stream 0, 1, 2, 3. */
+#define HUF_4X_FOR_EACH_STREAM_WITH_VAR(X, var) \
+    {                                           \
+        X(0, (var))                             \
+        X(1, (var))                             \
+        X(2, (var))                             \
+        X(3, (var))                             \
+    }
+
 
 #ifndef HUF_FORCE_DECOMPRESS_X2
 
@@ -702,7 +726,6 @@ void HUF_decompress4X1_usingDTable_internal_fast_c_loop(HUF_DecompressFastArgs*
     for (;;) {
         BYTE* olimit;
         int stream;
-        int symbol;
 
         /* Assert loop preconditions */
 #ifndef NDEBUG
@@ -749,27 +772,42 @@ void HUF_decompress4X1_usingDTable_internal_fast_c_loop(HUF_DecompressFastArgs*
         }
 #endif
 
+#define HUF_4X1_DECODE_SYMBOL(_stream, _symbol)                 \
+    {                                                           \
+        int const index = (int)(bits[(_stream)] >> 53);         \
+        int const entry = (int)dtable[index];                   \
+        bits[(_stream)] <<= (entry & 0x3F);                     \
+        op[(_stream)][(_symbol)] = (BYTE)((entry >> 8) & 0xFF); \
+    }
+
+#define HUF_4X1_RELOAD_STREAM(_stream)                              \
+    {                                                               \
+        int const ctz = ZSTD_countTrailingZeros64(bits[(_stream)]); \
+        int const nbBits = ctz & 7;                                 \
+        int const nbBytes = ctz >> 3;                               \
+        op[(_stream)] += 5;                                         \
+        ip[(_stream)] -= nbBytes;                                   \
+        bits[(_stream)] = MEM_read64(ip[(_stream)]) | 1;            \
+        bits[(_stream)] <<= nbBits;                                 \
+    }
+
+        /* Manually unroll the loop because compilers don't consistently
+         * unroll the inner loops, which destroys performance.
+         */
         do {
             /* Decode 5 symbols in each of the 4 streams */
-            for (symbol = 0; symbol < 5; ++symbol) {
-                for (stream = 0; stream < 4; ++stream) {
-                    int const index = (int)(bits[stream] >> 53);
-                    int const entry = (int)dtable[index];
-                    bits[stream] <<= (entry & 63);
-                    op[stream][symbol] = (BYTE)((entry >> 8) & 0xFF);
-                }
-            }
-            /* Reload the bitstreams */
-            for (stream = 0; stream < 4; ++stream) {
-                int const ctz = ZSTD_countTrailingZeros64(bits[stream]);
-                int const nbBits = ctz & 7;
-                int const nbBytes = ctz >> 3;
-                op[stream] += 5;
-                ip[stream] -= nbBytes;
-                bits[stream] = MEM_read64(ip[stream]) | 1;
-                bits[stream] <<= nbBits;
-            }
+            HUF_4X_FOR_EACH_STREAM_WITH_VAR(HUF_4X1_DECODE_SYMBOL, 0)
+            HUF_4X_FOR_EACH_STREAM_WITH_VAR(HUF_4X1_DECODE_SYMBOL, 1)
+            HUF_4X_FOR_EACH_STREAM_WITH_VAR(HUF_4X1_DECODE_SYMBOL, 2)
+            HUF_4X_FOR_EACH_STREAM_WITH_VAR(HUF_4X1_DECODE_SYMBOL, 3)
+            HUF_4X_FOR_EACH_STREAM_WITH_VAR(HUF_4X1_DECODE_SYMBOL, 4)
+
+            /* Reload each of the 4 the bitstreams */
+            HUF_4X_FOR_EACH_STREAM(HUF_4X1_RELOAD_STREAM)
         } while (op[3] < olimit);
+
+#undef HUF_4X1_DECODE_SYMBOL
+#undef HUF_4X1_RELOAD_STREAM
     }
 
 _out:
@@ -865,7 +903,7 @@ static size_t HUF_decompress4X1_usingDTable_internal(void* dst, size_t dstSize,
     }
 #endif
 
-    if (!(flags & HUF_flags_disableFast)) {
+    if (HUF_ENABLE_FAST_DECODE && !(flags & HUF_flags_disableFast)) {
         size_t const ret = HUF_decompress4X1_usingDTable_internal_fast(dst, dstSize, cSrc, cSrcSize, DTable, loopFn);
         if (ret != 0)
             return ret;
@@ -1487,7 +1525,6 @@ void HUF_decompress4X2_usingDTable_internal_fast_c_loop(HUF_DecompressFastArgs*
     for (;;) {
         BYTE* olimit;
         int stream;
-        int symbol;
 
         /* Assert loop preconditions */
 #ifndef NDEBUG
@@ -1544,54 +1581,56 @@ void HUF_decompress4X2_usingDTable_internal_fast_c_loop(HUF_DecompressFastArgs*
         }
 #endif
 
+#define HUF_4X2_DECODE_SYMBOL(_stream, _decode3)        \
+    if ((_decode3) || (_stream) != 3) {                 \
+        int const index = (int)(bits[(_stream)] >> 53); \
+        HUF_DEltX2 const entry = dtable[index];         \
+        MEM_write16(op[(_stream)], entry.sequence);     \
+        bits[(_stream)] <<= (entry.nbBits) & 0x3F;      \
+        op[(_stream)] += (entry.length);                \
+    }
+
+#define HUF_4X2_RELOAD_STREAM(_stream)                                  \
+    {                                                                   \
+        HUF_4X2_DECODE_SYMBOL(3, 1)                                     \
+        {                                                               \
+            int const ctz = ZSTD_countTrailingZeros64(bits[(_stream)]); \
+            int const nbBits = ctz & 7;                                 \
+            int const nbBytes = ctz >> 3;                               \
+            ip[(_stream)] -= nbBytes;                                   \
+            bits[(_stream)] = MEM_read64(ip[(_stream)]) | 1;            \
+            bits[(_stream)] <<= nbBits;                                 \
+        }                                                               \
+    }
+
+        /* Manually unroll the loop because compilers don't consistently
+         * unroll the inner loops, which destroys performance.
+         */
         do {
-            /* Do 5 table lookups for each of the first 3 streams */
-            for (symbol = 0; symbol < 5; ++symbol) {
-                for (stream = 0; stream < 3; ++stream) {
-                    int const index = (int)(bits[stream] >> 53);
-                    HUF_DEltX2 const entry = dtable[index];
-                    MEM_write16(op[stream], entry.sequence);
-                    bits[stream] <<= (entry.nbBits);
-                    op[stream] += (entry.length);
-                }
-            }
-            /* Do 1 table lookup from the final stream */
-            {
-                int const index = (int)(bits[3] >> 53);
-                HUF_DEltX2 const entry = dtable[index];
-                MEM_write16(op[3], entry.sequence);
-                bits[3] <<= (entry.nbBits);
-                op[3] += (entry.length);
-            }
-            /* Do 4 table lookups from the final stream & reload bitstreams */
-            for (stream = 0; stream < 4; ++stream) {
-                /* Do a table lookup from the final stream.
-                 * This is interleaved with the reloading to reduce register
-                 * pressure. This shouldn't be necessary, but compilers can
-                 * struggle with codegen with high register pressure.
-                 */
-                {
-                    int const index = (int)(bits[3] >> 53);
-                    HUF_DEltX2 const entry = dtable[index];
-                    MEM_write16(op[3], entry.sequence);
-                    bits[3] <<= (entry.nbBits);
-                    op[3] += (entry.length);
-                }
-                /* Reload the bistreams. The final bitstream must be reloaded
-                 * after the 5th symbol was decoded.
-                 */
-                {
-                    int const ctz = ZSTD_countTrailingZeros64(bits[stream]);
-                    int const nbBits = ctz & 7;
-                    int const nbBytes = ctz >> 3;
-                    ip[stream] -= nbBytes;
-                    bits[stream] = MEM_read64(ip[stream]) | 1;
-                    bits[stream] <<= nbBits;
-                }
-            }
+            /* Decode 5 symbols from each of the first 3 streams.
+             * The final stream will be decoded during the reload phase
+             * to reduce register pressure.
+             */
+            HUF_4X_FOR_EACH_STREAM_WITH_VAR(HUF_4X2_DECODE_SYMBOL, 0)
+            HUF_4X_FOR_EACH_STREAM_WITH_VAR(HUF_4X2_DECODE_SYMBOL, 0)
+            HUF_4X_FOR_EACH_STREAM_WITH_VAR(HUF_4X2_DECODE_SYMBOL, 0)
+            HUF_4X_FOR_EACH_STREAM_WITH_VAR(HUF_4X2_DECODE_SYMBOL, 0)
+            HUF_4X_FOR_EACH_STREAM_WITH_VAR(HUF_4X2_DECODE_SYMBOL, 0)
+
+            /* Decode one symbol from the final stream */
+            HUF_4X2_DECODE_SYMBOL(3, 1)
+
+            /* Decode 4 symbols from the final stream & reload bitstreams.
+             * The final stream is reloaded last, meaning that all 5 symbols
+             * are decoded from the final stream before it is reloaded.
+             */
+            HUF_4X_FOR_EACH_STREAM(HUF_4X2_RELOAD_STREAM)
         } while (op[3] < olimit);
     }
 
+#undef HUF_4X2_DECODE_SYMBOL
+#undef HUF_4X2_RELOAD_STREAM
+
 _out:
 
     /* Save the final values of each of the state variables back to args. */
@@ -1676,7 +1715,7 @@ static size_t HUF_decompress4X2_usingDTable_internal(void* dst, size_t dstSize,
     }
 #endif
 
-    if (!(flags & HUF_flags_disableFast)) {
+    if (HUF_ENABLE_FAST_DECODE && !(flags & HUF_flags_disableFast)) {
         size_t const ret = HUF_decompress4X2_usingDTable_internal_fast(dst, dstSize, cSrc, cSrcSize, DTable, loopFn);
         if (ret != 0)
             return ret;
-- 
2.42.1

Re: [PATCH 2/2] zstd: Backport Huffman speed improvement from upstream

Posted by Linus Torvalds 2 years, 2 months ago

On Mon, 20 Nov 2023 at 16:52, Nick Terrell <nickrterrell@gmail.com> wrote:
>
> +/* Calls X(N) for each stream 0, 1, 2, 3. */
> +#define HUF_4X_FOR_EACH_STREAM(X) \
> +    {                             \
> +        X(0)                      \
> +        X(1)                      \
> +        X(2)                      \
> +        X(3)                      \
> +    }
> +
> +/* Calls X(N, var) for each stream 0, 1, 2, 3. */
> +#define HUF_4X_FOR_EACH_STREAM_WITH_VAR(X, var) \
> +    {                                           \
> +        X(0, (var))                             \
> +        X(1, (var))                             \
> +        X(2, (var))                             \
> +        X(3, (var))                             \
> +    }
> +

What shitty compilers do you need to be compatible with?

Because at least for Linux, the above is one single #define:

    #define FOUR(X,y...) do { \
        X(0,##y); X(1,##y); X(2,##y); X(3,##y); \
    } while (0)

and it does the right thing for any number of arguments, ie

    FOUR(fn)
    FOUR(fn1,a)
    FOUR(fn2,a,b)

expands to

    do { fn(0); fn(1); fn(2); fn(3); } while (0)
    do { fn1(0,a); fn1(1,a); fn1(2,a); fn1(3,a); } while (0)
    do { fn2(0,a,b); fn2(1,a,b); fn2(2,a,b); fn2(3,a,b); } while (0)

so unless you need to support some completely garbage compiler
upstream, please just do the single #define.

               Linus

Re: [PATCH 2/2] zstd: Backport Huffman speed improvement from upstream

Posted by Nick Terrell 2 years, 2 months ago

> On Nov 21, 2023, at 12:23 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> !-------------------------------------------------------------------|
>  This Message Is From an External Sender
> 
> |-------------------------------------------------------------------!
> 
> On Mon, 20 Nov 2023 at 16:52, Nick Terrell <nickrterrell@gmail.com> wrote:
>> 
>> +/* Calls X(N) for each stream 0, 1, 2, 3. */
>> +#define HUF_4X_FOR_EACH_STREAM(X) \
>> +    {                             \
>> +        X(0)                      \
>> +        X(1)                      \
>> +        X(2)                      \
>> +        X(3)                      \
>> +    }
>> +
>> +/* Calls X(N, var) for each stream 0, 1, 2, 3. */
>> +#define HUF_4X_FOR_EACH_STREAM_WITH_VAR(X, var) \
>> +    {                                           \
>> +        X(0, (var))                             \
>> +        X(1, (var))                             \
>> +        X(2, (var))                             \
>> +        X(3, (var))                             \
>> +    }
>> +
> 
> What shitty compilers do you need to be compatible with?
> 
> Because at least for Linux, the above is one single #define:
> 
>    #define FOUR(X,y...) do { \
>        X(0,##y); X(1,##y); X(2,##y); X(3,##y); \
>    } while (0)
> 
> and it does the right thing for any number of arguments, ie
> 
>    FOUR(fn)
>    FOUR(fn1,a)
>    FOUR(fn2,a,b)
> 
> expands to
> 
>    do { fn(0); fn(1); fn(2); fn(3); } while (0)
>    do { fn1(0,a); fn1(1,a); fn1(2,a); fn1(3,a); } while (0)
>    do { fn2(0,a,b); fn2(1,a,b); fn2(2,a,b); fn2(3,a,b); } while (0)
> 
> so unless you need to support some completely garbage compiler
> upstream, please just do the single #define.

Upstream zstd maintains strict C89 compatibility. We do use compiler
extensions when available, but only when we can detect the feature and
provide a correct fallback when it isn’t available. E.g. __builtin_ctz().
Empty variadic macros fails our C89 compatibility CI, as well as our
Visual Studios CI [0].

We’ve discussed dropping C89 support. The consensus was that
if we were starting the project today we’d stick to C11, but that it is
worth keeping C89 support for Zstd. We’ve seen people compile
zstd in some wacky environments.

W.r.t. do { } while (0), our older Visual Studios CI jobs failed on the
do { } while (0) macros, because it complained about constant false
branches. However, it looks like our current Visual Studios CI jobs
accept do { } while (0) [1]. So I’ve opened an upstream issue to
modernize all of macros [2].

Best,
Nick Terrell

[0] https://github.com/terrelln/zstd/pull/3
[1] https://github.com/terrelln/zstd/pull/4
[2] https://github.com/facebook/zstd/issues/3830

>               Linus

Re: [PATCH 2/2] zstd: Backport Huffman speed improvement from upstream

Posted by Linus Torvalds 2 years, 2 months ago

On Tue, 21 Nov 2023 at 11:59, Nick Terrell <terrelln@meta.com> wrote:
>
> W.r.t. do { } while (0), our older Visual Studios CI jobs failed on the
> do { } while (0) macros, because it complained about constant false
> branches.

Wow. That is some truly incompetent compiler people.

I mean, really. As in "Why would you ever use that kind of garbage"
incompetence.

Honestly, any coding rule that includes "don't use the do-while-zero
construct" is actively broken shit.

Please just fix your upstream rules. Because they are incredible garbage.

               Linus

Re: [PATCH 2/2] zstd: Backport Huffman speed improvement from upstream

Posted by Nick Terrell 2 years, 2 months ago


> On Nov 21, 2023, at 3:12 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> !-------------------------------------------------------------------|
>  This Message Is From an External Sender
> 
> |-------------------------------------------------------------------!
> 
> On Tue, 21 Nov 2023 at 11:59, Nick Terrell <terrelln@meta.com> wrote:
>> 
>> W.r.t. do { } while (0), our older Visual Studios CI jobs failed on the
>> do { } while (0) macros, because it complained about constant false
>> branches.
> 
> Wow. That is some truly incompetent compiler people.
> 
> I mean, really. As in "Why would you ever use that kind of garbage"
> incompetence.
> 
> Honestly, any coding rule that includes "don't use the do-while-zero
> construct" is actively broken shit.
> 
> Please just fix your upstream rules. Because they are incredible garbage.

Yeah, that’s the plan. Visual Studios fixed that compiler bug in VS2015 [0],
so we should be safe to migrate to safer macros. I’m going through and
doing that now, and will backport that to the kernel on top of this series.

[0] https://stackoverflow.com/a/36658783

>               Linus

Re: [PATCH 2/2] zstd: Backport Huffman speed improvement from upstream

Posted by Linus Torvalds 2 years, 2 months ago

On Tue, 21 Nov 2023 at 12:35, Nick Terrell <terrelln@meta.com> wrote:
> >
> > Honestly, any coding rule that includes "don't use the do-while-zero
> > construct" is actively broken shit.
> >
> > Please just fix your upstream rules. Because they are incredible garbage.
>
> Yeah, that’s the plan. Visual Studios fixed that compiler bug in VS2015 [0],
> so we should be safe to migrate to safer macros.

I don't even use MSVS, but a minute of googling shows that you should
never have done that silly "avoid sane C", and you should always just
have done

  #pragma warning (disable: 4127)

for MSVC.

Honestly, the fact that the result was instead to disable that
standard - and required - construct in the project makes me worry
about the whole zstd thing. WTF?

The do-while-zero construct is _so_ important that there are (sane)
projects that literally *require* the use of it. See for example MISRA
code safety rules.

The kernel rules aren't quite that strict, but yes, do-while-zero is
very much "you should *absolutely* do this" along with all the usual
"make sure you have parentheses around macro arguments" rules.

We had some RFC patches for this area:

   https://lore.kernel.org/all/20230511152951.1970870-1-mathieu.desnoyers@efficios.com/

And on that note, when I googled for the solution to the MSVC brain
damage, I was distressed by how many hits I saw where people thought
the do-while-zero pattern was some "legacy pattern".

That just shows that there are lots of incompetent people simply do
not understand why it's actually *required* for reliable parsing of
macros.  This is not some "historical stylistic" issue, it's literally
a correctness issue for generic macro usage.

           Linus

Re: [PATCH 2/2] zstd: Backport Huffman speed improvement from upstream

Posted by Nick Terrell 2 years, 2 months ago


> On Nov 21, 2023, at 3:54 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Tue, 21 Nov 2023 at 12:35, Nick Terrell <terrelln@meta.com> wrote:
>>> 
>>> Honestly, any coding rule that includes "don't use the do-while-zero
>>> construct" is actively broken shit.
>>> 
>>> Please just fix your upstream rules. Because they are incredible garbage.
>> 
>> Yeah, that’s the plan. Visual Studios fixed that compiler bug in VS2015 [0],
>> so we should be safe to migrate to safer macros.
> 
> I don't even use MSVS, but a minute of googling shows that you should
> never have done that silly "avoid sane C", and you should always just
> have done
> 
>  #pragma warning (disable: 4127)
> 
> for MSVC.
> 
> Honestly, the fact that the result was instead to disable that
> standard - and required - construct in the project makes me worry
> about the whole zstd thing. WTF?

Admittedly our coding guidelines are overly conservative. And here
we are updating to our macros to use the do { } while (0) construct
in this PR [0].

However, we are also very conservative in our testing. We have very
extensive coverage-guided fuzz testing running continuously for
safety of (de)compressing untrusted data, round-trip correctness,
and more.

We take security & correctness very seriously. If you have any
questions I’d be happy to answer them, and I should collect our
testing process publicly in one place, so we can reference that.

If you have any further suggestions I’m very open to them, and
I am grateful for the time you’re taking to improve zstd.

[0] https://github.com/facebook/zstd/pull/3831

> The do-while-zero construct is _so_ important that there are (sane)
> projects that literally *require* the use of it. See for example MISRA
> code safety rules.
> 
> The kernel rules aren't quite that strict, but yes, do-while-zero is
> very much "you should *absolutely* do this" along with all the usual
> "make sure you have parentheses around macro arguments" rules.
> 
> We had some RFC patches for this area:
> 
>   https://lore.kernel.org/all/20230511152951.1970870-1-mathieu.desnoyers@efficios.com/

Agreed.

> And on that note, when I googled for the solution to the MSVC brain
> damage, I was distressed by how many hits I saw where people thought
> the do-while-zero pattern was some "legacy pattern".
> 
> That just shows that there are lots of incompetent people simply do
> not understand why it's actually *required* for reliable parsing of
> macros.  This is not some "historical stylistic" issue, it's literally
> a correctness issue for generic macro usage.
> 
>           Linus