drivers/infiniband/sw/siw/Kconfig | 1 + drivers/infiniband/sw/siw/siw.h | 22 +---- drivers/nvme/host/Kconfig | 4 +- drivers/nvme/host/tcp.c | 118 +++++++++----------------- include/linux/crc32.h | 23 ------ include/linux/skbuff.h | 16 +--- include/net/checksum.h | 12 --- include/net/sctp/checksum.h | 29 +------ lib/crc32.c | 6 -- lib/tests/crc_kunit.c | 6 -- net/Kconfig | 4 + net/core/datagram.c | 34 ++++---- net/core/dev.c | 10 +-- net/core/skbuff.c | 132 ++++++++++++++++++------------ net/netfilter/Kconfig | 4 +- net/netfilter/ipvs/Kconfig | 2 +- net/openvswitch/Kconfig | 2 +- net/sched/Kconfig | 2 +- net/sctp/Kconfig | 2 +- net/sctp/offload.c | 1 - 20 files changed, 156 insertions(+), 274 deletions(-)
Update networking code that computes the CRC32C of packets to just call crc32c() without unnecessary abstraction layers. The result is faster and simpler code. Patches 1-7 add skb_crc32c() and remove the overly-abstracted and inefficient __skb_checksum(). Patches 8-10 replace skb_copy_and_hash_datagram_iter() with skb_copy_and_crc32c_datagram_iter(), eliminating the unnecessary use of the crypto layer. This unblocks the conversion of nvme-tcp to call crc32c() directly instead of using the crypto layer, which patch 9 does. I'm proposing that this series be taken through net-next for 6.16, but patch 9 could use an ack from the NVME maintainers. Eric Biggers (10): net: introduce CONFIG_NET_CRC32C net: add skb_crc32c() net: use skb_crc32c() in skb_crc32c_csum_help() RDMA/siw: use skb_crc32c() instead of __skb_checksum() sctp: use skb_crc32c() instead of __skb_checksum() net: fold __skb_checksum() into skb_checksum() lib/crc32: remove unused support for CRC32C combination net: add skb_copy_and_crc32c_datagram_iter() nvme-tcp: use crc32c() and skb_copy_and_crc32c_datagram_iter() net: remove skb_copy_and_hash_datagram_iter() drivers/infiniband/sw/siw/Kconfig | 1 + drivers/infiniband/sw/siw/siw.h | 22 +---- drivers/nvme/host/Kconfig | 4 +- drivers/nvme/host/tcp.c | 118 +++++++++----------------- include/linux/crc32.h | 23 ------ include/linux/skbuff.h | 16 +--- include/net/checksum.h | 12 --- include/net/sctp/checksum.h | 29 +------ lib/crc32.c | 6 -- lib/tests/crc_kunit.c | 6 -- net/Kconfig | 4 + net/core/datagram.c | 34 ++++---- net/core/dev.c | 10 +-- net/core/skbuff.c | 132 ++++++++++++++++++------------ net/netfilter/Kconfig | 4 +- net/netfilter/ipvs/Kconfig | 2 +- net/openvswitch/Kconfig | 2 +- net/sched/Kconfig | 2 +- net/sctp/Kconfig | 2 +- net/sctp/offload.c | 1 - 20 files changed, 156 insertions(+), 274 deletions(-) base-commit: 0b28182c73a3d013bcabbb890dc1070a8388f55a -- 2.49.0
On Sat, 10 May 2025 17:41:00 -0700 Eric Biggers wrote: > I'm proposing that this series be taken through net-next for 6.16, but > patch 9 could use an ack from the NVME maintainers. And patch 4 from RDMA. Please repost once those were collected and with the benchmarks you shared on Andrew's request. From networking PoV this looks good. -- pw-bot: defer
On Tue, May 13, 2025 at 02:40:18PM -0700, Jakub Kicinski wrote: > On Sat, 10 May 2025 17:41:00 -0700 Eric Biggers wrote: > > I'm proposing that this series be taken through net-next for 6.16, but > > patch 9 could use an ack from the NVME maintainers. > > And patch 4 from RDMA. Please repost once those were collected and with > the benchmarks you shared on Andrew's request. From networking PoV this > looks good. > -- > pw-bot: defer Thanks. linux-nvme and linux-rdma are both Cc'ed on this patchset. - Eric
On Sat, May 10, 2025 at 05:41:00PM -0700, Eric Biggers wrote: > Update networking code that computes the CRC32C of packets to just call > crc32c() without unnecessary abstraction layers. The result is faster > and simpler code. Hi Eric Do you have some benchmarks for these changes? Andrew
On Sun, May 11, 2025 at 06:30:25PM +0200, Andrew Lunn wrote:
> On Sat, May 10, 2025 at 05:41:00PM -0700, Eric Biggers wrote:
> > Update networking code that computes the CRC32C of packets to just call
> > crc32c() without unnecessary abstraction layers. The result is faster
> > and simpler code.
>
> Hi Eric
>
> Do you have some benchmarks for these changes?
>
> Andrew
Do you want benchmarks that show that removing the indirect calls makes things
faster? I think that should be fairly self-evident by now after dealing with
retpoline for years, but I can provide more details if you need them.
Removing the inefficient use of crc32c_combine() makes a massive difference on
fragmented sk_buffs, since crc32c_combine() is so slow (much slower than the CRC
calculation itself). However, reverting the workaround commit 4c2f24549644
("sctp: linearize early if it's not GSO") is beyond the scope of this patchset,
so for now the sctp stack doesn't actually call skb_crc32c() on fragmented
sk_buffs. I can provide microbenchmarks of skb_crc32c() on a fragmented sk_buff
directly though, if you don't think it's clear already.
Of course, please also keep in mind the -118 line diffstat. Even if it wasn't
faster we should just do it this way anyway.
- Eric
On Sun, May 11, 2025 at 10:29:29AM -0700, Eric Biggers wrote: > On Sun, May 11, 2025 at 06:30:25PM +0200, Andrew Lunn wrote: > > On Sat, May 10, 2025 at 05:41:00PM -0700, Eric Biggers wrote: > > > Update networking code that computes the CRC32C of packets to just call > > > crc32c() without unnecessary abstraction layers. The result is faster > > > and simpler code. > > > > Hi Eric > > > > Do you have some benchmarks for these changes? > > > > Andrew > > Do you want benchmarks that show that removing the indirect calls makes things > faster? I think that should be fairly self-evident by now after dealing with > retpoline for years, but I can provide more details if you need them. I was think more like iperf before/after? Show the CPU load has gone down without the bandwidth also going down. Eric Dumazet has a T-Shirt with a commit message on the back which increased network performance by X%. At the moment, there is nothing T-Shirt quotable here. Andrew
On Sun, 11 May 2025 at 23:22, Andrew Lunn <andrew@lunn.ch> wrote: > > On Sun, May 11, 2025 at 10:29:29AM -0700, Eric Biggers wrote: > > On Sun, May 11, 2025 at 06:30:25PM +0200, Andrew Lunn wrote: > > > On Sat, May 10, 2025 at 05:41:00PM -0700, Eric Biggers wrote: > > > > Update networking code that computes the CRC32C of packets to just call > > > > crc32c() without unnecessary abstraction layers. The result is faster > > > > and simpler code. > > > > > > Hi Eric > > > > > > Do you have some benchmarks for these changes? > > > > > > Andrew > > > > Do you want benchmarks that show that removing the indirect calls makes things > > faster? I think that should be fairly self-evident by now after dealing with > > retpoline for years, but I can provide more details if you need them. > > I was think more like iperf before/after? Show the CPU load has gone > down without the bandwidth also going down. > > Eric Dumazet has a T-Shirt with a commit message on the back which > increased network performance by X%. At the moment, there is nothing > T-Shirt quotable here. > I think that removing layers of redundant code to ultimately call the same core CRC-32 implementation is a rather obvious win, especially when indirect calls are involved. The diffstat speaks for itself, so maybe you can print that on a T-shirt.
On Sun, May 11, 2025 at 11:45:14PM +0200, Ard Biesheuvel wrote:
> On Sun, 11 May 2025 at 23:22, Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > On Sun, May 11, 2025 at 10:29:29AM -0700, Eric Biggers wrote:
> > > On Sun, May 11, 2025 at 06:30:25PM +0200, Andrew Lunn wrote:
> > > > On Sat, May 10, 2025 at 05:41:00PM -0700, Eric Biggers wrote:
> > > > > Update networking code that computes the CRC32C of packets to just call
> > > > > crc32c() without unnecessary abstraction layers. The result is faster
> > > > > and simpler code.
> > > >
> > > > Hi Eric
> > > >
> > > > Do you have some benchmarks for these changes?
> > > >
> > > > Andrew
> > >
> > > Do you want benchmarks that show that removing the indirect calls makes things
> > > faster? I think that should be fairly self-evident by now after dealing with
> > > retpoline for years, but I can provide more details if you need them.
> >
> > I was think more like iperf before/after? Show the CPU load has gone
> > down without the bandwidth also going down.
> >
> > Eric Dumazet has a T-Shirt with a commit message on the back which
> > increased network performance by X%. At the moment, there is nothing
> > T-Shirt quotable here.
> >
>
> I think that removing layers of redundant code to ultimately call the
> same core CRC-32 implementation is a rather obvious win, especially
> when indirect calls are involved. The diffstat speaks for itself, so
> maybe you can print that on a T-shirt.
Agreed with Ard. I did try doing some SCTP benchmarks with iperf3 earlier, but
they were very noisy and the CRC32C checksumming seemed to be lost in the noise.
There probably are some tricks to running reliable networking benchmarks; I'm
not a networking developer. Regardless, this series is a clear win for the
CRC32C code, both from a simplicity and performance perspective. It also fixes
the kconfig dependency issues. That should be good enough, IMO.
In case it's helpful, here are some microbenchmarks of __skb_checksum (old) vs
skb_crc32c (new):
Linear sk_buffs
Length in bytes __skb_checksum cycles skb_crc32c cycles
=============== ===================== =================
64 43 18
1420 204 161
16384 1735 1642
Nonlinear sk_buffs (even split between head and one fragment)
Length in bytes __skb_checksum cycles skb_crc32c cycles
=============== ===================== =================
64 579 22
1420 1506 194
16384 4365 1682
So 1420-byte linear buffers (roughly the most common case) is 21% faster, but
other cases range from 5% to 2500% faster. This was on an AMD Zen 5 processor,
where the kernel defaults to using IBRS instead of retpoline; I understand that
an even larger improvement may be seen when retpoline is enabled.
But again this is just the CRC32C checksumming performance. I'm not claiming
measurable improvements to overall SCTP (or NVME-TLS) latency or throughput,
though it's possible that there are.
- Eric
On Sun, 11 May 2025 16:07:50 -0700 Eric Biggers <ebiggers@kernel.org> wrote: > On Sun, May 11, 2025 at 11:45:14PM +0200, Ard Biesheuvel wrote: > > On Sun, 11 May 2025 at 23:22, Andrew Lunn <andrew@lunn.ch> wrote: > > > > > > On Sun, May 11, 2025 at 10:29:29AM -0700, Eric Biggers wrote: > > > > On Sun, May 11, 2025 at 06:30:25PM +0200, Andrew Lunn wrote: > > > > > On Sat, May 10, 2025 at 05:41:00PM -0700, Eric Biggers wrote: > > > > > > Update networking code that computes the CRC32C of packets to just call > > > > > > crc32c() without unnecessary abstraction layers. The result is faster > > > > > > and simpler code. > > > > > > > > > > Hi Eric > > > > > > > > > > Do you have some benchmarks for these changes? > > > > > > > > > > Andrew > > > > > > > > Do you want benchmarks that show that removing the indirect calls makes things > > > > faster? I think that should be fairly self-evident by now after dealing with > > > > retpoline for years, but I can provide more details if you need them. > > > > > > I was think more like iperf before/after? Show the CPU load has gone > > > down without the bandwidth also going down. > > > > > > Eric Dumazet has a T-Shirt with a commit message on the back which > > > increased network performance by X%. At the moment, there is nothing > > > T-Shirt quotable here. > > > > > > > I think that removing layers of redundant code to ultimately call the > > same core CRC-32 implementation is a rather obvious win, especially > > when indirect calls are involved. The diffstat speaks for itself, so > > maybe you can print that on a T-shirt. > > Agreed with Ard. I did try doing some SCTP benchmarks with iperf3 earlier, but > they were very noisy and the CRC32C checksumming seemed to be lost in the noise. > There probably are some tricks to running reliable networking benchmarks; I'm > not a networking developer. Regardless, this series is a clear win for the > CRC32C code, both from a simplicity and performance perspective. It also fixes > the kconfig dependency issues. That should be good enough, IMO. > > In case it's helpful, here are some microbenchmarks of __skb_checksum (old) vs > skb_crc32c (new): > > Linear sk_buffs > > Length in bytes __skb_checksum cycles skb_crc32c cycles > =============== ===================== ================= > 64 43 18 > 1420 204 161 > 16384 1735 1642 > > Nonlinear sk_buffs (even split between head and one fragment) > > Length in bytes __skb_checksum cycles skb_crc32c cycles > =============== ===================== ================= > 64 579 22 > 1420 1506 194 > 16384 4365 1682 > > So 1420-byte linear buffers (roughly the most common case) is 21% faster, 1420 bytes is unlikely to be the most common case - at least for some users. SCTP is message oriented so the checksum is over a 'user message'. A non-uncommon use is carrying mobile network messages (eg SMS) over the IP network (instead of TDM links). In that case the maximum data chunk size (what is being checksummed) is limited to not much over 256 bytes - and a lot of data chunks will be smaller. The actual difficulty is getting multiple data chunks into a single ethernet packet without adding significant delays. But the changes definitely improve things. David > but other cases range from 5% to 2500% faster. This was on an AMD Zen 5 processor, > where the kernel defaults to using IBRS instead of retpoline; I understand that > an even larger improvement may be seen when retpoline is enabled. > > But again this is just the CRC32C checksumming performance. I'm not claiming > measurable improvements to overall SCTP (or NVME-TLS) latency or throughput, > though it's possible that there are. > > - Eric >
On Thu, May 15, 2025 at 08:21:36PM +0100, David Laight wrote: > On Sun, 11 May 2025 16:07:50 -0700 > Eric Biggers <ebiggers@kernel.org> wrote: > > > On Sun, May 11, 2025 at 11:45:14PM +0200, Ard Biesheuvel wrote: > > > On Sun, 11 May 2025 at 23:22, Andrew Lunn <andrew@lunn.ch> wrote: > > > > > > > > On Sun, May 11, 2025 at 10:29:29AM -0700, Eric Biggers wrote: > > > > > On Sun, May 11, 2025 at 06:30:25PM +0200, Andrew Lunn wrote: > > > > > > On Sat, May 10, 2025 at 05:41:00PM -0700, Eric Biggers wrote: > > > > > > > Update networking code that computes the CRC32C of packets to just call > > > > > > > crc32c() without unnecessary abstraction layers. The result is faster > > > > > > > and simpler code. > > > > > > > > > > > > Hi Eric > > > > > > > > > > > > Do you have some benchmarks for these changes? > > > > > > > > > > > > Andrew > > > > > > > > > > Do you want benchmarks that show that removing the indirect calls makes things > > > > > faster? I think that should be fairly self-evident by now after dealing with > > > > > retpoline for years, but I can provide more details if you need them. > > > > > > > > I was think more like iperf before/after? Show the CPU load has gone > > > > down without the bandwidth also going down. > > > > > > > > Eric Dumazet has a T-Shirt with a commit message on the back which > > > > increased network performance by X%. At the moment, there is nothing > > > > T-Shirt quotable here. > > > > > > > > > > I think that removing layers of redundant code to ultimately call the > > > same core CRC-32 implementation is a rather obvious win, especially > > > when indirect calls are involved. The diffstat speaks for itself, so > > > maybe you can print that on a T-shirt. > > > > Agreed with Ard. I did try doing some SCTP benchmarks with iperf3 earlier, but > > they were very noisy and the CRC32C checksumming seemed to be lost in the noise. > > There probably are some tricks to running reliable networking benchmarks; I'm > > not a networking developer. Regardless, this series is a clear win for the > > CRC32C code, both from a simplicity and performance perspective. It also fixes > > the kconfig dependency issues. That should be good enough, IMO. > > > > In case it's helpful, here are some microbenchmarks of __skb_checksum (old) vs > > skb_crc32c (new): > > > > Linear sk_buffs > > > > Length in bytes __skb_checksum cycles skb_crc32c cycles > > =============== ===================== ================= > > 64 43 18 > > 1420 204 161 > > 16384 1735 1642 > > > > Nonlinear sk_buffs (even split between head and one fragment) > > > > Length in bytes __skb_checksum cycles skb_crc32c cycles > > =============== ===================== ================= > > 64 579 22 > > 1420 1506 194 > > 16384 4365 1682 > > > > So 1420-byte linear buffers (roughly the most common case) is 21% faster, > > 1420 bytes is unlikely to be the most common case - at least for some users. > SCTP is message oriented so the checksum is over a 'user message'. > A non-uncommon use is carrying mobile network messages (eg SMS) over the IP > network (instead of TDM links). > In that case the maximum data chunk size (what is being checksummed) is limited > to not much over 256 bytes - and a lot of data chunks will be smaller. > The actual difficulty is getting multiple data chunks into a single ethernet > packet without adding significant delays. > > But the changes definitely improve things. Interesting. Of course, the data I gave shows that the proportional performance increase is even greater on short packets than long ones. I'll include those tables when I resend the patchset and add a row for 256 bytes too. - Eric
© 2016 - 2026 Red Hat, Inc.