net: add skb_crc32c()

Add skb_crc32c(), which calculates the CRC32C of a sk_buff. It will
replace __skb_checksum(), which unnecessarily supports arbitrary
checksums. Compared to __skb_checksum(), skb_crc32c():

- Uses the correct type for CRC32C values (u32, not __wsum).

- Does not require the caller to provide a skb_checksum_ops struct.

- Is faster because it does not use indirect calls and does not use
the very slow crc32c_combine().

According to commit 2817a336d4d5 ("net: skb_checksum: allow custom
update/combine for walking skb") which added __skb_checksum(), the
original motivation for the abstraction layer was to avoid code
duplication for CRC32C and other checksums in the future. However:

- No additional checksums showed up after CRC32C. __skb_checksum()
is only used with the "regular" net checksum and CRC32C.

- Indirect calls are expensive. Commit 2544af0344ba ("net: avoid
indirect calls in L4 checksum calculation") worked around this
using the INDIRECT_CALL_1 macro. But that only avoided the indirect
call for the net checksum, and at the cost of an extra branch.

- The checksums use different types (__wsum and u32), causing casts
to be needed.

- It made the checksums of fragments be combined (rather than
chained) for both checksums, despite this being highly
counterproductive for CRC32C due to how slow crc32c_combine() is.
This can clearly be seen in commit 4c2f24549644 ("sctp: linearize
early if it's not GSO") which tried to work around this performance
bug. With a dedicated function for each checksum, we can instead
just use the proper strategy for each checksum.

As shown by the following tables, the new function skb_crc32c() is
faster than __skb_checksum(), with the improvement varying greatly from
5% to 2500% depending on the case. The largest improvements come from
fragmented packets, mainly due to eliminating the inefficient
crc32c_combine(). But linear packets are improved too, especially
shorter ones, mainly due to eliminating indirect calls. These
benchmarks were done on AMD Zen 5. On that CPU, Linux uses IBRS instead
of retpoline; an even greater improvement might be seen with retpoline:

Linear sk_buffs

Length in bytes __skb_checksum cycles skb_crc32c cycles
=============== ===================== =================
64 43 18
256 94 77
1420 204 161
16384 1735 1642

Nonlinear sk_buffs (even split between head and one fragment)

Length in bytes __skb_checksum cycles skb_crc32c cycles
=============== ===================== =================
64 579 22
256 829 77
1420 1506 194
16384 4365 1682

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20250519175012.36581-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Eric Biggers and committed by

Jakub Kicinski 1 year ago a5bd029c 55d22ee0

+74

2 changed files

expand all

include

linux

skbuff.h

net

core

skbuff.c

include/linux/skbuff.h

··· 4203 4203 __wsum csum, const struct skb_checksum_ops *ops); 4204 4204 __wsum skb_checksum(const struct sk_buff *skb, int offset, int len, 4205 4205 __wsum csum); 4206 + u32 skb_crc32c(const struct sk_buff *skb, int offset, int len, u32 crc); 4206 4207 4207 4208 static inline void * __must_check 4208 4209 __skb_header_pointer(const struct sk_buff *skb, int offset, int len,

+73

net/core/skbuff.c

··· 64 64 #include <linux/mpls.h> 65 65 #include <linux/kcov.h> 66 66 #include <linux/iov_iter.h> 67 + #include <linux/crc32.h> 67 68 68 69 #include <net/protocol.h> 69 70 #include <net/dst.h> ··· 3633 3632 return csum; 3634 3633 } 3635 3634 EXPORT_SYMBOL(skb_copy_and_csum_bits); 3635 + 3636 + #ifdef CONFIG_NET_CRC32C 3637 + u32 skb_crc32c(const struct sk_buff *skb, int offset, int len, u32 crc) 3638 + { 3639 + int start = skb_headlen(skb); 3640 + int i, copy = start - offset; 3641 + struct sk_buff *frag_iter; 3642 + 3643 + if (copy > 0) { 3644 + copy = min(copy, len); 3645 + crc = crc32c(crc, skb->data + offset, copy); 3646 + len -= copy; 3647 + if (len == 0) 3648 + return crc; 3649 + offset += copy; 3650 + } 3651 + 3652 + if (WARN_ON_ONCE(!skb_frags_readable(skb))) 3653 + return 0; 3654 + 3655 + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { 3656 + int end; 3657 + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; 3658 + 3659 + WARN_ON(start > offset + len); 3660 + 3661 + end = start + skb_frag_size(frag); 3662 + copy = end - offset; 3663 + if (copy > 0) { 3664 + u32 p_off, p_len, copied; 3665 + struct page *p; 3666 + u8 *vaddr; 3667 + 3668 + copy = min(copy, len); 3669 + skb_frag_foreach_page(frag, 3670 + skb_frag_off(frag) + offset - start, 3671 + copy, p, p_off, p_len, copied) { 3672 + vaddr = kmap_atomic(p); 3673 + crc = crc32c(crc, vaddr + p_off, p_len); 3674 + kunmap_atomic(vaddr); 3675 + } 3676 + len -= copy; 3677 + if (len == 0) 3678 + return crc; 3679 + offset += copy; 3680 + } 3681 + start = end; 3682 + } 3683 + 3684 + skb_walk_frags(skb, frag_iter) { 3685 + int end; 3686 + 3687 + WARN_ON(start > offset + len); 3688 + 3689 + end = start + frag_iter->len; 3690 + copy = end - offset; 3691 + if (copy > 0) { 3692 + copy = min(copy, len); 3693 + crc = skb_crc32c(frag_iter, offset - start, copy, crc); 3694 + len -= copy; 3695 + if (len == 0) 3696 + return crc; 3697 + offset += copy; 3698 + } 3699 + start = end; 3700 + } 3701 + BUG_ON(len); 3702 + 3703 + return crc; 3704 + } 3705 + EXPORT_SYMBOL(skb_crc32c); 3706 + #endif /* CONFIG_NET_CRC32C */ 3636 3707 3637 3708 __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len) 3638 3709 {

Configure Feed

Configure Feed