Address XDP frags having negative tailroom

[PATCH bpf 0/6] Address XDP frags having negative tailroom

Posted by Larysa Zaremba 4 days, 13 hours ago

Many ethernet drivers report xdp Rx queue frag size as being the same as
DMA write size. However, the only user of this field, namely
bpf_xdp_frags_increase_tail(), clearly expects a truesize.

Such difference leads to unspecific memory corruption issues under certain
circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
all DMA-writable space in 2 buffers. This would be fine, if only
rxq->frag_size was properly set to 4K, but value of 3K results in a
negative tailroom, because there is a non-zero page offset.

We could return -EINVAL and be done with it in such case, but due to
tailroom being stored as an unsigned int, it is reported to be somewhere
near UINT_MAX, resulting in a tail being grown, even if the requested
offset is too much (it is around 2K in the abovementioned test). This later
leads to all kinds of unspecific calltraces.

[ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
[ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
[ 7340.338179]  in libc.so.6[61c9d,7f4161aaf000+160000]
[ 7340.339230]  in xskxceiver[42b5,400000+69000]
[ 7340.340300]  likely on CPU 6 (core 0, socket 6)
[ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
[ 7340.340888]  likely on CPU 3 (core 0, socket 3)
[ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
[ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
[ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
[ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
[ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
[ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
[ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
[ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
[ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
[ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
[ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
[ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
[ 7340.418229] FS:  0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
[ 7340.419489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
[ 7340.421237] PKRU: 55555554
[ 7340.421623] Call Trace:
[ 7340.421987]  <TASK>
[ 7340.422309]  ? softleaf_from_pte+0x77/0xa0
[ 7340.422855]  swap_pte_batch+0xa7/0x290
[ 7340.423363]  zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
[ 7340.424102]  zap_pte_range+0x281/0x580
[ 7340.424607]  zap_pmd_range.isra.0+0xc9/0x240
[ 7340.425177]  unmap_page_range+0x24d/0x420
[ 7340.425714]  unmap_vmas+0xa1/0x180
[ 7340.426185]  exit_mmap+0xe1/0x3b0
[ 7340.426644]  __mmput+0x41/0x150
[ 7340.427098]  exit_mm+0xb1/0x110
[ 7340.427539]  do_exit+0x1b2/0x460
[ 7340.427992]  do_group_exit+0x2d/0xc0
[ 7340.428477]  get_signal+0x79d/0x7e0
[ 7340.428957]  arch_do_signal_or_restart+0x34/0x100
[ 7340.429571]  exit_to_user_mode_loop+0x8e/0x4c0
[ 7340.430159]  do_syscall_64+0x188/0x6b0
[ 7340.430672]  ? __do_sys_clone3+0xd9/0x120
[ 7340.431212]  ? switch_fpu_return+0x4e/0xd0
[ 7340.431761]  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
[ 7340.432498]  ? do_syscall_64+0xbb/0x6b0
[ 7340.433015]  ? __handle_mm_fault+0x445/0x690
[ 7340.433582]  ? count_memcg_events+0xd6/0x210
[ 7340.434151]  ? handle_mm_fault+0x212/0x340
[ 7340.434697]  ? do_user_addr_fault+0x2b4/0x7b0
[ 7340.435271]  ? clear_bhb_loop+0x30/0x80
[ 7340.435788]  ? clear_bhb_loop+0x30/0x80
[ 7340.436299]  ? clear_bhb_loop+0x30/0x80
[ 7340.436812]  ? clear_bhb_loop+0x30/0x80
[ 7340.437323]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 7340.437973] RIP: 0033:0x7f4161b14169
[ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
[ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
[ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
[ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
[ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
[ 7340.444586]  </TASK>
[ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
[ 7340.449650] ---[ end trace 0000000000000000 ]---

The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
drivers to not do this. Therefore, make tailroom a signed int and produce a
warning when it is negative to prevent such mistakes in the future.

The issue can also be easily reproduced with ice driver, by applying
the following diff to xskxceiver and enjoying a kernel panic in xdpdrv mode:

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 5af28f359cfd..042d587fa7ef 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -2541,8 +2541,8 @@ int testapp_adjust_tail_grow_mb(struct test_spec *test)
 {
        test->mtu = MAX_ETH_JUMBO_SIZE;
        /* Grow by (frag_size - last_frag_Size) - 1 to stay inside the last fragment */
-       return testapp_adjust_tail(test, (XSK_UMEM__MAX_FRAME_SIZE / 2) - 1,
-                                  XSK_UMEM__LARGE_FRAME_SIZE * 2);
+       return testapp_adjust_tail(test, XSK_UMEM__MAX_FRAME_SIZE * 100,
+                                  6912);
 }

 int testapp_tx_queue_consumer(struct test_spec *test)

If we print out the values involved in the tailroom calculation:

tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);

4294967040 = 3456 - 3456 - 256

There may be also such issue on ZC path particulatly in i40e, but it would need
to be addressed separately, as within my understanding it seems like
neither frame_len nor chunk_size are appropriate values, instead
we should calculate frag_size the following way (like frame_sz in ZC buffers):

frag_size = xsk_pool_get_chunk_size(pool) - pool->headroom.

I personally reproduced and verified the issue in ice only,
aside from WiP ixgbevf implementation.

Larysa Zaremba (6):
  xdp: produce a warning when calculated tailroom is negative
  ice: fix rxq info registering in mbuf packets
  ice: change XDP RxQ frag_size from DMA write length to truesize
  i40e: use truesize as XDP RxQ info frag_size
  idpf: use truesize as XDP RxQ info frag_size
  net: enetc: use truesize as XDP RxQ info frag_size

 drivers/net/ethernet/freescale/enetc/enetc.c |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c  |  5 ++--
 drivers/net/ethernet/intel/ice/ice_base.c    | 26 ++++++--------------
 drivers/net/ethernet/intel/ice/ice_xsk.c     |  3 +++
 drivers/net/ethernet/intel/idpf/xdp.c        |  2 +-
 net/core/filter.c                            |  3 ++-
 6 files changed, 17 insertions(+), 24 deletions(-)

-- 
2.52.0

Re: [PATCH bpf 0/6] Address XDP frags having negative tailroom

Posted by Jakub Kicinski 2 days, 22 hours ago

On Tue,  3 Feb 2026 11:53:28 +0100 Larysa Zaremba wrote:
> Many ethernet drivers report xdp Rx queue frag size as being the same as
> DMA write size. However, the only user of this field, namely
> bpf_xdp_frags_increase_tail(), clearly expects a truesize.

truesize may not be the most fortunate term.
truesize is how much memory we charge to the skb (for memory management)
which may relate to frag size in strange ways for strange HW :S
I don't have a great term, but I have a feeling truesize already
confuses driver developers.

Note two our xdp.py tests covers tail adjust, I wonder if it would have
caught this for you. Looking at Intel's results the XDP tests don't run
due to missing BTF. Could you work with Adrian to address that?

Re: [PATCH bpf 0/6] Address XDP frags having negative tailroom

Posted by Martin KaFai Lau 3 days, 1 hour ago

On 2/3/26 2:53 AM, Larysa Zaremba wrote:
> Larysa Zaremba (6):
>    xdp: produce a warning when calculated tailroom is negative
>    ice: fix rxq info registering in mbuf packets
>    ice: change XDP RxQ frag_size from DMA write length to truesize
>    i40e: use truesize as XDP RxQ info frag_size
>    idpf: use truesize as XDP RxQ info frag_size
>    net: enetc: use truesize as XDP RxQ info frag_size
> 
>   drivers/net/ethernet/freescale/enetc/enetc.c |  2 +-
>   drivers/net/ethernet/intel/i40e/i40e_main.c  |  5 ++--
>   drivers/net/ethernet/intel/ice/ice_base.c    | 26 ++++++--------------
>   drivers/net/ethernet/intel/ice/ice_xsk.c     |  3 +++
>   drivers/net/ethernet/intel/idpf/xdp.c        |  2 +-
>   net/core/filter.c                            |  3 ++-

Jakub, can you help to take it to the net tree? Thanks.

Re: [PATCH bpf 0/6] Address XDP frags having negative tailroom

Posted by Jakub Kicinski 2 days, 22 hours ago

On Wed, 4 Feb 2026 14:57:13 -0800 Martin KaFai Lau wrote:
> On 2/3/26 2:53 AM, Larysa Zaremba wrote:
> > Larysa Zaremba (6):
> >    xdp: produce a warning when calculated tailroom is negative
> >    ice: fix rxq info registering in mbuf packets
> >    ice: change XDP RxQ frag_size from DMA write length to truesize
> >    i40e: use truesize as XDP RxQ info frag_size
> >    idpf: use truesize as XDP RxQ info frag_size
> >    net: enetc: use truesize as XDP RxQ info frag_size
>
> Jakub, can you help to take it to the net tree? Thanks.

Will do. I think I also have some questions, first, tho :)

[PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative

Posted by Larysa Zaremba 4 days, 13 hours ago

Many ethernet drivers report xdp Rx queue frag size as being the same as
DMA write size. However, the only user of this field, namely
bpf_xdp_frags_increase_tail(), clearly expects a truesize.

Such difference leads to unspecific memory corruption issues under certain
circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
all DMA-writable space in 2 buffers. This would be fine, if only
rxq->frag_size was properly set to 4K, but value of 3K results in a
negative tailroom, because there is a non-zero page offset.

We could return -EINVAL and be done with it in such case, but due to
tailroom being stored as an unsigned int, it is reported to be somewhere
near UINT_MAX, resulting in a tail being grown, even if the requested
offset is too much (it is around 2K in the abovementioned test). This later
leads to all kinds of unspecific calltraces.

[ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
[ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
[ 7340.338179]  in libc.so.6[61c9d,7f4161aaf000+160000]
[ 7340.339230]  in xskxceiver[42b5,400000+69000]
[ 7340.340300]  likely on CPU 6 (core 0, socket 6)
[ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
[ 7340.340888]  likely on CPU 3 (core 0, socket 3)
[ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
[ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
[ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
[ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
[ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
[ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
[ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
[ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
[ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
[ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
[ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
[ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
[ 7340.418229] FS:  0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
[ 7340.419489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
[ 7340.421237] PKRU: 55555554
[ 7340.421623] Call Trace:
[ 7340.421987]  <TASK>
[ 7340.422309]  ? softleaf_from_pte+0x77/0xa0
[ 7340.422855]  swap_pte_batch+0xa7/0x290
[ 7340.423363]  zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
[ 7340.424102]  zap_pte_range+0x281/0x580
[ 7340.424607]  zap_pmd_range.isra.0+0xc9/0x240
[ 7340.425177]  unmap_page_range+0x24d/0x420
[ 7340.425714]  unmap_vmas+0xa1/0x180
[ 7340.426185]  exit_mmap+0xe1/0x3b0
[ 7340.426644]  __mmput+0x41/0x150
[ 7340.427098]  exit_mm+0xb1/0x110
[ 7340.427539]  do_exit+0x1b2/0x460
[ 7340.427992]  do_group_exit+0x2d/0xc0
[ 7340.428477]  get_signal+0x79d/0x7e0
[ 7340.428957]  arch_do_signal_or_restart+0x34/0x100
[ 7340.429571]  exit_to_user_mode_loop+0x8e/0x4c0
[ 7340.430159]  do_syscall_64+0x188/0x6b0
[ 7340.430672]  ? __do_sys_clone3+0xd9/0x120
[ 7340.431212]  ? switch_fpu_return+0x4e/0xd0
[ 7340.431761]  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
[ 7340.432498]  ? do_syscall_64+0xbb/0x6b0
[ 7340.433015]  ? __handle_mm_fault+0x445/0x690
[ 7340.433582]  ? count_memcg_events+0xd6/0x210
[ 7340.434151]  ? handle_mm_fault+0x212/0x340
[ 7340.434697]  ? do_user_addr_fault+0x2b4/0x7b0
[ 7340.435271]  ? clear_bhb_loop+0x30/0x80
[ 7340.435788]  ? clear_bhb_loop+0x30/0x80
[ 7340.436299]  ? clear_bhb_loop+0x30/0x80
[ 7340.436812]  ? clear_bhb_loop+0x30/0x80
[ 7340.437323]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 7340.437973] RIP: 0033:0x7f4161b14169
[ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
[ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
[ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
[ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
[ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
[ 7340.444586]  </TASK>
[ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
[ 7340.449650] ---[ end trace 0000000000000000 ]---

The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
drivers to not do this. Therefore, make tailroom a signed int and produce a
warning when it is negative to prevent such mistakes in the future.

Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 net/core/filter.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 616e0520a0bb..9715d957e3c5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 	skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
 	struct xdp_rxq_info *rxq = xdp->rxq;
-	unsigned int tailroom;
+	int tailroom;
 
 	if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
 		return -EOPNOTSUPP;
 
 	tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
+	WARN_ON_ONCE(tailroom < 0);
 	if (unlikely(offset > tailroom))
 		return -EINVAL;
 
-- 
2.52.0

Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative

Posted by Martin KaFai Lau 3 days, 1 hour ago

On 2/3/26 2:53 AM, Larysa Zaremba wrote:
> The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> drivers to not do this. Therefore, make tailroom a signed int and produce a
> warning when it is negative to prevent such mistakes in the future.
> 
> Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
> Reviewed-by: Aleksandr Loktionov<aleksandr.loktionov@intel.com>
> Signed-off-by: Larysa Zaremba<larysa.zaremba@intel.com>
> ---
>   net/core/filter.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 616e0520a0bb..9715d957e3c5 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
>   	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>   	skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
>   	struct xdp_rxq_info *rxq = xdp->rxq;
> -	unsigned int tailroom;
> +	int tailroom;
>   
>   	if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
>   		return -EOPNOTSUPP;
>   
>   	tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> +	WARN_ON_ONCE(tailroom < 0);
>   	if (unlikely(offset > tailroom))
>   		return -EINVAL;

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>

Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative

Posted by Toke Høiland-Jørgensen 4 days, 11 hours ago

Larysa Zaremba <larysa.zaremba@intel.com> writes:

> Many ethernet drivers report xdp Rx queue frag size as being the same as
> DMA write size. However, the only user of this field, namely
> bpf_xdp_frags_increase_tail(), clearly expects a truesize.
>
> Such difference leads to unspecific memory corruption issues under certain
> circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
> running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
> all DMA-writable space in 2 buffers. This would be fine, if only
> rxq->frag_size was properly set to 4K, but value of 3K results in a
> negative tailroom, because there is a non-zero page offset.
>
> We could return -EINVAL and be done with it in such case, but due to
> tailroom being stored as an unsigned int, it is reported to be somewhere
> near UINT_MAX, resulting in a tail being grown, even if the requested
> offset is too much (it is around 2K in the abovementioned test). This later
> leads to all kinds of unspecific calltraces.
>
> [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
> [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
> [ 7340.338179]  in libc.so.6[61c9d,7f4161aaf000+160000]
> [ 7340.339230]  in xskxceiver[42b5,400000+69000]
> [ 7340.340300]  likely on CPU 6 (core 0, socket 6)
> [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
> [ 7340.340888]  likely on CPU 3 (core 0, socket 3)
> [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
> [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
> [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
> [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
> [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
> [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
> [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
> [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
> [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
> [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
> [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
> [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
> [ 7340.418229] FS:  0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
> [ 7340.419489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
> [ 7340.421237] PKRU: 55555554
> [ 7340.421623] Call Trace:
> [ 7340.421987]  <TASK>
> [ 7340.422309]  ? softleaf_from_pte+0x77/0xa0
> [ 7340.422855]  swap_pte_batch+0xa7/0x290
> [ 7340.423363]  zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
> [ 7340.424102]  zap_pte_range+0x281/0x580
> [ 7340.424607]  zap_pmd_range.isra.0+0xc9/0x240
> [ 7340.425177]  unmap_page_range+0x24d/0x420
> [ 7340.425714]  unmap_vmas+0xa1/0x180
> [ 7340.426185]  exit_mmap+0xe1/0x3b0
> [ 7340.426644]  __mmput+0x41/0x150
> [ 7340.427098]  exit_mm+0xb1/0x110
> [ 7340.427539]  do_exit+0x1b2/0x460
> [ 7340.427992]  do_group_exit+0x2d/0xc0
> [ 7340.428477]  get_signal+0x79d/0x7e0
> [ 7340.428957]  arch_do_signal_or_restart+0x34/0x100
> [ 7340.429571]  exit_to_user_mode_loop+0x8e/0x4c0
> [ 7340.430159]  do_syscall_64+0x188/0x6b0
> [ 7340.430672]  ? __do_sys_clone3+0xd9/0x120
> [ 7340.431212]  ? switch_fpu_return+0x4e/0xd0
> [ 7340.431761]  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
> [ 7340.432498]  ? do_syscall_64+0xbb/0x6b0
> [ 7340.433015]  ? __handle_mm_fault+0x445/0x690
> [ 7340.433582]  ? count_memcg_events+0xd6/0x210
> [ 7340.434151]  ? handle_mm_fault+0x212/0x340
> [ 7340.434697]  ? do_user_addr_fault+0x2b4/0x7b0
> [ 7340.435271]  ? clear_bhb_loop+0x30/0x80
> [ 7340.435788]  ? clear_bhb_loop+0x30/0x80
> [ 7340.436299]  ? clear_bhb_loop+0x30/0x80
> [ 7340.436812]  ? clear_bhb_loop+0x30/0x80
> [ 7340.437323]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 7340.437973] RIP: 0033:0x7f4161b14169
> [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
> [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
> [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
> [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
> [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
> [ 7340.444586]  </TASK>
> [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
> [ 7340.449650] ---[ end trace 0000000000000000 ]---
>
> The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> drivers to not do this. Therefore, make tailroom a signed int and produce a
> warning when it is negative to prevent such mistakes in the future.
>
> Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> ---
>  net/core/filter.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 616e0520a0bb..9715d957e3c5 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
>  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>  	skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
>  	struct xdp_rxq_info *rxq = xdp->rxq;
> -	unsigned int tailroom;
> +	int tailroom;
>  
>  	if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
>  		return -EOPNOTSUPP;
>  
>  	tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> +	WARN_ON_ONCE(tailroom < 0);
>  	if (unlikely(offset > tailroom))
>  		return -EINVAL;
>  

Why can't we do both? I.e., WARN_ON_ONCE() *and* return -EINVAL?

-Toke

Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative

Posted by Larysa Zaremba 4 days, 11 hours ago

On Tue, Feb 03, 2026 at 01:26:21PM +0100, Toke Høiland-Jørgensen wrote:
> Larysa Zaremba <larysa.zaremba@intel.com> writes:
> 
> > Many ethernet drivers report xdp Rx queue frag size as being the same as
> > DMA write size. However, the only user of this field, namely
> > bpf_xdp_frags_increase_tail(), clearly expects a truesize.
> >
> > Such difference leads to unspecific memory corruption issues under certain
> > circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
> > running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
> > all DMA-writable space in 2 buffers. This would be fine, if only
> > rxq->frag_size was properly set to 4K, but value of 3K results in a
> > negative tailroom, because there is a non-zero page offset.
> >
> > We could return -EINVAL and be done with it in such case, but due to
> > tailroom being stored as an unsigned int, it is reported to be somewhere
> > near UINT_MAX, resulting in a tail being grown, even if the requested
> > offset is too much (it is around 2K in the abovementioned test). This later
> > leads to all kinds of unspecific calltraces.
> >
> > [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
> > [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
> > [ 7340.338179]  in libc.so.6[61c9d,7f4161aaf000+160000]
> > [ 7340.339230]  in xskxceiver[42b5,400000+69000]
> > [ 7340.340300]  likely on CPU 6 (core 0, socket 6)
> > [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
> > [ 7340.340888]  likely on CPU 3 (core 0, socket 3)
> > [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
> > [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
> > [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
> > [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
> > [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
> > [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
> > [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
> > [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
> > [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
> > [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
> > [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
> > [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
> > [ 7340.418229] FS:  0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
> > [ 7340.419489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
> > [ 7340.421237] PKRU: 55555554
> > [ 7340.421623] Call Trace:
> > [ 7340.421987]  <TASK>
> > [ 7340.422309]  ? softleaf_from_pte+0x77/0xa0
> > [ 7340.422855]  swap_pte_batch+0xa7/0x290
> > [ 7340.423363]  zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
> > [ 7340.424102]  zap_pte_range+0x281/0x580
> > [ 7340.424607]  zap_pmd_range.isra.0+0xc9/0x240
> > [ 7340.425177]  unmap_page_range+0x24d/0x420
> > [ 7340.425714]  unmap_vmas+0xa1/0x180
> > [ 7340.426185]  exit_mmap+0xe1/0x3b0
> > [ 7340.426644]  __mmput+0x41/0x150
> > [ 7340.427098]  exit_mm+0xb1/0x110
> > [ 7340.427539]  do_exit+0x1b2/0x460
> > [ 7340.427992]  do_group_exit+0x2d/0xc0
> > [ 7340.428477]  get_signal+0x79d/0x7e0
> > [ 7340.428957]  arch_do_signal_or_restart+0x34/0x100
> > [ 7340.429571]  exit_to_user_mode_loop+0x8e/0x4c0
> > [ 7340.430159]  do_syscall_64+0x188/0x6b0
> > [ 7340.430672]  ? __do_sys_clone3+0xd9/0x120
> > [ 7340.431212]  ? switch_fpu_return+0x4e/0xd0
> > [ 7340.431761]  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
> > [ 7340.432498]  ? do_syscall_64+0xbb/0x6b0
> > [ 7340.433015]  ? __handle_mm_fault+0x445/0x690
> > [ 7340.433582]  ? count_memcg_events+0xd6/0x210
> > [ 7340.434151]  ? handle_mm_fault+0x212/0x340
> > [ 7340.434697]  ? do_user_addr_fault+0x2b4/0x7b0
> > [ 7340.435271]  ? clear_bhb_loop+0x30/0x80
> > [ 7340.435788]  ? clear_bhb_loop+0x30/0x80
> > [ 7340.436299]  ? clear_bhb_loop+0x30/0x80
> > [ 7340.436812]  ? clear_bhb_loop+0x30/0x80
> > [ 7340.437323]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 7340.437973] RIP: 0033:0x7f4161b14169
> > [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
> > [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> > [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
> > [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
> > [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
> > [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
> > [ 7340.444586]  </TASK>
> > [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
> > [ 7340.449650] ---[ end trace 0000000000000000 ]---
> >
> > The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> > drivers to not do this. Therefore, make tailroom a signed int and produce a
> > warning when it is negative to prevent such mistakes in the future.
> >
> > Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
> > Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > ---
> >  net/core/filter.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 616e0520a0bb..9715d957e3c5 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
> >  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> >  	skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
> >  	struct xdp_rxq_info *rxq = xdp->rxq;
> > -	unsigned int tailroom;
> > +	int tailroom;
> >  
> >  	if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
> >  		return -EOPNOTSUPP;
> >  
> >  	tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> > +	WARN_ON_ONCE(tailroom < 0);
> >  	if (unlikely(offset > tailroom))
> >  		return -EINVAL;
> >  
> 
> Why can't we do both? I.e., WARN_ON_ONCE() *and* return -EINVAL?
> 
> -Toke
> 

It would be redundant, offset is always >= 0 here, so with tailroom now being a 
signed int, offset is always bigger and -EINVAL is returned.

Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative

Posted by Toke Høiland-Jørgensen 4 days, 11 hours ago

Larysa Zaremba <larysa.zaremba@intel.com> writes:

> On Tue, Feb 03, 2026 at 01:26:21PM +0100, Toke Høiland-Jørgensen wrote:
>> Larysa Zaremba <larysa.zaremba@intel.com> writes:
>> 
>> > Many ethernet drivers report xdp Rx queue frag size as being the same as
>> > DMA write size. However, the only user of this field, namely
>> > bpf_xdp_frags_increase_tail(), clearly expects a truesize.
>> >
>> > Such difference leads to unspecific memory corruption issues under certain
>> > circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
>> > running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
>> > all DMA-writable space in 2 buffers. This would be fine, if only
>> > rxq->frag_size was properly set to 4K, but value of 3K results in a
>> > negative tailroom, because there is a non-zero page offset.
>> >
>> > We could return -EINVAL and be done with it in such case, but due to
>> > tailroom being stored as an unsigned int, it is reported to be somewhere
>> > near UINT_MAX, resulting in a tail being grown, even if the requested
>> > offset is too much (it is around 2K in the abovementioned test). This later
>> > leads to all kinds of unspecific calltraces.
>> >
>> > [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
>> > [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
>> > [ 7340.338179]  in libc.so.6[61c9d,7f4161aaf000+160000]
>> > [ 7340.339230]  in xskxceiver[42b5,400000+69000]
>> > [ 7340.340300]  likely on CPU 6 (core 0, socket 6)
>> > [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
>> > [ 7340.340888]  likely on CPU 3 (core 0, socket 3)
>> > [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
>> > [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
>> > [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
>> > [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
>> > [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
>> > [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
>> > [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
>> > [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
>> > [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
>> > [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
>> > [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
>> > [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
>> > [ 7340.418229] FS:  0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
>> > [ 7340.419489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
>> > [ 7340.421237] PKRU: 55555554
>> > [ 7340.421623] Call Trace:
>> > [ 7340.421987]  <TASK>
>> > [ 7340.422309]  ? softleaf_from_pte+0x77/0xa0
>> > [ 7340.422855]  swap_pte_batch+0xa7/0x290
>> > [ 7340.423363]  zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
>> > [ 7340.424102]  zap_pte_range+0x281/0x580
>> > [ 7340.424607]  zap_pmd_range.isra.0+0xc9/0x240
>> > [ 7340.425177]  unmap_page_range+0x24d/0x420
>> > [ 7340.425714]  unmap_vmas+0xa1/0x180
>> > [ 7340.426185]  exit_mmap+0xe1/0x3b0
>> > [ 7340.426644]  __mmput+0x41/0x150
>> > [ 7340.427098]  exit_mm+0xb1/0x110
>> > [ 7340.427539]  do_exit+0x1b2/0x460
>> > [ 7340.427992]  do_group_exit+0x2d/0xc0
>> > [ 7340.428477]  get_signal+0x79d/0x7e0
>> > [ 7340.428957]  arch_do_signal_or_restart+0x34/0x100
>> > [ 7340.429571]  exit_to_user_mode_loop+0x8e/0x4c0
>> > [ 7340.430159]  do_syscall_64+0x188/0x6b0
>> > [ 7340.430672]  ? __do_sys_clone3+0xd9/0x120
>> > [ 7340.431212]  ? switch_fpu_return+0x4e/0xd0
>> > [ 7340.431761]  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
>> > [ 7340.432498]  ? do_syscall_64+0xbb/0x6b0
>> > [ 7340.433015]  ? __handle_mm_fault+0x445/0x690
>> > [ 7340.433582]  ? count_memcg_events+0xd6/0x210
>> > [ 7340.434151]  ? handle_mm_fault+0x212/0x340
>> > [ 7340.434697]  ? do_user_addr_fault+0x2b4/0x7b0
>> > [ 7340.435271]  ? clear_bhb_loop+0x30/0x80
>> > [ 7340.435788]  ? clear_bhb_loop+0x30/0x80
>> > [ 7340.436299]  ? clear_bhb_loop+0x30/0x80
>> > [ 7340.436812]  ? clear_bhb_loop+0x30/0x80
>> > [ 7340.437323]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> > [ 7340.437973] RIP: 0033:0x7f4161b14169
>> > [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
>> > [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
>> > [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
>> > [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
>> > [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
>> > [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> > [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
>> > [ 7340.444586]  </TASK>
>> > [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
>> > [ 7340.449650] ---[ end trace 0000000000000000 ]---
>> >
>> > The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
>> > drivers to not do this. Therefore, make tailroom a signed int and produce a
>> > warning when it is negative to prevent such mistakes in the future.
>> >
>> > Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
>> > Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>> > ---
>> >  net/core/filter.c | 3 ++-
>> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/net/core/filter.c b/net/core/filter.c
>> > index 616e0520a0bb..9715d957e3c5 100644
>> > --- a/net/core/filter.c
>> > +++ b/net/core/filter.c
>> > @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
>> >  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>> >  	skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
>> >  	struct xdp_rxq_info *rxq = xdp->rxq;
>> > -	unsigned int tailroom;
>> > +	int tailroom;
>> >  
>> >  	if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
>> >  		return -EOPNOTSUPP;
>> >  
>> >  	tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
>> > +	WARN_ON_ONCE(tailroom < 0);
>> >  	if (unlikely(offset > tailroom))
>> >  		return -EINVAL;
>> >  
>> 
>> Why can't we do both? I.e., WARN_ON_ONCE() *and* return -EINVAL?
>> 
>> -Toke
>> 
>
> It would be redundant, offset is always >= 0 here, so with tailroom now being a 
> signed int, offset is always bigger and -EINVAL is returned.

Oh, I see. OK, may be worth calling out; I read this paragraph in your
commit message to mean "we don't bother returning EINVAL in this case,
we just warn":

> > We could return -EINVAL and be done with it in such case, but due to
> > tailroom being stored as an unsigned int, it is reported to be somewhere
> > near UINT_MAX, resulting in a tail being grown, even if the requested
> > offset is too much (it is around 2K in the abovementioned test). This later
> > leads to all kinds of unspecific calltraces.

-Toke

Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative

Posted by Larysa Zaremba 4 days, 11 hours ago

On Tue, Feb 03, 2026 at 01:38:47PM +0100, Toke Høiland-Jørgensen wrote:
> Larysa Zaremba <larysa.zaremba@intel.com> writes:
> 
> > On Tue, Feb 03, 2026 at 01:26:21PM +0100, Toke Høiland-Jørgensen wrote:
> >> Larysa Zaremba <larysa.zaremba@intel.com> writes:
> >> 
> >> > Many ethernet drivers report xdp Rx queue frag size as being the same as
> >> > DMA write size. However, the only user of this field, namely
> >> > bpf_xdp_frags_increase_tail(), clearly expects a truesize.
> >> >
> >> > Such difference leads to unspecific memory corruption issues under certain
> >> > circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
> >> > running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
> >> > all DMA-writable space in 2 buffers. This would be fine, if only
> >> > rxq->frag_size was properly set to 4K, but value of 3K results in a
> >> > negative tailroom, because there is a non-zero page offset.
> >> >
> >> > We could return -EINVAL and be done with it in such case, but due to
> >> > tailroom being stored as an unsigned int, it is reported to be somewhere
> >> > near UINT_MAX, resulting in a tail being grown, even if the requested
> >> > offset is too much (it is around 2K in the abovementioned test). This later
> >> > leads to all kinds of unspecific calltraces.
> >> >
> >> > [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
> >> > [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
> >> > [ 7340.338179]  in libc.so.6[61c9d,7f4161aaf000+160000]
> >> > [ 7340.339230]  in xskxceiver[42b5,400000+69000]
> >> > [ 7340.340300]  likely on CPU 6 (core 0, socket 6)
> >> > [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
> >> > [ 7340.340888]  likely on CPU 3 (core 0, socket 3)
> >> > [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
> >> > [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
> >> > [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
> >> > [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
> >> > [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
> >> > [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
> >> > [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
> >> > [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
> >> > [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
> >> > [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
> >> > [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
> >> > [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
> >> > [ 7340.418229] FS:  0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
> >> > [ 7340.419489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> > [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
> >> > [ 7340.421237] PKRU: 55555554
> >> > [ 7340.421623] Call Trace:
> >> > [ 7340.421987]  <TASK>
> >> > [ 7340.422309]  ? softleaf_from_pte+0x77/0xa0
> >> > [ 7340.422855]  swap_pte_batch+0xa7/0x290
> >> > [ 7340.423363]  zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
> >> > [ 7340.424102]  zap_pte_range+0x281/0x580
> >> > [ 7340.424607]  zap_pmd_range.isra.0+0xc9/0x240
> >> > [ 7340.425177]  unmap_page_range+0x24d/0x420
> >> > [ 7340.425714]  unmap_vmas+0xa1/0x180
> >> > [ 7340.426185]  exit_mmap+0xe1/0x3b0
> >> > [ 7340.426644]  __mmput+0x41/0x150
> >> > [ 7340.427098]  exit_mm+0xb1/0x110
> >> > [ 7340.427539]  do_exit+0x1b2/0x460
> >> > [ 7340.427992]  do_group_exit+0x2d/0xc0
> >> > [ 7340.428477]  get_signal+0x79d/0x7e0
> >> > [ 7340.428957]  arch_do_signal_or_restart+0x34/0x100
> >> > [ 7340.429571]  exit_to_user_mode_loop+0x8e/0x4c0
> >> > [ 7340.430159]  do_syscall_64+0x188/0x6b0
> >> > [ 7340.430672]  ? __do_sys_clone3+0xd9/0x120
> >> > [ 7340.431212]  ? switch_fpu_return+0x4e/0xd0
> >> > [ 7340.431761]  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
> >> > [ 7340.432498]  ? do_syscall_64+0xbb/0x6b0
> >> > [ 7340.433015]  ? __handle_mm_fault+0x445/0x690
> >> > [ 7340.433582]  ? count_memcg_events+0xd6/0x210
> >> > [ 7340.434151]  ? handle_mm_fault+0x212/0x340
> >> > [ 7340.434697]  ? do_user_addr_fault+0x2b4/0x7b0
> >> > [ 7340.435271]  ? clear_bhb_loop+0x30/0x80
> >> > [ 7340.435788]  ? clear_bhb_loop+0x30/0x80
> >> > [ 7340.436299]  ? clear_bhb_loop+0x30/0x80
> >> > [ 7340.436812]  ? clear_bhb_loop+0x30/0x80
> >> > [ 7340.437323]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >> > [ 7340.437973] RIP: 0033:0x7f4161b14169
> >> > [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
> >> > [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> >> > [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
> >> > [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
> >> > [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
> >> > [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> >> > [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
> >> > [ 7340.444586]  </TASK>
> >> > [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
> >> > [ 7340.449650] ---[ end trace 0000000000000000 ]---
> >> >
> >> > The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
> >> > drivers to not do this. Therefore, make tailroom a signed int and produce a
> >> > warning when it is negative to prevent such mistakes in the future.
> >> >
> >> > Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
> >> > Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> >> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> >> > ---
> >> >  net/core/filter.c | 3 ++-
> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/net/core/filter.c b/net/core/filter.c
> >> > index 616e0520a0bb..9715d957e3c5 100644
> >> > --- a/net/core/filter.c
> >> > +++ b/net/core/filter.c
> >> > @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
> >> >  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> >> >  	skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
> >> >  	struct xdp_rxq_info *rxq = xdp->rxq;
> >> > -	unsigned int tailroom;
> >> > +	int tailroom;
> >> >  
> >> >  	if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
> >> >  		return -EOPNOTSUPP;
> >> >  
> >> >  	tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
> >> > +	WARN_ON_ONCE(tailroom < 0);
> >> >  	if (unlikely(offset > tailroom))
> >> >  		return -EINVAL;
> >> >  
> >> 
> >> Why can't we do both? I.e., WARN_ON_ONCE() *and* return -EINVAL?
> >> 
> >> -Toke
> >> 
> >
> > It would be redundant, offset is always >= 0 here, so with tailroom now being a 
> > signed int, offset is always bigger and -EINVAL is returned.
> 
> Oh, I see. OK, may be worth calling out; I read this paragraph in your
> commit message to mean "we don't bother returning EINVAL in this case,
> we just warn":

Worth changing 'could' to 'are supposed to', if there will be v2.

> 
> > > We could return -EINVAL and be done with it in such case, but due to
> > > tailroom being stored as an unsigned int, it is reported to be somewhere
> > > near UINT_MAX, resulting in a tail being grown, even if the requested
> > > offset is too much (it is around 2K in the abovementioned test). This later
> > > leads to all kinds of unspecific calltraces.
> 
> -Toke
>

Re: [PATCH bpf 1/6] xdp: produce a warning when calculated tailroom is negative

Posted by Toke Høiland-Jørgensen 4 days, 10 hours ago

Larysa Zaremba <larysa.zaremba@intel.com> writes:

> On Tue, Feb 03, 2026 at 01:38:47PM +0100, Toke Høiland-Jørgensen wrote:
>> Larysa Zaremba <larysa.zaremba@intel.com> writes:
>> 
>> > On Tue, Feb 03, 2026 at 01:26:21PM +0100, Toke Høiland-Jørgensen wrote:
>> >> Larysa Zaremba <larysa.zaremba@intel.com> writes:
>> >> 
>> >> > Many ethernet drivers report xdp Rx queue frag size as being the same as
>> >> > DMA write size. However, the only user of this field, namely
>> >> > bpf_xdp_frags_increase_tail(), clearly expects a truesize.
>> >> >
>> >> > Such difference leads to unspecific memory corruption issues under certain
>> >> > circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when
>> >> > running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses
>> >> > all DMA-writable space in 2 buffers. This would be fine, if only
>> >> > rxq->frag_size was properly set to 4K, but value of 3K results in a
>> >> > negative tailroom, because there is a non-zero page offset.
>> >> >
>> >> > We could return -EINVAL and be done with it in such case, but due to
>> >> > tailroom being stored as an unsigned int, it is reported to be somewhere
>> >> > near UINT_MAX, resulting in a tail being grown, even if the requested
>> >> > offset is too much (it is around 2K in the abovementioned test). This later
>> >> > leads to all kinds of unspecific calltraces.
>> >> >
>> >> > [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6
>> >> > [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4
>> >> > [ 7340.338179]  in libc.so.6[61c9d,7f4161aaf000+160000]
>> >> > [ 7340.339230]  in xskxceiver[42b5,400000+69000]
>> >> > [ 7340.340300]  likely on CPU 6 (core 0, socket 6)
>> >> > [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe
>> >> > [ 7340.340888]  likely on CPU 3 (core 0, socket 3)
>> >> > [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7
>> >> > [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI
>> >> > [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy)
>> >> > [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014
>> >> > [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80
>> >> > [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89
>> >> > [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202
>> >> > [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010
>> >> > [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff
>> >> > [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0
>> >> > [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0
>> >> > [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500
>> >> > [ 7340.418229] FS:  0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000
>> >> > [ 7340.419489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> >> > [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0
>> >> > [ 7340.421237] PKRU: 55555554
>> >> > [ 7340.421623] Call Trace:
>> >> > [ 7340.421987]  <TASK>
>> >> > [ 7340.422309]  ? softleaf_from_pte+0x77/0xa0
>> >> > [ 7340.422855]  swap_pte_batch+0xa7/0x290
>> >> > [ 7340.423363]  zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270
>> >> > [ 7340.424102]  zap_pte_range+0x281/0x580
>> >> > [ 7340.424607]  zap_pmd_range.isra.0+0xc9/0x240
>> >> > [ 7340.425177]  unmap_page_range+0x24d/0x420
>> >> > [ 7340.425714]  unmap_vmas+0xa1/0x180
>> >> > [ 7340.426185]  exit_mmap+0xe1/0x3b0
>> >> > [ 7340.426644]  __mmput+0x41/0x150
>> >> > [ 7340.427098]  exit_mm+0xb1/0x110
>> >> > [ 7340.427539]  do_exit+0x1b2/0x460
>> >> > [ 7340.427992]  do_group_exit+0x2d/0xc0
>> >> > [ 7340.428477]  get_signal+0x79d/0x7e0
>> >> > [ 7340.428957]  arch_do_signal_or_restart+0x34/0x100
>> >> > [ 7340.429571]  exit_to_user_mode_loop+0x8e/0x4c0
>> >> > [ 7340.430159]  do_syscall_64+0x188/0x6b0
>> >> > [ 7340.430672]  ? __do_sys_clone3+0xd9/0x120
>> >> > [ 7340.431212]  ? switch_fpu_return+0x4e/0xd0
>> >> > [ 7340.431761]  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
>> >> > [ 7340.432498]  ? do_syscall_64+0xbb/0x6b0
>> >> > [ 7340.433015]  ? __handle_mm_fault+0x445/0x690
>> >> > [ 7340.433582]  ? count_memcg_events+0xd6/0x210
>> >> > [ 7340.434151]  ? handle_mm_fault+0x212/0x340
>> >> > [ 7340.434697]  ? do_user_addr_fault+0x2b4/0x7b0
>> >> > [ 7340.435271]  ? clear_bhb_loop+0x30/0x80
>> >> > [ 7340.435788]  ? clear_bhb_loop+0x30/0x80
>> >> > [ 7340.436299]  ? clear_bhb_loop+0x30/0x80
>> >> > [ 7340.436812]  ? clear_bhb_loop+0x30/0x80
>> >> > [ 7340.437323]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> >> > [ 7340.437973] RIP: 0033:0x7f4161b14169
>> >> > [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f.
>> >> > [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
>> >> > [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169
>> >> > [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990
>> >> > [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff
>> >> > [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> >> > [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0
>> >> > [ 7340.444586]  </TASK>
>> >> > [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg
>> >> > [ 7340.449650] ---[ end trace 0000000000000000 ]---
>> >> >
>> >> > The issue can be fixed in all in-tree drivers, but we cannot just trust OOT
>> >> > drivers to not do this. Therefore, make tailroom a signed int and produce a
>> >> > warning when it is negative to prevent such mistakes in the future.
>> >> >
>> >> > Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API")
>> >> > Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>> >> > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
>> >> > ---
>> >> >  net/core/filter.c | 3 ++-
>> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/net/core/filter.c b/net/core/filter.c
>> >> > index 616e0520a0bb..9715d957e3c5 100644
>> >> > --- a/net/core/filter.c
>> >> > +++ b/net/core/filter.c
>> >> > @@ -4149,12 +4149,13 @@ static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
>> >> >  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
>> >> >  	skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
>> >> >  	struct xdp_rxq_info *rxq = xdp->rxq;
>> >> > -	unsigned int tailroom;
>> >> > +	int tailroom;
>> >> >  
>> >> >  	if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
>> >> >  		return -EOPNOTSUPP;
>> >> >  
>> >> >  	tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
>> >> > +	WARN_ON_ONCE(tailroom < 0);
>> >> >  	if (unlikely(offset > tailroom))
>> >> >  		return -EINVAL;
>> >> >  
>> >> 
>> >> Why can't we do both? I.e., WARN_ON_ONCE() *and* return -EINVAL?
>> >> 
>> >> -Toke
>> >> 
>> >
>> > It would be redundant, offset is always >= 0 here, so with tailroom now being a 
>> > signed int, offset is always bigger and -EINVAL is returned.
>> 
>> Oh, I see. OK, may be worth calling out; I read this paragraph in your
>> commit message to mean "we don't bother returning EINVAL in this case,
>> we just warn":
>
> Worth changing 'could' to 'are supposed to', if there will be v2.

Yeah, that would be better, thanks!

Other than that,

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

[PATCH bpf 2/6] ice: fix rxq info registering in mbuf packets

Posted by Larysa Zaremba 4 days, 13 hours ago

XDP RxQ info contains frag_size, which depends on the MTU. This makes the
old way of registering RxQ info before calculating new buffer sizes
invalid. Currently, it leads to frag_size being outdated, making it
sometimes impossible to grow tailroom in a mbuf packet. E.g. fragments are
actually 3K+, but frag size is still as if MTU was 1500.

Always register new XDP RxQ info after reconfiguring memory pools.

Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_base.c | 26 ++++++-----------------
 drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index eadb1e3d12b3..511d803cf0a4 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -663,23 +663,12 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 	int err;
 
 	if (ring->vsi->type == ICE_VSI_PF || ring->vsi->type == ICE_VSI_SF) {
-		if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) {
-			err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
-						 ring->q_index,
-						 ring->q_vector->napi.napi_id,
-						 ring->rx_buf_len);
-			if (err)
-				return err;
-		}
-
 		ice_rx_xsk_pool(ring);
 		err = ice_realloc_rx_xdp_bufs(ring, ring->xsk_pool);
 		if (err)
 			return err;
 
 		if (ring->xsk_pool) {
-			xdp_rxq_info_unreg(&ring->xdp_rxq);
-
 			rx_buf_len =
 				xsk_pool_get_rx_frame_size(ring->xsk_pool);
 			err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
@@ -702,14 +691,13 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 			if (err)
 				return err;
 
-			if (!xdp_rxq_info_is_reg(&ring->xdp_rxq)) {
-				err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
-							 ring->q_index,
-							 ring->q_vector->napi.napi_id,
-							 ring->rx_buf_len);
-				if (err)
-					goto err_destroy_fq;
-			}
+			err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
+						 ring->q_index,
+						 ring->q_vector->napi.napi_id,
+						 ring->rx_buf_len);
+			if (err)
+				goto err_destroy_fq;
+
 			xdp_rxq_info_attach_page_pool(&ring->xdp_rxq,
 						      ring->pp);
 		}
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 989ff1fd9110..102631398af3 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -900,6 +900,9 @@ void ice_xsk_clean_rx_ring(struct ice_rx_ring *rx_ring)
 	u16 ntc = rx_ring->next_to_clean;
 	u16 ntu = rx_ring->next_to_use;
 
+	if (xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
+		xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+
 	while (ntc != ntu) {
 		struct xdp_buff *xdp = *ice_xdp_buf(rx_ring, ntc);
 
-- 
2.52.0

[PATCH bpf 3/6] ice: change XDP RxQ frag_size from DMA write length to truesize

Posted by Larysa Zaremba 4 days, 13 hours ago

The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in ice driver configuration lead to
negative tailroom.

Before changing the tailroom in the abovementioned function to a signed
int, this allows to trigger kernel panic, when using
XDP_ADJUST_TAIL_GROW_MULTI_BUFF xskxceiver test and changing packet size to
6912 and the requisted offset to a huge value, e.g.
XSK_UMEM__MAX_FRAME_SIZE * 100.

Due to other quirks of the ZC configuration in ice, issue is not observed
in ZC mode.

Use fill queue buffer truesize instead of DMA write size in XDP RxQ info.

Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_base.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
index 511d803cf0a4..6a15b66e340e 100644
--- a/drivers/net/ethernet/intel/ice/ice_base.c
+++ b/drivers/net/ethernet/intel/ice/ice_base.c
@@ -694,7 +694,7 @@ static int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
 			err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
 						 ring->q_index,
 						 ring->q_vector->napi.napi_id,
-						 ring->rx_buf_len);
+						 ring->truesize);
 			if (err)
 				goto err_destroy_fq;
 
-- 
2.52.0

[PATCH bpf 4/6] i40e: use truesize as XDP RxQ info frag_size

Posted by Larysa Zaremba 4 days, 13 hours ago

The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in i40e driver configuration lead to
negative tailroom.

Set frag_size to the same value as frame_sz.

Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 0b1cc0481027..72a33aa3e1f3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3561,6 +3561,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	struct i40e_vsi *vsi = ring->vsi;
 	u32 chain_len = vsi->back->hw.func_caps.rx_buf_chain_len;
 	u16 pf_q = vsi->base_queue + ring->queue_index;
+	u32 truesize = i40e_rx_pg_size(ring) / 2;
 	struct i40e_hw *hw = &vsi->back->hw;
 	struct i40e_hmc_obj_rxq rx_ctx;
 	int err = 0;
@@ -3581,7 +3582,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 		err = __xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev,
 					 ring->queue_index,
 					 ring->q_vector->napi.napi_id,
-					 ring->rx_buf_len);
+					 truesize);
 		if (err)
 			return err;
 	}
@@ -3614,7 +3615,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	}
 
 skip:
-	xdp_init_buff(&ring->xdp, i40e_rx_pg_size(ring) / 2, &ring->xdp_rxq);
+	xdp_init_buff(&ring->xdp, truesize, &ring->xdp_rxq);
 
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
 				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
-- 
2.52.0

[PATCH bpf 5/6] idpf: use truesize as XDP RxQ info frag_size

Posted by Larysa Zaremba 4 days, 13 hours ago

The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in idpf driver configuration lead to
negative tailroom.

Set frag_size to the same value as frame_sz.

Fixes: ac8a861f632e ("idpf: prepare structures to support XDP")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/intel/idpf/xdp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/idpf/xdp.c b/drivers/net/ethernet/intel/idpf/xdp.c
index 958d16f87424..4966376a9785 100644
--- a/drivers/net/ethernet/intel/idpf/xdp.c
+++ b/drivers/net/ethernet/intel/idpf/xdp.c
@@ -50,7 +50,7 @@ static int __idpf_xdp_rxq_info_init(struct idpf_rx_queue *rxq, void *arg)
 
 	err = __xdp_rxq_info_reg(&rxq->xdp_rxq, vport->netdev, rxq->idx,
 				 rxq->q_vector->napi.napi_id,
-				 rxq->rx_buf_size);
+				 rxq->truesize);
 	if (err)
 		return err;
 
-- 
2.52.0

[PATCH bpf 6/6] net: enetc: use truesize as XDP RxQ info frag_size

Posted by Larysa Zaremba 4 days, 13 hours ago

The only user of frag_size field in XDP RxQ info is
bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA
write size. Different assumptions in enetc driver configuration lead to
negative tailroom.

Set frag_size to the same value as frame_sz.

Fixes: 2768b2e2f7d2 ("net: enetc: register XDP RX queues with frag_size")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
---
 drivers/net/ethernet/freescale/enetc/enetc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index 53b26cece16a..389331571d9e 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -3465,7 +3465,7 @@ static int enetc_int_vector_init(struct enetc_ndev_priv *priv, int i,
 	priv->rx_ring[i] = bdr;
 
 	err = __xdp_rxq_info_reg(&bdr->xdp.rxq, priv->ndev, i, 0,
-				 ENETC_RXB_DMA_SIZE_XDP);
+				 ENETC_RXB_TRUESIZE);
 	if (err)
 		goto free_vector;
 
-- 
2.52.0