drivers/net/tun.c | 2 +- net/core/skbuff.c | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-)
On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
multiqueue (64) tap devices testing has shown contention on the zone lock
of the page allocator.
A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
# perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22
...
#
100.00%
|
|--9.47%--queued_spin_lock_slowpath
| |
| --9.37%--_raw_spin_lock_irqsave
| |
| |--5.00%--__rmqueue_pcplist
| | get_page_from_freelist
| | __alloc_pages_noprof
| | |
| | |--3.34%--napi_alloc_skb
#
That is, for Rx packets
- ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
- vhost-net threads float across CPUs do SKB free.
One method to avoid this contention is to free SKB allocations on the same
CPU as they were allocated on. This allows freed pages to be placed on the
per-cpu page (PCP) lists so that any new allocations can be taken directly
from the PCP list rather than having to request new pages from the page
allocator (and taking the zone lock).
Fortunately, previous work has provided all the infrastructure to do this
via the skb_attempt_defer_free call which this change uses instead of
consume_skb in tun_do_read.
Testing done with a 6.12 based kernel and the patch ported forward.
Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
Load generator: iPerf2 x 1200 clients MSS=400
Before:
Maximum traffic rate: 55Gbps
After:
Maximum traffic rate 110Gbps
---
drivers/net/tun.c | 2 +-
net/core/skbuff.c | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8192740357a0..388f3ffc6657 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
if (unlikely(ret < 0))
kfree_skb(skb);
else
- consume_skb(skb);
+ skb_attempt_defer_free(skb);
}
return ret;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6be01454f262..89217c43c639 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb);
DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
DEBUG_NET_WARN_ON_ONCE(skb->destructor);
DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
+ DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
sdn = per_cpu_ptr(net_hotdata.skb_defer_nodes, cpu) + numa_node_id();
@@ -7221,6 +7222,7 @@ nodefer: kfree_skb_napi_cache(skb);
if (unlikely(kick))
kick_defer_list_purge(cpu);
}
+EXPORT_SYMBOL(skb_attempt_defer_free);
static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
size_t offset, size_t len)
--
2.34.1
On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@akamai.com> wrote: > > On a 640 CPU system running virtio-net VMs with the vhost-net driver, and > multiqueue (64) tap devices testing has shown contention on the zone lock > of the page allocator. > > A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows > > # perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22 > ... > # > 100.00% > | > |--9.47%--queued_spin_lock_slowpath > | | > | --9.37%--_raw_spin_lock_irqsave > | | > | |--5.00%--__rmqueue_pcplist > | | get_page_from_freelist > | | __alloc_pages_noprof > | | | > | | |--3.34%--napi_alloc_skb > # > > That is, for Rx packets > - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation. > - vhost-net threads float across CPUs do SKB free. > > One method to avoid this contention is to free SKB allocations on the same > CPU as they were allocated on. This allows freed pages to be placed on the > per-cpu page (PCP) lists so that any new allocations can be taken directly > from the PCP list rather than having to request new pages from the page > allocator (and taking the zone lock). > > Fortunately, previous work has provided all the infrastructure to do this > via the skb_attempt_defer_free call which this change uses instead of > consume_skb in tun_do_read. > > Testing done with a 6.12 based kernel and the patch ported forward. > > Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs > Load generator: iPerf2 x 1200 clients MSS=400 > > Before: > Maximum traffic rate: 55Gbps > > After: > Maximum traffic rate 110Gbps > --- > drivers/net/tun.c | 2 +- > net/core/skbuff.c | 2 ++ > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 8192740357a0..388f3ffc6657 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile, > if (unlikely(ret < 0)) > kfree_skb(skb); > else > - consume_skb(skb); > + skb_attempt_defer_free(skb); > } > > return ret; > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 6be01454f262..89217c43c639 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb); > DEBUG_NET_WARN_ON_ONCE(skb_dst(skb)); > DEBUG_NET_WARN_ON_ONCE(skb->destructor); > DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb)); > + DEBUG_NET_WARN_ON_ONCE(skb_shared(skb)); I may miss something but it looks there's no guarantee that the packet sent to TAP is not shared. > > sdn = per_cpu_ptr(net_hotdata.skb_defer_nodes, cpu) + numa_node_id(); > > @@ -7221,6 +7222,7 @@ nodefer: kfree_skb_napi_cache(skb); > if (unlikely(kick)) > kick_defer_list_purge(cpu); > } > +EXPORT_SYMBOL(skb_attempt_defer_free); > > static void skb_splice_csum_page(struct sk_buff *skb, struct page *page, > size_t offset, size_t len) > -- > 2.34.1 > Thanks
> On 7 Nov 2025, at 02:21, Jason Wang <jasowang@redhat.com> wrote:
>
> !-------------------------------------------------------------------|
> This Message Is From an External Sender
> This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@akamai.com> wrote:
>>
>> On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
>> multiqueue (64) tap devices testing has shown contention on the zone lock
>> of the page allocator.
>>
>> A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
>>
>> # perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22
>> ...
>> #
>> 100.00%
>> |
>> |--9.47%--queued_spin_lock_slowpath
>> | |
>> | --9.37%--_raw_spin_lock_irqsave
>> | |
>> | |--5.00%--__rmqueue_pcplist
>> | | get_page_from_freelist
>> | | __alloc_pages_noprof
>> | | |
>> | | |--3.34%--napi_alloc_skb
>> #
>>
>> That is, for Rx packets
>> - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
>> - vhost-net threads float across CPUs do SKB free.
>>
>> One method to avoid this contention is to free SKB allocations on the same
>> CPU as they were allocated on. This allows freed pages to be placed on the
>> per-cpu page (PCP) lists so that any new allocations can be taken directly
>> from the PCP list rather than having to request new pages from the page
>> allocator (and taking the zone lock).
>>
>> Fortunately, previous work has provided all the infrastructure to do this
>> via the skb_attempt_defer_free call which this change uses instead of
>> consume_skb in tun_do_read.
>>
>> Testing done with a 6.12 based kernel and the patch ported forward.
>>
>> Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
>> Load generator: iPerf2 x 1200 clients MSS=400
>>
>> Before:
>> Maximum traffic rate: 55Gbps
>>
>> After:
>> Maximum traffic rate 110Gbps
>> ---
>> drivers/net/tun.c | 2 +-
>> net/core/skbuff.c | 2 ++
>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 8192740357a0..388f3ffc6657 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>> if (unlikely(ret < 0))
>> kfree_skb(skb);
>> else
>> - consume_skb(skb);
>> + skb_attempt_defer_free(skb);
>> }
>>
>> return ret;
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 6be01454f262..89217c43c639 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb);
>> DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
>> DEBUG_NET_WARN_ON_ONCE(skb->destructor);
>> DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
>> + DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
>
> I may miss something but it looks there's no guarantee that the packet
> sent to TAP is not shared.
Yes, I did wonder.
How about something like
/**
* consume_skb_attempt_defer - free an skbuff
* @skb: buffer to free
*
* Drop a ref to the buffer and attempt to defer free it if the usage count
* has hit zero.
*/
void consume_skb_attempt_defer(struct sk_buff *skb)
{
if (!skb_unref(skb))
return;
trace_consume_skb(skb, __builtin_return_address(0));
skb_attempt_defer_free(skb);
}
EXPORT_SYMBOL(consume_skb_attempt_defer);
and an inline version for the !CONFIG_TRACEPOINTS case
On Fri, Nov 7, 2025 at 12:41 AM Hudson, Nick <nhudson@akamai.com> wrote:
>
>
>
> > On 7 Nov 2025, at 02:21, Jason Wang <jasowang@redhat.com> wrote:
> >
> > !-------------------------------------------------------------------|
> > This Message Is From an External Sender
> > This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@akamai.com> wrote:
> >>
> >> On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
> >> multiqueue (64) tap devices testing has shown contention on the zone lock
> >> of the page allocator.
> >>
> >> A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
> >>
> >> # perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22
> >> ...
> >> #
> >> 100.00%
> >> |
> >> |--9.47%--queued_spin_lock_slowpath
> >> | |
> >> | --9.37%--_raw_spin_lock_irqsave
> >> | |
> >> | |--5.00%--__rmqueue_pcplist
> >> | | get_page_from_freelist
> >> | | __alloc_pages_noprof
> >> | | |
> >> | | |--3.34%--napi_alloc_skb
> >> #
> >>
> >> That is, for Rx packets
> >> - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
> >> - vhost-net threads float across CPUs do SKB free.
> >>
> >> One method to avoid this contention is to free SKB allocations on the same
> >> CPU as they were allocated on. This allows freed pages to be placed on the
> >> per-cpu page (PCP) lists so that any new allocations can be taken directly
> >> from the PCP list rather than having to request new pages from the page
> >> allocator (and taking the zone lock).
> >>
> >> Fortunately, previous work has provided all the infrastructure to do this
> >> via the skb_attempt_defer_free call which this change uses instead of
> >> consume_skb in tun_do_read.
> >>
> >> Testing done with a 6.12 based kernel and the patch ported forward.
> >>
> >> Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
> >> Load generator: iPerf2 x 1200 clients MSS=400
> >>
> >> Before:
> >> Maximum traffic rate: 55Gbps
> >>
> >> After:
> >> Maximum traffic rate 110Gbps
> >> ---
> >> drivers/net/tun.c | 2 +-
> >> net/core/skbuff.c | 2 ++
> >> 2 files changed, 3 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index 8192740357a0..388f3ffc6657 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
> >> if (unlikely(ret < 0))
> >> kfree_skb(skb);
> >> else
> >> - consume_skb(skb);
> >> + skb_attempt_defer_free(skb);
> >> }
> >>
> >> return ret;
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index 6be01454f262..89217c43c639 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb);
> >> DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
> >> DEBUG_NET_WARN_ON_ONCE(skb->destructor);
> >> DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
> >> + DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
> >
> > I may miss something but it looks there's no guarantee that the packet
> > sent to TAP is not shared.
>
> Yes, I did wonder.
>
> How about something like
>
> /**
> * consume_skb_attempt_defer - free an skbuff
> * @skb: buffer to free
> *
> * Drop a ref to the buffer and attempt to defer free it if the usage count
> * has hit zero.
> */
> void consume_skb_attempt_defer(struct sk_buff *skb)
> {
> if (!skb_unref(skb))
> return;
>
> trace_consume_skb(skb, __builtin_return_address(0));
>
> skb_attempt_defer_free(skb);
> }
> EXPORT_SYMBOL(consume_skb_attempt_defer);
>
> and an inline version for the !CONFIG_TRACEPOINTS case
I will take care of the changes, have you seen my recent series ?
https://lore.kernel.org/netdev/20251106202935.1776179-1-edumazet@google.com/T/#m94e853a732f3cf1bb6a8f613f7d9d6f150f87f6f
I think you are missing a few points....
> On 7 Nov 2025, at 09:11, Eric Dumazet <edumazet@google.com> wrote:
>
> !-------------------------------------------------------------------|
> This Message Is From an External Sender
> This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Fri, Nov 7, 2025 at 12:41 AM Hudson, Nick <nhudson@akamai.com> wrote:
>>
>>
>>
>>> On 7 Nov 2025, at 02:21, Jason Wang <jasowang@redhat.com> wrote:
>>>
>>> !-------------------------------------------------------------------|
>>> This Message Is From an External Sender
>>> This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@akamai.com> wrote:
>>>>
>>>> On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
>>>> multiqueue (64) tap devices testing has shown contention on the zone lock
>>>> of the page allocator.
>>>>
>>>> A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
>>>>
>>>> # perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22
>>>> ...
>>>> #
>>>> 100.00%
>>>> |
>>>> |--9.47%--queued_spin_lock_slowpath
>>>> | |
>>>> | --9.37%--_raw_spin_lock_irqsave
>>>> | |
>>>> | |--5.00%--__rmqueue_pcplist
>>>> | | get_page_from_freelist
>>>> | | __alloc_pages_noprof
>>>> | | |
>>>> | | |--3.34%--napi_alloc_skb
>>>> #
>>>>
>>>> That is, for Rx packets
>>>> - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
>>>> - vhost-net threads float across CPUs do SKB free.
>>>>
>>>> One method to avoid this contention is to free SKB allocations on the same
>>>> CPU as they were allocated on. This allows freed pages to be placed on the
>>>> per-cpu page (PCP) lists so that any new allocations can be taken directly
>>>> from the PCP list rather than having to request new pages from the page
>>>> allocator (and taking the zone lock).
>>>>
>>>> Fortunately, previous work has provided all the infrastructure to do this
>>>> via the skb_attempt_defer_free call which this change uses instead of
>>>> consume_skb in tun_do_read.
>>>>
>>>> Testing done with a 6.12 based kernel and the patch ported forward.
>>>>
>>>> Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
>>>> Load generator: iPerf2 x 1200 clients MSS=400
>>>>
>>>> Before:
>>>> Maximum traffic rate: 55Gbps
>>>>
>>>> After:
>>>> Maximum traffic rate 110Gbps
>>>> ---
>>>> drivers/net/tun.c | 2 +-
>>>> net/core/skbuff.c | 2 ++
>>>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index 8192740357a0..388f3ffc6657 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>>>> if (unlikely(ret < 0))
>>>> kfree_skb(skb);
>>>> else
>>>> - consume_skb(skb);
>>>> + skb_attempt_defer_free(skb);
>>>> }
>>>>
>>>> return ret;
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index 6be01454f262..89217c43c639 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb);
>>>> DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
>>>> DEBUG_NET_WARN_ON_ONCE(skb->destructor);
>>>> DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
>>>> + DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
>>>
>>> I may miss something but it looks there's no guarantee that the packet
>>> sent to TAP is not shared.
>>
>> Yes, I did wonder.
>>
>> How about something like
>>
>> /**
>> * consume_skb_attempt_defer - free an skbuff
>> * @skb: buffer to free
>> *
>> * Drop a ref to the buffer and attempt to defer free it if the usage count
>> * has hit zero.
>> */
>> void consume_skb_attempt_defer(struct sk_buff *skb)
>> {
>> if (!skb_unref(skb))
>> return;
>>
>> trace_consume_skb(skb, __builtin_return_address(0));
>>
>> skb_attempt_defer_free(skb);
>> }
>> EXPORT_SYMBOL(consume_skb_attempt_defer);
>>
>> and an inline version for the !CONFIG_TRACEPOINTS case
>
> I will take care of the changes, have you seen my recent series ?
Great, thanks. I did see your series and will evaluate the improvement in our test setup.
>
>
> I think you are missing a few points….
Sure, still learning.
On Fri, Nov 7, 2025 at 1:16 AM Hudson, Nick <nhudson@akamai.com> wrote:
>
>
>
> > On 7 Nov 2025, at 09:11, Eric Dumazet <edumazet@google.com> wrote:
> >
> > !-------------------------------------------------------------------|
> > This Message Is From an External Sender
> > This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Fri, Nov 7, 2025 at 12:41 AM Hudson, Nick <nhudson@akamai.com> wrote:
> >>
> >>
> >>
> >>> On 7 Nov 2025, at 02:21, Jason Wang <jasowang@redhat.com> wrote:
> >>>
> >>> !-------------------------------------------------------------------|
> >>> This Message Is From an External Sender
> >>> This message came from outside your organization.
> >>> |-------------------------------------------------------------------!
> >>>
> >>> On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@akamai.com> wrote:
> >>>>
> >>>> On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
> >>>> multiqueue (64) tap devices testing has shown contention on the zone lock
> >>>> of the page allocator.
> >>>>
> >>>> A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
> >>>>
> >>>> # perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22
> >>>> ...
> >>>> #
> >>>> 100.00%
> >>>> |
> >>>> |--9.47%--queued_spin_lock_slowpath
> >>>> | |
> >>>> | --9.37%--_raw_spin_lock_irqsave
> >>>> | |
> >>>> | |--5.00%--__rmqueue_pcplist
> >>>> | | get_page_from_freelist
> >>>> | | __alloc_pages_noprof
> >>>> | | |
> >>>> | | |--3.34%--napi_alloc_skb
> >>>> #
> >>>>
> >>>> That is, for Rx packets
> >>>> - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
> >>>> - vhost-net threads float across CPUs do SKB free.
> >>>>
> >>>> One method to avoid this contention is to free SKB allocations on the same
> >>>> CPU as they were allocated on. This allows freed pages to be placed on the
> >>>> per-cpu page (PCP) lists so that any new allocations can be taken directly
> >>>> from the PCP list rather than having to request new pages from the page
> >>>> allocator (and taking the zone lock).
> >>>>
> >>>> Fortunately, previous work has provided all the infrastructure to do this
> >>>> via the skb_attempt_defer_free call which this change uses instead of
> >>>> consume_skb in tun_do_read.
> >>>>
> >>>> Testing done with a 6.12 based kernel and the patch ported forward.
> >>>>
> >>>> Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
> >>>> Load generator: iPerf2 x 1200 clients MSS=400
> >>>>
> >>>> Before:
> >>>> Maximum traffic rate: 55Gbps
> >>>>
> >>>> After:
> >>>> Maximum traffic rate 110Gbps
> >>>> ---
> >>>> drivers/net/tun.c | 2 +-
> >>>> net/core/skbuff.c | 2 ++
> >>>> 2 files changed, 3 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>> index 8192740357a0..388f3ffc6657 100644
> >>>> --- a/drivers/net/tun.c
> >>>> +++ b/drivers/net/tun.c
> >>>> @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
> >>>> if (unlikely(ret < 0))
> >>>> kfree_skb(skb);
> >>>> else
> >>>> - consume_skb(skb);
> >>>> + skb_attempt_defer_free(skb);
> >>>> }
> >>>>
> >>>> return ret;
> >>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>>> index 6be01454f262..89217c43c639 100644
> >>>> --- a/net/core/skbuff.c
> >>>> +++ b/net/core/skbuff.c
> >>>> @@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb);
> >>>> DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
> >>>> DEBUG_NET_WARN_ON_ONCE(skb->destructor);
> >>>> DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
> >>>> + DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
> >>>
> >>> I may miss something but it looks there's no guarantee that the packet
> >>> sent to TAP is not shared.
> >>
> >> Yes, I did wonder.
> >>
> >> How about something like
> >>
> >> /**
> >> * consume_skb_attempt_defer - free an skbuff
> >> * @skb: buffer to free
> >> *
> >> * Drop a ref to the buffer and attempt to defer free it if the usage count
> >> * has hit zero.
> >> */
> >> void consume_skb_attempt_defer(struct sk_buff *skb)
> >> {
> >> if (!skb_unref(skb))
> >> return;
> >>
> >> trace_consume_skb(skb, __builtin_return_address(0));
> >>
> >> skb_attempt_defer_free(skb);
> >> }
> >> EXPORT_SYMBOL(consume_skb_attempt_defer);
> >>
> >> and an inline version for the !CONFIG_TRACEPOINTS case
> >
> > I will take care of the changes, have you seen my recent series ?
>
> Great, thanks. I did see your series and will evaluate the improvement in our test setup.
>
> >
> >
> > I think you are missing a few points….
>
> Sure, still learning.
Sure !
Make sure to add in your dev .config : CONFIG_DEBUG_NET=y
> On Nov 7, 2025, at 4:19 AM, Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 1:16 AM Hudson, Nick <nhudson@akamai.com> wrote:
>>
>>
>>
>>> On 7 Nov 2025, at 09:11, Eric Dumazet <edumazet@google.com> wrote:
>>>
>>> !-------------------------------------------------------------------|
>>> This Message Is From an External Sender
>>> This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> On Fri, Nov 7, 2025 at 12:41 AM Hudson, Nick <nhudson@akamai.com> wrote:
>>>>
>>>>
>>>>
>>>>> On 7 Nov 2025, at 02:21, Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>> !-------------------------------------------------------------------|
>>>>> This Message Is From an External Sender
>>>>> This message came from outside your organization.
>>>>> |-------------------------------------------------------------------!
>>>>>
>>>>> On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@akamai.com> wrote:
>>>>>>
>>>>>> On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
>>>>>> multiqueue (64) tap devices testing has shown contention on the zone lock
>>>>>> of the page allocator.
>>>>>>
>>>>>> A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
>>>>>>
>>>>>> # perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22
>>>>>> ...
>>>>>> #
>>>>>> 100.00%
>>>>>> |
>>>>>> |--9.47%--queued_spin_lock_slowpath
>>>>>> | |
>>>>>> | --9.37%--_raw_spin_lock_irqsave
>>>>>> | |
>>>>>> | |--5.00%--__rmqueue_pcplist
>>>>>> | | get_page_from_freelist
>>>>>> | | __alloc_pages_noprof
>>>>>> | | |
>>>>>> | | |--3.34%--napi_alloc_skb
>>>>>> #
>>>>>>
>>>>>> That is, for Rx packets
>>>>>> - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
>>>>>> - vhost-net threads float across CPUs do SKB free.
>>>>>>
>>>>>> One method to avoid this contention is to free SKB allocations on the same
>>>>>> CPU as they were allocated on. This allows freed pages to be placed on the
>>>>>> per-cpu page (PCP) lists so that any new allocations can be taken directly
>>>>>> from the PCP list rather than having to request new pages from the page
>>>>>> allocator (and taking the zone lock).
>>>>>>
>>>>>> Fortunately, previous work has provided all the infrastructure to do this
>>>>>> via the skb_attempt_defer_free call which this change uses instead of
>>>>>> consume_skb in tun_do_read.
>>>>>>
>>>>>> Testing done with a 6.12 based kernel and the patch ported forward.
>>>>>>
>>>>>> Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
>>>>>> Load generator: iPerf2 x 1200 clients MSS=400
>>>>>>
>>>>>> Before:
>>>>>> Maximum traffic rate: 55Gbps
>>>>>>
>>>>>> After:
>>>>>> Maximum traffic rate 110Gbps
>>>>>> ---
>>>>>> drivers/net/tun.c | 2 +-
>>>>>> net/core/skbuff.c | 2 ++
>>>>>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>> index 8192740357a0..388f3ffc6657 100644
>>>>>> --- a/drivers/net/tun.c
>>>>>> +++ b/drivers/net/tun.c
>>>>>> @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>>>>>> if (unlikely(ret < 0))
>>>>>> kfree_skb(skb);
>>>>>> else
>>>>>> - consume_skb(skb);
>>>>>> + skb_attempt_defer_free(skb);
>>>>>> }
>>>>>>
>>>>>> return ret;
>>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>>> index 6be01454f262..89217c43c639 100644
>>>>>> --- a/net/core/skbuff.c
>>>>>> +++ b/net/core/skbuff.c
>>>>>> @@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb);
>>>>>> DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
>>>>>> DEBUG_NET_WARN_ON_ONCE(skb->destructor);
>>>>>> DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
>>>>>> + DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
>>>>>
>>>>> I may miss something but it looks there's no guarantee that the packet
>>>>> sent to TAP is not shared.
>>>>
>>>> Yes, I did wonder.
>>>>
>>>> How about something like
>>>>
>>>> /**
>>>> * consume_skb_attempt_defer - free an skbuff
>>>> * @skb: buffer to free
>>>> *
>>>> * Drop a ref to the buffer and attempt to defer free it if the usage count
>>>> * has hit zero.
>>>> */
>>>> void consume_skb_attempt_defer(struct sk_buff *skb)
>>>> {
>>>> if (!skb_unref(skb))
>>>> return;
>>>>
>>>> trace_consume_skb(skb, __builtin_return_address(0));
>>>>
>>>> skb_attempt_defer_free(skb);
>>>> }
>>>> EXPORT_SYMBOL(consume_skb_attempt_defer);
>>>>
>>>> and an inline version for the !CONFIG_TRACEPOINTS case
>>>
>>> I will take care of the changes, have you seen my recent series ?
>>
>> Great, thanks. I did see your series and will evaluate the improvement in our test setup.
>>
>>>
>>>
>>> I think you are missing a few points….
>>
>> Sure, still learning.
>
> Sure !
>
> Make sure to add in your dev .config : CONFIG_DEBUG_NET=y
>
Hey Nick,
Thanks for sending this out, and funny enough, I had almost this
exact same series of thoughts back in May, but ended up getting
sucked into a rabbit hole the size of Texas and never circled
back to finish up the series.
Check out my series here:
https://patchwork.kernel.org/project/netdevbpf/patch/20250506145530.2877229-5-jon@nutanix.com/
I was also monkeying around with defer free in this exact spot,
but it too got lost in the rabbit hole, so I’m glad I stumbled
upon this again tonight.
Let me dust this baby off and send a v2 on top of Eric’s
napi_consume_skb() series, as the combination of the two
of them should net out positively for you
Jon
> On Nov 19, 2025, at 8:49 PM, Jon Kohler <jonmkohler@icloud.com> wrote:
>
>
>> On Nov 7, 2025, at 4:19 AM, Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Fri, Nov 7, 2025 at 1:16 AM Hudson, Nick <nhudson@akamai.com> wrote:
>>>
>>>
>>>
>>>> On 7 Nov 2025, at 09:11, Eric Dumazet <edumazet@google.com> wrote:
>>>>
>>>> !-------------------------------------------------------------------|
>>>> This Message Is From an External Sender
>>>> This message came from outside your organization.
>>>> |-------------------------------------------------------------------!
>>>>
>>>> On Fri, Nov 7, 2025 at 12:41 AM Hudson, Nick <nhudson@akamai.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On 7 Nov 2025, at 02:21, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>
>>>>>> !-------------------------------------------------------------------|
>>>>>> This Message Is From an External Sender
>>>>>> This message came from outside your organization.
>>>>>> |-------------------------------------------------------------------!
>>>>>>
>>>>>> On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@akamai.com> wrote:
>>>>>>>
>>>>>>> On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
>>>>>>> multiqueue (64) tap devices testing has shown contention on the zone lock
>>>>>>> of the page allocator.
>>>>>>>
>>>>>>> A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
>>>>>>>
>>>>>>> # perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22
>>>>>>> ...
>>>>>>> #
>>>>>>> 100.00%
>>>>>>> |
>>>>>>> |--9.47%--queued_spin_lock_slowpath
>>>>>>> | |
>>>>>>> | --9.37%--_raw_spin_lock_irqsave
>>>>>>> | |
>>>>>>> | |--5.00%--__rmqueue_pcplist
>>>>>>> | | get_page_from_freelist
>>>>>>> | | __alloc_pages_noprof
>>>>>>> | | |
>>>>>>> | | |--3.34%--napi_alloc_skb
>>>>>>> #
>>>>>>>
>>>>>>> That is, for Rx packets
>>>>>>> - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
>>>>>>> - vhost-net threads float across CPUs do SKB free.
>>>>>>>
>>>>>>> One method to avoid this contention is to free SKB allocations on the same
>>>>>>> CPU as they were allocated on. This allows freed pages to be placed on the
>>>>>>> per-cpu page (PCP) lists so that any new allocations can be taken directly
>>>>>>> from the PCP list rather than having to request new pages from the page
>>>>>>> allocator (and taking the zone lock).
>>>>>>>
>>>>>>> Fortunately, previous work has provided all the infrastructure to do this
>>>>>>> via the skb_attempt_defer_free call which this change uses instead of
>>>>>>> consume_skb in tun_do_read.
>>>>>>>
>>>>>>> Testing done with a 6.12 based kernel and the patch ported forward.
>>>>>>>
>>>>>>> Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
>>>>>>> Load generator: iPerf2 x 1200 clients MSS=400
>>>>>>>
>>>>>>> Before:
>>>>>>> Maximum traffic rate: 55Gbps
>>>>>>>
>>>>>>> After:
>>>>>>> Maximum traffic rate 110Gbps
>>>>>>> ---
>>>>>>> drivers/net/tun.c | 2 +-
>>>>>>> net/core/skbuff.c | 2 ++
>>>>>>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>> index 8192740357a0..388f3ffc6657 100644
>>>>>>> --- a/drivers/net/tun.c
>>>>>>> +++ b/drivers/net/tun.c
>>>>>>> @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>>>>>>> if (unlikely(ret < 0))
>>>>>>> kfree_skb(skb);
>>>>>>> else
>>>>>>> - consume_skb(skb);
>>>>>>> + skb_attempt_defer_free(skb);
>>>>>>> }
>>>>>>>
>>>>>>> return ret;
>>>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>>>> index 6be01454f262..89217c43c639 100644
>>>>>>> --- a/net/core/skbuff.c
>>>>>>> +++ b/net/core/skbuff.c
>>>>>>> @@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb);
>>>>>>> DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
>>>>>>> DEBUG_NET_WARN_ON_ONCE(skb->destructor);
>>>>>>> DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
>>>>>>> + DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
>>>>>>
>>>>>> I may miss something but it looks there's no guarantee that the packet
>>>>>> sent to TAP is not shared.
>>>>>
>>>>> Yes, I did wonder.
>>>>>
>>>>> How about something like
>>>>>
>>>>> /**
>>>>> * consume_skb_attempt_defer - free an skbuff
>>>>> * @skb: buffer to free
>>>>> *
>>>>> * Drop a ref to the buffer and attempt to defer free it if the usage count
>>>>> * has hit zero.
>>>>> */
>>>>> void consume_skb_attempt_defer(struct sk_buff *skb)
>>>>> {
>>>>> if (!skb_unref(skb))
>>>>> return;
>>>>>
>>>>> trace_consume_skb(skb, __builtin_return_address(0));
>>>>>
>>>>> skb_attempt_defer_free(skb);
>>>>> }
>>>>> EXPORT_SYMBOL(consume_skb_attempt_defer);
>>>>>
>>>>> and an inline version for the !CONFIG_TRACEPOINTS case
>>>>
>>>> I will take care of the changes, have you seen my recent series ?
>>>
>>> Great, thanks. I did see your series and will evaluate the improvement in our test setup.
>>>
>>>>
>>>>
>>>> I think you are missing a few points….
>>>
>>> Sure, still learning.
>>
>> Sure !
>>
>> Make sure to add in your dev .config : CONFIG_DEBUG_NET=y
>>
>
> Hey Nick,
> Thanks for sending this out, and funny enough, I had almost this
> exact same series of thoughts back in May, but ended up getting
> sucked into a rabbit hole the size of Texas and never circled
> back to finish up the series.
>
> Check out my series here:
> https://patchwork.kernel.org/project/netdevbpf/patch/20250506145530.2877229-5-jon@nutanix.com/
>
> I was also monkeying around with defer free in this exact spot,
> but it too got lost in the rabbit hole, so I’m glad I stumbled
> upon this again tonight.
>
> Let me dust this baby off and send a v2 on top of Eric’s
> napi_consume_skb() series, as the combination of the two
> of them should net out positively for you
>
> Jon
>
Bah, epic fail, I sent that from my iCloud account. Back again
with my work account. I’ll go give it a whirl tonight and see
what trouble I can get into
> On Nov 19, 2025, at 9:00 PM, Jon Kohler <jon@nutanix.com> wrote:
>
>
>
>> On Nov 19, 2025, at 8:49 PM, Jon Kohler <jonmkohler@icloud.com> wrote:
>>
>>
>>> On Nov 7, 2025, at 4:19 AM, Eric Dumazet <edumazet@google.com> wrote:
>>>
>>> On Fri, Nov 7, 2025 at 1:16 AM Hudson, Nick <nhudson@akamai.com> wrote:
>>>>
>>>>
>>>>
>>>>> On 7 Nov 2025, at 09:11, Eric Dumazet <edumazet@google.com> wrote:
>>>>>
>>>>> !-------------------------------------------------------------------|
>>>>> This Message Is From an External Sender
>>>>> This message came from outside your organization.
>>>>> |-------------------------------------------------------------------!
>>>>>
>>>>> On Fri, Nov 7, 2025 at 12:41 AM Hudson, Nick <nhudson@akamai.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 7 Nov 2025, at 02:21, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>
>>>>>>> !-------------------------------------------------------------------|
>>>>>>> This Message Is From an External Sender
>>>>>>> This message came from outside your organization.
>>>>>>> |-------------------------------------------------------------------!
>>>>>>>
>>>>>>> On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@akamai.com> wrote:
>>>>>>>>
>>>>>>>> On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
>>>>>>>> multiqueue (64) tap devices testing has shown contention on the zone lock
>>>>>>>> of the page allocator.
>>>>>>>>
>>>>>>>> A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
>>>>>>>>
>>>>>>>> # perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22
>>>>>>>> ...
>>>>>>>> #
>>>>>>>> 100.00%
>>>>>>>> |
>>>>>>>> |--9.47%--queued_spin_lock_slowpath
>>>>>>>> | |
>>>>>>>> | --9.37%--_raw_spin_lock_irqsave
>>>>>>>> | |
>>>>>>>> | |--5.00%--__rmqueue_pcplist
>>>>>>>> | | get_page_from_freelist
>>>>>>>> | | __alloc_pages_noprof
>>>>>>>> | | |
>>>>>>>> | | |--3.34%--napi_alloc_skb
>>>>>>>> #
>>>>>>>>
>>>>>>>> That is, for Rx packets
>>>>>>>> - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
>>>>>>>> - vhost-net threads float across CPUs do SKB free.
>>>>>>>>
>>>>>>>> One method to avoid this contention is to free SKB allocations on the same
>>>>>>>> CPU as they were allocated on. This allows freed pages to be placed on the
>>>>>>>> per-cpu page (PCP) lists so that any new allocations can be taken directly
>>>>>>>> from the PCP list rather than having to request new pages from the page
>>>>>>>> allocator (and taking the zone lock).
>>>>>>>>
>>>>>>>> Fortunately, previous work has provided all the infrastructure to do this
>>>>>>>> via the skb_attempt_defer_free call which this change uses instead of
>>>>>>>> consume_skb in tun_do_read.
>>>>>>>>
>>>>>>>> Testing done with a 6.12 based kernel and the patch ported forward.
>>>>>>>>
>>>>>>>> Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
>>>>>>>> Load generator: iPerf2 x 1200 clients MSS=400
>>>>>>>>
>>>>>>>> Before:
>>>>>>>> Maximum traffic rate: 55Gbps
>>>>>>>>
>>>>>>>> After:
>>>>>>>> Maximum traffic rate 110Gbps
>>>>>>>> ---
>>>>>>>> drivers/net/tun.c | 2 +-
>>>>>>>> net/core/skbuff.c | 2 ++
>>>>>>>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>> index 8192740357a0..388f3ffc6657 100644
>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>> @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>>>>>>>> if (unlikely(ret < 0))
>>>>>>>> kfree_skb(skb);
>>>>>>>> else
>>>>>>>> - consume_skb(skb);
>>>>>>>> + skb_attempt_defer_free(skb);
>>>>>>>> }
>>>>>>>>
>>>>>>>> return ret;
>>>>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>>>>> index 6be01454f262..89217c43c639 100644
>>>>>>>> --- a/net/core/skbuff.c
>>>>>>>> +++ b/net/core/skbuff.c
>>>>>>>> @@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb);
>>>>>>>> DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
>>>>>>>> DEBUG_NET_WARN_ON_ONCE(skb->destructor);
>>>>>>>> DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
>>>>>>>> + DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
>>>>>>>
>>>>>>> I may miss something but it looks there's no guarantee that the packet
>>>>>>> sent to TAP is not shared.
>>>>>>
>>>>>> Yes, I did wonder.
>>>>>>
>>>>>> How about something like
>>>>>>
>>>>>> /**
>>>>>> * consume_skb_attempt_defer - free an skbuff
>>>>>> * @skb: buffer to free
>>>>>> *
>>>>>> * Drop a ref to the buffer and attempt to defer free it if the usage count
>>>>>> * has hit zero.
>>>>>> */
>>>>>> void consume_skb_attempt_defer(struct sk_buff *skb)
>>>>>> {
>>>>>> if (!skb_unref(skb))
>>>>>> return;
>>>>>>
>>>>>> trace_consume_skb(skb, __builtin_return_address(0));
>>>>>>
>>>>>> skb_attempt_defer_free(skb);
>>>>>> }
>>>>>> EXPORT_SYMBOL(consume_skb_attempt_defer);
>>>>>>
>>>>>> and an inline version for the !CONFIG_TRACEPOINTS case
>>>>>
>>>>> I will take care of the changes, have you seen my recent series ?
>>>>
>>>> Great, thanks. I did see your series and will evaluate the improvement in our test setup.
>>>>
>>>>>
>>>>>
>>>>> I think you are missing a few points….
>>>>
>>>> Sure, still learning.
>>>
>>> Sure !
>>>
>>> Make sure to add in your dev .config : CONFIG_DEBUG_NET=y
>>>
>>
>> Hey Nick,
>> Thanks for sending this out, and funny enough, I had almost this
>> exact same series of thoughts back in May, but ended up getting
>> sucked into a rabbit hole the size of Texas and never circled
>> back to finish up the series.
>>
>> Check out my series here:
>> https://patchwork.kernel.org/project/netdevbpf/patch/20250506145530.2877229-5-jon@nutanix.com/
>>
>> I was also monkeying around with defer free in this exact spot,
>> but it too got lost in the rabbit hole, so I’m glad I stumbled
>> upon this again tonight.
>>
>> Let me dust this baby off and send a v2 on top of Eric’s
>> napi_consume_skb() series, as the combination of the two
>> of them should net out positively for you
>>
>> Jon
>>
Did some testing on this, it does work well. The only downside is that
when testing a very heavy UDP TX workload, the TX vhost thread
gets IPI’d heavily to process the deferred list. I’m going to try to
see if tactically calling skb_defer_free_flush immediately before
napi_skb_cache_get_bulk in my patch set helps resolve that. Will check
that out tomorrow and report back.
> On Nov 20, 2025, at 1:11 AM, Jon Kohler <jon@nutanix.com> wrote:
>
>
>
>> On Nov 19, 2025, at 9:00 PM, Jon Kohler <jon@nutanix.com> wrote:
>>
>>
>>
>>> On Nov 19, 2025, at 8:49 PM, Jon Kohler <jonmkohler@icloud.com> wrote:
>>>
>>>
>>>> On Nov 7, 2025, at 4:19 AM, Eric Dumazet <edumazet@google.com> wrote:
>>>>
>>>> On Fri, Nov 7, 2025 at 1:16 AM Hudson, Nick <nhudson@akamai.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On 7 Nov 2025, at 09:11, Eric Dumazet <edumazet@google.com> wrote:
>>>>>>
>>>>>> !-------------------------------------------------------------------|
>>>>>> This Message Is From an External Sender
>>>>>> This message came from outside your organization.
>>>>>> |-------------------------------------------------------------------!
>>>>>>
>>>>>> On Fri, Nov 7, 2025 at 12:41 AM Hudson, Nick <nhudson@akamai.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On 7 Nov 2025, at 02:21, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>>
>>>>>>>> !-------------------------------------------------------------------|
>>>>>>>> This Message Is From an External Sender
>>>>>>>> This message came from outside your organization.
>>>>>>>> |-------------------------------------------------------------------!
>>>>>>>>
>>>>>>>> On Thu, Nov 6, 2025 at 11:51 PM Nick Hudson <nhudson@akamai.com> wrote:
>>>>>>>>>
>>>>>>>>> On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
>>>>>>>>> multiqueue (64) tap devices testing has shown contention on the zone lock
>>>>>>>>> of the page allocator.
>>>>>>>>>
>>>>>>>>> A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows
>>>>>>>>>
>>>>>>>>> # perf report -i perf.data.vhost --stdio --sort overhead --no-children | head -22
>>>>>>>>> ...
>>>>>>>>> #
>>>>>>>>> 100.00%
>>>>>>>>> |
>>>>>>>>> |--9.47%--queued_spin_lock_slowpath
>>>>>>>>> | |
>>>>>>>>> | --9.37%--_raw_spin_lock_irqsave
>>>>>>>>> | |
>>>>>>>>> | |--5.00%--__rmqueue_pcplist
>>>>>>>>> | | get_page_from_freelist
>>>>>>>>> | | __alloc_pages_noprof
>>>>>>>>> | | |
>>>>>>>>> | | |--3.34%--napi_alloc_skb
>>>>>>>>> #
>>>>>>>>>
>>>>>>>>> That is, for Rx packets
>>>>>>>>> - ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
>>>>>>>>> - vhost-net threads float across CPUs do SKB free.
>>>>>>>>>
>>>>>>>>> One method to avoid this contention is to free SKB allocations on the same
>>>>>>>>> CPU as they were allocated on. This allows freed pages to be placed on the
>>>>>>>>> per-cpu page (PCP) lists so that any new allocations can be taken directly
>>>>>>>>> from the PCP list rather than having to request new pages from the page
>>>>>>>>> allocator (and taking the zone lock).
>>>>>>>>>
>>>>>>>>> Fortunately, previous work has provided all the infrastructure to do this
>>>>>>>>> via the skb_attempt_defer_free call which this change uses instead of
>>>>>>>>> consume_skb in tun_do_read.
>>>>>>>>>
>>>>>>>>> Testing done with a 6.12 based kernel and the patch ported forward.
>>>>>>>>>
>>>>>>>>> Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
>>>>>>>>> Load generator: iPerf2 x 1200 clients MSS=400
>>>>>>>>>
>>>>>>>>> Before:
>>>>>>>>> Maximum traffic rate: 55Gbps
>>>>>>>>>
>>>>>>>>> After:
>>>>>>>>> Maximum traffic rate 110Gbps
>>>>>>>>> ---
>>>>>>>>> drivers/net/tun.c | 2 +-
>>>>>>>>> net/core/skbuff.c | 2 ++
>>>>>>>>> 2 files changed, 3 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>>> index 8192740357a0..388f3ffc6657 100644
>>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>>> @@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>>>>>>>>> if (unlikely(ret < 0))
>>>>>>>>> kfree_skb(skb);
>>>>>>>>> else
>>>>>>>>> - consume_skb(skb);
>>>>>>>>> + skb_attempt_defer_free(skb);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> return ret;
>>>>>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>>>>>> index 6be01454f262..89217c43c639 100644
>>>>>>>>> --- a/net/core/skbuff.c
>>>>>>>>> +++ b/net/core/skbuff.c
>>>>>>>>> @@ -7201,6 +7201,7 @@ nodefer: kfree_skb_napi_cache(skb);
>>>>>>>>> DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
>>>>>>>>> DEBUG_NET_WARN_ON_ONCE(skb->destructor);
>>>>>>>>> DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
>>>>>>>>> + DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
>>>>>>>>
>>>>>>>> I may miss something but it looks there's no guarantee that the packet
>>>>>>>> sent to TAP is not shared.
>>>>>>>
>>>>>>> Yes, I did wonder.
>>>>>>>
>>>>>>> How about something like
>>>>>>>
>>>>>>> /**
>>>>>>> * consume_skb_attempt_defer - free an skbuff
>>>>>>> * @skb: buffer to free
>>>>>>> *
>>>>>>> * Drop a ref to the buffer and attempt to defer free it if the usage count
>>>>>>> * has hit zero.
>>>>>>> */
>>>>>>> void consume_skb_attempt_defer(struct sk_buff *skb)
>>>>>>> {
>>>>>>> if (!skb_unref(skb))
>>>>>>> return;
>>>>>>>
>>>>>>> trace_consume_skb(skb, __builtin_return_address(0));
>>>>>>>
>>>>>>> skb_attempt_defer_free(skb);
>>>>>>> }
>>>>>>> EXPORT_SYMBOL(consume_skb_attempt_defer);
>>>>>>>
>>>>>>> and an inline version for the !CONFIG_TRACEPOINTS case
>>>>>>
>>>>>> I will take care of the changes, have you seen my recent series ?
>>>>>
>>>>> Great, thanks. I did see your series and will evaluate the improvement in our test setup.
>>>>>
>>>>>>
>>>>>>
>>>>>> I think you are missing a few points….
>>>>>
>>>>> Sure, still learning.
>>>>
>>>> Sure !
>>>>
>>>> Make sure to add in your dev .config : CONFIG_DEBUG_NET=y
>>>>
>>>
>>> Hey Nick,
>>> Thanks for sending this out, and funny enough, I had almost this
>>> exact same series of thoughts back in May, but ended up getting
>>> sucked into a rabbit hole the size of Texas and never circled
>>> back to finish up the series.
>>>
>>> Check out my series here:
>>> https://patchwork.kernel.org/project/netdevbpf/patch/20250506145530.2877229-5-jon@nutanix.com/
>>>
>>> I was also monkeying around with defer free in this exact spot,
>>> but it too got lost in the rabbit hole, so I’m glad I stumbled
>>> upon this again tonight.
>>>
>>> Let me dust this baby off and send a v2 on top of Eric’s
>>> napi_consume_skb() series, as the combination of the two
>>> of them should net out positively for you
>>>
>>> Jon
>>>
>
> Did some testing on this, it does work well. The only downside is that
> when testing a very heavy UDP TX workload, the TX vhost thread
> gets IPI’d heavily to process the deferred list. I’m going to try to
> see if tactically calling skb_defer_free_flush immediately before
> napi_skb_cache_get_bulk in my patch set helps resolve that. Will check
> that out tomorrow and report back.
Hey Nick - I’ve posted a v2 of my series, would appreciate your eyes
if you’ve got time to give it a poke and see how it helps your use case?
Would love to see how it fairs in your high scale test.
https://patchwork.kernel.org/project/netdevbpf/cover/20251125200041.1565663-1-jon@nutanix.com/
Thanks,
Jon
© 2016 - 2025 Red Hat, Inc.