drivers/vhost/vhost.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
vhost_get_user and vhost_put_user leverage __get_user and __put_user,
respectively, which were both added in 2016 by commit 6b1e6cc7855b
("vhost: new device IOTLB API"). In a heavy UDP transmit workload on a
vhost-net backed tap device, these functions showed up as ~11.6% of
samples in a flamegraph of the underlying vhost worker thread.
Quoting Linus from [1]:
Anyway, every single __get_user() call I looked at looked like
historical garbage. [...] End result: I get the feeling that we
should just do a global search-and-replace of the __get_user/
__put_user users, replace them with plain get_user/put_user instead,
and then fix up any fallout (eg the coco code).
Switch to plain get_user/put_user in vhost, which results in a slight
throughput speedup. get_user now about ~8.4% of samples in flamegraph.
Basic iperf3 test on a Intel 5416S CPU with Ubuntu 25.10 guest:
TX: taskset -c 2 iperf3 -c <rx_ip> -t 60 -p 5200 -b 0 -u -i 5
RX: taskset -c 2 iperf3 -s -p 5200 -D
Before: 6.08 Gbits/sec
After: 6.32 Gbits/sec
As to what drives the speedup, Sean's patch [2] explains:
Use the normal, checked versions for get_user() and put_user() instead of
the double-underscore versions that omit range checks, as the checked
versions are actually measurably faster on modern CPUs (12%+ on Intel,
25%+ on AMD).
The performance hit on the unchecked versions is almost entirely due to
the added LFENCE on CPUs where LFENCE is serializing (which is effectively
all modern CPUs), which was added by commit 304ec1b05031 ("x86/uaccess:
Use __uaccess_begin_nospec() and uaccess_try_nospec"). The small
optimizations done by commit b19b74bc99b1 ("x86/mm: Rework address range
check in get_user() and put_user()") likely shave a few cycles off, but
the bulk of the extra latency comes from the LFENCE.
[1] https://lore.kernel.org/all/CAHk-=wiJiDSPZJTV7z3Q-u4DfLgQTNWqUqqrwSBHp0+Dh016FA@mail.gmail.com/
[2] https://lore.kernel.org/all/20251106210206.221558-1-seanjc@google.com/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
---
drivers/vhost/vhost.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 8570fdf2e14a..ffbd0a9a7a03 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1442,13 +1442,13 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
({ \
int ret; \
if (!vq->iotlb) { \
- ret = __put_user(x, ptr); \
+ ret = put_user(x, ptr); \
} else { \
__typeof__(ptr) to = \
(__typeof__(ptr)) __vhost_get_user(vq, ptr, \
sizeof(*ptr), VHOST_ADDR_USED); \
if (to != NULL) \
- ret = __put_user(x, to); \
+ ret = put_user(x, to); \
else \
ret = -EFAULT; \
} \
@@ -1487,14 +1487,14 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
({ \
int ret; \
if (!vq->iotlb) { \
- ret = __get_user(x, ptr); \
+ ret = get_user(x, ptr); \
} else { \
__typeof__(ptr) from = \
(__typeof__(ptr)) __vhost_get_user(vq, ptr, \
sizeof(*ptr), \
type); \
if (from != NULL) \
- ret = __get_user(x, from); \
+ ret = get_user(x, from); \
else \
ret = -EFAULT; \
} \
--
2.43.0
On Wed, 12 Nov 2025 17:55:28 -0700
Jon Kohler <jon@nutanix.com> wrote:
> vhost_get_user and vhost_put_user leverage __get_user and __put_user,
> respectively, which were both added in 2016 by commit 6b1e6cc7855b
> ("vhost: new device IOTLB API"). In a heavy UDP transmit workload on a
> vhost-net backed tap device, these functions showed up as ~11.6% of
> samples in a flamegraph of the underlying vhost worker thread.
>
> Quoting Linus from [1]:
> Anyway, every single __get_user() call I looked at looked like
> historical garbage. [...] End result: I get the feeling that we
> should just do a global search-and-replace of the __get_user/
> __put_user users, replace them with plain get_user/put_user instead,
> and then fix up any fallout (eg the coco code).
>
> Switch to plain get_user/put_user in vhost, which results in a slight
> throughput speedup. get_user now about ~8.4% of samples in flamegraph.
>
> Basic iperf3 test on a Intel 5416S CPU with Ubuntu 25.10 guest:
> TX: taskset -c 2 iperf3 -c <rx_ip> -t 60 -p 5200 -b 0 -u -i 5
> RX: taskset -c 2 iperf3 -s -p 5200 -D
> Before: 6.08 Gbits/sec
> After: 6.32 Gbits/sec
>
> As to what drives the speedup, Sean's patch [2] explains:
> Use the normal, checked versions for get_user() and put_user() instead of
> the double-underscore versions that omit range checks, as the checked
> versions are actually measurably faster on modern CPUs (12%+ on Intel,
> 25%+ on AMD).
Is there an associated access_ok() that can also be removed?
David
> On Nov 14, 2025, at 1:54 PM, David Laight <david.laight.linux@gmail.com> wrote:
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> On Wed, 12 Nov 2025 17:55:28 -0700
> Jon Kohler <jon@nutanix.com> wrote:
>
>> vhost_get_user and vhost_put_user leverage __get_user and __put_user,
>> respectively, which were both added in 2016 by commit 6b1e6cc7855b
>> ("vhost: new device IOTLB API"). In a heavy UDP transmit workload on a
>> vhost-net backed tap device, these functions showed up as ~11.6% of
>> samples in a flamegraph of the underlying vhost worker thread.
>>
>> Quoting Linus from [1]:
>> Anyway, every single __get_user() call I looked at looked like
>> historical garbage. [...] End result: I get the feeling that we
>> should just do a global search-and-replace of the __get_user/
>> __put_user users, replace them with plain get_user/put_user instead,
>> and then fix up any fallout (eg the coco code).
>>
>> Switch to plain get_user/put_user in vhost, which results in a slight
>> throughput speedup. get_user now about ~8.4% of samples in flamegraph.
>>
>> Basic iperf3 test on a Intel 5416S CPU with Ubuntu 25.10 guest:
>> TX: taskset -c 2 iperf3 -c <rx_ip> -t 60 -p 5200 -b 0 -u -i 5
>> RX: taskset -c 2 iperf3 -s -p 5200 -D
>> Before: 6.08 Gbits/sec
>> After: 6.32 Gbits/sec
>>
>> As to what drives the speedup, Sean's patch [2] explains:
>> Use the normal, checked versions for get_user() and put_user() instead of
>> the double-underscore versions that omit range checks, as the checked
>> versions are actually measurably faster on modern CPUs (12%+ on Intel,
>> 25%+ on AMD).
>
> Is there an associated access_ok() that can also be removed?
>
> David
Hey David - IIUC, the access_ok() for non-iotlb setups is done at
initial setup time, not per event, see vhost_vring_set_addr and
for the vhost net side see vhost_net_set_backend ->
vhost_vq_access_ok.
Will lean on MST/Jason to help sanity check my understanding.
In the iotlb case, that’s handled differently (Jason can speak to
that side), but I dont think there is something we’d remove there?
On Fri, Nov 14, 2025 at 07:30:32PM +0000, Jon Kohler wrote:
>
>
> > On Nov 14, 2025, at 1:54 PM, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > !-------------------------------------------------------------------|
> > CAUTION: External Email
> >
> > |-------------------------------------------------------------------!
> >
> > On Wed, 12 Nov 2025 17:55:28 -0700
> > Jon Kohler <jon@nutanix.com> wrote:
> >
> >> vhost_get_user and vhost_put_user leverage __get_user and __put_user,
> >> respectively, which were both added in 2016 by commit 6b1e6cc7855b
> >> ("vhost: new device IOTLB API"). In a heavy UDP transmit workload on a
> >> vhost-net backed tap device, these functions showed up as ~11.6% of
> >> samples in a flamegraph of the underlying vhost worker thread.
> >>
> >> Quoting Linus from [1]:
> >> Anyway, every single __get_user() call I looked at looked like
> >> historical garbage. [...] End result: I get the feeling that we
> >> should just do a global search-and-replace of the __get_user/
> >> __put_user users, replace them with plain get_user/put_user instead,
> >> and then fix up any fallout (eg the coco code).
> >>
> >> Switch to plain get_user/put_user in vhost, which results in a slight
> >> throughput speedup. get_user now about ~8.4% of samples in flamegraph.
> >>
> >> Basic iperf3 test on a Intel 5416S CPU with Ubuntu 25.10 guest:
> >> TX: taskset -c 2 iperf3 -c <rx_ip> -t 60 -p 5200 -b 0 -u -i 5
> >> RX: taskset -c 2 iperf3 -s -p 5200 -D
> >> Before: 6.08 Gbits/sec
> >> After: 6.32 Gbits/sec
> >>
> >> As to what drives the speedup, Sean's patch [2] explains:
> >> Use the normal, checked versions for get_user() and put_user() instead of
> >> the double-underscore versions that omit range checks, as the checked
> >> versions are actually measurably faster on modern CPUs (12%+ on Intel,
> >> 25%+ on AMD).
> >
> > Is there an associated access_ok() that can also be removed?
> >
> > David
>
> Hey David - IIUC, the access_ok() for non-iotlb setups is done at
> initial setup time, not per event, see vhost_vring_set_addr and
> for the vhost net side see vhost_net_set_backend ->
> vhost_vq_access_ok.
>
> Will lean on MST/Jason to help sanity check my understanding.
Right.
> In the iotlb case, that’s handled differently (Jason can speak to
> that side), but I dont think there is something we’d remove there?
On Fri, 14 Nov 2025 19:30:32 +0000
Jon Kohler <jon@nutanix.com> wrote:
> > On Nov 14, 2025, at 1:54 PM, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > !-------------------------------------------------------------------|
> > CAUTION: External Email
> >
> > |-------------------------------------------------------------------!
> >
> > On Wed, 12 Nov 2025 17:55:28 -0700
> > Jon Kohler <jon@nutanix.com> wrote:
> >
> >> vhost_get_user and vhost_put_user leverage __get_user and __put_user,
> >> respectively, which were both added in 2016 by commit 6b1e6cc7855b
> >> ("vhost: new device IOTLB API"). In a heavy UDP transmit workload on a
> >> vhost-net backed tap device, these functions showed up as ~11.6% of
> >> samples in a flamegraph of the underlying vhost worker thread.
> >>
> >> Quoting Linus from [1]:
> >> Anyway, every single __get_user() call I looked at looked like
> >> historical garbage. [...] End result: I get the feeling that we
> >> should just do a global search-and-replace of the __get_user/
> >> __put_user users, replace them with plain get_user/put_user instead,
> >> and then fix up any fallout (eg the coco code).
> >>
> >> Switch to plain get_user/put_user in vhost, which results in a slight
> >> throughput speedup. get_user now about ~8.4% of samples in flamegraph.
> >>
> >> Basic iperf3 test on a Intel 5416S CPU with Ubuntu 25.10 guest:
> >> TX: taskset -c 2 iperf3 -c <rx_ip> -t 60 -p 5200 -b 0 -u -i 5
> >> RX: taskset -c 2 iperf3 -s -p 5200 -D
> >> Before: 6.08 Gbits/sec
> >> After: 6.32 Gbits/sec
> >>
> >> As to what drives the speedup, Sean's patch [2] explains:
> >> Use the normal, checked versions for get_user() and put_user() instead of
> >> the double-underscore versions that omit range checks, as the checked
> >> versions are actually measurably faster on modern CPUs (12%+ on Intel,
> >> 25%+ on AMD).
> >
> > Is there an associated access_ok() that can also be removed?
> >
> > David
>
> Hey David - IIUC, the access_ok() for non-iotlb setups is done at
> initial setup time, not per event, see vhost_vring_set_addr and
> for the vhost net side see vhost_net_set_backend ->
> vhost_vq_access_ok.
This is a long way away from the actual access....
The early 'sanity check' might be worth keeping, but the code has to
allow for the user access faulting (the application might unmap it).
But, in some sense, that early check is optimising for the user passing
in an invalid buffer - so not actually worth while,
> Will lean on MST/Jason to help sanity check my understanding.
>
> In the iotlb case, that’s handled differently (Jason can speak to
> that side), but I dont think there is something we’d remove there?
Isn't the application side much the same?
(But I don't know what the code is doing...)
David
On Thu, Nov 13, 2025 at 8:14 AM Jon Kohler <jon@nutanix.com> wrote:
>
> vhost_get_user and vhost_put_user leverage __get_user and __put_user,
> respectively, which were both added in 2016 by commit 6b1e6cc7855b
> ("vhost: new device IOTLB API").
It has been used even before this commit.
> In a heavy UDP transmit workload on a
> vhost-net backed tap device, these functions showed up as ~11.6% of
> samples in a flamegraph of the underlying vhost worker thread.
>
> Quoting Linus from [1]:
> Anyway, every single __get_user() call I looked at looked like
> historical garbage. [...] End result: I get the feeling that we
> should just do a global search-and-replace of the __get_user/
> __put_user users, replace them with plain get_user/put_user instead,
> and then fix up any fallout (eg the coco code).
>
> Switch to plain get_user/put_user in vhost, which results in a slight
> throughput speedup. get_user now about ~8.4% of samples in flamegraph.
>
> Basic iperf3 test on a Intel 5416S CPU with Ubuntu 25.10 guest:
> TX: taskset -c 2 iperf3 -c <rx_ip> -t 60 -p 5200 -b 0 -u -i 5
> RX: taskset -c 2 iperf3 -s -p 5200 -D
> Before: 6.08 Gbits/sec
> After: 6.32 Gbits/sec
I wonder if we need to test on archs like ARM.
Thanks
© 2016 - 2025 Red Hat, Inc.