arch/x86/kvm/hyperv.c | 2 +- arch/x86/kvm/mmu/paging_tmpl.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
Use the normal, checked versions for get_user() and put_user() instead of
the double-underscore versions that omit range checks, as the checked
versions are actually measurably faster on modern CPUs (12%+ on Intel,
25%+ on AMD).
The performance hit on the unchecked versions is almost entirely due to
the added LFENCE on CPUs where LFENCE is serializing (which is effectively
all modern CPUs), which was added by commit 304ec1b05031 ("x86/uaccess:
Use __uaccess_begin_nospec() and uaccess_try_nospec"). The small
optimizations done by commit b19b74bc99b1 ("x86/mm: Rework address range
check in get_user() and put_user()") likely shave a few cycles off, but
the bulk of the extra latency comes from the LFENCE.
Don't bother trying to open-code an equivalent for performance reasons, as
the loss of inlining (e.g. see commit ea6f043fc984 ("x86: Make __get_user()
generate an out-of-line call") is largely a non-factor (ignoring setups
where RET is something entirely different),
As measured across tens of millions of calls of guest PTE reads in
FNAME(walk_addr_generic):
__get_user() get_user() open-coded open-coded, no LFENCE
Intel (EMR) 75.1 67.6 75.3 65.5
AMD (Turin) 68.1 51.1 67.5 49.3
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Closes: https://lore.kernel.org/all/CAHk-=wimh_3jM9Xe8Zx0rpuf8CPDu6DkRCGb44azk0Sz5yqSnw@mail.gmail.com
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/hyperv.c | 2 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 38595ecb990d..de92292eb1f5 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1568,7 +1568,7 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host)
* only, there can be valuable data in the rest which needs
* to be preserved e.g. on migration.
*/
- if (__put_user(0, (u32 __user *)addr))
+ if (put_user(0, (u32 __user *)addr))
return 1;
hv_vcpu->hv_vapic = data;
kvm_vcpu_mark_page_dirty(vcpu, gfn);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index ed762bb4b007..901cd2bd40b8 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -402,7 +402,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
goto error;
ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
- if (unlikely(__get_user(pte, ptep_user)))
+ if (unlikely(get_user(pte, ptep_user)))
goto error;
walker->ptep_user[walker->level - 1] = ptep_user;
base-commit: a996dd2a5e1ec54dcf7d7b93915ea3f97e14e68a
--
2.51.2.1041.gc1ab5b90ca-goog
On Thu, 06 Nov 2025 13:02:06 -0800, Sean Christopherson wrote:
> Use the normal, checked versions for get_user() and put_user() instead of
> the double-underscore versions that omit range checks, as the checked
> versions are actually measurably faster on modern CPUs (12%+ on Intel,
> 25%+ on AMD).
>
> The performance hit on the unchecked versions is almost entirely due to
> the added LFENCE on CPUs where LFENCE is serializing (which is effectively
> all modern CPUs), which was added by commit 304ec1b05031 ("x86/uaccess:
> Use __uaccess_begin_nospec() and uaccess_try_nospec"). The small
> optimizations done by commit b19b74bc99b1 ("x86/mm: Rework address range
> check in get_user() and put_user()") likely shave a few cycles off, but
> the bulk of the extra latency comes from the LFENCE.
>
> [...]
Applied to kvm-x86 misc, with a call out in the changelog that the Hyper-V
path isn't performance sensitive, and that the motiviation is consistency and
the purging of __{get,put}_user().
[1/1] KVM: x86: Use "checked" versions of get_user() and put_user()
https://github.com/kvm-x86/linux/commit/d1bc00483759
--
https://github.com/kvm-x86/linux/tree/next
Sean Christopherson <seanjc@google.com> writes:
> Use the normal, checked versions for get_user() and put_user() instead of
> the double-underscore versions that omit range checks, as the checked
> versions are actually measurably faster on modern CPUs (12%+ on Intel,
> 25%+ on AMD).
>
> The performance hit on the unchecked versions is almost entirely due to
> the added LFENCE on CPUs where LFENCE is serializing (which is effectively
> all modern CPUs), which was added by commit 304ec1b05031 ("x86/uaccess:
> Use __uaccess_begin_nospec() and uaccess_try_nospec"). The small
> optimizations done by commit b19b74bc99b1 ("x86/mm: Rework address range
> check in get_user() and put_user()") likely shave a few cycles off, but
> the bulk of the extra latency comes from the LFENCE.
>
> Don't bother trying to open-code an equivalent for performance reasons, as
> the loss of inlining (e.g. see commit ea6f043fc984 ("x86: Make __get_user()
> generate an out-of-line call") is largely a non-factor (ignoring setups
> where RET is something entirely different),
>
> As measured across tens of millions of calls of guest PTE reads in
> FNAME(walk_addr_generic):
>
> __get_user() get_user() open-coded open-coded, no LFENCE
> Intel (EMR) 75.1 67.6 75.3 65.5
> AMD (Turin) 68.1 51.1 67.5 49.3
>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Closes: https://lore.kernel.org/all/CAHk-=wimh_3jM9Xe8Zx0rpuf8CPDu6DkRCGb44azk0Sz5yqSnw@mail.gmail.com
> Cc: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/hyperv.c | 2 +-
> arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 38595ecb990d..de92292eb1f5 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1568,7 +1568,7 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host)
> * only, there can be valuable data in the rest which needs
> * to be preserved e.g. on migration.
> */
> - if (__put_user(0, (u32 __user *)addr))
Did some history digging on this one, apparently it appeared with
commit 8b0cedff040b652f3d36b1368778667581b0c140
Author: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Date: Sun May 15 23:22:04 2011 +0800
KVM: use __copy_to_user/__clear_user to write guest page
and the justification was:
Simply use __copy_to_user/__clear_user to write guest page since we have
already verified the user address when the memslot is set
Unlike FNAME(walk_addr_generic), I don't belive kvm_hv_set_msr() is
actually performance critical, normally behaving guests/userspaces
should never be doing extensive writing to
HV_X64_MSR_VP_ASSIST_PAGE. I.e. we can probably ignore the performance
aspect of this change completely.
> + if (put_user(0, (u32 __user *)addr))
> return 1;
> hv_vcpu->hv_vapic = data;
> kvm_vcpu_mark_page_dirty(vcpu, gfn);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index ed762bb4b007..901cd2bd40b8 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -402,7 +402,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> goto error;
>
> ptep_user = (pt_element_t __user *)((void *)host_addr + offset);
> - if (unlikely(__get_user(pte, ptep_user)))
> + if (unlikely(get_user(pte, ptep_user)))
> goto error;
> walker->ptep_user[walker->level - 1] = ptep_user;
>
>
> base-commit: a996dd2a5e1ec54dcf7d7b93915ea3f97e14e68a
--
Vitaly
On Thu, 6 Nov 2025 at 13:02, Sean Christopherson <seanjc@google.com> wrote:
>
> Use the normal, checked versions for get_user() and put_user() instead of
> the double-underscore versions that omit range checks, as the checked
> versions are actually measurably faster on modern CPUs (12%+ on Intel,
> 25%+ on AMD).
Thanks. I'm assuming I'll see this from the regular kvm pull at some point.
We have a number of other cases of this in x86 signal handling, and
those probably should also be just replaced with plain get_user()
calls.
The x86 FPU context handling in particular is disgusting, and doesn't
have access_ok() close to the actual accesses. The access_ok() is in
copy_fpstate_to_sigframe(), while the __get_user() calls are in a
different file entirely.
That's almost certainly also a pessimization, in *addition* to being
an unreadable mess with security implications if anybody ever gets
that code wrong. So I really think that should be fixed.
The perf events core similarly has some odd checking. For a moment I
thought it used __get_user() as a way to do both user and kernel
frames, but no, it actually has an alias for access_ok(), except it
calls it "valid_user_frame()" and for some reason uses "__access_ok()"
which lacks the compiler "likely()" marking.
Anyway, every single __get_user() call I looked at looked like
historical garbage.
Another example of complete horror: the PCI code uses
__get_user/__put_user in the /proc handling code.
Which didn't even make sense historically, when the actual data read
or written is then used with the pci_user_read/write_config_xyz()
functions.
I suspect it may go back to some *really* old code when the PCI writes
were also done as just raw inline asm, and while that has not been the
case for decades, the user accesses remained because they still
worked. That code predates not just git, but the BK tree too.
End result: I get the feeling that we should just do a global
search-and-replace of the __get_user/__put_user users, replace them
with plain get_user/put_user instead, and then fix up any fallout (eg
the coco code).
Because unlike the "start checking __{get,put}_user() addresses", such
a global search-and-replace could then be reverted one case at a time
as people notice "that was one of those horror-cases that actually
*wanted* to work with kernel addresses too".
Clearly it's much too late to do that for 6.18, but if somebody
reminds me during the 6.19 merge window, I think I'll do exactly that.
Or even better - some brave heroic soul that wants to deal with the
fallout do this in a branch for linux-next?
Linus
© 2016 - 2025 Red Hat, Inc.