[PATCH RESEND] x86/entry: Don't write to CR3 when restoring to kernel CR3

Brendan Jackman posted 1 patch 2 years, 4 months ago
arch/x86/entry/calling.h | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
[PATCH RESEND] x86/entry: Don't write to CR3 when restoring to kernel CR3
Posted by Brendan Jackman 2 years, 4 months ago
From: Lai Jiangshan <laijs@linux.alibaba.com>

Skip resuming KERNEL pages since it is already KERNEL CR3

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---

While staring at paranoid_exit I was confused about why we had this CR3
write, avoiding it seems like a free optimisation. The original commit
21e94459110252 ("x86/mm: Optimize RESTORE_CR3") says "Most NMI/paranoid
exceptions will not in fact change pagetables" but I didn't't understand
what the "most" was referring to. I then discovered this patch on the
mailing list, Andy said[1] that it looks correct so maybe now is the
time to merge it?

Note there's another patch in [1] as well, the benefit of that one is
not obvious to me though.

We've tested an equivalent patch in our internal kernel.

[1] https://lore.kernel.org/lkml/20200526043507.51977-3-laijs@linux.alibaba.com/
-- >8 --
 arch/x86/entry/calling.h | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index f6907627172b..b2458685d56e 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -236,14 +236,13 @@ For 32-bit we have the following conventions - kernel is built with
 .macro RESTORE_CR3 scratch_reg:req save_reg:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 
-	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
-
 	/*
-	 * KERNEL pages can always resume with NOFLUSH as we do
-	 * explicit flushes.
+	 * Skip resuming KERNEL pages since it is already KERNEL CR3.
 	 */
 	bt	$PTI_USER_PGTABLE_BIT, \save_reg
-	jnc	.Lnoflush_\@
+	jnc	.Lend_\@
+
+	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
 
 	/*
 	 * Check if there's a pending flush for the user ASID we're
@@ -261,10 +260,6 @@ For 32-bit we have the following conventions - kernel is built with
 	SET_NOFLUSH_BIT \save_reg
 
 .Lwrcr3_\@:
-	/*
-	 * The CR3 write could be avoided when not changing its value,
-	 * but would require a CR3 read *and* a scratch register.
-	 */
 	movq	\save_reg, %cr3
 .Lend_\@:
 .endm
-- 
2.41.0.694.ge786442a9b-goog
Re: [PATCH RESEND] x86/entry: Don't write to CR3 when restoring to kernel CR3
Posted by Andy Lutomirski 2 years, 2 months ago
On Thu, Aug 17, 2023, at 5:15 AM, Brendan Jackman wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
>
> Skip resuming KERNEL pages since it is already KERNEL CR3
>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>
> While staring at paranoid_exit I was confused about why we had this CR3
> write, avoiding it seems like a free optimisation. The original commit
> 21e94459110252 ("x86/mm: Optimize RESTORE_CR3") says "Most NMI/paranoid
> exceptions will not in fact change pagetables" but I didn't't understand
> what the "most" was referring to. I then discovered this patch on the
> mailing list, Andy said[1] that it looks correct so maybe now is the
> time to merge it?

I did?

Looking at the link, I think I was saying that the opposite patch (*always* flush) looked okay.

>
> Note there's another patch in [1] as well, the benefit of that one is
> not obvious to me though.
>
> We've tested an equivalent patch in our internal kernel.
>
> [1] https://lore.kernel.org/lkml/20200526043507.51977-3-laijs@linux.alibaba.com/
> -- >8 --
>  arch/x86/entry/calling.h | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index f6907627172b..b2458685d56e 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -236,14 +236,13 @@ For 32-bit we have the following conventions - 
> kernel is built with
>  .macro RESTORE_CR3 scratch_reg:req save_reg:req
>  	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
> 
> -	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
> -
>  	/*
> -	 * KERNEL pages can always resume with NOFLUSH as we do
> -	 * explicit flushes.
> +	 * Skip resuming KERNEL pages since it is already KERNEL CR3.
>  	 */
>  	bt	$PTI_USER_PGTABLE_BIT, \save_reg
> -	jnc	.Lnoflush_\@
> +	jnc	.Lend_\@

I don't get it.  How do you know that the actual loaded CR3 is correct?

I'm willing to believe that there is some constraint in the way the kernel works such that every paranoid entry will, as part of its execution, switch CR3 to kernel *and leave it like that* *and that this will be the _same_ kernel CR3 that was saved*.  But I'm not really convinced that's true.  (Can we schedule in a paranoid entry?  Probably not.  What about the weird NMI paths?  What if we do something that switches to init mm?  Hmm -- doing that in a paranoid context is probably not a brilliant idea.)

Maybe it is true, and maybe a convincing argument could be made.  But that seems like a lot of thinking and fragility for an optimization that seems pretty minor.

--Andy
Re: [PATCH RESEND] x86/entry: Don't write to CR3 when restoring to kernel CR3
Posted by Thomas Gleixner 2 years, 2 months ago
On Mon, Sep 18 2023 at 20:28, Andy Lutomirski wrote:
> On Thu, Aug 17, 2023, at 5:15 AM, Brendan Jackman wrote:
>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>
>> Skip resuming KERNEL pages since it is already KERNEL CR3
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> Signed-off-by: Brendan Jackman <jackmanb@google.com>
>> ---
>>
>> While staring at paranoid_exit I was confused about why we had this CR3
>> write, avoiding it seems like a free optimisation. The original commit
>> 21e94459110252 ("x86/mm: Optimize RESTORE_CR3") says "Most NMI/paranoid
>> exceptions will not in fact change pagetables" but I didn't't understand
>> what the "most" was referring to. I then discovered this patch on the
>> mailing list, Andy said[1] that it looks correct so maybe now is the
>> time to merge it?
>
> I did?
>
> Looking at the link, I think I was saying that the opposite patch
> (*always* flush) looked okay.

That always flush part was solely for the user CR3 restore path.

>> @@ -236,14 +236,13 @@ For 32-bit we have the following conventions - 
>> kernel is built with
>>  .macro RESTORE_CR3 scratch_reg:req save_reg:req
>>  	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>> 
>> -	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
>> -
>>  	/*
>> -	 * KERNEL pages can always resume with NOFLUSH as we do
>> -	 * explicit flushes.
>> +	 * Skip resuming KERNEL pages since it is already KERNEL CR3.
>>  	 */
>>  	bt	$PTI_USER_PGTABLE_BIT, \save_reg
>> -	jnc	.Lnoflush_\@
>> +	jnc	.Lend_\@
>
> I don't get it.  How do you know that the actual loaded CR3 is correct?
>
> I'm willing to believe that there is some constraint in the way the
> kernel works such that every paranoid entry will, as part of its
> execution, switch CR3 to kernel *and leave it like that* *and that
> this will be the _same_ kernel CR3 that was saved*.  But I'm not
> really convinced that's true.  (Can we schedule in a paranoid entry?
> Probably not.  What about the weird NMI paths?  What if we do
> something that switches to init mm?  Hmm -- doing that in a paranoid
> context is probably not a brilliant idea.)

You have to differentiate between entry from kernel and entry from user.

Entry from user switches to the task stack, while entry from kernel
always runs on IST.

Entry from user cannot have kernel CR3 obviously, while entry from
kernel can have either kernel CR3 or user CR3. Entry from user does not
use the paranoid entry/exit paths at all, so that's a non-issue.

IST prevents that the exception can schedule, which in turn guarantees
that CR3 stays the same until it returns. Unless some completely stupid
code path would trie to switch to a different mm from an exception which
can hit into the middle of an mm switch. Then the failed restore is
probably the least of our problems.

> Maybe it is true, and maybe a convincing argument could be made.  But
> that seems like a lot of thinking and fragility for an optimization
> that seems pretty minor.

I don't think its pretty minor. CR3 writes even with the noflush bit set
are not necessarily cheap.

While most exceptions which go through the paranoid path are not
hotpath, the one which matters is #NMI due to perf. So I think it's
worth to spare the redundant CR3 switch in that case.

Thanks,

        tglx
Re: [PATCH RESEND] x86/entry: Don't write to CR3 when restoring to kernel CR3
Posted by Thomas Gleixner 2 years, 2 months ago
On Thu, Aug 17 2023 at 12:15, Brendan Jackman wrote:

> From: Lai Jiangshan <laijs@linux.alibaba.com>
>
> Skip resuming KERNEL pages since it is already KERNEL CR3

This really want's some more explanation than this.

Also the subject line does not make much sense to me. Something like:

  x86/entry: Avoid redundant CR3 write on paranoid returns

> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>
> While staring at paranoid_exit I was confused about why we had this CR3
> write, avoiding it seems like a free optimisation. The original commit
> 21e94459110252 ("x86/mm: Optimize RESTORE_CR3") says "Most NMI/paranoid
> exceptions will not in fact change pagetables" but I didn't't understand
> what the "most" was referring to. I then discovered this patch on the
> mailing list, Andy said[1] that it looks correct so maybe now is the
> time to merge it?
>
> Note there's another patch in [1] as well, the benefit of that one is
> not obvious to me though.

The point is that RESTORE_CR3 is only used for exceptions which enter
via paranoid_entry(): #DF, #MCE, #VC, #DB and the NMI low level return
code.

#NMI, #MCE, #VC, #DB use this code path only when the exception
interrupted kernel mode. #DF uses it unconditionally.

Entries from user space for #NMI, #MCE, #VC, #DB switch away from the
IST stack to the thread stack and handle the exception there. So they
won't return via a code path which uses RESTORE_CR3.

So your patch optimizes the CR3 handling vs. kernel entries. If the
exception hits code which has kernel CR3 loaded then restoring the
unchanged kernel CR3 is a pointless exercise. CR3 is guaranteed to be
unchanged as none of those exceptions can schedule when the exception
hit kernel mode, i.e. CPL=0.

Now the second change Lai did is to get rid of the conditional flush.

The reasoning is that the code paths where kernel mode (CPL=0) runs with
user CR3 are pretty small. They all look like this:

     SWITCH_TO_USER_CR3
     // Do the final preparation for return
     IRET 	// SYSRET, SYSEXIT

and therefore the probability that an affected exception hits between
the CR3 switch and the instruction which returns to user space is pretty
small.

The only case where an unconditional flush might matter is #NMI. #DF,
#MCE, #VC, #DB are slow path anyway.

As the probability is low Lai removed the conditional flush optimization
and turned it into an unconditional flush to make the code simpler.

It's debatable whether this matters or not, but as that optimization and
sits in the rarely executed code path it is not hurting either and
avoids the occasional pointless flush when perf(1) is used.

> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -236,14 +236,13 @@ For 32-bit we have the following conventions - kernel is built with
>  .macro RESTORE_CR3 scratch_reg:req save_reg:req
>  	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>  
> -	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
> -
>  	/*
> -	 * KERNEL pages can always resume with NOFLUSH as we do
> -	 * explicit flushes.
> +	 * Skip resuming KERNEL pages since it is already KERNEL CR3.
>  	 */

         /*
          * If CR3 contained the kernel page tables at the paranoid
          * exception entry, then there is nothing to restore as CR3
          * is not modified while handling the exception.
          */

Perhaps?

There are now stale comments vs. the restore mechanism and
paranoid_entry() in some places in entry_64.S. They really want to be
fixed up to avoid lots of head scratching later. E.g. this:

	/* Always restore stashed CR3 value (see paranoid_entry) */
	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14

does not make sense now. Neither does the comment in paranoid_entry()
itself.

Also the macro name is misleading with this change. It does not
unconditionally restore CR3 as the name suggests.

PARANOID_RESTORE_USER_CR3 or something like that makes it clear what
this is about.

There is another detail. The slightly convoluted code flow in the user
restore path noflush handling.

	movq	\save_reg, \scratch_reg
	andq	$(0x7FF), \scratch_reg
	bt	\scratch_reg, THIS_CPU_user_pcid_flush_mask
	jnc	.Lnoflush_\@

	btr	\scratch_reg, THIS_CPU_user_pcid_flush_mask
	jmp	.Lwrcr3_\@

.Lnoflush_\@:
	SET_NOFLUSH_BIT \save_reg

The only reason why this code is so convoluted was to reuse the setting
of the NOFLUSH bit for the kernel CR3 write:

	bt	$PTI_USER_PGTABLE_BIT, \save_reg
	jnc	.Lnoflush_\@

which this patch eliminates.

So this logic can be simplified to

	movq	\save_reg, \scratch_reg
	andq	$(0x7FF), \scratch_reg
	btr	\scratch_reg, THIS_CPU_user_pcid_flush_mask
	jc	.Lwrcr3_\@

	SET_NOFLUSH_BIT \save_reg

Thanks,

        tglx
Re: [PATCH RESEND] x86/entry: Don't write to CR3 when restoring to kernel CR3
Posted by Ingo Molnar 2 years, 2 months ago
* Brendan Jackman <jackmanb@google.com> wrote:

> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Skip resuming KERNEL pages since it is already KERNEL CR3
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
> 
> While staring at paranoid_exit I was confused about why we had this CR3
> write, avoiding it seems like a free optimisation. The original commit
> 21e94459110252 ("x86/mm: Optimize RESTORE_CR3") says "Most NMI/paranoid
> exceptions will not in fact change pagetables" but I didn't't understand
> what the "most" was referring to. I then discovered this patch on the
> mailing list, Andy said[1] that it looks correct so maybe now is the
> time to merge it?
> 
> Note there's another patch in [1] as well, the benefit of that one is
> not obvious to me though.
> 
> We've tested an equivalent patch in our internal kernel.
> 
> [1] https://lore.kernel.org/lkml/20200526043507.51977-3-laijs@linux.alibaba.com/
> -- >8 --
>  arch/x86/entry/calling.h | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)

I concur that this is a good change, but it would be really nice to get an 
ack from Andy or Thomas as well.

Thanks,

	Ingo
Re: [PATCH RESEND] x86/entry: Don't write to CR3 when restoring to kernel CR3
Posted by Brendan Jackman 2 years, 3 months ago
Oops, just noticed I didn't put +Peter in the recipients.

On Thu, 17 Aug 2023 at 05:15, Brendan Jackman <jackmanb@google.com> wrote:
>
> From: Lai Jiangshan <laijs@linux.alibaba.com>
>
> Skip resuming KERNEL pages since it is already KERNEL CR3
>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>
> While staring at paranoid_exit I was confused about why we had this CR3
> write, avoiding it seems like a free optimisation. The original commit
> 21e94459110252 ("x86/mm: Optimize RESTORE_CR3") says "Most NMI/paranoid
> exceptions will not in fact change pagetables" but I didn't't understand
> what the "most" was referring to. I then discovered this patch on the
> mailing list, Andy said[1] that it looks correct so maybe now is the
> time to merge it?
>
> Note there's another patch in [1] as well, the benefit of that one is
> not obvious to me though.

Also expanding on this a bit: the main purpose of this code is to
switch back to the user address space after handling one of these
"paranoid" exceptions. When one of those exceptions interrupts kernel
mode, we didn't switch from user to kernel address space so the
restore isn't needed.

There's no other reason to change CR3 here; context switches don't
happen in these exceptions but even if they did we would restore CR3
from the returning context_switch path. In fact in that case doing it
in paranoid_exit would potentially use the wrong PCID if it had been
reallocated by choose_new_asid. (And on the other side of the coin if
the restore was needed, it would be needed under !KPTI too).

> We've tested an equivalent patch in our internal kernel.
>
> [1] https://lore.kernel.org/lkml/20200526043507.51977-3-laijs@linux.alibaba.com/
> -- >8 --
>  arch/x86/entry/calling.h | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index f6907627172b..b2458685d56e 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -236,14 +236,13 @@ For 32-bit we have the following conventions - kernel is built with
>  .macro RESTORE_CR3 scratch_reg:req save_reg:req
>         ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
>
> -       ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
> -
>         /*
> -        * KERNEL pages can always resume with NOFLUSH as we do
> -        * explicit flushes.
> +        * Skip resuming KERNEL pages since it is already KERNEL CR3.
>          */
>         bt      $PTI_USER_PGTABLE_BIT, \save_reg
> -       jnc     .Lnoflush_\@
> +       jnc     .Lend_\@
> +
> +       ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
>
>         /*
>          * Check if there's a pending flush for the user ASID we're
> @@ -261,10 +260,6 @@ For 32-bit we have the following conventions - kernel is built with
>         SET_NOFLUSH_BIT \save_reg
>
>  .Lwrcr3_\@:
> -       /*
> -        * The CR3 write could be avoided when not changing its value,
> -        * but would require a CR3 read *and* a scratch register.
> -        */
>         movq    \save_reg, %cr3
>  .Lend_\@:
>  .endm
> --
> 2.41.0.694.ge786442a9b-goog
>