[PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism

Xin Li (Intel) posted 3 patches 1 year, 4 months ago
There is a newer version of this series
[PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Xin Li (Intel) 1 year, 4 months ago
From: Andrew Cooper <andrew.cooper3@citrix.com>

Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
use the alternatives mechanism to choose WRMSRNS when it's available,
otherwise fallback to WRMSR.

[1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
---
 arch/x86/include/asm/msr.h | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index d642037f9ed5..3e402d717815 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
 		     : : "c" (msr), "a"(low), "d" (high) : "memory");
 }
 
-/*
- * WRMSRNS behaves exactly like WRMSR with the only difference being
- * that it is not a serializing instruction by default.
- */
-static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
-{
-	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
-	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
-		     "2:\n"
-		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
-		     : : "c" (msr), "a"(low), "d" (high));
-}
-
 #define native_rdmsr(msr, val1, val2)			\
 do {							\
 	u64 __val = __rdmsr((msr));			\
@@ -312,9 +299,22 @@ do {							\
 
 #endif	/* !CONFIG_PARAVIRT_XXL */
 
+/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
+#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
+
+/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
 static __always_inline void wrmsrns(u32 msr, u64 val)
 {
-	__wrmsrns(msr, val, val >> 32);
+	/*
+	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
+	 * DS prefix to avoid a trailing NOP.
+	 */
+	asm volatile("1: "
+		     ALTERNATIVE("ds wrmsr",
+				 WRMSRNS, X86_FEATURE_WRMSRNS)
+		     "2:\n"
+		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
+		     : : "c" (msr), "a" ((u32)val), "d" ((u32)(val >> 32)));
 }
 
 /*
-- 
2.45.2
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Andrew Cooper 1 year, 4 months ago
On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
> From: Andrew Cooper <andrew.cooper3@citrix.com>
>
> Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
> use the alternatives mechanism to choose WRMSRNS when it's available,
> otherwise fallback to WRMSR.
>
> [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
> ---
>  arch/x86/include/asm/msr.h | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index d642037f9ed5..3e402d717815 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>  		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>  }
>  
> -/*
> - * WRMSRNS behaves exactly like WRMSR with the only difference being
> - * that it is not a serializing instruction by default.
> - */
> -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
> -{
> -	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
> -	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
> -		     "2:\n"
> -		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
> -		     : : "c" (msr), "a"(low), "d" (high));
> -}
> -
>  #define native_rdmsr(msr, val1, val2)			\
>  do {							\
>  	u64 __val = __rdmsr((msr));			\
> @@ -312,9 +299,22 @@ do {							\
>  
>  #endif	/* !CONFIG_PARAVIRT_XXL */
>  
> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
> +
> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>  static __always_inline void wrmsrns(u32 msr, u64 val)
>  {
> -	__wrmsrns(msr, val, val >> 32);
> +	/*
> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
> +	 * DS prefix to avoid a trailing NOP.
> +	 */
> +	asm volatile("1: "
> +		     ALTERNATIVE("ds wrmsr",

This isn't the version I presented, and there's no discussion of the
alteration.

The choice of CS over DS was deliberate, and came from Intel:

https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

So unless Intel want to retract that whitepaper, and all the binutils
work with it, I'd suggest keeping it as CS like we use elsewhere, and as
explicitly instructed by Intel.

~Andrew
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Xin Li 1 year, 4 months ago
On 8/9/2024 4:07 PM, Andrew Cooper wrote:
> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>> +
>> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>   {
>> -	__wrmsrns(msr, val, val >> 32);
>> +	/*
>> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
>> +	 * DS prefix to avoid a trailing NOP.
>> +	 */
>> +	asm volatile("1: "
>> +		     ALTERNATIVE("ds wrmsr",
> 
> This isn't the version I presented, and there's no discussion of the
> alteration.

I'm trying to implement wrmsr() as

static __always_inline void wrmsr(u32 msr, u64 val)
{
	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
		     "D" (msr), "S" (val));
}


As the CALL instruction is 5-byte long, and we need to pad nop for both
WRMSR and WRMSRNS, what about not using segment prefix at all?


> The choice of CS over DS was deliberate, and came from Intel:
> 
> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
> 
> So unless Intel want to retract that whitepaper, and all the binutils
> work with it, I'd suggest keeping it as CS like we use elsewhere, and as
> explicitly instructed by Intel.
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Borislav Petkov 1 year, 4 months ago
On August 16, 2024 8:52:51 PM GMT+03:00, Xin Li <xin@zytor.com> wrote:
>On 8/9/2024 4:07 PM, Andrew Cooper wrote:
>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>> +
>>> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>>   {
>>> -	__wrmsrns(msr, val, val >> 32);
>>> +	/*
>>> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
>>> +	 * DS prefix to avoid a trailing NOP.
>>> +	 */
>>> +	asm volatile("1: "
>>> +		     ALTERNATIVE("ds wrmsr",
>> 
>> This isn't the version I presented, and there's no discussion of the
>> alteration.
>
>I'm trying to implement wrmsr() as
>
>static __always_inline void wrmsr(u32 msr, u64 val)
>{
>	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
>					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
>		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>		     "D" (msr), "S" (val));
>}
>
>
>As the CALL instruction is 5-byte long, and we need to pad nop for both
>WRMSR and WRMSRNS, what about not using segment prefix at all?

The alternatives macros can handle arbitrary insn sizes - no need for any padding.


-- 
Sent from a small device: formatting sucks and brevity is inevitable.
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by H. Peter Anvin 1 year, 4 months ago
On August 17, 2024 7:23:07 AM PDT, Borislav Petkov <bp@alien8.de> wrote:
>On August 16, 2024 8:52:51 PM GMT+03:00, Xin Li <xin@zytor.com> wrote:
>>On 8/9/2024 4:07 PM, Andrew Cooper wrote:
>>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>>> +
>>>> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>>>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>>>   {
>>>> -	__wrmsrns(msr, val, val >> 32);
>>>> +	/*
>>>> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
>>>> +	 * DS prefix to avoid a trailing NOP.
>>>> +	 */
>>>> +	asm volatile("1: "
>>>> +		     ALTERNATIVE("ds wrmsr",
>>> 
>>> This isn't the version I presented, and there's no discussion of the
>>> alteration.
>>
>>I'm trying to implement wrmsr() as
>>
>>static __always_inline void wrmsr(u32 msr, u64 val)
>>{
>>	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
>>					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
>>		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>>		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>>		     "D" (msr), "S" (val));
>>}
>>
>>
>>As the CALL instruction is 5-byte long, and we need to pad nop for both
>>WRMSR and WRMSRNS, what about not using segment prefix at all?
>
>The alternatives macros can handle arbitrary insn sizes - no need for any padding.
>
>

The padding avoids unnecessary NOPs.
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Borislav Petkov 1 year, 4 months ago
On Sat, Aug 17, 2024 at 12:22:22PM -0700, H. Peter Anvin wrote:
> The padding avoids unnecessary NOPs.

And a NOP vs some other prefix to have one less insn matters practically
how exactly here?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by H. Peter Anvin 1 year, 4 months ago
On 8/17/24 22:49, Borislav Petkov wrote:
> On Sat, Aug 17, 2024 at 12:22:22PM -0700, H. Peter Anvin wrote:
>> The padding avoids unnecessary NOPs.
> 
> And a NOP vs some other prefix to have one less insn matters practically
> how exactly here?

Pretty much nothing, but it is zero cost and possibly some insanely 
small gain.

	-hpa
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Borislav Petkov 1 year, 4 months ago
On Sat, Aug 17, 2024 at 05:23:07PM +0300, Borislav Petkov wrote:
> >static __always_inline void wrmsr(u32 msr, u64 val)
> >{
> >	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
> >					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
> >		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
> >		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
> >		     "D" (msr), "S" (val));
> >}
> >
> >
> >As the CALL instruction is 5-byte long, and we need to pad nop for both
> >WRMSR and WRMSRNS, what about not using segment prefix at all?
> 
> The alternatives macros can handle arbitrary insn sizes - no need for any padding.

What you cannot do is have a CALL insn in only one of the alternative
insn groups because the compiler is free to clobber callee regs and
those might be live where you place the alternative and thus will have
a nice lovely corruption.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by H. Peter Anvin 1 year, 4 months ago
On August 17, 2024 7:43:16 AM PDT, Borislav Petkov <bp@alien8.de> wrote:
>On Sat, Aug 17, 2024 at 05:23:07PM +0300, Borislav Petkov wrote:
>> >static __always_inline void wrmsr(u32 msr, u64 val)
>> >{
>> >	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
>> >					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
>> >		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>> >		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>> >		     "D" (msr), "S" (val));
>> >}
>> >
>> >
>> >As the CALL instruction is 5-byte long, and we need to pad nop for both
>> >WRMSR and WRMSRNS, what about not using segment prefix at all?
>> 
>> The alternatives macros can handle arbitrary insn sizes - no need for any padding.
>
>What you cannot do is have a CALL insn in only one of the alternative
>insn groups because the compiler is free to clobber callee regs and
>those might be live where you place the alternative and thus will have
>a nice lovely corruption.
>

Only if the call goes to a C function.
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Borislav Petkov 1 year, 4 months ago
On Sat, Aug 17, 2024 at 12:22:56PM -0700, H. Peter Anvin wrote:
> Only if the call goes to a C function.

Aha, you have a patch in the thread which calls a by-hand asm thing.

I haven't looked in detail but we have a THUNK macro for such stuff.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Xin Li 1 year, 4 months ago
On 8/17/2024 7:43 AM, Borislav Petkov wrote:
> On Sat, Aug 17, 2024 at 05:23:07PM +0300, Borislav Petkov wrote:
>>> static __always_inline void wrmsr(u32 msr, u64 val)
>>> {
>>> 	asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS,
>>> 					 "call asm_xen_write_msr", X86_FEATURE_XENPV)
>>> 		     "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>>> 		     : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>>> 		     "D" (msr), "S" (val));
>>> }
>>>
>>>
>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>>
>> The alternatives macros can handle arbitrary insn sizes - no need for any padding.
> 
> What you cannot do is have a CALL insn in only one of the alternative
> insn groups because the compiler is free to clobber callee regs and
> those might be live where you place the alternative and thus will have
> a nice lovely corruption.

Yeah, asm_xen_write_msr() is a wrapper function to xen_write_msr() which
saves/restores those regs.

My first draft patch calls xen_write_msr() directly and works fine.  But
hpa said the same thing as you said.

Thanks!
     Xin
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Andrew Cooper 1 year, 4 months ago
On 16/08/2024 6:52 pm, Xin Li wrote:
> On 8/9/2024 4:07 PM, Andrew Cooper wrote:
>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>> +
>>> +/* Non-serializing WRMSR, when available.  Falls back to a
>>> serializing WRMSR. */
>>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>>   {
>>> -    __wrmsrns(msr, val, val >> 32);
>>> +    /*
>>> +     * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a
>>> redundant
>>> +     * DS prefix to avoid a trailing NOP.
>>> +     */
>>> +    asm volatile("1: "
>>> +             ALTERNATIVE("ds wrmsr",
>>
>> This isn't the version I presented, and there's no discussion of the
>> alteration.
>
> I'm trying to implement wrmsr() as
>
> static __always_inline void wrmsr(u32 msr, u64 val)
> {
>     asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS,
> X86_FEATURE_WRMSRNS,
>                      "call asm_xen_write_msr", X86_FEATURE_XENPV)
>              "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>              : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>              "D" (msr), "S" (val));
> }
>
>
> As the CALL instruction is 5-byte long, and we need to pad nop for both
> WRMSR and WRMSRNS, what about not using segment prefix at all?

The prefix was a minor optimisation to avoid having a trailing nop at
the end.

When combined with a call, you need 3 prefixes on WRMSR and 2 prefixes
on WRMSRNS to make all options be 5 bytes long.

That said, there's already a paravirt hook for this, and if you're
looking to work around the code gen mess for that, then doing it like
this by doubling up into rdi and rsi isn't great either.

My suggestion, not that I've had time to experiment, was to change
paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
edx:eax into rsi.  That way the top level wrmsr() retains sensible
codegen for native even when paravirt is active.

~Andrew
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by H. Peter Anvin 1 year, 4 months ago
On 8/16/24 11:40, Andrew Cooper wrote:
>>
>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>> WRMSR and WRMSRNS, what about not using segment prefix at all?
> 

You can use up to 4 prefixes of any kind (which includes opcode prefixes 
before 0F) before most decoders start hurting, so we can pad it out to 5 
bytes by doing 3f 3f .. .. ..

> 
> My suggestion, not that I've had time to experiment, was to change
> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
> edx:eax into rsi.  That way the top level wrmsr() retains sensible
> codegen for native even when paravirt is active.
> 

I have attached what should be an "obvious" example... famous last words.

	-hpa
/*
 * Input in %rax, MSR number in %ecx
 * %rdx is clobbered, as the native stub is assumed to do
 *
 * Change xen_do_write_msr to return its error code instead
 * of passing a pointer	- this gives the extra benefit that
 * we can get the *actual invocation address* in the error
 * messages.
 *
 * ex_handler_msr() should be changed to get the MSR data
 * from %rax regardless of Xen vs native; alternatively the
 * Xen handler can set up %edx as well.
 *
 * Let the native pattern look like:
 *
 * 48 89 c2                mov    %rax,%rdx
 * 48 c1 ea 20             shr    $32,%rdx
 * 3e 0f 30                ds wrmsr		<--- trap point
 *
 * ... which can be replaced with ...
 * 48 89 c2                mov    %rax,%rdx
 * 48 c1 ea 20             shr    $32,%rdx
 * 0f 01 c6                wrmsrns		<--- trap point
 *
 * ... or ...
 * e8 xx xx xx xx          call asm_xen_write_msr
 * 74 02                   jz 1f
 * 3e 0f 0b                ds ud2                  <--- trap point
 *                      1:
 * ds ud2 can of course be replaced with any other 3-byte trapping
 * instruction.
 *
 * This also removes the need for Xen to maintain different safe and
 * unsafe MSR routines, as the difference is handled by the same
 * trap handler as is used natively.
 */
 SYM_FUNC_START(asm_xen_write_msr)
	FRAME_BEGIN
	push %rax		/* Save in case of error */
	push %rcx
	push %rsi
	push %rdi
	push %r8
	push %r9
	push %r10
	push %r11
	mov %rax,%rdi
	mov %ecx,%esi
	call xen_do_write_msr
	test %eax,%eax		/* ZF=0 on error */
	pop %r11
	pop %r10
	pop %r9
	pop %r8
	pop %rdi
	pop %rsi
	pop %rcx
#ifdef OPTIONAL
	mov 4(%rsp),%edx
#endif
	pop %rax
	FRAME_END
	RET
SYM_FUNC_END(asm_xen_write_msr)
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by H. Peter Anvin 1 year, 4 months ago
On 8/16/24 14:26, H. Peter Anvin wrote:
> On 8/16/24 11:40, Andrew Cooper wrote:
>>>
>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>>
> 
> You can use up to 4 prefixes of any kind (which includes opcode prefixes 
> before 0F) before most decoders start hurting, so we can pad it out to 5 
> bytes by doing 3f 3f .. .. ..
> 
>>
>> My suggestion, not that I've had time to experiment, was to change
>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>> edx:eax into rsi.  That way the top level wrmsr() retains sensible
>> codegen for native even when paravirt is active.
>>
> 
> I have attached what should be an "obvious" example... famous last words.
> 

After looking at what xen_do_write_msr looks like, I realized we can do 
*much* better than that. This means using two alternative sequences, one 
for native/Xen and the other for native wrmsr/wrmsrns.

The call/jz sequence is *exactly* the same length as mov, shr, which 
means that xen_do_write_msr can simply return a flag if it wants the MSR 
access to be performed natively.

An important point is that in no code path can the error code be set to 
-EIO except by native MSR invocation, so the only possibilities are 
"done successfully" or "execute natively." [That being said, that would 
not be *that* hard to fix if needed.]

The new C prototype for xen_do_write_msr() becomes:

bool xen_do_write_msr(uint32_t msr, uint64_t val)

... with a true return meaning "execute natively."

(I changed the order of the arguments from my last version since it is 
more consistent with what is used everywhere else.)

RDMSR is a bit trickier. I think the best option there is to set up a 
new permission trap handler type that amounts to "get the address from 
the stack, apply a specific offset, and invoke the fixup handler for 
that address:


	case EX_TYPE_UPLEVEL: {
		/* Let reg hold the unsigned number of machine
		 * words to pop off the stack before the return
		 * address, and imm the signed offset from the
		 * return address to the desired trap point.
		 *
		 * pointer in units of machine words, and imm the
		 * signed offset from this stack word...
		 */
		unsigned long *sp = (unsigned long *)regs->sp + reg;
		regs->ip = *sp++ + (int16_t)imm;
		regs->sp = (unsigned long)sp;
		goto again;	/* Loop back to the beginning */
	}

Again, "obviously correct" code attached.

	-hpa

NOTE:

I had to dig several levels down to uncover actual situation, but 
pmu_msr_write() is actually a completely pointless function: all it does 
is shuffle some arguments, then calls pmu_msr_chk_emulated() and if it 
returns true AND the emulated flag is clear then does *exactly the same 
thing* that the calling code would have done if pmu_msr_write() itself 
had returned true... in other words, the whole function could be 
replaced by:

bool pmu_msr_write(uint32_t msr, uint64_t val)
{
         bool emulated;

         return pmu_msr_chk_emulated(msr, &val, false, &emulated) &&
  	       emulated;
}

pmu_msr_read() does the equivalent stupidity; the obvious fix(es) to 
pmu_msr_chk_emulated() I hope are obvious.


/*
 * Input in %rax, MSR number in %ecx
 *
 * %edx is set up to match %rax >> 32 like the native stub
 * is expected to do
 *
 * Change xen_do_write_msr to return a true value if
 * the MSR access should be executed natively (or vice versa,
 * if you prefer.)
 *
 * bool xen_do_write_msr(uint32_t msr, uint64_t value)
 *
 * Let the native pattern look like:
 *
 * 48 89 c2                mov    %rax,%rdx
 * 48 c1 ea 20             shr    $32,%rdx
 * 3e 0f 30                ds wrmsr		<--- trap point
 *
 * ... which can be replaced with ...
 * 48 89 c2                mov    %rax,%rdx
 * 48 c1 ea 20             shr    $32,%rdx
 * 0f 01 c6                wrmsrns		<--- trap point
 *
 * FOR XEN, replace the FIRST SEVEN BYTES with:
 * e8 xx xx xx xx          call asm_xen_write_msr
 * 74 03                   jz .+5
 *
 * If ZF=0 then this will fall down to the actual native WRMSR[NS]
 * instruction.
 *
 * This also removes the need for Xen to maintain different safe and
 * unsafe MSR routines, as the difference is handled by the same
 * trap handler as is used natively.
 */
 SYM_FUNC_START(asm_xen_write_msr)
	FRAME_BEGIN
	push %rax		/* Save in case of native fallback */
	push %rcx
	push %rsi
	push %rdi
	push %r8
	push %r9
	push %r10
	push %r11
	mov %ecx,%edi		/* MSR number */
	mov %rax,%rsi		/* MSR data */
	call xen_do_write_msr
	test %al,%al		/* ZF=1 means we are done */
	pop %r11
	pop %r10
	pop %r9
	pop %r8
	pop %rdi
	pop %rsi
	pop %rcx
	mov 4(%rsp),%edx	/* Set up %edx for native execution */
	pop %rax
	FRAME_END
	RET
SYM_FUNC_END(asm_xen_write_msr)

/*
 * RDMSR code. It isn't quite as clean; it requires a new trap handler
 * type:
 *
 *	case EX_TYPE_UPLEVEL: {
 *		/* Let reg hold the unsigned number of machine
 *		 * words to pop off the stack before the return
 *		 * address, and imm the signed offset from the
 *		 * return address to the desired trap point.
 *		 *
 *		 * pointer in units of machine words, and imm the
 *		 * signed offset from this stack word...
 *		 * /
 *		unsigned long *sp = (unsigned long *)regs->sp + reg;
 *		regs->ip = *sp++ + (int16_t)imm;
 *		regs->sp = (unsigned long)sp;
 *		goto again;	/* Loop back to the beginning * /
 *	}
 *
 * The prototype of the Xen C code:
 * struct { uint64_t val, bool native } xen_do_read_msr(uint32_t msr)
 *
 * Native code:
 * 0f 32                   rdmsr	<--- trap point
 * 48 c1 e2 20             shl    $0x20,%rdx
 * 48 09 d0                or     %rdx,%rax
 *
 * Xen code (cs rex is just for padding)
 * 2e 40 e8 xx xx xx xx    cs rex call asm_xen_read_msr
*/

SYM_FUNC_START(asm_xen_read_msr)
	FRAME_BEGIN
	push %rcx
	push %rsi
	push %rdi
	push %r8
	push %r9
	push %r10
	push %r11
	mov %ecx,%edi		/* MSR number */
	call xen_do_read_msr
	test %dl,%dl		/* ZF=1 means we are done */
	pop %r11
	pop %r10
	pop %r9
	pop %r8
	pop %rdi
	pop %rsi
	pop %rcx
	jz 2f
1:
	rdmsr
	_ASM_EXTABLE_TYPE(1b, 2f, \
		EX_TYPE_UPLEVEL|EX_DATA_IMM(-7)|EX_DATA_REG(0))
	shl $32,%rdx
	or %rdx,%rax
	/*
	 * The top of the stack points directly at the return address;
	 * back up by 7 bytes from the return address.
	 */
2:
	FRAME_END
	RET
SYM_FUNC_END(asm_xen_read_msr)
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by H. Peter Anvin 1 year, 4 months ago
On 8/16/24 15:59, H. Peter Anvin wrote:
> 
> RDMSR is a bit trickier. I think the best option there is to set up a 
> new trap fixup handler type that amounts to "get the address from 
> the stack, apply a specific offset, and invoke the fixup handler for 
> that address:
> 
> 
>      case EX_TYPE_UPLEVEL: {
>          /* Let reg hold the unsigned number of machine
>           * words to pop off the stack before the return
>           * address, and imm the signed offset from the
>           * return address to the desired trap point.
>           *
>           * pointer in units of machine words, and imm the
>           * signed offset from this stack word...
>           */
>          unsigned long *sp = (unsigned long *)regs->sp + reg;
>          regs->ip = *sp++ + (int16_t)imm;
>          regs->sp = (unsigned long)sp;
>          goto again;    /* Loop back to the beginning */
>      }
> 
> Again, "obviously correct" code attached.
> 

Here is an untested patch implemented the above functionality, tidied up 
and wrapped in a macro as _ASM_EXTABLE_FUNC_REWIND().

	-hpa
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Andrew Cooper 1 year, 4 months ago
On 16/08/2024 10:26 pm, H. Peter Anvin wrote:
> On 8/16/24 11:40, Andrew Cooper wrote:
>>>
>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>>
>
> You can use up to 4 prefixes of any kind (which includes opcode
> prefixes before 0F) before most decoders start hurting, so we can pad
> it out to 5 bytes by doing 3f 3f .. .. ..
>
>>
>> My suggestion, not that I've had time to experiment, was to change
>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>> edx:eax into rsi.  That way the top level wrmsr() retains sensible
>> codegen for native even when paravirt is active.
>>
>
> I have attached what should be an "obvious" example... famous last words.

Ah, now I see what you mean about Xen's #GP semantics.

That's a neat way of doing it.  It means the faulting path will really
take 2 faults on Xen, but it's a faulting path anyway so speed is
already out of the window.

Do you mind about teaching the #UD handler to deal with WRMSR like that?

I ask, because I can't think of anything nicer.

There are plenty of 3-byte instructions which #GP in PV guests (CPL3),
and LTR is my go-to for debugging purposes, as it's not emulated by Xen.

Anything here (and it can't be an actual WRMSR) will be slightly
confusing to read in an OOPS, especially #UD for what is logically a #GP.

But, a clear UD of some form in the disassembly is probably better than
a random other instruction unrelated to the operation.

~Andrew
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Andrew Cooper 1 year, 4 months ago
On 16/08/2024 11:27 pm, Andrew Cooper wrote:
> On 16/08/2024 10:26 pm, H. Peter Anvin wrote:
>> On 8/16/24 11:40, Andrew Cooper wrote:
>>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>> You can use up to 4 prefixes of any kind (which includes opcode
>> prefixes before 0F) before most decoders start hurting, so we can pad
>> it out to 5 bytes by doing 3f 3f .. .. ..
>>
>>> My suggestion, not that I've had time to experiment, was to change
>>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>>> edx:eax into rsi.  That way the top level wrmsr() retains sensible
>>> codegen for native even when paravirt is active.
>>>
>> I have attached what should be an "obvious" example... famous last words.
> Ah, now I see what you mean about Xen's #GP semantics.
>
> That's a neat way of doing it.  It means the faulting path will really
> take 2 faults on Xen, but it's a faulting path anyway so speed is
> already out of the window.
>
> Do you mind about teaching the #UD handler to deal with WRMSR like that?
>
> I ask, because I can't think of anything nicer.
>
> There are plenty of 3-byte instructions which #GP in PV guests (CPL3),
> and LTR is my go-to for debugging purposes, as it's not emulated by Xen.
>
> Anything here (and it can't be an actual WRMSR) will be slightly
> confusing to read in an OOPS, especially #UD for what is logically a #GP.
>
> But, a clear UD of some form in the disassembly is probably better than
> a random other instruction unrelated to the operation.
>
> ~Andrew

Oh, P.S.

We can probably drop most of the register manipulation by making the new
xen_do_write_msr be no_caller_saved_registers.  As we're intentionally
not a C ABI to start with, we might as well not spill registers we don't
use either.

~Andrew
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by H. Peter Anvin 1 year, 4 months ago
On August 16, 2024 3:34:53 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>On 16/08/2024 11:27 pm, Andrew Cooper wrote:
>> On 16/08/2024 10:26 pm, H. Peter Anvin wrote:
>>> On 8/16/24 11:40, Andrew Cooper wrote:
>>>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>>> You can use up to 4 prefixes of any kind (which includes opcode
>>> prefixes before 0F) before most decoders start hurting, so we can pad
>>> it out to 5 bytes by doing 3f 3f .. .. ..
>>>
>>>> My suggestion, not that I've had time to experiment, was to change
>>>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>>>> edx:eax into rsi.  That way the top level wrmsr() retains sensible
>>>> codegen for native even when paravirt is active.
>>>>
>>> I have attached what should be an "obvious" example... famous last words.
>> Ah, now I see what you mean about Xen's #GP semantics.
>>
>> That's a neat way of doing it.  It means the faulting path will really
>> take 2 faults on Xen, but it's a faulting path anyway so speed is
>> already out of the window.
>>
>> Do you mind about teaching the #UD handler to deal with WRMSR like that?
>>
>> I ask, because I can't think of anything nicer.
>>
>> There are plenty of 3-byte instructions which #GP in PV guests (CPL3),
>> and LTR is my go-to for debugging purposes, as it's not emulated by Xen.
>>
>> Anything here (and it can't be an actual WRMSR) will be slightly
>> confusing to read in an OOPS, especially #UD for what is logically a #GP.
>>
>> But, a clear UD of some form in the disassembly is probably better than
>> a random other instruction unrelated to the operation.
>>
>> ~Andrew
>
>Oh, P.S.
>
>We can probably drop most of the register manipulation by making the new
>xen_do_write_msr be no_caller_saved_registers.  As we're intentionally
>not a C ABI to start with, we might as well not spill registers we don't
>use either.
>
>~Andrew

If you are willing to require a new enough gcc, certainly.
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by H. Peter Anvin 1 year, 4 months ago
On August 16, 2024 11:40:05 AM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>On 16/08/2024 6:52 pm, Xin Li wrote:
>> On 8/9/2024 4:07 PM, Andrew Cooper wrote:
>>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>>> +
>>>> +/* Non-serializing WRMSR, when available.  Falls back to a
>>>> serializing WRMSR. */
>>>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>>>   {
>>>> -    __wrmsrns(msr, val, val >> 32);
>>>> +    /*
>>>> +     * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a
>>>> redundant
>>>> +     * DS prefix to avoid a trailing NOP.
>>>> +     */
>>>> +    asm volatile("1: "
>>>> +             ALTERNATIVE("ds wrmsr",
>>>
>>> This isn't the version I presented, and there's no discussion of the
>>> alteration.
>>
>> I'm trying to implement wrmsr() as
>>
>> static __always_inline void wrmsr(u32 msr, u64 val)
>> {
>>     asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS,
>> X86_FEATURE_WRMSRNS,
>>                      "call asm_xen_write_msr", X86_FEATURE_XENPV)
>>              "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>>              : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>>              "D" (msr), "S" (val));
>> }
>>
>>
>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>
>The prefix was a minor optimisation to avoid having a trailing nop at
>the end.
>
>When combined with a call, you need 3 prefixes on WRMSR and 2 prefixes
>on WRMSRNS to make all options be 5 bytes long.
>
>That said, there's already a paravirt hook for this, and if you're
>looking to work around the code gen mess for that, then doing it like
>this by doubling up into rdi and rsi isn't great either.
>
>My suggestion, not that I've had time to experiment, was to change
>paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>edx:eax into rsi.  That way the top level wrmsr() retains sensible
>codegen for native even when paravirt is active.
>
>~Andrew

Heh, that was my suggestion, except that I suggested defining it so rax pass the full value; the generation of edx still is necessary but there is no real reason to have to recombine them. (One could even add that code to the assembly pattern as the CALL instruction is longer.)

My biggest question is how the #GP on an invalid MSR access is handled with Xen. The rest is trivial.
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Andrew Cooper 1 year, 4 months ago
On 16/08/2024 8:18 pm, H. Peter Anvin wrote:
> On August 16, 2024 11:40:05 AM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 16/08/2024 6:52 pm, Xin Li wrote:
>>> On 8/9/2024 4:07 PM, Andrew Cooper wrote:
>>>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>>>> +
>>>>> +/* Non-serializing WRMSR, when available.  Falls back to a
>>>>> serializing WRMSR. */
>>>>>   static __always_inline void wrmsrns(u32 msr, u64 val)
>>>>>   {
>>>>> -    __wrmsrns(msr, val, val >> 32);
>>>>> +    /*
>>>>> +     * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a
>>>>> redundant
>>>>> +     * DS prefix to avoid a trailing NOP.
>>>>> +     */
>>>>> +    asm volatile("1: "
>>>>> +             ALTERNATIVE("ds wrmsr",
>>>> This isn't the version I presented, and there's no discussion of the
>>>> alteration.
>>> I'm trying to implement wrmsr() as
>>>
>>> static __always_inline void wrmsr(u32 msr, u64 val)
>>> {
>>>     asm volatile("1: " ALTERNATIVE_2("wrmsr", WRMSRNS,
>>> X86_FEATURE_WRMSRNS,
>>>                      "call asm_xen_write_msr", X86_FEATURE_XENPV)
>>>              "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>>>              : : "c" (msr), "a" (val), "d" ((u32)(val >> 32)),
>>>              "D" (msr), "S" (val));
>>> }
>>>
>>>
>>> As the CALL instruction is 5-byte long, and we need to pad nop for both
>>> WRMSR and WRMSRNS, what about not using segment prefix at all?
>> The prefix was a minor optimisation to avoid having a trailing nop at
>> the end.
>>
>> When combined with a call, you need 3 prefixes on WRMSR and 2 prefixes
>> on WRMSRNS to make all options be 5 bytes long.
>>
>> That said, there's already a paravirt hook for this, and if you're
>> looking to work around the code gen mess for that, then doing it like
>> this by doubling up into rdi and rsi isn't great either.
>>
>> My suggestion, not that I've had time to experiment, was to change
>> paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
>> edx:eax into rsi.  That way the top level wrmsr() retains sensible
>> codegen for native even when paravirt is active.
>>
>> ~Andrew
> Heh, that was my suggestion, except that I suggested defining it so rax pass the full value; the generation of edx still is necessary but there is no real reason to have to recombine them. (One could even add that code to the assembly pattern as the CALL instruction is longer.)

Hmm yeah, having %rax be full is likely how the logic is going to look
before generating %edx.

> My biggest question is how the #GP on an invalid MSR access is handled with Xen. The rest is trivial.

For historical reasons it's a mess.

xen_do_write_msr() does several things.

* Turns FSBASE/GSBASE/GSKERN into their respective hypercalls (although
no error checking at all!)
* Discards modifications to the SYSCALL/SYSENTER MSRs

Otherwise, uses the native accessor, taking the #GP path to Xen and is
emulated (including full decode).

Way back in the day, XenPV's paravirt wrmsr would swallow #GP and
pretend success, and then started depending on this behaviour in order
to boot.  But times have moved on, and even the normal accessors are
really safe+warn, and I'm sure all of this could be cleaned up.

For ages I've been wanting to make a single PV_PRIV_OP hypercall so we
can skip the x86 emulator, but I've never had time to do that either.

~Andrew
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by H. Peter Anvin 1 year, 4 months ago
On August 9, 2024 4:07:35 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>> Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
>> use the alternatives mechanism to choose WRMSRNS when it's available,
>> otherwise fallback to WRMSR.
>>
>> [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
>> ---
>>  arch/x86/include/asm/msr.h | 28 ++++++++++++++--------------
>>  1 file changed, 14 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>> index d642037f9ed5..3e402d717815 100644
>> --- a/arch/x86/include/asm/msr.h
>> +++ b/arch/x86/include/asm/msr.h
>> @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>>  		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>>  }
>>  
>> -/*
>> - * WRMSRNS behaves exactly like WRMSR with the only difference being
>> - * that it is not a serializing instruction by default.
>> - */
>> -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
>> -{
>> -	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
>> -	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
>> -		     "2:\n"
>> -		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>> -		     : : "c" (msr), "a"(low), "d" (high));
>> -}
>> -
>>  #define native_rdmsr(msr, val1, val2)			\
>>  do {							\
>>  	u64 __val = __rdmsr((msr));			\
>> @@ -312,9 +299,22 @@ do {							\
>>  
>>  #endif	/* !CONFIG_PARAVIRT_XXL */
>>  
>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>> +
>> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>>  static __always_inline void wrmsrns(u32 msr, u64 val)
>>  {
>> -	__wrmsrns(msr, val, val >> 32);
>> +	/*
>> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
>> +	 * DS prefix to avoid a trailing NOP.
>> +	 */
>> +	asm volatile("1: "
>> +		     ALTERNATIVE("ds wrmsr",
>
>This isn't the version I presented, and there's no discussion of the
>alteration.
>
>The choice of CS over DS was deliberate, and came from Intel:
>
>https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>
>So unless Intel want to retract that whitepaper, and all the binutils
>work with it, I'd suggest keeping it as CS like we use elsewhere, and as
>explicitly instructed by Intel.
>
>~Andrew

I looked around the kernel, and I believe we are inconsistent. I see both 0x2e (CS) and 0x3e (DS) prefixes used for padding where open-coded.

We can't use cs in all cases, since you can't do a store to the code segment (always readonly) so we use 0x3e (DS) to patch out LOCK.

In the paper you describe, it only mentions 0x2e as a "benign prefix" in a specific example, not as any kind of specific recommendation. It is particularly irrelevant when it comes to padding a two instructions to the same length as the paper deals with assignment. 

If you want, I'm perfectly happy to go and ask if there is any general recommendation (except for direct conditional branch hints, of course.)
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Andrew Cooper 1 year, 4 months ago
On 10/08/2024 1:01 am, H. Peter Anvin wrote:
> On August 9, 2024 4:07:35 PM PDT, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 07/08/2024 6:47 am, Xin Li (Intel) wrote:
>>> From: Andrew Cooper <andrew.cooper3@citrix.com>
>>>
>>> Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
>>> use the alternatives mechanism to choose WRMSRNS when it's available,
>>> otherwise fallback to WRMSR.
>>>
>>> [1] https://lore.kernel.org/lkml/15f56e6a-6edd-43d0-8e83-bb6430096514@citrix.com/
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>> Signed-off-by: Xin Li (Intel) <xin@zytor.com>
>>> ---
>>>  arch/x86/include/asm/msr.h | 28 ++++++++++++++--------------
>>>  1 file changed, 14 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>>> index d642037f9ed5..3e402d717815 100644
>>> --- a/arch/x86/include/asm/msr.h
>>> +++ b/arch/x86/include/asm/msr.h
>>> @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high)
>>>  		     : : "c" (msr), "a"(low), "d" (high) : "memory");
>>>  }
>>>  
>>> -/*
>>> - * WRMSRNS behaves exactly like WRMSR with the only difference being
>>> - * that it is not a serializing instruction by default.
>>> - */
>>> -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high)
>>> -{
>>> -	/* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */
>>> -	asm volatile("1: .byte 0x0f,0x01,0xc6\n"
>>> -		     "2:\n"
>>> -		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR)
>>> -		     : : "c" (msr), "a"(low), "d" (high));
>>> -}
>>> -
>>>  #define native_rdmsr(msr, val1, val2)			\
>>>  do {							\
>>>  	u64 __val = __rdmsr((msr));			\
>>> @@ -312,9 +299,22 @@ do {							\
>>>  
>>>  #endif	/* !CONFIG_PARAVIRT_XXL */
>>>  
>>> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
>>> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
>>> +
>>> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>>>  static __always_inline void wrmsrns(u32 msr, u64 val)
>>>  {
>>> -	__wrmsrns(msr, val, val >> 32);
>>> +	/*
>>> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
>>> +	 * DS prefix to avoid a trailing NOP.
>>> +	 */
>>> +	asm volatile("1: "
>>> +		     ALTERNATIVE("ds wrmsr",
>> This isn't the version I presented, and there's no discussion of the
>> alteration.
>>
>> The choice of CS over DS was deliberate, and came from Intel:
>>
>> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>>
>> So unless Intel want to retract that whitepaper, and all the binutils
>> work with it, I'd suggest keeping it as CS like we use elsewhere, and as
>> explicitly instructed by Intel.
>>
>> ~Andrew
> I looked around the kernel, and I believe we are inconsistent. I see both 0x2e (CS) and 0x3e (DS) prefixes used for padding where open-coded.
>
> We can't use cs in all cases, since you can't do a store to the code segment (always readonly) so we use 0x3e (DS) to patch out LOCK.
>
> In the paper you describe, it only mentions 0x2e as a "benign prefix" in a specific example, not as any kind of specific recommendation. It is particularly irrelevant when it comes to padding a two instructions to the same length as the paper deals with assignment. 
>
> If you want, I'm perfectly happy to go and ask if there is any general recommendation (except for direct conditional branch hints, of course.)

It would be lovely if there could be a single coherent statement.

In addition to store semantics, off the top of my head:

* CS is the P4 hint-not-taken (presumably Jcc only), ignored now.

* DS is both the P4 hint-taken (presumably Jcc only), newly reintroduced
in Redwood Cove with tweaked semantics (definitely Jcc only), and the
CET notrack prefix (JMP/CALL *IND only).

Plus whatever else I've missed[1].

~Andrew

[1] I'm ignoring XuCode mode memory operand semantics as not relevant to
CPL0 software (where AIUI unprefixed is physical and DS prefixed is
virtual) but I'm noting it here to highlight that there's definitely
extra complexity beyond what is in SDM Vol2.
Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
Posted by Thomas Gleixner 1 year, 4 months ago
On Tue, Aug 06 2024 at 22:47, Xin Li wrote:
>  
> +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */
> +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6)
> +
> +/* Non-serializing WRMSR, when available.  Falls back to a serializing WRMSR. */
>  static __always_inline void wrmsrns(u32 msr, u64 val)
>  {
> -	__wrmsrns(msr, val, val >> 32);
> +	/*
> +	 * WRMSR is 2 bytes.  WRMSRNS is 3 bytes.  Pad WRMSR with a redundant
> +	 * DS prefix to avoid a trailing NOP.
> +	 */
> +	asm volatile("1: "
> +		     ALTERNATIVE("ds wrmsr",
> +				 WRMSRNS, X86_FEATURE_WRMSRNS)

Please get rid of this horrible line break. You have 100 characters line width.

Thanks,

        tglx