[PATCH] x86/boot: Fix load_system_tables() to be NMI/#MC-safe

Andrew Cooper posted 1 patch 3 years, 10 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/xen tags/patchew/20200527130607.32069-1-andrew.cooper3@citrix.com
xen/arch/x86/cpu/common.c | 49 ++++++++++++++++++++++-------------------------
1 file changed, 23 insertions(+), 26 deletions(-)
[PATCH] x86/boot: Fix load_system_tables() to be NMI/#MC-safe
Posted by Andrew Cooper 3 years, 10 months ago
During boot, load_system_tables() is used in reinit_bsp_stack() to switch the
virtual addresses used from their .data/.bss alias, to their directmap alias.

The structure assignment is implemented as a memset() to zero first, then a
copy-in of the new data.  This causes the NMI/#MC stack pointers to
transiently become 0, at a point where we may have an NMI watchdog running.

Rewrite the logic using a volatile tss pointer (equivalent to, but more
readable than, using ACCESS_ONCE() for all writes).

This does drop the zeroing side effect for holes in the structure, but the
backing memory for the TSS is fully zeroed anyway, and architecturally, they
are all reserved.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>

This wants backporting a fairly long way, technically to Xen 4.6.
---
 xen/arch/x86/cpu/common.c | 49 ++++++++++++++++++++++-------------------------
 1 file changed, 23 insertions(+), 26 deletions(-)

diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 3e0d9cbe98..a78b796fe5 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -703,11 +703,12 @@ static cpumask_t cpu_initialized;
  */
 void load_system_tables(void)
 {
-	unsigned int cpu = smp_processor_id();
+	unsigned int i, cpu = smp_processor_id();
 	unsigned long stack_bottom = get_stack_bottom(),
 		stack_top = stack_bottom & ~(STACK_SIZE - 1);
 
-	struct tss64 *tss = &this_cpu(tss_page).tss;
+	/* The TSS may be live.	 Disuade any clever optimisations. */
+	volatile struct tss64 *tss = &this_cpu(tss_page).tss;
 	seg_desc_t *gdt =
 		this_cpu(gdt) - FIRST_RESERVED_GDT_ENTRY;
 
@@ -720,30 +721,26 @@ void load_system_tables(void)
 		.limit = (IDT_ENTRIES * sizeof(idt_entry_t)) - 1,
 	};
 
-	*tss = (struct tss64){
-		/* Main stack for interrupts/exceptions. */
-		.rsp0 = stack_bottom,
-
-		/* Ring 1 and 2 stacks poisoned. */
-		.rsp1 = 0x8600111111111111ul,
-		.rsp2 = 0x8600111111111111ul,
-
-		/*
-		 * MCE, NMI and Double Fault handlers get their own stacks.
-		 * All others poisoned.
-		 */
-		.ist = {
-			[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE,
-			[IST_DF  - 1] = stack_top + IST_DF  * PAGE_SIZE,
-			[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE,
-			[IST_DB  - 1] = stack_top + IST_DB  * PAGE_SIZE,
-
-			[IST_MAX ... ARRAY_SIZE(tss->ist) - 1] =
-				0x8600111111111111ul,
-		},
-
-		.bitmap = IOBMP_INVALID_OFFSET,
-	};
+	/*
+	 * Set up the TSS.  Warning - may be live, and the NMI/#MC must remain
+	 * valid on every instruction boundary.  (Note: these are all
+	 * semantically ACCESS_ONCE() due to tss's volatile qualifier.)
+	 *
+	 * rsp0 refers to the primary stack.  #MC, #DF, NMI and #DB handlers
+	 * each get their own stacks.  No IO Bitmap.
+	 */
+	tss->rsp0 = stack_bottom;
+	tss->ist[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE;
+	tss->ist[IST_DF  - 1] = stack_top + IST_DF  * PAGE_SIZE;
+	tss->ist[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE;
+	tss->ist[IST_DB  - 1] = stack_top + IST_DB  * PAGE_SIZE;
+	tss->bitmap = IOBMP_INVALID_OFFSET;
+
+	/* All other stack pointers poisioned. */
+	for ( i = IST_MAX; i < ARRAY_SIZE(tss->ist); ++i )
+		tss->ist[i] = 0x8600111111111111ul;
+	tss->rsp1 = 0x8600111111111111ul;
+	tss->rsp2 = 0x8600111111111111ul;
 
 	BUILD_BUG_ON(sizeof(*tss) <= 0x67); /* Mandated by the architecture. */
 
-- 
2.11.0


Re: [PATCH] x86/boot: Fix load_system_tables() to be NMI/#MC-safe
Posted by Jan Beulich 3 years, 10 months ago
On 27.05.2020 15:06, Andrew Cooper wrote:
> @@ -720,30 +721,26 @@ void load_system_tables(void)
>  		.limit = (IDT_ENTRIES * sizeof(idt_entry_t)) - 1,
>  	};
>  
> -	*tss = (struct tss64){
> -		/* Main stack for interrupts/exceptions. */
> -		.rsp0 = stack_bottom,
> -
> -		/* Ring 1 and 2 stacks poisoned. */
> -		.rsp1 = 0x8600111111111111ul,
> -		.rsp2 = 0x8600111111111111ul,
> -
> -		/*
> -		 * MCE, NMI and Double Fault handlers get their own stacks.
> -		 * All others poisoned.
> -		 */
> -		.ist = {
> -			[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE,
> -			[IST_DF  - 1] = stack_top + IST_DF  * PAGE_SIZE,
> -			[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE,
> -			[IST_DB  - 1] = stack_top + IST_DB  * PAGE_SIZE,
> -
> -			[IST_MAX ... ARRAY_SIZE(tss->ist) - 1] =
> -				0x8600111111111111ul,
> -		},
> -
> -		.bitmap = IOBMP_INVALID_OFFSET,
> -	};
> +	/*
> +	 * Set up the TSS.  Warning - may be live, and the NMI/#MC must remain
> +	 * valid on every instruction boundary.  (Note: these are all
> +	 * semantically ACCESS_ONCE() due to tss's volatile qualifier.)
> +	 *
> +	 * rsp0 refers to the primary stack.  #MC, #DF, NMI and #DB handlers
> +	 * each get their own stacks.  No IO Bitmap.
> +	 */
> +	tss->rsp0 = stack_bottom;
> +	tss->ist[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE;
> +	tss->ist[IST_DF  - 1] = stack_top + IST_DF  * PAGE_SIZE;
> +	tss->ist[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE;
> +	tss->ist[IST_DB  - 1] = stack_top + IST_DB  * PAGE_SIZE;
> +	tss->bitmap = IOBMP_INVALID_OFFSET;
> +
> +	/* All other stack pointers poisioned. */
> +	for ( i = IST_MAX; i < ARRAY_SIZE(tss->ist); ++i )
> +		tss->ist[i] = 0x8600111111111111ul;
> +	tss->rsp1 = 0x8600111111111111ul;
> +	tss->rsp2 = 0x8600111111111111ul;

ACCESS_ONCE() unfortunately only has one of the two needed effects:
It guarantees that each memory location gets accessed exactly once
(which I assume can also be had with just the volatile addition,
but without the moving away from using an initializer), but it does
not guarantee single-insn accesses. I consider this in particular
relevant here because all of the 64-bit fields are misaligned. By
doing it like you do, we're setting us up to have to re-do this yet
again in a couple of years time (presumably using write_atomic()
instead then).

Nevertheless it is a clear improvement, so if you want to leave it
like this
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan

Re: [PATCH] x86/boot: Fix load_system_tables() to be NMI/#MC-safe
Posted by Andrew Cooper 3 years, 10 months ago
On 27/05/2020 14:19, Jan Beulich wrote:
> On 27.05.2020 15:06, Andrew Cooper wrote:
>> @@ -720,30 +721,26 @@ void load_system_tables(void)
>>  		.limit = (IDT_ENTRIES * sizeof(idt_entry_t)) - 1,
>>  	};
>>  
>> -	*tss = (struct tss64){
>> -		/* Main stack for interrupts/exceptions. */
>> -		.rsp0 = stack_bottom,
>> -
>> -		/* Ring 1 and 2 stacks poisoned. */
>> -		.rsp1 = 0x8600111111111111ul,
>> -		.rsp2 = 0x8600111111111111ul,
>> -
>> -		/*
>> -		 * MCE, NMI and Double Fault handlers get their own stacks.
>> -		 * All others poisoned.
>> -		 */
>> -		.ist = {
>> -			[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE,
>> -			[IST_DF  - 1] = stack_top + IST_DF  * PAGE_SIZE,
>> -			[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE,
>> -			[IST_DB  - 1] = stack_top + IST_DB  * PAGE_SIZE,
>> -
>> -			[IST_MAX ... ARRAY_SIZE(tss->ist) - 1] =
>> -				0x8600111111111111ul,
>> -		},
>> -
>> -		.bitmap = IOBMP_INVALID_OFFSET,
>> -	};
>> +	/*
>> +	 * Set up the TSS.  Warning - may be live, and the NMI/#MC must remain
>> +	 * valid on every instruction boundary.  (Note: these are all
>> +	 * semantically ACCESS_ONCE() due to tss's volatile qualifier.)
>> +	 *
>> +	 * rsp0 refers to the primary stack.  #MC, #DF, NMI and #DB handlers
>> +	 * each get their own stacks.  No IO Bitmap.
>> +	 */
>> +	tss->rsp0 = stack_bottom;
>> +	tss->ist[IST_MCE - 1] = stack_top + IST_MCE * PAGE_SIZE;
>> +	tss->ist[IST_DF  - 1] = stack_top + IST_DF  * PAGE_SIZE;
>> +	tss->ist[IST_NMI - 1] = stack_top + IST_NMI * PAGE_SIZE;
>> +	tss->ist[IST_DB  - 1] = stack_top + IST_DB  * PAGE_SIZE;
>> +	tss->bitmap = IOBMP_INVALID_OFFSET;
>> +
>> +	/* All other stack pointers poisioned. */
>> +	for ( i = IST_MAX; i < ARRAY_SIZE(tss->ist); ++i )
>> +		tss->ist[i] = 0x8600111111111111ul;
>> +	tss->rsp1 = 0x8600111111111111ul;
>> +	tss->rsp2 = 0x8600111111111111ul;
> ACCESS_ONCE() unfortunately only has one of the two needed effects:
> It guarantees that each memory location gets accessed exactly once
> (which I assume can also be had with just the volatile addition,
> but without the moving away from using an initializer), but it does
> not guarantee single-insn accesses.

Linux's memory-barriers.txt disagrees, and specifically gives an example
with a misaligned int (vs two shorts) and the use volatile cast (by way
of {READ,WRITE}_ONCE()) to prevent load/store tearing, as the memory
location is of a size which can be fit in a single access.

I'm fairly sure we're safe here.

>  I consider this in particular
> relevant here because all of the 64-bit fields are misaligned. By
> doing it like you do, we're setting us up to have to re-do this yet
> again in a couple of years time (presumably using write_atomic()
> instead then).
>
> Nevertheless it is a clear improvement, so if you want to leave it
> like this
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks,

~Andrew