[PATCH 22/22] x86/mm: zero stack on stack switch or reset

Roger Pau Monne posted 22 patches 1 month, 3 weeks ago
[PATCH 22/22] x86/mm: zero stack on stack switch or reset
Posted by Roger Pau Monne 1 month, 3 weeks ago
With the stack mapped on a per-CPU basis there's no risk of other CPUs being
able to read the stack contents, but vCPUs running on the current pCPU could
read stack rubble from operations of previous vCPUs.

The #DF stack is not zeroed because handling of #DF results in a panic.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index 75b9a341f814..02b4118b03ef 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
 # define SHADOW_STACK_WORK ""
 #endif
 
+#define ZERO_STACK                                              \
+    "test %[stk_size], %[stk_size];"                            \
+    "jz .L_skip_zeroing.%=;"                                    \
+    "std;"                                                      \
+    "rep stosb;"                                                \
+    "cld;"                                                      \
+    ".L_skip_zeroing.%=:"
+
 #if __GNUC__ >= 9
 # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn, __noreturn__)
 #else
@@ -187,10 +195,24 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
 #define switch_stack_and_jump(fn, instr, constr)                        \
     ({                                                                  \
         unsigned int tmp;                                               \
+        bool zero_stack = current->domain->arch.asi;                    \
         BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn));                      \
+        ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() -        \
+                          PRIMARY_STACK_SIZE +                          \
+                          sizeof(struct cpu_info), PAGE_SIZE));         \
+        if ( zero_stack )                                               \
+        {                                                               \
+            unsigned long stack_top = get_stack_bottom() &              \
+                                      ~(STACK_SIZE - 1);                \
+                                                                        \
+            clear_page((void *)stack_top + IST_MCE * PAGE_SIZE);        \
+            clear_page((void *)stack_top + IST_NMI * PAGE_SIZE);        \
+            clear_page((void *)stack_top + IST_DB  * PAGE_SIZE);        \
+        }                                                               \
         __asm__ __volatile__ (                                          \
             SHADOW_STACK_WORK                                           \
             "mov %[stk], %%rsp;"                                        \
+            ZERO_STACK                                                  \
             CHECK_FOR_LIVEPATCH_WORK                                    \
             instr "[fun]"                                               \
             : [val] "=&r" (tmp),                                        \
@@ -201,7 +223,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
               ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8),               \
               [stack_mask] "i" (STACK_SIZE - 1),                        \
               _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__,                \
-                                 __FILE__, NULL)                        \
+                                 __FILE__, NULL),                       \
+              /* For stack zeroing. */                                  \
+              "D" ((void *)guest_cpu_user_regs() - 1),                  \
+              [stk_size] "c"                                            \
+              (zero_stack ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\
+                          : 0),                                         \
+              "a" (0)                                                   \
             : "memory" );                                               \
         unreachable();                                                  \
     })
-- 
2.45.2


Re: [PATCH 22/22] x86/mm: zero stack on stack switch or reset
Posted by Jan Beulich 1 month ago
On 26.07.2024 17:22, Roger Pau Monne wrote:
> With the stack mapped on a per-CPU basis there's no risk of other CPUs being
> able to read the stack contents, but vCPUs running on the current pCPU could
> read stack rubble from operations of previous vCPUs.
> 
> The #DF stack is not zeroed because handling of #DF results in a panic.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
> index 75b9a341f814..02b4118b03ef 100644
> --- a/xen/arch/x86/include/asm/current.h
> +++ b/xen/arch/x86/include/asm/current.h
> @@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>  # define SHADOW_STACK_WORK ""
>  #endif
>  
> +#define ZERO_STACK                                              \
> +    "test %[stk_size], %[stk_size];"                            \
> +    "jz .L_skip_zeroing.%=;"                                    \
> +    "std;"                                                      \
> +    "rep stosb;"                                                \
> +    "cld;"                                                      \

Is ERMS actually helping with backwards copies? I didn't think so, and hence
it may be that REP STOSQ might be more efficient here?

Jan

Re: [PATCH 22/22] x86/mm: zero stack on stack switch or reset
Posted by Andrew Cooper 1 month, 2 weeks ago
On 26/07/2024 4:22 pm, Roger Pau Monne wrote:
> With the stack mapped on a per-CPU basis there's no risk of other CPUs being
> able to read the stack contents, but vCPUs running on the current pCPU could
> read stack rubble from operations of previous vCPUs.
>
> The #DF stack is not zeroed because handling of #DF results in a panic.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
> index 75b9a341f814..02b4118b03ef 100644
> --- a/xen/arch/x86/include/asm/current.h
> +++ b/xen/arch/x86/include/asm/current.h
> @@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>  # define SHADOW_STACK_WORK ""
>  #endif
>  
> +#define ZERO_STACK                                              \
> +    "test %[stk_size], %[stk_size];"                            \
> +    "jz .L_skip_zeroing.%=;"                                    \
> +    "std;"                                                      \
> +    "rep stosb;"                                                \
> +    "cld;"                                                      \
> +    ".L_skip_zeroing.%=:"
> +
>  #if __GNUC__ >= 9
>  # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn, __noreturn__)
>  #else
> @@ -187,10 +195,24 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>  #define switch_stack_and_jump(fn, instr, constr)                        \
>      ({                                                                  \
>          unsigned int tmp;                                               \
> +        bool zero_stack = current->domain->arch.asi;                    \
>          BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn));                      \
> +        ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() -        \
> +                          PRIMARY_STACK_SIZE +                          \
> +                          sizeof(struct cpu_info), PAGE_SIZE));         \
> +        if ( zero_stack )                                               \
> +        {                                                               \
> +            unsigned long stack_top = get_stack_bottom() &              \
> +                                      ~(STACK_SIZE - 1);                \
> +                                                                        \
> +            clear_page((void *)stack_top + IST_MCE * PAGE_SIZE);        \
> +            clear_page((void *)stack_top + IST_NMI * PAGE_SIZE);        \
> +            clear_page((void *)stack_top + IST_DB  * PAGE_SIZE);        \
> +        }                                                               \
>          __asm__ __volatile__ (                                          \
>              SHADOW_STACK_WORK                                           \
>              "mov %[stk], %%rsp;"                                        \
> +            ZERO_STACK                                                  \
>              CHECK_FOR_LIVEPATCH_WORK                                    \
>              instr "[fun]"                                               \
>              : [val] "=&r" (tmp),                                        \
> @@ -201,7 +223,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>                ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8),               \
>                [stack_mask] "i" (STACK_SIZE - 1),                        \
>                _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__,                \
> -                                 __FILE__, NULL)                        \
> +                                 __FILE__, NULL),                       \
> +              /* For stack zeroing. */                                  \
> +              "D" ((void *)guest_cpu_user_regs() - 1),                  \
> +              [stk_size] "c"                                            \
> +              (zero_stack ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\
> +                          : 0),                                         \
> +              "a" (0)                                                   \
>              : "memory" );                                               \
>          unreachable();                                                  \
>      })

This looks very expensive.

For starters, switch_stack_and_jump() is used twice in a typical context
switch; once in the schedule tail, and again out of hvm_do_resume().

Furthermore, #MC happen never (to many many significant figures), #DB
happens never for HVM guests (but does happen for PV), and NMIs are
either ~never, or 2Hz which is far less often than the 30ms default
timeslice.

So, the overwhelming majority of the time, those 3 calls to clear_page()
will be re-zeroing blocks of zeroes.

This can probably be avoided by making use of ist_exit (held in %r12) to
only zero an IST stack when leaving it.  This leaves the IRET frame able
to be recovered, but with e.g. RFDS, you can do that irrespective, and
it's not terribly sensitive.


What about shadow stacks?  You're not zeroing those, and while they're
less sensitive than the data stack, there ought to be some reasoning
about them.

~Andrew

Re: [PATCH 22/22] x86/mm: zero stack on stack switch or reset
Posted by Roger Pau Monné 1 month, 2 weeks ago
On Mon, Jul 29, 2024 at 04:40:24PM +0100, Andrew Cooper wrote:
> On 26/07/2024 4:22 pm, Roger Pau Monne wrote:
> > With the stack mapped on a per-CPU basis there's no risk of other CPUs being
> > able to read the stack contents, but vCPUs running on the current pCPU could
> > read stack rubble from operations of previous vCPUs.
> >
> > The #DF stack is not zeroed because handling of #DF results in a panic.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
> >  1 file changed, 29 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
> > index 75b9a341f814..02b4118b03ef 100644
> > --- a/xen/arch/x86/include/asm/current.h
> > +++ b/xen/arch/x86/include/asm/current.h
> > @@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> >  # define SHADOW_STACK_WORK ""
> >  #endif
> >  
> > +#define ZERO_STACK                                              \
> > +    "test %[stk_size], %[stk_size];"                            \
> > +    "jz .L_skip_zeroing.%=;"                                    \
> > +    "std;"                                                      \
> > +    "rep stosb;"                                                \
> > +    "cld;"                                                      \
> > +    ".L_skip_zeroing.%=:"
> > +
> >  #if __GNUC__ >= 9
> >  # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn, __noreturn__)
> >  #else
> > @@ -187,10 +195,24 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> >  #define switch_stack_and_jump(fn, instr, constr)                        \
> >      ({                                                                  \
> >          unsigned int tmp;                                               \
> > +        bool zero_stack = current->domain->arch.asi;                    \
> >          BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn));                      \
> > +        ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() -        \
> > +                          PRIMARY_STACK_SIZE +                          \
> > +                          sizeof(struct cpu_info), PAGE_SIZE));         \
> > +        if ( zero_stack )                                               \
> > +        {                                                               \
> > +            unsigned long stack_top = get_stack_bottom() &              \
> > +                                      ~(STACK_SIZE - 1);                \
> > +                                                                        \
> > +            clear_page((void *)stack_top + IST_MCE * PAGE_SIZE);        \
> > +            clear_page((void *)stack_top + IST_NMI * PAGE_SIZE);        \
> > +            clear_page((void *)stack_top + IST_DB  * PAGE_SIZE);        \
> > +        }                                                               \
> >          __asm__ __volatile__ (                                          \
> >              SHADOW_STACK_WORK                                           \
> >              "mov %[stk], %%rsp;"                                        \
> > +            ZERO_STACK                                                  \
> >              CHECK_FOR_LIVEPATCH_WORK                                    \
> >              instr "[fun]"                                               \
> >              : [val] "=&r" (tmp),                                        \
> > @@ -201,7 +223,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> >                ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8),               \
> >                [stack_mask] "i" (STACK_SIZE - 1),                        \
> >                _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__,                \
> > -                                 __FILE__, NULL)                        \
> > +                                 __FILE__, NULL),                       \
> > +              /* For stack zeroing. */                                  \
> > +              "D" ((void *)guest_cpu_user_regs() - 1),                  \
> > +              [stk_size] "c"                                            \
> > +              (zero_stack ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\
> > +                          : 0),                                         \
> > +              "a" (0)                                                   \
> >              : "memory" );                                               \
> >          unreachable();                                                  \
> >      })
> 
> This looks very expensive.
> 
> For starters, switch_stack_and_jump() is used twice in a typical context
> switch; once in the schedule tail, and again out of hvm_do_resume().

Right, it's the reset_stack_and_call_ind() at the end of context
switch and then the reset_stack_and_jump() in the HVM tail context
switch handlers.

One option would be to only do the stack zeroing from the
reset_stack_and_call_ind() call in context_switch().

I've got no idea how expensive this is, I might try to run some
benchmarks to get some figures.  I was planning on running two VMs
with 1 vCPU each, both pinned to the same pCPU.

> 
> Furthermore, #MC happen never (to many many significant figures), #DB
> happens never for HVM guests (but does happen for PV), and NMIs are
> either ~never, or 2Hz which is far less often than the 30ms default
> timeslice.
> 
> So, the overwhelming majority of the time, those 3 calls to clear_page()
> will be re-zeroing blocks of zeroes.
> 
> This can probably be avoided by making use of ist_exit (held in %r12) to
> only zero an IST stack when leaving it.  This leaves the IRET frame able
> to be recovered, but with e.g. RFDS, you can do that irrespective, and
> it's not terribly sensitive.

I could look into that, TBH I was bordeline with clearing the IST
stacks, as I wasn't convinced there could be anything sensitive there,
but again couldn't convince myself there's nothing sensitive now,
nor can be in the future.

> What about shadow stacks?  You're not zeroing those, and while they're
> less sensitive than the data stack, there ought to be some reasoning
> about them.

I've assumed that shadow stacks only contained the expected return
addresses, and hence won't be considered sensitive information, but
maybe I was too lax.

An attacker could get execution traces of the previous vCPU, and that
might be useful for some exploits?

Thanks, Roger.