.../admin-guide/kernel-parameters.txt | 4 + Documentation/filesystems/proc.rst | 1 + Documentation/x86/cet.rst | 145 ++++++ Documentation/x86/index.rst | 1 + arch/arm/kernel/signal.c | 2 +- arch/arm64/kernel/signal.c | 2 +- arch/arm64/kernel/signal32.c | 2 +- arch/sparc/kernel/signal32.c | 2 +- arch/sparc/kernel/signal_64.c | 2 +- arch/x86/Kconfig | 22 + arch/x86/Kconfig.assembler | 5 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/ia32/ia32_signal.c | 25 +- arch/x86/include/asm/cet.h | 54 +++ arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/fpu/api.h | 8 + arch/x86/include/asm/fpu/types.h | 23 +- arch/x86/include/asm/fpu/xstate.h | 6 +- arch/x86/include/asm/idtentry.h | 4 + arch/x86/include/asm/mman.h | 24 + arch/x86/include/asm/mmu_context.h | 2 + arch/x86/include/asm/msr-index.h | 20 + arch/x86/include/asm/page_types.h | 7 + arch/x86/include/asm/pgtable.h | 302 ++++++++++-- arch/x86/include/asm/pgtable_types.h | 48 +- arch/x86/include/asm/processor.h | 6 + arch/x86/include/asm/special_insns.h | 30 ++ arch/x86/include/asm/trap_pf.h | 2 + arch/x86/include/uapi/asm/mman.h | 8 +- arch/x86/include/uapi/asm/prctl.h | 10 + arch/x86/include/uapi/asm/processor-flags.h | 2 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/cpu/common.c | 20 + arch/x86/kernel/cpu/cpuid-deps.c | 1 + arch/x86/kernel/elf_feature_prctl.c | 72 +++ arch/x86/kernel/fpu/xstate.c | 167 ++++++- arch/x86/kernel/idt.c | 4 + arch/x86/kernel/process.c | 17 +- arch/x86/kernel/process_64.c | 2 + arch/x86/kernel/shstk.c | 446 ++++++++++++++++++ arch/x86/kernel/signal.c | 13 + arch/x86/kernel/signal_compat.c | 2 +- arch/x86/kernel/traps.c | 62 +++ arch/x86/mm/fault.c | 19 + arch/x86/mm/mmap.c | 48 ++ arch/x86/mm/pat/set_memory.c | 2 +- arch/x86/mm/pgtable.c | 25 + drivers/gpu/drm/i915/gvt/gtt.c | 2 +- fs/aio.c | 2 +- fs/proc/task_mmu.c | 3 + include/linux/mm.h | 19 +- include/linux/pgtable.h | 8 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/siginfo.h | 3 +- include/uapi/asm-generic/unistd.h | 2 +- ipc/shm.c | 2 +- kernel/sys_ni.c | 1 + mm/gup.c | 16 +- mm/huge_memory.c | 27 +- mm/memory.c | 5 +- mm/migrate.c | 3 +- mm/mmap.c | 15 +- mm/mprotect.c | 9 +- mm/nommu.c | 4 +- mm/util.c | 2 +- tools/testing/selftests/x86/Makefile | 9 +- .../selftests/x86/test_map_shadow_stack.c | 75 +++ 69 files changed, 1797 insertions(+), 92 deletions(-) create mode 100644 Documentation/x86/cet.rst create mode 100644 arch/x86/include/asm/cet.h create mode 100644 arch/x86/include/asm/mman.h create mode 100644 arch/x86/kernel/elf_feature_prctl.c create mode 100644 arch/x86/kernel/shstk.c create mode 100644 tools/testing/selftests/x86/test_map_shadow_stack.c
Hi,
This is a slight reboot of the userspace CET series. I will be taking over the
series from Yu-cheng. Per some internal recommendations, I’ve reset the version
number and am calling it a new series. Hopefully, it doesn’t cause confusion.
The new plan is to upstream only userspace Shadow Stack support at this point.
IBT can follow later, but for now I’ll focus solely on the most in-demand and
widely available (with the feature on AMD CPUs now) part of CET.
I thought as part of this reset, it might be useful to more fully write-up the
design and summarize the history of the previous CET series. So this slightly
long cover letter does that. The "Updates" section has the changes, if anyone
doesn't want the history.
Why is Shadow Stack Wanted
==========================
The main use case for userspace shadow stack is providing protection against
return oriented programming attacks. Fedora and Ubuntu already have many/most
packages enabled for shadow stack. The main missing piece is Linux kernel
support and there seems to be a high amount of interest in the ecosystem for
getting this feature supported. Besides security, Google has also done some
work on using shadow stack to improve performance and reliability of tracing.
Userspace Shadow Stack Implementation
=====================================
Shadow stack works by maintaining a secondary (shadow) stack that cannot be
directly modified by applications. When executing a CALL instruction, the
processor pushes the return address to both the normal stack and to the special
permissioned shadow stack. Upon ret, the processor pops the shadow stack copy
and compares it to the normal stack copy. If the two differ, the processor
raises a control protection fault. This implementation supports shadow stack on
64 bit kernels only, with support for 32 bit only via IA32 emulation.
Shadow Stack Memory
-------------------
The majority of this series deals with changes for handling the special
shadow stack memory permissions. This memory is specified by the
Dirty+RO PTE bits. A tricky aspect of this is that this combination was
previously used to specify COW memory. So Linux needs to handle COW
differently when shadow stack is in use. The solution is to use a
software PTE bit to denote COW memory, and take care to clear the dirty
bit when setting the memory RO.
Setup and Upkeep of HW Registers
--------------------------------
Using userspace CET requires a CR4 bit set, and also the manipulation
of two xsave managed MSRs. The kernel needs to modify these registers
during various operations like clone and signal handling. These
operations may happen when the registers are restored to the CPU, or
saved in an xsave buffer. Since the recent AMX triggered FPU overhaul
removed direct access to the xsave buffer, this series adds an
interface to operate on the supervisor xstate.
New ABIs
--------
This series introduces some new ABIs. The primary one is the shadow
stack itself. Since it is readable and the shadow stack pointer is
exposed to user space, applications can easily read and process the
shadow stack. And in fact the tracing usages plan to do exactly that.
Most of the shadow stack contents are written by HW, but some of the
entries are added by the kernel. The main place for this is signals. As
part of handling the signal the kernel does some manual adjustment of
the shadow stack that userspace depends on.
In addition to the contents of the shadow stack there is also user
visible behavior around when new shadow stacks are created and set in
the shadow stack pointer (SSP) register. This is relatively
straightforward – shadow stacks are created when new stacks are created
(thread creation, fork, etc). It is more or less what is required to
keep apps working.
For situations when userspace creates a new stack (i.e. makecontext(),
fibers, etc), a new syscall is provided for creating shadow stack
memory. To make the shadow stack usable, it needs to have a restore
token written to the protected memory. So the syscall provides a way to
specificity this should be done by the kernel.
When a shadow stack violation happens (when the return address of stack
not matching return address in shadow stack), a segfault is generated
with a new si_code specific to CET violations.
Lastly, a new arch_prctl interface is created for controlling the
enablement of CET-like features. It is intended to also be used for
LAM. It operates on the feature status per-thread, so for process wide
enabling it is intended to be used early in things like dynamic
linker/loaders. However, it can be used later for per-thread enablement
of features like WRSS.
WRSS
----
WRSS is an instruction that can write to shadow stacks. The HW provides
a way to enable this instruction for userspace use. Since shadow
stack’s are created initially protected, enabling WRSS allows any apps
that want to do unusual things with their stacks to have a way to
weaken protection and make things more flexible. A new feature bit is
defined to control enabling/disabling of WRSS.
History
=======
The branding “CET” really consists of two features: “Shadow Stack” and
“Indirect Branch Tracking”. They both restrict previously allowed, but rarely
valid behaviors and require userspace to change to avoid these behaviors before
enabling the protection. These raw HW features need to be assembled into a
software solution across userspace and kernel in order to add security value.
The kernel part of this solution has evolved iteratively starting with a lengthy
RFC period.
Until now, the enabling effort was trying to support both Shadow Stack and IBT.
This history will focus on a few areas of the shadow stack development history
that I thought stood out.
Signals
-------
Originally signals placed the location of the shadow stack restore
token inside the saved state on the stack. This was problematic from a
past ABI promises perspective. So the restore location was instead just
assumed from the shadow stack pointer. This works because in normal
allowed cases of calling sigreturn, the shadow stack pointer should be
right at the restore token at that time. There is no alternate shadow
stack support. If an alt shadow stack is added later we would need to
find a place to store the regular shadow stack token location. Options
could be to push something on the alt shadow stack, or to keep
something on the kernel side. So the current design keeps things simple
while slightly kicking the can down the road if alt shadow stacks
become a thing later. Siglongjmp is handled in glibc, using the incssp
instruction to unwind the shadow stack over the token.
Shadow Stack Allocation
-----------------------
makecontext() implementations need a way to create new shadow stacks
with restore token’s such that they can be pivoted to from userspace.
The first interface to do this was an arch_prctl(). It created a shadow
stack with a restore token pre-setup, since the kernel has an
instruction that can write to user shadow stacks. However, this
interface was abandoned for being strange.
The next version created PROT_SHADOW_STACK. This interface had two
problems. One, it left no options but for userspace to create writable
memory, write a restore token, then mproctect() it PROT_SHADOW_STACK.
The writable window left the shadow stack exposed, weakening the
security. Second, it caused problems with the guard pages. Since the
memory was initially created writable it did not have a guard page, but
then was mprotected later to a type of memory that should have one.
This resulted in missing guard pages and confused rb_subtree_gap’s.
This version introduces a new syscall that behaves similarly to the
initial arch_prctl() interface in that it has the kernel write the
restore token.
Enabling Interface
------------------
For the entire history of the original CET series, the design was to
enable shadow stack automatically if the feature bit was detected in
the elf header. Then it was userspace’s responsibility to turn it off
via an arch_prctl() if it was not desired, and this was handled by the
glibc dynamic loader. Glibc’s standard behavior (when CET if configured
is to leave shadow stack enabled if the executable and all linked
libraries are marked with shadow stacks.
Many distros (Fedora and others) have binaries already marked with
shadow stack, waiting for kernel support. Unfortunately their glibc
binaries expect the original arch_prctl() interface for allocating
shadow stacks, as those changes were pushed ahead of kernel support.
The net result of it all is, when updating to a kernel with shadow
stack these binaries would suddenly get shadow stack enabled and expect
the arch_prctl() interface to be there. And so calls to makecontext()
will fail, resulting in visible breakages. This series deals with this
problem as described below in "Updates".
Updates
=======
These updates were mostly driven by public comments, but a lot of the design
elements are new. I would like some extra scrutiny on the updates.
New syscall for Shadow Stack Allocation
---------------------------------------
A new syscall is added for allocating shadow stacks to replace
PROT_SHADOW_STACK. Several options were considered, as described in the
“x86/cet/shstk: Introduce map_shadow_stack syscall”.
Xsave Managed Supervisor State Modifications
--------------------------------------------
The shadow stack feature requires the kernel to modify xsaves managed
state. On one of the last versions of Yu-cheng’s series Boris had
commented on the pattern it was using to do this not necessarily being
ideal. The pattern was to force a restore to the registers and always
do the modification there. Then Thomas did an overhaul of the fpu code,
part of which consisted of making raw access to the xsave buffer
private to the fpu code. So this series tries to expose access again,
and in a way that addresses Boris’ comments.
The method is to provide functions like wmsrl/rdmsrl, but that can
direct the operation to the correct location (registers or buffer),
while giving the proper notice to the fpu subsystem so things don’t get
clobbered or corrupted.
In the past a solution like this was discussed as part of the PASID
series, and Thomas was not in favor. In CET’s case there is a more
logic around the CET MSR’s than in PASID's, and wrapping this logic
minimizes near identical open coded logic needed to do this more
efficiently. In addition it resolves the above described problem of
having no access to the xsave buffer. So it is being put forward here
under the supposition that CET’s usage may lead to a different
conclusion, not to try to ignore past direction.
The user interrupt series has similar needs as CET, and will also use
this internal interface if it’s found acceptable.
Support for WRSS
----------------
Andy Lutomirski had asked if we change the shadow stack allocation API
such that userspace cannot create arbitrary shadow stacks, then we look
at exposing an interface to enable the WRSS instruction for userspace.
This way app’s that want to do unexpected things with shadow stacks
would still have the option to create shadow stacks with arbitrary
data.
Switch Enabling Interface
-------------------------
As described above there is a problem with userspace binaries waiting
to break as soon as the kernel supports CET. This needs to be prevented
by changing the interface such that the old binaries will not enable
shadow stack AND behave as if shadow stack is not enabled. They should
run normally without shadow stack protection. Creating a new feature
(SHSTK2) for shadow stack was explored. SHSTK would never be supported
by the kernel, and all the userspace build tools would be updated to
target SHSTK2 instead of SHSTK. So old SHSTK binaries would be cleanly
disabled.
But there are existing downsides to automatic elf header processing
based enabling. The elf header feature spec is not defined by the
kernel and there are proposals to expand it to describe additional
logic. A simpler interface where the kernel is simply told what to
enable, and leaves all the decision making to userspace, is more
flexible for userspace and simpler for the kernel. There also already
needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and
likely LAM will use it too), so it avoids there being two ways to turn
on these types of features. The only tricky part for shadow stack, is
that it has to be enabled very early. Wherever the shadow stack is
enabled, the app cannot return from that point, otherwise there will be
a shadow stack violation. It turns out glibc can enable shadow stack
this early, so it works nicely. So not automatically enabling any
features in the elf header will cleanly disable all old binaries, which
expect the kernel to enable CET features automatically. Then after the
kernel changes are upstream, glibc can be updated to use the new
interface. This is the solution implemented in this series.
Expand Commit Logs
------------------
As part of spinning up on this series, I found some of the commit logs
did not describe the changes in enough detail for me understand their
purpose. I tried to expand the logs and comments, where I had to go
digging. Hopefully it’s useful.
Limit to only Intel Processors
------------------------------
Shadow stack is supported on some AMD processors, but this revision
(with expanded HW usage and xsaves changes) has only has been tested on
Intel ones. So this series has a patch to limit shadow stack support to
Intel processors. Ideally the patch would not even make it to mainline,
and should be dropped as soon as this testing is done. It's included
just in case.
Future Work
===========
Even though this is now exclusively a shadow stack series, there is still some
remaining shadow stack work to be done.
Ptrace
------
Early in the series, there was a patch to allow IA32_U_CET and
IA32_PL3_SSP to be set. This patch was dropped and planned as a follow
up to basic support, and it remains the plan. It will be needed for
in-progress gdb support.
CRIU Support
------------
In the past there was some speculation on the mailing list about
whether CRIU would need to be taught about CET. It turns out, it does.
The first issue hit is that CRIU calls sigreturn directly from its
“parasite code” that it injects into the dumper process. This violates
this shadow stack implementation’s protection that intends to prevent
attackers from doing this.
With so many packages already enabled with shadow stack, there is
probably desire to make it work seamlessly. But in the meantime if
distros want to support shadow stack and CRIU, users could manually
disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for
a process they will wants to dump. It’s not ideal.
I’d like to hear what people think about having shadow stack in the
kernel without this resolved. Nothing would change for any users until
they enable shadow stack in the kernel and update to a glibc configured
with CET. Should CRIU userspace be solved before kernel support?
Selftests
---------
There are some CET selftests being worked on and they are not included
here.
Thanks,
Rick
Rick Edgecombe (7):
x86/mm: Prevent VM_WRITE shadow stacks
x86/fpu: Add helpers for modifying supervisor xstate
x86/fpu: Add unsafe xsave buffer helpers
x86/cet/shstk: Introduce map_shadow_stack syscall
selftests/x86: Add map_shadow_stack syscall test
x86/cet/shstk: Support wrss for userspace
x86/cpufeatures: Limit shadow stack to Intel CPUs
Yu-cheng Yu (28):
Documentation/x86: Add CET description
x86/cet/shstk: Add Kconfig option for Shadow Stack
x86/cpufeatures: Add CET CPU feature flags for Control-flow
Enforcement Technology (CET)
x86/cpufeatures: Introduce CPU setup and option parsing for CET
x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
x86/cet: Add control-protection fault handler
x86/mm: Remove _PAGE_DIRTY from kernel RO pages
x86/mm: Move pmd_write(), pud_write() up in the file
x86/mm: Introduce _PAGE_COW
drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS
x86/mm: Update pte_modify for _PAGE_COW
x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for
transition from _PAGE_DIRTY to _PAGE_COW
mm: Move VM_UFFD_MINOR_BIT from 37 to 38
mm: Introduce VM_SHADOW_STACK for shadow stack memory
x86/mm: Check Shadow Stack page fault errors
x86/mm: Update maybe_mkwrite() for shadow stack
mm: Fixup places that call pte_mkwrite() directly
mm: Add guard pages around a shadow stack.
mm/mmap: Add shadow stack pages to memory accounting
mm: Update can_follow_write_pte() for shadow stack
mm/mprotect: Exclude shadow stack from preserve_write
mm: Re-introduce vm_flags to do_mmap()
x86/cet/shstk: Add user-mode shadow stack support
x86/process: Change copy_thread() argument 'arg' to 'stack_size'
x86/cet/shstk: Handle thread shadow stack
x86/cet/shstk: Introduce shadow stack token setup/verify routines
x86/cet/shstk: Handle signals for shadow stack
x86/cet/shstk: Add arch_prctl elf feature functions
.../admin-guide/kernel-parameters.txt | 4 +
Documentation/filesystems/proc.rst | 1 +
Documentation/x86/cet.rst | 145 ++++++
Documentation/x86/index.rst | 1 +
arch/arm/kernel/signal.c | 2 +-
arch/arm64/kernel/signal.c | 2 +-
arch/arm64/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal32.c | 2 +-
arch/sparc/kernel/signal_64.c | 2 +-
arch/x86/Kconfig | 22 +
arch/x86/Kconfig.assembler | 5 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/ia32/ia32_signal.c | 25 +-
arch/x86/include/asm/cet.h | 54 +++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/fpu/api.h | 8 +
arch/x86/include/asm/fpu/types.h | 23 +-
arch/x86/include/asm/fpu/xstate.h | 6 +-
arch/x86/include/asm/idtentry.h | 4 +
arch/x86/include/asm/mman.h | 24 +
arch/x86/include/asm/mmu_context.h | 2 +
arch/x86/include/asm/msr-index.h | 20 +
arch/x86/include/asm/page_types.h | 7 +
arch/x86/include/asm/pgtable.h | 302 ++++++++++--
arch/x86/include/asm/pgtable_types.h | 48 +-
arch/x86/include/asm/processor.h | 6 +
arch/x86/include/asm/special_insns.h | 30 ++
arch/x86/include/asm/trap_pf.h | 2 +
arch/x86/include/uapi/asm/mman.h | 8 +-
arch/x86/include/uapi/asm/prctl.h | 10 +
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/cpu/common.c | 20 +
arch/x86/kernel/cpu/cpuid-deps.c | 1 +
arch/x86/kernel/elf_feature_prctl.c | 72 +++
arch/x86/kernel/fpu/xstate.c | 167 ++++++-
arch/x86/kernel/idt.c | 4 +
arch/x86/kernel/process.c | 17 +-
arch/x86/kernel/process_64.c | 2 +
arch/x86/kernel/shstk.c | 446 ++++++++++++++++++
arch/x86/kernel/signal.c | 13 +
arch/x86/kernel/signal_compat.c | 2 +-
arch/x86/kernel/traps.c | 62 +++
arch/x86/mm/fault.c | 19 +
arch/x86/mm/mmap.c | 48 ++
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/mm/pgtable.c | 25 +
drivers/gpu/drm/i915/gvt/gtt.c | 2 +-
fs/aio.c | 2 +-
fs/proc/task_mmu.c | 3 +
include/linux/mm.h | 19 +-
include/linux/pgtable.h | 8 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/asm-generic/unistd.h | 2 +-
ipc/shm.c | 2 +-
kernel/sys_ni.c | 1 +
mm/gup.c | 16 +-
mm/huge_memory.c | 27 +-
mm/memory.c | 5 +-
mm/migrate.c | 3 +-
mm/mmap.c | 15 +-
mm/mprotect.c | 9 +-
mm/nommu.c | 4 +-
mm/util.c | 2 +-
tools/testing/selftests/x86/Makefile | 9 +-
.../selftests/x86/test_map_shadow_stack.c | 75 +++
69 files changed, 1797 insertions(+), 92 deletions(-)
create mode 100644 Documentation/x86/cet.rst
create mode 100644 arch/x86/include/asm/cet.h
create mode 100644 arch/x86/include/asm/mman.h
create mode 100644 arch/x86/kernel/elf_feature_prctl.c
create mode 100644 arch/x86/kernel/shstk.c
create mode 100644 tools/testing/selftests/x86/test_map_shadow_stack.c
base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
--
2.17.1
Rick,
On Sun, Jan 30 2022 at 13:18, Rick Edgecombe wrote:
> This is a slight reboot of the userspace CET series. I will be taking over the
> series from Yu-cheng. Per some internal recommendations, I’ve reset the version
> number and am calling it a new series. Hopefully, it doesn’t cause
> confusion.
That's fine as it seems to be a major change in course, so a reset to V1
is justified. Don't worry about confusion, we can easily confuse ourself
with minor things than that version reset :)
> The new plan is to upstream only userspace Shadow Stack support at this point.
> IBT can follow later, but for now I’ll focus solely on the most in-demand and
> widely available (with the feature on AMD CPUs now) part of CET.
We just have to keep in IBT mind so that we don't add roadblocks which
we regret some time later.
> I thought as part of this reset, it might be useful to more fully write-up the
> design and summarize the history of the previous CET series. So this slightly
> long cover letter does that. The "Updates" section has the changes, if anyone
> doesn't want the history.
Thanks for that lengthy writeup. It's appreciated. There is too much
confusion already so a coherent summary is helpful.
> Why is Shadow Stack Wanted
> ==========================
> The main use case for userspace shadow stack is providing protection against
> return oriented programming attacks. Fedora and Ubuntu already have many/most
> packages enabled for shadow stack.
Which is unfortunately part of the overall problem ...
> History
> =======
> The branding “CET” really consists of two features: “Shadow Stack” and
> “Indirect Branch Tracking”. They both restrict previously allowed, but rarely
> valid behaviors and require userspace to change to avoid these behaviors before
> enabling the protection. These raw HW features need to be assembled into a
> software solution across userspace and kernel in order to add security value.
> The kernel part of this solution has evolved iteratively starting with a lengthy
> RFC period.
>
> Until now, the enabling effort was trying to support both Shadow Stack and IBT.
> This history will focus on a few areas of the shadow stack development history
> that I thought stood out.
>
> Signals
> -------
> Originally signals placed the location of the shadow stack restore
> token inside the saved state on the stack. This was problematic from a
> past ABI promises perspective. So the restore location was instead just
> assumed from the shadow stack pointer. This works because in normal
> allowed cases of calling sigreturn, the shadow stack pointer should be
> right at the restore token at that time. There is no alternate shadow
> stack support. If an alt shadow stack is added later we would
> need to
So how is that going to work? altstack is not an esoteric corner case.
> Enabling Interface
> ------------------
> For the entire history of the original CET series, the design was to
> enable shadow stack automatically if the feature bit was detected in
> the elf header. Then it was userspace’s responsibility to turn it off
> via an arch_prctl() if it was not desired, and this was handled by the
> glibc dynamic loader. Glibc’s standard behavior (when CET if configured
> is to leave shadow stack enabled if the executable and all linked
> libraries are marked with shadow stacks.
>
> Many distros (Fedora and others) have binaries already marked with
> shadow stack, waiting for kernel support. Unfortunately their glibc
> binaries expect the original arch_prctl() interface for allocating
> shadow stacks, as those changes were pushed ahead of kernel support.
> The net result of it all is, when updating to a kernel with shadow
> stack these binaries would suddenly get shadow stack enabled and expect
> the arch_prctl() interface to be there. And so calls to makecontext()
> will fail, resulting in visible breakages. This series deals with this
> problem as described below in "Updates".
I'm really impressed by the well thought out coordination on the glibc and
distro side. Designed by committee never worked ...
> Updates
> =======
> These updates were mostly driven by public comments, but a lot of the design
> elements are new. I would like some extra scrutiny on the updates.
>
> New syscall for Shadow Stack Allocation
> ---------------------------------------
> A new syscall is added for allocating shadow stacks to replace
> PROT_SHADOW_STACK. Several options were considered, as described in the
> “x86/cet/shstk: Introduce map_shadow_stack syscall”.
>
> Xsave Managed Supervisor State Modifications
> --------------------------------------------
> The shadow stack feature requires the kernel to modify xsaves managed
> state. On one of the last versions of Yu-cheng’s series Boris had
> commented on the pattern it was using to do this not necessarily being
> ideal. The pattern was to force a restore to the registers and always
> do the modification there. Then Thomas did an overhaul of the fpu code,
> part of which consisted of making raw access to the xsave buffer
> private to the fpu code. So this series tries to expose access again,
> and in a way that addresses Boris’ comments.
>
> The method is to provide functions like wmsrl/rdmsrl, but that can
> direct the operation to the correct location (registers or buffer),
> while giving the proper notice to the fpu subsystem so things don’t get
> clobbered or corrupted.
>
> In the past a solution like this was discussed as part of the PASID
> series, and Thomas was not in favor. In CET’s case there is a more
> logic around the CET MSR’s than in PASID's, and wrapping this logic
> minimizes near identical open coded logic needed to do this more
> efficiently. In addition it resolves the above described problem of
> having no access to the xsave buffer. So it is being put forward here
> under the supposition that CET’s usage may lead to a different
> conclusion, not to try to ignore past direction.
>
> The user interrupt series has similar needs as CET, and will also use
> this internal interface if it’s found acceptable.
I'll have a look.
> Switch Enabling Interface
> -------------------------
> But there are existing downsides to automatic elf header processing
> based enabling. The elf header feature spec is not defined by the
> kernel and there are proposals to expand it to describe additional
> logic. A simpler interface where the kernel is simply told what to
> enable, and leaves all the decision making to userspace, is more
> flexible for userspace and simpler for the kernel. There also already
> needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and
> likely LAM will use it too), so it avoids there being two ways to turn
> on these types of features. The only tricky part for shadow stack, is
> that it has to be enabled very early. Wherever the shadow stack is
> enabled, the app cannot return from that point, otherwise there will be
> a shadow stack violation. It turns out glibc can enable shadow stack
> this early, so it works nicely. So not automatically enabling any
> features in the elf header will cleanly disable all old binaries, which
> expect the kernel to enable CET features automatically. Then after the
> kernel changes are upstream, glibc can be updated to use the new
> interface. This is the solution implemented in this series.
Makes sense.
> Expand Commit Logs
> ------------------
> As part of spinning up on this series, I found some of the commit logs
> did not describe the changes in enough detail for me understand their
> purpose. I tried to expand the logs and comments, where I had to go
> digging. Hopefully it’s useful.
Proper changelogs are always appreciated.
> Limit to only Intel Processors
> ------------------------------
> Shadow stack is supported on some AMD processors, but this revision
> (with expanded HW usage and xsaves changes) has only has been tested on
> Intel ones. So this series has a patch to limit shadow stack support to
> Intel processors. Ideally the patch would not even make it to mainline,
> and should be dropped as soon as this testing is done. It's included
> just in case.
Ha. I can give you access to an AMD machine with CET SS supported :)
> Future Work
> ===========
> Even though this is now exclusively a shadow stack series, there is still some
> remaining shadow stack work to be done.
>
> Ptrace
> ------
> Early in the series, there was a patch to allow IA32_U_CET and
> IA32_PL3_SSP to be set. This patch was dropped and planned as a follow
> up to basic support, and it remains the plan. It will be needed for
> in-progress gdb support.
It's pretty much a prerequisite for enabling it, right?
> CRIU Support
> ------------
> In the past there was some speculation on the mailing list about
> whether CRIU would need to be taught about CET. It turns out, it does.
> The first issue hit is that CRIU calls sigreturn directly from its
> “parasite code” that it injects into the dumper process. This violates
> this shadow stack implementation’s protection that intends to prevent
> attackers from doing this.
>
> With so many packages already enabled with shadow stack, there is
> probably desire to make it work seamlessly. But in the meantime if
> distros want to support shadow stack and CRIU, users could manually
> disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for
> a process they will wants to dump. It’s not ideal.
>
> I’d like to hear what people think about having shadow stack in the
> kernel without this resolved. Nothing would change for any users until
> they enable shadow stack in the kernel and update to a glibc configured
> with CET. Should CRIU userspace be solved before kernel support?
Definitely yes. Making CRIU users add a glibc tunable is not really an
option. We can't break CRIU systems with a kernel upgrade.
Thanks,
tglx
Hi Thomas, Thanks for feedback on the plan. On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote: > > Until now, the enabling effort was trying to support both Shadow > > Stack and IBT. > > This history will focus on a few areas of the shadow stack > > development history > > that I thought stood out. > > > > Signals > > ------- > > Originally signals placed the location of the shadow stack > > restore > > token inside the saved state on the stack. This was > > problematic from a > > past ABI promises perspective. So the restore location was > > instead just > > assumed from the shadow stack pointer. This works because in > > normal > > allowed cases of calling sigreturn, the shadow stack pointer > > should be > > right at the restore token at that time. There is no > > alternate shadow > > stack support. If an alt shadow stack is added later we > > would > > need to > > So how is that going to work? altstack is not an esoteric corner > case. My understanding is that the main usages for the signal stack were handling stack overflows and corruption. Since the shadow stack only contains return addresses rather than large stack allocations, and is not generally writable or pivotable, I thought there was a good possibility an alt shadow stack would not end up being especially useful. Does it seem like reasonable guesswork? If it does seems likely, I'll give it some more thought than that hand wavy plan. > > > Limit to only Intel Processors > > ------------------------------ > > Shadow stack is supported on some AMD processors, but this > > revision > > (with expanded HW usage and xsaves changes) has only has > > been tested on > > Intel ones. So this series has a patch to limit shadow stack > > support to > > Intel processors. Ideally the patch would not even make it > > to mainline, > > and should be dropped as soon as this testing is done. It's > > included > > just in case. > > Ha. I can give you access to an AMD machine with CET SS supported :) Thanks for the offer. It sounds like John Allen can do this testing. > > > Future Work > > =========== > > Even though this is now exclusively a shadow stack series, there is > > still some > > remaining shadow stack work to be done. > > > > Ptrace > > ------ > > Early in the series, there was a patch to allow IA32_U_CET > > and > > IA32_PL3_SSP to be set. This patch was dropped and planned > > as a follow > > up to basic support, and it remains the plan. It will be > > needed for > > in-progress gdb support. > > It's pretty much a prerequisite for enabling it, right? Yes. > > > CRIU Support > > ------------ > > In the past there was some speculation on the mailing list > > about > > whether CRIU would need to be taught about CET. It turns > > out, it does. > > The first issue hit is that CRIU calls sigreturn directly > > from its > > “parasite code” that it injects into the dumper process. > > This violates > > this shadow stack implementation’s protection that intends > > to prevent > > attackers from doing this. > > > > With so many packages already enabled with shadow stack, > > there is > > probably desire to make it work seamlessly. But in the > > meantime if > > distros want to support shadow stack and CRIU, users could > > manually > > disabled shadow stack via > > “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for > > a process they will wants to dump. It’s not ideal. > > > > I’d like to hear what people think about having shadow stack > > in the > > kernel without this resolved. Nothing would change for any > > users until > > they enable shadow stack in the kernel and update to a glibc > > configured > > with CET. Should CRIU userspace be solved before kernel > > support? > > Definitely yes. Making CRIU users add a glibc tunable is not really > an > option. We can't break CRIU systems with a kernel upgrade. Ok got it, thanks. Just to be clear though, existing distros/binaries out there will not have shadow stack enabled with just an updated kernel (due to the enabling changes). So the CRIU tools would only break after future glibc binaries enable CET, which users/distros would have to do specifically (glibc doesn't even enable cet by default). Since the purpose of this feature is to restrict previously allowed behaviors, and it’s apparently getting enabled by default in some distro's packages, I guess there is a decent chance that once a system is updated with a future glibc some app somewhere will break. I was under the impression that as long as there were no breakages under a current set of binaries (including glibc), this was not considered a kernel regression. Please correct me if this is wrong. I think there are other options if we want to make this softer. Of course none of that prevents known breakages from being fixed for normal reasons and I’ll look into that for CRIU.
On Thu, Feb 3, 2022, at 5:08 PM, Edgecombe, Rick P wrote: > Hi Thomas, >> > Signals >> > ------- >> > Originally signals placed the location of the shadow stack >> > restore >> > token inside the saved state on the stack. This was >> > problematic from a >> > past ABI promises perspective. What was the actual problem? >> > So the restore location was >> > instead just >> > assumed from the shadow stack pointer. This works because in >> > normal >> > allowed cases of calling sigreturn, the shadow stack pointer >> > should be >> > right at the restore token at that time. There is no >> > alternate shadow >> > stack support. If an alt shadow stack is added later we >> > would >> > need to >> >> So how is that going to work? altstack is not an esoteric corner >> case. > > My understanding is that the main usages for the signal stack were > handling stack overflows and corruption. Since the shadow stack only > contains return addresses rather than large stack allocations, and is > not generally writable or pivotable, I thought there was a good > possibility an alt shadow stack would not end up being especially > useful. Does it seem like reasonable guesswork? It's also used for things like DOSEMU that execute in a weird context and then trap back out to the outer program using a signal handler and an altstack. Also, imagine someone writing a SIGSEGV handler specifically intended to handle shadow stack overflow. The shadow stack can be pivoted using RSTORSSP.
On Thu, 2022-02-03 at 21:20 -0800, Andy Lutomirski wrote: > On Thu, Feb 3, 2022, at 5:08 PM, Edgecombe, Rick P wrote: > > Hi Thomas, > > > > Signals > > > > ------- > > > > Originally signals placed the location of the shadow > > > > stack > > > > restore > > > > token inside the saved state on the stack. This was > > > > problematic from a > > > > past ABI promises perspective. > > What was the actual problem? The meat of the discussion that I saw was this thread: https://lore.kernel.org/lkml/CALCETrVTeYfzO-XWh+VwTuKCyPyp-oOMGH=QR_msG9tPQ4xPmA@mail.gmail.com/ The problem was that the saved shadow stack pointer was placed in a location unexpected by userspace (at the end of the floating point state), which apparently can be relocated by userspace that would not be aware of the shadow stack state at the end. I think an earlier version was not extendable either. It does not seem to be a fully dead end, but I have to admit I didn’t fully internalize the limits imposed by the userspace expectations of the sigframe because the current solution seemed good. I’ll have to dig into it a little more because alt stack, CRIU and these expectations all seem intertwined. Here is a question for you guys though – is a general ucontext extension solution a nice-to-have on its own? I was thinking that it would be better to keep CET stuff out of the sigframe related structures if it can be avoided. One reason is that it keeps security related register values out of writable locations. I’m not thinking of any specific problem, but just a general idea of not opening that stuff up if it’s not needed. An example is in the IBT series, the wait-for-endbranch state was saved in a ucontext flag. Some usages may want to keep it from being cleared in a signal and the endbranch check skipped. So for shadow stack, just the general notion that this is not ideal state to open up. On where to keep the wait-for-endbranch state, PeterZ had suggested the possibility of having a per-mm hashmap of (userspace stack addresses)-> (kernel side saved state), limited to some sensible limit of items. This could be extendable to other state besides CET stuff. I was thinking to look in that direction if it’s needed for the alt shadow stack. But, is it swimming against the place where saved state is "supposed" to be? There is some optimum of compatibility (more apps able to opt- in) and security. I guess it's probably not good to have the kernel bend over backwards trying to get both. > > > > So the restore location was > > > > instead just > > > > assumed from the shadow stack pointer. This works > > > > because in > > > > normal > > > > allowed cases of calling sigreturn, the shadow stack > > > > pointer > > > > should be > > > > right at the restore token at that time. There is no > > > > alternate shadow > > > > stack support. If an alt shadow stack is added later we > > > > would > > > > need to > > > > > > So how is that going to work? altstack is not an esoteric corner > > > case. > > > > My understanding is that the main usages for the signal stack were > > handling stack overflows and corruption. Since the shadow stack > > only > > contains return addresses rather than large stack allocations, and > > is > > not generally writable or pivotable, I thought there was a good > > possibility an alt shadow stack would not end up being especially > > useful. Does it seem like reasonable guesswork? > > It's also used for things like DOSEMU that execute in a weird context > and then trap back out to the outer program using a signal handler > and an altstack. Also, imagine someone writing a SIGSEGV handler > specifically intended to handle shadow stack overflow. Interesting, thanks. I had been thinking that an alt shadow stack would require a new interface that would mostly just sit dormant taking up space. But probably an (obvious) better way would be to just have the signalstack() syscall automatically create a new shadow stack, like the rest of the series does automatically for new threads. I’ll think I’ll see how that would look. > > The shadow stack can be pivoted using RSTORSSP. Yes, I just meant that the ability to pivot or modify is restricted (in RSTORSSP's case by restore token checks) and so with less ability to interact with it, it could be less likely for there to be corruptions. This if of course just speculation.
From: Edgecombe, Rick P > Sent: 04 February 2022 01:08 > Hi Thomas, > > Thanks for feedback on the plan. > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote: > > > Until now, the enabling effort was trying to support both Shadow > > > Stack and IBT. > > > This history will focus on a few areas of the shadow stack > > > development history > > > that I thought stood out. > > > > > > Signals > > > ------- > > > Originally signals placed the location of the shadow stack > > > restore > > > token inside the saved state on the stack. This was > > > problematic from a > > > past ABI promises perspective. So the restore location was > > > instead just > > > assumed from the shadow stack pointer. This works because in > > > normal > > > allowed cases of calling sigreturn, the shadow stack pointer > > > should be > > > right at the restore token at that time. There is no > > > alternate shadow > > > stack support. If an alt shadow stack is added later we > > > would > > > need to > > > > So how is that going to work? altstack is not an esoteric corner > > case. > > My understanding is that the main usages for the signal stack were > handling stack overflows and corruption. Since the shadow stack only > contains return addresses rather than large stack allocations, and is > not generally writable or pivotable, I thought there was a good > possibility an alt shadow stack would not end up being especially > useful. Does it seem like reasonable guesswork? The other 'problem' is that it is valid to longjump out of a signal handler. These days you have to use siglongjmp() not longjmp() but it is still used. It is probably also valid to use siglongjmp() to jump from a nested signal handler into the outer handler. Given both signal handlers can have their own stack, there can be three stacks involved. I think the shadow stack pointer has to be in ucontext - which also means the application can change it before returning from a signal. In much the same way as all the segment registers can be changed leading to all the nasty bugs when the final 'return to user' code traps in kernel when loading invalid segment registers or executing iret. Hmmm... do shadow stacks mean that longjmp() has to be a system call? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com> wrote: > > From: Edgecombe, Rick P > > Sent: 04 February 2022 01:08 > > Hi Thomas, > > > > Thanks for feedback on the plan. > > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote: > > > > Until now, the enabling effort was trying to support both Shadow > > > > Stack and IBT. > > > > This history will focus on a few areas of the shadow stack > > > > development history > > > > that I thought stood out. > > > > > > > > Signals > > > > ------- > > > > Originally signals placed the location of the shadow stack > > > > restore > > > > token inside the saved state on the stack. This was > > > > problematic from a > > > > past ABI promises perspective. So the restore location was > > > > instead just > > > > assumed from the shadow stack pointer. This works because in > > > > normal > > > > allowed cases of calling sigreturn, the shadow stack pointer > > > > should be > > > > right at the restore token at that time. There is no > > > > alternate shadow > > > > stack support. If an alt shadow stack is added later we > > > > would > > > > need to > > > > > > So how is that going to work? altstack is not an esoteric corner > > > case. > > > > My understanding is that the main usages for the signal stack were > > handling stack overflows and corruption. Since the shadow stack only > > contains return addresses rather than large stack allocations, and is > > not generally writable or pivotable, I thought there was a good > > possibility an alt shadow stack would not end up being especially > > useful. Does it seem like reasonable guesswork? > > The other 'problem' is that it is valid to longjump out of a signal handler. > These days you have to use siglongjmp() not longjmp() but it is still used. > > It is probably also valid to use siglongjmp() to jump from a nested > signal handler into the outer handler. > Given both signal handlers can have their own stack, there can be three > stacks involved. > > I think the shadow stack pointer has to be in ucontext - which also > means the application can change it before returning from a signal. > In much the same way as all the segment registers can be changed > leading to all the nasty bugs when the final 'return to user' code > traps in kernel when loading invalid segment registers or executing iret. > > Hmmm... do shadow stacks mean that longjmp() has to be a system call? No. setjmp/longjmp save and restore shadow stack pointer. -- H.J.
On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote: > On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com> > wrote: > > > > From: Edgecombe, Rick P > > > Sent: 04 February 2022 01:08 > > > Hi Thomas, > > > > > > Thanks for feedback on the plan. > > > > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote: > > > > > Until now, the enabling effort was trying to support both > > > > > Shadow > > > > > Stack and IBT. > > > > > This history will focus on a few areas of the shadow stack > > > > > development history > > > > > that I thought stood out. > > > > > > > > > > Signals > > > > > ------- > > > > > Originally signals placed the location of the shadow > > > > > stack > > > > > restore > > > > > token inside the saved state on the stack. This was > > > > > problematic from a > > > > > past ABI promises perspective. So the restore location > > > > > was > > > > > instead just > > > > > assumed from the shadow stack pointer. This works > > > > > because in > > > > > normal > > > > > allowed cases of calling sigreturn, the shadow stack > > > > > pointer > > > > > should be > > > > > right at the restore token at that time. There is no > > > > > alternate shadow > > > > > stack support. If an alt shadow stack is added later > > > > > we > > > > > would > > > > > need to > > > > > > > > So how is that going to work? altstack is not an esoteric > > > > corner > > > > case. > > > > > > My understanding is that the main usages for the signal stack > > > were > > > handling stack overflows and corruption. Since the shadow stack > > > only > > > contains return addresses rather than large stack allocations, > > > and is > > > not generally writable or pivotable, I thought there was a good > > > possibility an alt shadow stack would not end up being especially > > > useful. Does it seem like reasonable guesswork? > > > > The other 'problem' is that it is valid to longjump out of a signal > > handler. > > These days you have to use siglongjmp() not longjmp() but it is > > still used. > > > > It is probably also valid to use siglongjmp() to jump from a nested > > signal handler into the outer handler. > > Given both signal handlers can have their own stack, there can be > > three > > stacks involved. So the scenario is? 1. Handle signal 1 2. sigsetjmp() 3. signalstack() 4. Handle signal 2 on alt stack 5. siglongjmp() I'll check that it is covered by the tests, but I think it should work in this series that has no alt shadow stack. I have only done a high level overview of how the shadow stack stuff, that doesn't involve the kernel, works in glibc. Sounds like I'll need to do a deeper dive. > > > > I think the shadow stack pointer has to be in ucontext - which also > > means the application can change it before returning from a signal. Yes we might need to change it to support alt shadow stacks. Can you elaborate why you think it has to be in ucontext? I was thinking of looking at three options for storing the ssp: - Stored in the shadow stack like a token using WRUSS from the kernel. - Stored on the kernel side using a hashmap that maps ucontext or sigframe userspace address to ssp (this is of course similar to storing in ucontext, except that the user can’t change the ssp). - Stored writable in userspace in ucontext. But in this version, without alt shadow stacks, the shadow stack pointer is not stored in ucontext. This causes the limitation that userspace can only call sigreturn when it has returned back to a point where there is a restore token on the shadow stack (which was placed there by the kernel). This doesn’t mean it can’t switch to a different shadow stack or handle a nested signal, but it limits the possibility for calling sigreturn with a totally different sigframe (like CRIU and SROP attacks do). It should hopefully be a helpful, protective limitation for most apps and I'm hoping CRIU can be fixed without removing it. I am not aware of other limitations to signals (besides normal shadow stack enforcement), but I could be missing it. And people's skepticism is making me want to go back over it with more scrutiny. > > In much the same way as all the segment registers can be changed > > leading to all the nasty bugs when the final 'return to user' code > > traps in kernel when loading invalid segment registers or executing > > iret. I don't think this is as difficult to avoid because userspace ssp has its own register that should not be accessed at that point, but I have not given this aspect enough analysis. Thanks for bringing it up. > > > > Hmmm... do shadow stacks mean that longjmp() has to be a system > > call? > > No. setjmp/longjmp save and restore shadow stack pointer. > It sounds like it would help to write up in a lot more detail exactly how all the signal and specialer stack manipulation scenarios work in glibc.
On Sat, Feb 5, 2022 at 12:15 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > > On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote: > > On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com> > > wrote: > > > > > > From: Edgecombe, Rick P > > > > Sent: 04 February 2022 01:08 > > > > Hi Thomas, > > > > > > > > Thanks for feedback on the plan. > > > > > > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote: > > > > > > Until now, the enabling effort was trying to support both > > > > > > Shadow > > > > > > Stack and IBT. > > > > > > This history will focus on a few areas of the shadow stack > > > > > > development history > > > > > > that I thought stood out. > > > > > > > > > > > > Signals > > > > > > ------- > > > > > > Originally signals placed the location of the shadow > > > > > > stack > > > > > > restore > > > > > > token inside the saved state on the stack. This was > > > > > > problematic from a > > > > > > past ABI promises perspective. So the restore location > > > > > > was > > > > > > instead just > > > > > > assumed from the shadow stack pointer. This works > > > > > > because in > > > > > > normal > > > > > > allowed cases of calling sigreturn, the shadow stack > > > > > > pointer > > > > > > should be > > > > > > right at the restore token at that time. There is no > > > > > > alternate shadow > > > > > > stack support. If an alt shadow stack is added later > > > > > > we > > > > > > would > > > > > > need to > > > > > > > > > > So how is that going to work? altstack is not an esoteric > > > > > corner > > > > > case. > > > > > > > > My understanding is that the main usages for the signal stack > > > > were > > > > handling stack overflows and corruption. Since the shadow stack > > > > only > > > > contains return addresses rather than large stack allocations, > > > > and is > > > > not generally writable or pivotable, I thought there was a good > > > > possibility an alt shadow stack would not end up being especially > > > > useful. Does it seem like reasonable guesswork? > > > > > > The other 'problem' is that it is valid to longjump out of a signal > > > handler. > > > These days you have to use siglongjmp() not longjmp() but it is > > > still used. > > > > > > It is probably also valid to use siglongjmp() to jump from a nested > > > signal handler into the outer handler. > > > Given both signal handlers can have their own stack, there can be > > > three > > > stacks involved. > > So the scenario is? > > 1. Handle signal 1 > 2. sigsetjmp() > 3. signalstack() > 4. Handle signal 2 on alt stack > 5. siglongjmp() > > I'll check that it is covered by the tests, but I think it should work > in this series that has no alt shadow stack. I have only done a high > level overview of how the shadow stack stuff, that doesn't involve the > kernel, works in glibc. Sounds like I'll need to do a deeper dive. > > > > > > > I think the shadow stack pointer has to be in ucontext - which also > > > means the application can change it before returning from a signal. > > Yes we might need to change it to support alt shadow stacks. Can you > elaborate why you think it has to be in ucontext? I was thinking of > looking at three options for storing the ssp: > - Stored in the shadow stack like a token using WRUSS from the kernel. > - Stored on the kernel side using a hashmap that maps ucontext or > sigframe userspace address to ssp (this is of course similar to > storing in ucontext, except that the user can’t change the ssp). > - Stored writable in userspace in ucontext. > > But in this version, without alt shadow stacks, the shadow stack > pointer is not stored in ucontext. This causes the limitation that > userspace can only call sigreturn when it has returned back to a point > where there is a restore token on the shadow stack (which was placed > there by the kernel). This doesn’t mean it can’t switch to a different > shadow stack or handle a nested signal, but it limits the possibility > for calling sigreturn with a totally different sigframe (like CRIU and > SROP attacks do). It should hopefully be a helpful, protective > limitation for most apps and I'm hoping CRIU can be fixed without > removing it. > > I am not aware of other limitations to signals (besides normal shadow > stack enforcement), but I could be missing it. And people's skepticism > is making me want to go back over it with more scrutiny. > > > > In much the same way as all the segment registers can be changed > > > leading to all the nasty bugs when the final 'return to user' code > > > traps in kernel when loading invalid segment registers or executing > > > iret. > > I don't think this is as difficult to avoid because userspace ssp has > its own register that should not be accessed at that point, but I have > not given this aspect enough analysis. Thanks for bringing it up. > > > > > > > Hmmm... do shadow stacks mean that longjmp() has to be a system > > > call? > > > > No. setjmp/longjmp save and restore shadow stack pointer. > > > > It sounds like it would help to write up in a lot more detail exactly > how all the signal and specialer stack manipulation scenarios work in > glibc. > setjmp/longjmp work on the same sigjmp_buf. Shadow stack pointer is saved and restored, just like any other callee-saved registers. -- H.J.
On 2/5/22 12:15, Edgecombe, Rick P wrote: > On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote: >> On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com> >> wrote: >>> >>> From: Edgecombe, Rick P >>>> Sent: 04 February 2022 01:08 >>>> Hi Thomas, >>>> >>>> Thanks for feedback on the plan. >>>> >>>> On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote: >>>>>> Until now, the enabling effort was trying to support both >>>>>> Shadow >>>>>> Stack and IBT. >>>>>> This history will focus on a few areas of the shadow stack >>>>>> development history >>>>>> that I thought stood out. >>>>>> >>>>>> Signals >>>>>> ------- >>>>>> Originally signals placed the location of the shadow >>>>>> stack >>>>>> restore >>>>>> token inside the saved state on the stack. This was >>>>>> problematic from a >>>>>> past ABI promises perspective. So the restore location >>>>>> was >>>>>> instead just >>>>>> assumed from the shadow stack pointer. This works >>>>>> because in >>>>>> normal >>>>>> allowed cases of calling sigreturn, the shadow stack >>>>>> pointer >>>>>> should be >>>>>> right at the restore token at that time. There is no >>>>>> alternate shadow >>>>>> stack support. If an alt shadow stack is added later >>>>>> we >>>>>> would >>>>>> need to >>>>> >>>>> So how is that going to work? altstack is not an esoteric >>>>> corner >>>>> case. >>>> >>>> My understanding is that the main usages for the signal stack >>>> were >>>> handling stack overflows and corruption. Since the shadow stack >>>> only >>>> contains return addresses rather than large stack allocations, >>>> and is >>>> not generally writable or pivotable, I thought there was a good >>>> possibility an alt shadow stack would not end up being especially >>>> useful. Does it seem like reasonable guesswork? >>> >>> The other 'problem' is that it is valid to longjump out of a signal >>> handler. >>> These days you have to use siglongjmp() not longjmp() but it is >>> still used. >>> >>> It is probably also valid to use siglongjmp() to jump from a nested >>> signal handler into the outer handler. >>> Given both signal handlers can have their own stack, there can be >>> three >>> stacks involved. > > So the scenario is? > > 1. Handle signal 1 > 2. sigsetjmp() > 3. signalstack() > 4. Handle signal 2 on alt stack > 5. siglongjmp() > > I'll check that it is covered by the tests, but I think it should work > in this series that has no alt shadow stack. I have only done a high > level overview of how the shadow stack stuff, that doesn't involve the > kernel, works in glibc. Sounds like I'll need to do a deeper dive. > >>> >>> I think the shadow stack pointer has to be in ucontext - which also >>> means the application can change it before returning from a signal. > > Yes we might need to change it to support alt shadow stacks. Can you > elaborate why you think it has to be in ucontext? I was thinking of > looking at three options for storing the ssp: > - Stored in the shadow stack like a token using WRUSS from the kernel. > - Stored on the kernel side using a hashmap that maps ucontext or > sigframe userspace address to ssp (this is of course similar to > storing in ucontext, except that the user can’t change the ssp). > - Stored writable in userspace in ucontext. > > But in this version, without alt shadow stacks, the shadow stack > pointer is not stored in ucontext. This causes the limitation that > userspace can only call sigreturn when it has returned back to a point > where there is a restore token on the shadow stack (which was placed > there by the kernel). I'll reply here and maybe cover multiple things. User code already needs to rewind the regular stack to call sigreturn -- sigreturn find the signal frame based on ESP/RSP. So if you call it from the wrong place, you go boom. I think that the Linux SHSTK ABI should have the property that no amount of tampering with just the ucontext and associated structures can cause sigreturn to redirect to the wrong IP -- there should be something on the shadow stack that also gets verified in sigreturn. IIRC the series does this, but it's been a while. The post-sigreturn SSP should be entirely implied by pre-sigreturn SSP (or perhaps something on the shadow stack), so, in the absence of an altshadowstack feature, no ucontext changes should be needed. We can also return from a signal or from more than one signal at once, as above, using siglongjmp. It seems like this should Just Work (tm), at least in the absence of altshadowstack. So this leaves altshadowstack. If we want to allow userspace to handle a shstk overflow, I think we need altshadowstack. And I can easily imagine signal handling in a coroutine or user-threading evironment (Go? UMCG or whatever it's called?) wanting this. As noted, this obnoxious Andy person didn't like putting any shstk-related extensions in the FPU state. For better or for worse, altshadowstack is (I think) fundamentally a new API. No amount of ucontext magic is going to materialize an entire shadow stack out of nowhere when someone calls sigaltstack(). So the questions are: should we support altshadowstack from day one and, if so, what should it look like? If we want to be clever, we could attempt to make altstadowstack compatible with RSTORSSP. Signal delivery pushes a restore token to the old stack (hah! what if the old stack is full?) and pushes the RSTORSSP busy magic to the new stack, and sigreturn inverts it. Code that wants to return without sigreturn does it manually with RSTORSSP. (Assuming that I've understood the arcane RSTORSSP sequence right. Intel wins major points for documentation quality here.) Or we could invent our own scheme. In either case, I don't immediately see any reason that the ucontext needs to contain a shadow stack pointer. There's a delightful wart to consider, though. siglongjmp, at least as currently envisioned, can't return off an altshadowstack: the whole point of the INCSSP distance restrictions to to avoid incrementing right off the top of the current stack, but siglongjmp off an altshadowstack fundamentally switches stacks. So either siglongjmp off an altshadowstack needs to be illegal or it needs to work differently. (By incssp-ing to the top of the altshadowstack, then switching, then incssp-ing some more? How does it even find the top of the current altshadowstack?) And the plot thickens if one tries to siglongjmp off two nested altshadowstack-using signals in a single call. Fortunately, since altshadowstack is a new API, it's not entirely crazy to have different rules. So I don't have a complete or even almost complete design in mind, but I think we do need to make a conscious decision either to design this right or to skip it for v1. As for CRIU, I don't think anyone really expects a new kernel, running new userspace that takes advantage of features in the new kernel, to work with old CRIU. Upgrading to a SHSTK kernel should still allow using CRIU with non-SHSTK userspace, but I don't see how it's possible for CRIU to handle SHSTK without updates. We should certainly do our best to make CRIU's life easy, though. This doesn’t mean it can’t switch to a different > shadow stack or handle a nested signal, but it limits the possibility > for calling sigreturn with a totally different sigframe (like CRIU and > SROP attacks do). It should hopefully be a helpful, protective > limitation for most apps and I'm hoping CRIU can be fixed without > removing it. > > I am not aware of other limitations to signals (besides normal shadow > stack enforcement), but I could be missing it. And people's skepticism > is making me want to go back over it with more scrutiny. > >>> In much the same way as all the segment registers can be changed >>> leading to all the nasty bugs when the final 'return to user' code >>> traps in kernel when loading invalid segment registers or executing >>> iret. > > I don't think this is as difficult to avoid because userspace ssp has > its own register that should not be accessed at that point, but I have > not given this aspect enough analysis. Thanks for bringing it up. > >>> >>> Hmmm... do shadow stacks mean that longjmp() has to be a system >>> call? >> >> No. setjmp/longjmp save and restore shadow stack pointer. >> > > It sounds like it would help to write up in a lot more detail exactly > how all the signal and specialer stack manipulation scenarios work in > glibc. >
On Mon, Feb 07 2022 at 17:31, Andy Lutomirski wrote:
> So this leaves altshadowstack. If we want to allow userspace to handle
> a shstk overflow, I think we need altshadowstack. And I can easily
> imagine signal handling in a coroutine or user-threading evironment (Go?
> UMCG or whatever it's called?) wanting this. As noted, this obnoxious
> Andy person didn't like putting any shstk-related extensions in the FPU
> state.
>
> For better or for worse, altshadowstack is (I think) fundamentally a new
> API. No amount of ucontext magic is going to materialize an entire
> shadow stack out of nowhere when someone calls sigaltstack(). So the
> questions are: should we support altshadowstack from day one and, if so,
> what should it look like?
I think we should support them from day one.
> So I don't have a complete or even almost complete design in mind, but I
> think we do need to make a conscious decision either to design this
> right or to skip it for v1.
Skipping it might create a fundamental design fail situation as it might
require changes to the shadow stack signal handling in general which
becomes a nightmare once a non-altstack API is exposed.
> As for CRIU, I don't think anyone really expects a new kernel, running
> new userspace that takes advantage of features in the new kernel, to
> work with old CRIU.
Yes, CRIU needs updates, but what ensures that CRIU managed user space
does not use SHSTK if CRIU is not updated yet?
> Upgrading to a SHSTK kernel should still allow using CRIU with
> non-SHSTK userspace, but I don't see how it's possible for CRIU to
> handle SHSTK without updates. We should certainly do our best to make
> CRIU's life easy, though.
Handling CRIU with SHSTK enabled has to be part of the overall design
otherwise we'll either end up with horrible hacks or with a requirement
to change the V1 UAPI....
Thanks,
tglx
On Tue, Feb 8, 2022, at 1:31 AM, Thomas Gleixner wrote: > On Mon, Feb 07 2022 at 17:31, Andy Lutomirski wrote: >> So this leaves altshadowstack. If we want to allow userspace to handle >> a shstk overflow, I think we need altshadowstack. And I can easily >> imagine signal handling in a coroutine or user-threading evironment (Go? >> UMCG or whatever it's called?) wanting this. As noted, this obnoxious >> Andy person didn't like putting any shstk-related extensions in the FPU >> state. >> >> For better or for worse, altshadowstack is (I think) fundamentally a new >> API. No amount of ucontext magic is going to materialize an entire >> shadow stack out of nowhere when someone calls sigaltstack(). So the >> questions are: should we support altshadowstack from day one and, if so, >> what should it look like? > > I think we should support them from day one. > >> So I don't have a complete or even almost complete design in mind, but I >> think we do need to make a conscious decision either to design this >> right or to skip it for v1. > > Skipping it might create a fundamental design fail situation as it might > require changes to the shadow stack signal handling in general which > becomes a nightmare once a non-altstack API is exposed. It would also expose a range of kernels in which shstk is on but programs that want altshadowstack don't have it. That would be annoying. > >> As for CRIU, I don't think anyone really expects a new kernel, running >> new userspace that takes advantage of features in the new kernel, to >> work with old CRIU. > > Yes, CRIU needs updates, but what ensures that CRIU managed user space > does not use SHSTK if CRIU is not updated yet? In some sense this is like any other feature. If a program uses timerfd but CRIU doesn't support timerfd, then it won't work. SHSTK is a bit unique because it's likely that all programs on a system will start using it all at once.
From: Edgecombe, Rick P > Sent: 05 February 2022 20:15 > > On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote: > > On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com> > > wrote: > > > > > > From: Edgecombe, Rick P > > > > Sent: 04 February 2022 01:08 > > > > Hi Thomas, > > > > > > > > Thanks for feedback on the plan. > > > > > > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote: > > > > > > Until now, the enabling effort was trying to support both > > > > > > Shadow > > > > > > Stack and IBT. > > > > > > This history will focus on a few areas of the shadow stack > > > > > > development history > > > > > > that I thought stood out. > > > > > > > > > > > > Signals > > > > > > ------- > > > > > > Originally signals placed the location of the shadow > > > > > > stack > > > > > > restore > > > > > > token inside the saved state on the stack. This was > > > > > > problematic from a > > > > > > past ABI promises perspective. So the restore location > > > > > > was > > > > > > instead just > > > > > > assumed from the shadow stack pointer. This works > > > > > > because in > > > > > > normal > > > > > > allowed cases of calling sigreturn, the shadow stack > > > > > > pointer > > > > > > should be > > > > > > right at the restore token at that time. There is no > > > > > > alternate shadow > > > > > > stack support. If an alt shadow stack is added later > > > > > > we > > > > > > would > > > > > > need to > > > > > > > > > > So how is that going to work? altstack is not an esoteric > > > > > corner > > > > > case. > > > > > > > > My understanding is that the main usages for the signal stack > > > > were > > > > handling stack overflows and corruption. Since the shadow stack > > > > only > > > > contains return addresses rather than large stack allocations, > > > > and is > > > > not generally writable or pivotable, I thought there was a good > > > > possibility an alt shadow stack would not end up being especially > > > > useful. Does it seem like reasonable guesswork? > > > > > > The other 'problem' is that it is valid to longjump out of a signal > > > handler. > > > These days you have to use siglongjmp() not longjmp() but it is > > > still used. > > > > > > It is probably also valid to use siglongjmp() to jump from a nested > > > signal handler into the outer handler. > > > Given both signal handlers can have their own stack, there can be > > > three > > > stacks involved. > > So the scenario is? > > 1. Handle signal 1 > 2. sigsetjmp() > 3. signalstack() > 4. Handle signal 2 on alt stack > 5. siglongjmp() > > I'll check that it is covered by the tests, but I think it should work > in this series that has no alt shadow stack. I have only done a high > level overview of how the shadow stack stuff, that doesn't involve the > kernel, works in glibc. Sounds like I'll need to do a deeper dive. The posix/xopen definition for setjmp/longjmp doesn't require such longjmp requests to work. Although they still have to do something that doesn't break badly. Aborting the process is probably fine! > > > I think the shadow stack pointer has to be in ucontext - which also > > > means the application can change it before returning from a signal. > > Yes we might need to change it to support alt shadow stacks. Can you > elaborate why you think it has to be in ucontext? I was thinking of > looking at three options for storing the ssp: > - Stored in the shadow stack like a token using WRUSS from the kernel. > - Stored on the kernel side using a hashmap that maps ucontext or > sigframe userspace address to ssp (this is of course similar to > storing in ucontext, except that the user can’t change the ssp). > - Stored writable in userspace in ucontext. > > But in this version, without alt shadow stacks, the shadow stack > pointer is not stored in ucontext. This causes the limitation that > userspace can only call sigreturn when it has returned back to a point > where there is a restore token on the shadow stack (which was placed > there by the kernel). This doesn’t mean it can’t switch to a different > shadow stack or handle a nested signal, but it limits the possibility > for calling sigreturn with a totally different sigframe (like CRIU and > SROP attacks do). It should hopefully be a helpful, protective > limitation for most apps and I'm hoping CRIU can be fixed without > removing it. > > I am not aware of other limitations to signals (besides normal shadow > stack enforcement), but I could be missing it. And people's skepticism > is making me want to go back over it with more scrutiny. > > > > In much the same way as all the segment registers can be changed > > > leading to all the nasty bugs when the final 'return to user' code > > > traps in kernel when loading invalid segment registers or executing > > > iret. > > I don't think this is as difficult to avoid because userspace ssp has > its own register that should not be accessed at that point, but I have > not given this aspect enough analysis. Thanks for bringing it up. So the user ssp isn't saved (or restored) by the trap entry/exit. So it needs to be saved by the context switch code? Much like the user segment registers? So you are likely to get the same problems if restoring it can fault in kernel (eg for a non-canonical address). > > > Hmmm... do shadow stacks mean that longjmp() has to be a system > > > call? > > > > No. setjmp/longjmp save and restore shadow stack pointer. Ok, I was thinking that direct access to the user ssp would be a privileged operation. If it can be written you don't really have to worry about what code is trying to do - it can actually do what it likes. It just catches unintentional operations (like buffer overflows). Was there any 'spare' space in struct jmpbuf ? Otherwise you can only enable shadow stacks if everything has been recompiled - including any shared libraries the might be dlopen()ed. (or does the compiler invent an alloca() call somehow for a size that comes back from glibc?) I've never really considered how setjmp/longjmp handle callee saved register variables (apart from it being hard). The original pdp11 implementation probably only needed to save r6 and r7. What does happen to all the 'extended state' that XSAVE handles? IIRC all the AVX registers are caller saved (so should probably be zerod), but some of the SSE ones are callee saved, and one or two of the fpu flags are sticky and annoying enough the save/restore at the best of times. > It sounds like it would help to write up in a lot more detail exactly > how all the signal and specialer stack manipulation scenarios work in > glibc. Some cross references might have made people notice that the ucontext extensions for AVX512 (if not earlier ones) broke the minimal/default signal stack size. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, Feb 6, 2022 at 5:42 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Edgecombe, Rick P
> > Sent: 05 February 2022 20:15
> >
> > On Sat, 2022-02-05 at 05:29 -0800, H.J. Lu wrote:
> > > On Sat, Feb 5, 2022 at 5:27 AM David Laight <David.Laight@aculab.com>
> > > wrote:
> > > >
> > > > From: Edgecombe, Rick P
> > > > > Sent: 04 February 2022 01:08
> > > > > Hi Thomas,
> > > > >
> > > > > Thanks for feedback on the plan.
> > > > >
> > > > > On Thu, 2022-02-03 at 22:07 +0100, Thomas Gleixner wrote:
> > > > > > > Until now, the enabling effort was trying to support both
> > > > > > > Shadow
> > > > > > > Stack and IBT.
> > > > > > > This history will focus on a few areas of the shadow stack
> > > > > > > development history
> > > > > > > that I thought stood out.
> > > > > > >
> > > > > > > Signals
> > > > > > > -------
> > > > > > > Originally signals placed the location of the shadow
> > > > > > > stack
> > > > > > > restore
> > > > > > > token inside the saved state on the stack. This was
> > > > > > > problematic from a
> > > > > > > past ABI promises perspective. So the restore location
> > > > > > > was
> > > > > > > instead just
> > > > > > > assumed from the shadow stack pointer. This works
> > > > > > > because in
> > > > > > > normal
> > > > > > > allowed cases of calling sigreturn, the shadow stack
> > > > > > > pointer
> > > > > > > should be
> > > > > > > right at the restore token at that time. There is no
> > > > > > > alternate shadow
> > > > > > > stack support. If an alt shadow stack is added later
> > > > > > > we
> > > > > > > would
> > > > > > > need to
> > > > > >
> > > > > > So how is that going to work? altstack is not an esoteric
> > > > > > corner
> > > > > > case.
> > > > >
> > > > > My understanding is that the main usages for the signal stack
> > > > > were
> > > > > handling stack overflows and corruption. Since the shadow stack
> > > > > only
> > > > > contains return addresses rather than large stack allocations,
> > > > > and is
> > > > > not generally writable or pivotable, I thought there was a good
> > > > > possibility an alt shadow stack would not end up being especially
> > > > > useful. Does it seem like reasonable guesswork?
> > > >
> > > > The other 'problem' is that it is valid to longjump out of a signal
> > > > handler.
> > > > These days you have to use siglongjmp() not longjmp() but it is
> > > > still used.
> > > >
> > > > It is probably also valid to use siglongjmp() to jump from a nested
> > > > signal handler into the outer handler.
> > > > Given both signal handlers can have their own stack, there can be
> > > > three
> > > > stacks involved.
> >
> > So the scenario is?
> >
> > 1. Handle signal 1
> > 2. sigsetjmp()
> > 3. signalstack()
> > 4. Handle signal 2 on alt stack
> > 5. siglongjmp()
> >
> > I'll check that it is covered by the tests, but I think it should work
> > in this series that has no alt shadow stack. I have only done a high
> > level overview of how the shadow stack stuff, that doesn't involve the
> > kernel, works in glibc. Sounds like I'll need to do a deeper dive.
>
> The posix/xopen definition for setjmp/longjmp doesn't require such
> longjmp requests to work.
>
> Although they still have to do something that doesn't break badly.
> Aborting the process is probably fine!
>
> > > > I think the shadow stack pointer has to be in ucontext - which also
> > > > means the application can change it before returning from a signal.
> >
> > Yes we might need to change it to support alt shadow stacks. Can you
> > elaborate why you think it has to be in ucontext? I was thinking of
> > looking at three options for storing the ssp:
> > - Stored in the shadow stack like a token using WRUSS from the kernel.
> > - Stored on the kernel side using a hashmap that maps ucontext or
> > sigframe userspace address to ssp (this is of course similar to
> > storing in ucontext, except that the user can’t change the ssp).
> > - Stored writable in userspace in ucontext.
> >
> > But in this version, without alt shadow stacks, the shadow stack
> > pointer is not stored in ucontext. This causes the limitation that
> > userspace can only call sigreturn when it has returned back to a point
> > where there is a restore token on the shadow stack (which was placed
> > there by the kernel). This doesn’t mean it can’t switch to a different
> > shadow stack or handle a nested signal, but it limits the possibility
> > for calling sigreturn with a totally different sigframe (like CRIU and
> > SROP attacks do). It should hopefully be a helpful, protective
> > limitation for most apps and I'm hoping CRIU can be fixed without
> > removing it.
> >
> > I am not aware of other limitations to signals (besides normal shadow
> > stack enforcement), but I could be missing it. And people's skepticism
> > is making me want to go back over it with more scrutiny.
> >
> > > > In much the same way as all the segment registers can be changed
> > > > leading to all the nasty bugs when the final 'return to user' code
> > > > traps in kernel when loading invalid segment registers or executing
> > > > iret.
> >
> > I don't think this is as difficult to avoid because userspace ssp has
> > its own register that should not be accessed at that point, but I have
> > not given this aspect enough analysis. Thanks for bringing it up.
>
> So the user ssp isn't saved (or restored) by the trap entry/exit.
> So it needs to be saved by the context switch code?
> Much like the user segment registers?
> So you are likely to get the same problems if restoring it can fault
> in kernel (eg for a non-canonical address).
>
> > > > Hmmm... do shadow stacks mean that longjmp() has to be a system
> > > > call?
> > >
> > > No. setjmp/longjmp save and restore shadow stack pointer.
>
> Ok, I was thinking that direct access to the user ssp would be
> a privileged operation.
User space can only pop shadow stack. longjmp does
#ifdef SHADOW_STACK_POINTER_OFFSET
# if IS_IN (libc) && defined SHARED && defined FEATURE_1_OFFSET
/* Check if Shadow Stack is enabled. */
testl $X86_FEATURE_1_SHSTK, %fs:FEATURE_1_OFFSET
jz L(skip_ssp)
# else
xorl %eax, %eax
# endif
/* Check and adjust the Shadow-Stack-Pointer. */
/* Get the current ssp. */
rdsspq %rax
/* And compare it with the saved ssp value. */
subq SHADOW_STACK_POINTER_OFFSET(%rdi), %rax
je L(skip_ssp)
/* Count the number of frames to adjust and adjust it
with incssp instruction. The instruction can adjust
the ssp by [0..255] value only thus use a loop if
the number of frames is bigger than 255. */
negq %rax
shrq $3, %rax
/* NB: We saved Shadow-Stack-Pointer of setjmp. Since we are
restoring Shadow-Stack-Pointer of setjmp's caller, we
need to unwind shadow stack by one more frame. */
addq $1, %rax
movl $255, %ebx
L(loop):
cmpq %rbx, %rax
cmovb %rax, %rbx
incsspq %rbx
subq %rbx, %rax
ja L(loop)
L(skip_ssp):
#endif
> If it can be written you don't really have to worry about what code
> is trying to do - it can actually do what it likes.
> It just catches unintentional operations (like buffer overflows).
>
> Was there any 'spare' space in struct jmpbuf ?
By pure luck, we have ONE spare space in sigjmp_buf.
> Otherwise you can only enable shadow stacks if everything has been
> recompiled - including any shared libraries the might be dlopen()ed.
> (or does the compiler invent an alloca() call somehow for a
> size that comes back from glibc?)
>
> I've never really considered how setjmp/longjmp handle callee saved
> register variables (apart from it being hard).
> The original pdp11 implementation probably only needed to save r6 and r7.
>
> What does happen to all the 'extended state' that XSAVE handles?
> IIRC all the AVX registers are caller saved (so should probably
> be zerod), but some of the SSE ones are callee saved, and one or
> two of the fpu flags are sticky and annoying enough the save/restore
> at the best of times.
>
> > It sounds like it would help to write up in a lot more detail exactly
> > how all the signal and specialer stack manipulation scenarios work in
> > glibc.
>
> Some cross references might have made people notice that the ucontext
> extensions for AVX512 (if not earlier ones) broke the minimal/default
> signal stack size.
>
> David
>
--
H.J.
* David Laight:
> Was there any 'spare' space in struct jmpbuf ?
jmp_buf in glibc looks like this:
(gdb) ptype/o jmp_buf
type = struct __jmp_buf_tag {
/* 0 | 64 */ __jmp_buf __jmpbuf;
/* 64 | 4 */ int __mask_was_saved;
/* XXX 4-byte hole */
/* 72 | 128 */ __sigset_t __saved_mask;
/* total size (bytes): 200 */
} [1]
(gdb) ptype/o __jmp_buf
type = long [8]
The glibc ABI reserves space for 1024 signals, something that Linux is
never going to implement. We can use that space to store a few extra
registers in __save_mask. There is a complication because the
pthread_cancel unwinding allocates only space for the __jmpbuf member.
Fortunately, we do not need to unwind the shadow stack for thread
cancellation, so we don't need that extra space in that case.
Thanks,
Florian
On Sun, 2022-02-06 at 13:42 +0000, David Laight wrote: > > I don't think this is as difficult to avoid because userspace ssp > > has > > its own register that should not be accessed at that point, but I > > have > > not given this aspect enough analysis. Thanks for bringing it up. > > So the user ssp isn't saved (or restored) by the trap entry/exit. > So it needs to be saved by the context switch code? > Much like the user segment registers? > So you are likely to get the same problems if restoring it can fault > in kernel (eg for a non-canonical address). PL3_SSP is lazily saved and restored by the FPU supervisor xsave code, which has its buffer in kernel memory. For the most part it is userspace instructions that use this register and they can only modify it in limited ways. It does look like IRET can cause a #CP if the PL3 SSP is not aligned, but only after RIP and CPL are set back to userspace. I'm not confident enough interpreting the specs to assert the specific behavior and will follow up internally to clarify.
(added more CRIU people) On Sun, Jan 30, 2022 at 01:18:03PM -0800, Rick Edgecombe wrote: > Hi, > > This is a slight reboot of the userspace CET series. I will be taking over the > series from Yu-cheng. Per some internal recommendations, I’ve reset the version > number and am calling it a new series. Hopefully, it doesn’t cause confusion. > > The new plan is to upstream only userspace Shadow Stack support at this point. > IBT can follow later, but for now I’ll focus solely on the most in-demand and > widely available (with the feature on AMD CPUs now) part of CET. > > I thought as part of this reset, it might be useful to more fully write-up the > design and summarize the history of the previous CET series. So this slightly > long cover letter does that. The "Updates" section has the changes, if anyone > doesn't want the history. > > > Why is Shadow Stack Wanted > ========================== > The main use case for userspace shadow stack is providing protection against > return oriented programming attacks. Fedora and Ubuntu already have many/most > packages enabled for shadow stack. The main missing piece is Linux kernel > support and there seems to be a high amount of interest in the ecosystem for > getting this feature supported. Besides security, Google has also done some > work on using shadow stack to improve performance and reliability of tracing. > > > Userspace Shadow Stack Implementation > ===================================== > Shadow stack works by maintaining a secondary (shadow) stack that cannot be > directly modified by applications. When executing a CALL instruction, the > processor pushes the return address to both the normal stack and to the special > permissioned shadow stack. Upon ret, the processor pops the shadow stack copy > and compares it to the normal stack copy. If the two differ, the processor > raises a control protection fault. This implementation supports shadow stack on > 64 bit kernels only, with support for 32 bit only via IA32 emulation. > > Shadow Stack Memory > ------------------- > The majority of this series deals with changes for handling the special > shadow stack memory permissions. This memory is specified by the > Dirty+RO PTE bits. A tricky aspect of this is that this combination was > previously used to specify COW memory. So Linux needs to handle COW > differently when shadow stack is in use. The solution is to use a > software PTE bit to denote COW memory, and take care to clear the dirty > bit when setting the memory RO. > > Setup and Upkeep of HW Registers > -------------------------------- > Using userspace CET requires a CR4 bit set, and also the manipulation > of two xsave managed MSRs. The kernel needs to modify these registers > during various operations like clone and signal handling. These > operations may happen when the registers are restored to the CPU, or > saved in an xsave buffer. Since the recent AMX triggered FPU overhaul > removed direct access to the xsave buffer, this series adds an > interface to operate on the supervisor xstate. > > New ABIs > -------- > This series introduces some new ABIs. The primary one is the shadow > stack itself. Since it is readable and the shadow stack pointer is > exposed to user space, applications can easily read and process the > shadow stack. And in fact the tracing usages plan to do exactly that. > > Most of the shadow stack contents are written by HW, but some of the > entries are added by the kernel. The main place for this is signals. As > part of handling the signal the kernel does some manual adjustment of > the shadow stack that userspace depends on. > > In addition to the contents of the shadow stack there is also user > visible behavior around when new shadow stacks are created and set in > the shadow stack pointer (SSP) register. This is relatively > straightforward – shadow stacks are created when new stacks are created > (thread creation, fork, etc). It is more or less what is required to > keep apps working. > > For situations when userspace creates a new stack (i.e. makecontext(), > fibers, etc), a new syscall is provided for creating shadow stack > memory. To make the shadow stack usable, it needs to have a restore > token written to the protected memory. So the syscall provides a way to > specificity this should be done by the kernel. > > When a shadow stack violation happens (when the return address of stack > not matching return address in shadow stack), a segfault is generated > with a new si_code specific to CET violations. > > Lastly, a new arch_prctl interface is created for controlling the > enablement of CET-like features. It is intended to also be used for > LAM. It operates on the feature status per-thread, so for process wide > enabling it is intended to be used early in things like dynamic > linker/loaders. However, it can be used later for per-thread enablement > of features like WRSS. > > WRSS > ---- > WRSS is an instruction that can write to shadow stacks. The HW provides > a way to enable this instruction for userspace use. Since shadow > stack’s are created initially protected, enabling WRSS allows any apps > that want to do unusual things with their stacks to have a way to > weaken protection and make things more flexible. A new feature bit is > defined to control enabling/disabling of WRSS. > > > History > ======= > The branding “CET” really consists of two features: “Shadow Stack” and > “Indirect Branch Tracking”. They both restrict previously allowed, but rarely > valid behaviors and require userspace to change to avoid these behaviors before > enabling the protection. These raw HW features need to be assembled into a > software solution across userspace and kernel in order to add security value. > The kernel part of this solution has evolved iteratively starting with a lengthy > RFC period. > > Until now, the enabling effort was trying to support both Shadow Stack and IBT. > This history will focus on a few areas of the shadow stack development history > that I thought stood out. > > Signals > ------- > Originally signals placed the location of the shadow stack restore > token inside the saved state on the stack. This was problematic from a > past ABI promises perspective. So the restore location was instead just > assumed from the shadow stack pointer. This works because in normal > allowed cases of calling sigreturn, the shadow stack pointer should be > right at the restore token at that time. There is no alternate shadow > stack support. If an alt shadow stack is added later we would need to > find a place to store the regular shadow stack token location. Options > could be to push something on the alt shadow stack, or to keep > something on the kernel side. So the current design keeps things simple > while slightly kicking the can down the road if alt shadow stacks > become a thing later. Siglongjmp is handled in glibc, using the incssp > instruction to unwind the shadow stack over the token. > > Shadow Stack Allocation > ----------------------- > makecontext() implementations need a way to create new shadow stacks > with restore token’s such that they can be pivoted to from userspace. > The first interface to do this was an arch_prctl(). It created a shadow > stack with a restore token pre-setup, since the kernel has an > instruction that can write to user shadow stacks. However, this > interface was abandoned for being strange. > > The next version created PROT_SHADOW_STACK. This interface had two > problems. One, it left no options but for userspace to create writable > memory, write a restore token, then mproctect() it PROT_SHADOW_STACK. > The writable window left the shadow stack exposed, weakening the > security. Second, it caused problems with the guard pages. Since the > memory was initially created writable it did not have a guard page, but > then was mprotected later to a type of memory that should have one. > This resulted in missing guard pages and confused rb_subtree_gap’s. > > This version introduces a new syscall that behaves similarly to the > initial arch_prctl() interface in that it has the kernel write the > restore token. > > Enabling Interface > ------------------ > For the entire history of the original CET series, the design was to > enable shadow stack automatically if the feature bit was detected in > the elf header. Then it was userspace’s responsibility to turn it off > via an arch_prctl() if it was not desired, and this was handled by the > glibc dynamic loader. Glibc’s standard behavior (when CET if configured > is to leave shadow stack enabled if the executable and all linked > libraries are marked with shadow stacks. > > Many distros (Fedora and others) have binaries already marked with > shadow stack, waiting for kernel support. Unfortunately their glibc > binaries expect the original arch_prctl() interface for allocating > shadow stacks, as those changes were pushed ahead of kernel support. > The net result of it all is, when updating to a kernel with shadow > stack these binaries would suddenly get shadow stack enabled and expect > the arch_prctl() interface to be there. And so calls to makecontext() > will fail, resulting in visible breakages. This series deals with this > problem as described below in "Updates". > > > Updates > ======= > These updates were mostly driven by public comments, but a lot of the design > elements are new. I would like some extra scrutiny on the updates. > > New syscall for Shadow Stack Allocation > --------------------------------------- > A new syscall is added for allocating shadow stacks to replace > PROT_SHADOW_STACK. Several options were considered, as described in the > “x86/cet/shstk: Introduce map_shadow_stack syscall”. > > Xsave Managed Supervisor State Modifications > -------------------------------------------- > The shadow stack feature requires the kernel to modify xsaves managed > state. On one of the last versions of Yu-cheng’s series Boris had > commented on the pattern it was using to do this not necessarily being > ideal. The pattern was to force a restore to the registers and always > do the modification there. Then Thomas did an overhaul of the fpu code, > part of which consisted of making raw access to the xsave buffer > private to the fpu code. So this series tries to expose access again, > and in a way that addresses Boris’ comments. > > The method is to provide functions like wmsrl/rdmsrl, but that can > direct the operation to the correct location (registers or buffer), > while giving the proper notice to the fpu subsystem so things don’t get > clobbered or corrupted. > > In the past a solution like this was discussed as part of the PASID > series, and Thomas was not in favor. In CET’s case there is a more > logic around the CET MSR’s than in PASID's, and wrapping this logic > minimizes near identical open coded logic needed to do this more > efficiently. In addition it resolves the above described problem of > having no access to the xsave buffer. So it is being put forward here > under the supposition that CET’s usage may lead to a different > conclusion, not to try to ignore past direction. > > The user interrupt series has similar needs as CET, and will also use > this internal interface if it’s found acceptable. > > Support for WRSS > ---------------- > Andy Lutomirski had asked if we change the shadow stack allocation API > such that userspace cannot create arbitrary shadow stacks, then we look > at exposing an interface to enable the WRSS instruction for userspace. > This way app’s that want to do unexpected things with shadow stacks > would still have the option to create shadow stacks with arbitrary > data. > > Switch Enabling Interface > ------------------------- > As described above there is a problem with userspace binaries waiting > to break as soon as the kernel supports CET. This needs to be prevented > by changing the interface such that the old binaries will not enable > shadow stack AND behave as if shadow stack is not enabled. They should > run normally without shadow stack protection. Creating a new feature > (SHSTK2) for shadow stack was explored. SHSTK would never be supported > by the kernel, and all the userspace build tools would be updated to > target SHSTK2 instead of SHSTK. So old SHSTK binaries would be cleanly > disabled. > > But there are existing downsides to automatic elf header processing > based enabling. The elf header feature spec is not defined by the > kernel and there are proposals to expand it to describe additional > logic. A simpler interface where the kernel is simply told what to > enable, and leaves all the decision making to userspace, is more > flexible for userspace and simpler for the kernel. There also already > needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and > likely LAM will use it too), so it avoids there being two ways to turn > on these types of features. The only tricky part for shadow stack, is > that it has to be enabled very early. Wherever the shadow stack is > enabled, the app cannot return from that point, otherwise there will be > a shadow stack violation. It turns out glibc can enable shadow stack > this early, so it works nicely. So not automatically enabling any > features in the elf header will cleanly disable all old binaries, which > expect the kernel to enable CET features automatically. Then after the > kernel changes are upstream, glibc can be updated to use the new > interface. This is the solution implemented in this series. > > Expand Commit Logs > ------------------ > As part of spinning up on this series, I found some of the commit logs > did not describe the changes in enough detail for me understand their > purpose. I tried to expand the logs and comments, where I had to go > digging. Hopefully it’s useful. > > Limit to only Intel Processors > ------------------------------ > Shadow stack is supported on some AMD processors, but this revision > (with expanded HW usage and xsaves changes) has only has been tested on > Intel ones. So this series has a patch to limit shadow stack support to > Intel processors. Ideally the patch would not even make it to mainline, > and should be dropped as soon as this testing is done. It's included > just in case. > > > Future Work > =========== > Even though this is now exclusively a shadow stack series, there is still some > remaining shadow stack work to be done. > > Ptrace > ------ > Early in the series, there was a patch to allow IA32_U_CET and > IA32_PL3_SSP to be set. This patch was dropped and planned as a follow > up to basic support, and it remains the plan. It will be needed for > in-progress gdb support. > > CRIU Support > ------------ > In the past there was some speculation on the mailing list about > whether CRIU would need to be taught about CET. It turns out, it does. > The first issue hit is that CRIU calls sigreturn directly from its > “parasite code” that it injects into the dumper process. This violates > this shadow stack implementation’s protection that intends to prevent > attackers from doing this. > > With so many packages already enabled with shadow stack, there is > probably desire to make it work seamlessly. But in the meantime if > distros want to support shadow stack and CRIU, users could manually > disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for > a process they will wants to dump. It’s not ideal. > > I’d like to hear what people think about having shadow stack in the > kernel without this resolved. Nothing would change for any users until > they enable shadow stack in the kernel and update to a glibc configured > with CET. Should CRIU userspace be solved before kernel support? > > Selftests > --------- > There are some CET selftests being worked on and they are not included > here. > > Thanks, > > Rick > > Rick Edgecombe (7): > x86/mm: Prevent VM_WRITE shadow stacks > x86/fpu: Add helpers for modifying supervisor xstate > x86/fpu: Add unsafe xsave buffer helpers > x86/cet/shstk: Introduce map_shadow_stack syscall > selftests/x86: Add map_shadow_stack syscall test > x86/cet/shstk: Support wrss for userspace > x86/cpufeatures: Limit shadow stack to Intel CPUs > > Yu-cheng Yu (28): > Documentation/x86: Add CET description > x86/cet/shstk: Add Kconfig option for Shadow Stack > x86/cpufeatures: Add CET CPU feature flags for Control-flow > Enforcement Technology (CET) > x86/cpufeatures: Introduce CPU setup and option parsing for CET > x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states > x86/cet: Add control-protection fault handler > x86/mm: Remove _PAGE_DIRTY from kernel RO pages > x86/mm: Move pmd_write(), pud_write() up in the file > x86/mm: Introduce _PAGE_COW > drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS > x86/mm: Update pte_modify for _PAGE_COW > x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for > transition from _PAGE_DIRTY to _PAGE_COW > mm: Move VM_UFFD_MINOR_BIT from 37 to 38 > mm: Introduce VM_SHADOW_STACK for shadow stack memory > x86/mm: Check Shadow Stack page fault errors > x86/mm: Update maybe_mkwrite() for shadow stack > mm: Fixup places that call pte_mkwrite() directly > mm: Add guard pages around a shadow stack. > mm/mmap: Add shadow stack pages to memory accounting > mm: Update can_follow_write_pte() for shadow stack > mm/mprotect: Exclude shadow stack from preserve_write > mm: Re-introduce vm_flags to do_mmap() > x86/cet/shstk: Add user-mode shadow stack support > x86/process: Change copy_thread() argument 'arg' to 'stack_size' > x86/cet/shstk: Handle thread shadow stack > x86/cet/shstk: Introduce shadow stack token setup/verify routines > x86/cet/shstk: Handle signals for shadow stack > x86/cet/shstk: Add arch_prctl elf feature functions > > .../admin-guide/kernel-parameters.txt | 4 + > Documentation/filesystems/proc.rst | 1 + > Documentation/x86/cet.rst | 145 ++++++ > Documentation/x86/index.rst | 1 + > arch/arm/kernel/signal.c | 2 +- > arch/arm64/kernel/signal.c | 2 +- > arch/arm64/kernel/signal32.c | 2 +- > arch/sparc/kernel/signal32.c | 2 +- > arch/sparc/kernel/signal_64.c | 2 +- > arch/x86/Kconfig | 22 + > arch/x86/Kconfig.assembler | 5 + > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/x86/ia32/ia32_signal.c | 25 +- > arch/x86/include/asm/cet.h | 54 +++ > arch/x86/include/asm/cpufeatures.h | 1 + > arch/x86/include/asm/disabled-features.h | 8 +- > arch/x86/include/asm/fpu/api.h | 8 + > arch/x86/include/asm/fpu/types.h | 23 +- > arch/x86/include/asm/fpu/xstate.h | 6 +- > arch/x86/include/asm/idtentry.h | 4 + > arch/x86/include/asm/mman.h | 24 + > arch/x86/include/asm/mmu_context.h | 2 + > arch/x86/include/asm/msr-index.h | 20 + > arch/x86/include/asm/page_types.h | 7 + > arch/x86/include/asm/pgtable.h | 302 ++++++++++-- > arch/x86/include/asm/pgtable_types.h | 48 +- > arch/x86/include/asm/processor.h | 6 + > arch/x86/include/asm/special_insns.h | 30 ++ > arch/x86/include/asm/trap_pf.h | 2 + > arch/x86/include/uapi/asm/mman.h | 8 +- > arch/x86/include/uapi/asm/prctl.h | 10 + > arch/x86/include/uapi/asm/processor-flags.h | 2 + > arch/x86/kernel/Makefile | 1 + > arch/x86/kernel/cpu/common.c | 20 + > arch/x86/kernel/cpu/cpuid-deps.c | 1 + > arch/x86/kernel/elf_feature_prctl.c | 72 +++ > arch/x86/kernel/fpu/xstate.c | 167 ++++++- > arch/x86/kernel/idt.c | 4 + > arch/x86/kernel/process.c | 17 +- > arch/x86/kernel/process_64.c | 2 + > arch/x86/kernel/shstk.c | 446 ++++++++++++++++++ > arch/x86/kernel/signal.c | 13 + > arch/x86/kernel/signal_compat.c | 2 +- > arch/x86/kernel/traps.c | 62 +++ > arch/x86/mm/fault.c | 19 + > arch/x86/mm/mmap.c | 48 ++ > arch/x86/mm/pat/set_memory.c | 2 +- > arch/x86/mm/pgtable.c | 25 + > drivers/gpu/drm/i915/gvt/gtt.c | 2 +- > fs/aio.c | 2 +- > fs/proc/task_mmu.c | 3 + > include/linux/mm.h | 19 +- > include/linux/pgtable.h | 8 + > include/linux/syscalls.h | 1 + > include/uapi/asm-generic/siginfo.h | 3 +- > include/uapi/asm-generic/unistd.h | 2 +- > ipc/shm.c | 2 +- > kernel/sys_ni.c | 1 + > mm/gup.c | 16 +- > mm/huge_memory.c | 27 +- > mm/memory.c | 5 +- > mm/migrate.c | 3 +- > mm/mmap.c | 15 +- > mm/mprotect.c | 9 +- > mm/nommu.c | 4 +- > mm/util.c | 2 +- > tools/testing/selftests/x86/Makefile | 9 +- > .../selftests/x86/test_map_shadow_stack.c | 75 +++ > 69 files changed, 1797 insertions(+), 92 deletions(-) > create mode 100644 Documentation/x86/cet.rst > create mode 100644 arch/x86/include/asm/cet.h > create mode 100644 arch/x86/include/asm/mman.h > create mode 100644 arch/x86/kernel/elf_feature_prctl.c > create mode 100644 arch/x86/kernel/shstk.c > create mode 100644 tools/testing/selftests/x86/test_map_shadow_stack.c > > > base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07 > -- > 2.17.1 -- Sincerely yours, Mike.
On Sun, Feb 06, 2022 at 08:42:03PM +0200, Mike Rapoport wrote: > (added more CRIU people) Thanks, Mike. > On Sun, Jan 30, 2022 at 01:18:03PM -0800, Rick Edgecombe wrote: > > This is a slight reboot of the userspace CET series. I will be taking over the > > series from Yu-cheng. Per some internal recommendations, I’ve reset the version > > number and am calling it a new series. Hopefully, it doesn’t cause confusion. > > > > The new plan is to upstream only userspace Shadow Stack support at this point. > > IBT can follow later, but for now I’ll focus solely on the most in-demand and > > widely available (with the feature on AMD CPUs now) part of CET. > > > > I thought as part of this reset, it might be useful to more fully write-up the > > design and summarize the history of the previous CET series. So this slightly > > long cover letter does that. The "Updates" section has the changes, if anyone > > doesn't want the history. [...] > > CRIU Support > > ------------ > > In the past there was some speculation on the mailing list about > > whether CRIU would need to be taught about CET. It turns out, it does. > > The first issue hit is that CRIU calls sigreturn directly from its > > “parasite code” that it injects into the dumper process. This violates > > this shadow stack implementation’s protection that intends to prevent > > attackers from doing this. > > > > With so many packages already enabled with shadow stack, there is > > probably desire to make it work seamlessly. But in the meantime if > > distros want to support shadow stack and CRIU, users could manually > > disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for > > a process they will wants to dump. It’s not ideal. > > > > I’d like to hear what people think about having shadow stack in the > > kernel without this resolved. Nothing would change for any users until > > they enable shadow stack in the kernel and update to a glibc configured > > with CET. Should CRIU userspace be solved before kernel support? From the CRIU side I can say that I would definitely like to see this resolved. CRIU just went through a similar exercise with rseq() being enabled in glibc and CI broke all around for us and other projects relying on CRIU. Although rseq() was around for a long time we were not aware of it but luckily 5.13 introduced a way to handle it for CRIU with ptrace. An environment variable existed but did not really help when CRIU is called somewhere in the middle of the container software stack. From my point of view a solution not involving an environment variable would definitely be preferred. Adrian
On 2/6/22 23:20, Adrian Reber wrote: >>> CRIU Support >>> ------------ >>> In the past there was some speculation on the mailing list about >>> whether CRIU would need to be taught about CET. It turns out, it does. >>> The first issue hit is that CRIU calls sigreturn directly from its >>> “parasite code” that it injects into the dumper process. This violates >>> this shadow stack implementation’s protection that intends to prevent >>> attackers from doing this. ... >>From the CRIU side I can say that I would definitely like to see this > resolved. CRIU just went through a similar exercise with rseq() being > enabled in glibc and CI broke all around for us and other projects > relying on CRIU. Although rseq() was around for a long time we were not > aware of it but luckily 5.13 introduced a way to handle it for CRIU with > ptrace. An environment variable existed but did not really help when > CRIU is called somewhere in the middle of the container software stack. > >>From my point of view a solution not involving an environment variable > would definitely be preferred. Have there been things like this for CRIU in the past? Something where CRIU needs control but that's also security-sensitive? Any thoughts on how you would _like_ to see this resolved?
On Mon, Feb 07, 2022 at 08:30:50AM -0800, Dave Hansen wrote: > On 2/6/22 23:20, Adrian Reber wrote: > >>> CRIU Support > >>> ------------ > >>> In the past there was some speculation on the mailing list about > >>> whether CRIU would need to be taught about CET. It turns out, it does. > >>> The first issue hit is that CRIU calls sigreturn directly from its > >>> “parasite code” that it injects into the dumper process. This violates > >>> this shadow stack implementation’s protection that intends to prevent > >>> attackers from doing this. > ... > >>From the CRIU side I can say that I would definitely like to see this > > resolved. CRIU just went through a similar exercise with rseq() being > > enabled in glibc and CI broke all around for us and other projects > > relying on CRIU. Although rseq() was around for a long time we were not > > aware of it but luckily 5.13 introduced a way to handle it for CRIU with > > ptrace. An environment variable existed but did not really help when > > CRIU is called somewhere in the middle of the container software stack. > > > >>From my point of view a solution not involving an environment variable > > would definitely be preferred. > > Have there been things like this for CRIU in the past? Something where > CRIU needs control but that's also security-sensitive? Generally CRIU requires (almost) root privileges to work, but I don't think it handles something as security sensitive and restrictive as shadow stacks. > Any thoughts on how you would _like_ to see this resolved? Ideally, CRIU will need a knob that will tell the kernel/CET machinery where the next RET will jump, along the lines of restore_signal_shadow_stack() AFAIU. But such a knob will immediately reduce the security value of the entire thing, and I don't have good ideas how to deal with it :( -- Sincerely yours, Mike.
On Tue, Feb 08, 2022 at 11:16:51AM +0200, Mike Rapoport wrote: > > > Any thoughts on how you would _like_ to see this resolved? > > Ideally, CRIU will need a knob that will tell the kernel/CET machinery > where the next RET will jump, along the lines of > restore_signal_shadow_stack() AFAIU. > > But such a knob will immediately reduce the security value of the entire > thing, and I don't have good ideas how to deal with it :( Probably a kind of latch in the task_struct which would trigger off once returt to a different address happened, thus we would be able to jump inside paratite code. Of course such trigger should be available under proper capability only.
On Tue, Feb 8, 2022, at 1:29 AM, Cyrill Gorcunov wrote: > On Tue, Feb 08, 2022 at 11:16:51AM +0200, Mike Rapoport wrote: >> >> > Any thoughts on how you would _like_ to see this resolved? >> >> Ideally, CRIU will need a knob that will tell the kernel/CET machinery >> where the next RET will jump, along the lines of >> restore_signal_shadow_stack() AFAIU. >> >> But such a knob will immediately reduce the security value of the entire >> thing, and I don't have good ideas how to deal with it :( > > Probably a kind of latch in the task_struct which would trigger off once > returt to a different address happened, thus we would be able to jump inside > paratite code. Of course such trigger should be available under proper > capability only. I'm not fully in touch with how parasite, etc works. Are we talking about save or restore? If it's restore, what exactly does CRIU need to do? Is it just that CRIU needs to return out from its resume code into the to-be-resumed program without tripping CET? Would it be acceptable for CRIU to require that at least one shstk slot be free at save time? Or do we need a mechanism to atomically switch to a completely full shadow stack at resume? Off the top of my head, a sigreturn (or sigreturn-like mechanism) that is intended for use for altshadowstack could safely verify a token on the altshdowstack, possibly compare to something in ucontext (or not -- this isn't clearly necessary) and switch back to the previous stack. CRIU could use that too. Obviously CRIU will need a way to populate the relevant stacks, but WRUSS can be used for that, and I think this is a fundamental requirement for CRIU -- CRIU restore absolutely needs a way to write the saved shadow stack data into the shadow stack. So I think the only special capability that CRIU really needs is WRUSS, and we need to wire that up anyway.
On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote: > >> But such a knob will immediately reduce the security value of the entire > >> thing, and I don't have good ideas how to deal with it :( > > > > Probably a kind of latch in the task_struct which would trigger off once > > returt to a different address happened, thus we would be able to jump inside > > paratite code. Of course such trigger should be available under proper > > capability only. > > I'm not fully in touch with how parasite, etc works. Are we talking about save or restore? We use parasite code in question during checkpoint phase as far as I remember. push addr/lret trick is used to run "injected" code (code injection itself is done via ptrace) in compat mode at least. Dima, Andrei, I didn't look into this code for years already, do we still need to support compat mode at all? > If it's restore, what exactly does CRIU need to do? Is it just that CRIU needs to return > out from its resume code into the to-be-resumed program without tripping CET? Would it > be acceptable for CRIU to require that at least one shstk slot be free at save time? > Or do we need a mechanism to atomically switch to a completely full shadow stack at resume? > > Off the top of my head, a sigreturn (or sigreturn-like mechanism) that is intended for > use for altshadowstack could safely verify a token on the altshdowstack, possibly > compare to something in ucontext (or not -- this isn't clearly necessary) and switch > back to the previous stack. CRIU could use that too. Obviously CRIU will need a way > to populate the relevant stacks, but WRUSS can be used for that, and I think this > is a fundamental requirement for CRIU -- CRIU restore absolutely needs a way to write > the saved shadow stack data into the shadow stack. > > So I think the only special capability that CRIU really needs is WRUSS, and > we need to wire that up anyway. Thanks for these notes, Andy! I can't provide any sane answer here since didn't read tech spec for this feature yet :-)
[un-Cc'ed a lot of people, as the question is highly off-topic, so I
don't feel like the answer is of big interest to them, keeping x86
maintainer in]
On 2/8/22 17:02, Cyrill Gorcunov wrote:
> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote:
>>>> But such a knob will immediately reduce the security value of the entire
>>>> thing, and I don't have good ideas how to deal with it :(
>>>
>>> Probably a kind of latch in the task_struct which would trigger off once
>>> returt to a different address happened, thus we would be able to jump inside
>>> paratite code. Of course such trigger should be available under proper
>>> capability only.
>>
>> I'm not fully in touch with how parasite, etc works. Are we talking about save or restore?
>
> We use parasite code in question during checkpoint phase as far as I remember.
> push addr/lret trick is used to run "injected" code (code injection itself is
> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look into this code
> for years already, do we still need to support compat mode at all?
Cyrill, I haven't been working on/with Virtuozzo people last 5 years, so
I don't know. As you're more connected to Vz, your question seems to
imply that ia32 C/R is no longer needed by Vz customers. If it's not
needed anymore - I'm all for stopping testing of it in CRIU.
The only thing I ask before you go and remove that is to ping the person
who paid some substantial amount of money on bugsbounty to get ia32
support in CRIU. Albeit, in the end I didn't get a cent out of it (VZ
managers insisted on receiving all of the money), I still feel
responsible to that person as the amount he paid was the biggest bounty
at that moment and I was the person, who presented him ia32 C/R as
working and being tested.
If you need his contacts - ping me, I'll search and find it.
Other than that - if no one needs ia32 C/R, let's do go ahead and drop
testing (and maybe some complicated code) of it.
Thanks,
Dmitry
On Tue, Feb 08, 2022 at 09:54:14PM +0000, Dmitry Safonov wrote: > [un-Cc'ed a lot of people, as the question is highly off-topic, so I > don't feel like the answer is of big interest to them, keeping x86 > maintainer in] > > On 2/8/22 17:02, Cyrill Gorcunov wrote: > >>> Probably a kind of latch in the task_struct which would trigger off once > >>> returt to a different address happened, thus we would be able to jump inside > >>> paratite code. Of course such trigger should be available under proper > >>> capability only. > >> > >> I'm not fully in touch with how parasite, etc works. Are we talking about save or restore? > > > > We use parasite code in question during checkpoint phase as far as I remember. > > push addr/lret trick is used to run "injected" code (code injection itself is > > done via ptrace) in compat mode at least. Dima, Andrei, I didn't look into this code > > for years already, do we still need to support compat mode at all? > > Cyrill, I haven't been working on/with Virtuozzo people last 5 years, so > I don't know. As you're more connected to Vz, your question seems to > imply that ia32 C/R is no longer needed by Vz customers. If it's not > needed anymore - I'm all for stopping testing of it in CRIU. Nope. I didn't see any sign that Vz is intended to drop ia32 suport. But Vz's criu instance is following vanilla's one, that is why I asked you and Andrew about ia32 support. This ia32 code snippet with stack manipulation simply popped out in my mind immediately when Andy asked how we deal with stack. Also we adjust stack in restorer code but I need some time to recall all thses details since as I said I didn't work with criu code for years already. Cyrill
On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote: > On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote: > > > > But such a knob will immediately reduce the security value of > > > > the entire > > > > thing, and I don't have good ideas how to deal with it :( > > > > > > Probably a kind of latch in the task_struct which would trigger > > > off once > > > returt to a different address happened, thus we would be able to > > > jump inside > > > paratite code. Of course such trigger should be available under > > > proper > > > capability only. > > > > I'm not fully in touch with how parasite, etc works. Are we > > talking about save or restore? > > We use parasite code in question during checkpoint phase as far as I > remember. > push addr/lret trick is used to run "injected" code (code injection > itself is > done via ptrace) in compat mode at least. Dima, Andrei, I didn't look > into this code > for years already, do we still need to support compat mode at all? > > > If it's restore, what exactly does CRIU need to do? Is it just > > that CRIU needs to return > > out from its resume code into the to-be-resumed program without > > tripping CET? Would it > > be acceptable for CRIU to require that at least one shstk slot be > > free at save time? > > Or do we need a mechanism to atomically switch to a completely full > > shadow stack at resume? > > > > Off the top of my head, a sigreturn (or sigreturn-like mechanism) > > that is intended for > > use for altshadowstack could safely verify a token on the > > altshdowstack, possibly > > compare to something in ucontext (or not -- this isn't clearly > > necessary) and switch > > back to the previous stack. CRIU could use that too. Obviously > > CRIU will need a way > > to populate the relevant stacks, but WRUSS can be used for that, > > and I think this > > is a fundamental requirement for CRIU -- CRIU restore absolutely > > needs a way to write > > the saved shadow stack data into the shadow stack. Still wrapping my head around the CRIU save and restore steps, but another general approach might be to give ptrace the ability to temporarily pause/resume/set CET enablement and SSP for a stopped thread. Then injected code doesn't need to jump through any hoops or possibly run into road blocks. I'm not sure how much this opens things up if the thread has to be stopped... Cyrill, could it fit into the CRIU pause and resume flow? What action causes the final resuming of execution of the restored process for checkpointing and for restore? Wondering if we could somehow make CET re-enable exactly then. And I guess this also needs a way to create shadow stack allocations at a specific address to match where they were in the dumped process. That is missing in this series. > > > > So I think the only special capability that CRIU really needs is > > WRUSS, and > > we need to wire that up anyway. > > Thanks for these notes, Andy! I can't provide any sane answer here > since didn't > read tech spec for this feature yet :-)
On Wed, Feb 09, 2022 at 02:18:42AM +0000, Edgecombe, Rick P wrote: ... > > Still wrapping my head around the CRIU save and restore steps, but > another general approach might be to give ptrace the ability to > temporarily pause/resume/set CET enablement and SSP for a stopped > thread. Then injected code doesn't need to jump through any hoops or > possibly run into road blocks. I'm not sure how much this opens things > up if the thread has to be stopped... > > Cyrill, could it fit into the CRIU pause and resume flow? What action > causes the final resuming of execution of the restored process for > checkpointing and for restore? Wondering if we could somehow make CET > re-enable exactly then. > > And I guess this also needs a way to create shadow stack allocations at > a specific address to match where they were in the dumped process. That > is missing in this series. Thanks Rick! This sounds like an option. I need a couple of days to refresh my memory about criu internals. Let me CC a few current active criu developers (CC list is already big enough though:), maybe this will speedup the procedure.
Hi Rick, On Wed, Feb 09, 2022 at 02:18:42AM +0000, Edgecombe, Rick P wrote: > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote: > > On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote: > > > > > But such a knob will immediately reduce the security value of > > > > > the entire > > > > > thing, and I don't have good ideas how to deal with it :( > > > > > > > > Probably a kind of latch in the task_struct which would trigger > > > > off once > > > > returt to a different address happened, thus we would be able to > > > > jump inside > > > > paratite code. Of course such trigger should be available under > > > > proper > > > > capability only. > > > > > > I'm not fully in touch with how parasite, etc works. Are we > > > talking about save or restore? > > > > We use parasite code in question during checkpoint phase as far as I > > remember. > > push addr/lret trick is used to run "injected" code (code injection > > itself is > > done via ptrace) in compat mode at least. Dima, Andrei, I didn't look > > into this code > > for years already, do we still need to support compat mode at all? > > > > > If it's restore, what exactly does CRIU need to do? Is it just > > > that CRIU needs to return > > > out from its resume code into the to-be-resumed program without > > > tripping CET? Would it > > > be acceptable for CRIU to require that at least one shstk slot be > > > free at save time? > > > Or do we need a mechanism to atomically switch to a completely full > > > shadow stack at resume? > > > > > > Off the top of my head, a sigreturn (or sigreturn-like mechanism) > > > that is intended for > > > use for altshadowstack could safely verify a token on the > > > altshdowstack, possibly > > > compare to something in ucontext (or not -- this isn't clearly > > > necessary) and switch > > > back to the previous stack. CRIU could use that too. Obviously > > > CRIU will need a way > > > to populate the relevant stacks, but WRUSS can be used for that, > > > and I think this > > > is a fundamental requirement for CRIU -- CRIU restore absolutely > > > needs a way to write > > > the saved shadow stack data into the shadow stack. > > Still wrapping my head around the CRIU save and restore steps, but > another general approach might be to give ptrace the ability to > temporarily pause/resume/set CET enablement and SSP for a stopped > thread. Then injected code doesn't need to jump through any hoops or > possibly run into road blocks. I'm not sure how much this opens things > up if the thread has to be stopped... IIRC, criu dump does something like this: * Stop the process being dumped (victim) with ptrace * Inject parasite code and data into the victim, again with ptrace. Among other things the parasite data contains a sigreturn frame with saved victim state. * Resume the victim process, which will run parasite code now. * When parasite finishes it uses that frame to sigreturn to normal victim execution So, my feeling is that for dump side WRUSS should be enough. > Cyrill, could it fit into the CRIU pause and resume flow? What action > causes the final resuming of execution of the restored process for > checkpointing and for restore? Wondering if we could somehow make CET > re-enable exactly then. > > And I guess this also needs a way to create shadow stack allocations at > a specific address to match where they were in the dumped process. That > is missing in this series. Yes, criu restore will need to recreate shadow stack mappings. Currently, we recreate the restored process (target) address space based on /proc/pid/maps and /proc/pid/smaps. CRIU preserves the virtual addresses and VMA flags. The relevant steps of restore process can be summarised as: * Clone() the target process tree * Recreate VMAs with the needed size and flags, but not necessarily at the correct place yet * Partially populate memory data from the saved images * Move VMAs to their exact addresses * Complete restoring the data * Create a frame for sigreturn and jump to the target. Here, the stack used after sigreturn contains the data that was captured during dump and it entirely different from what shadow stack will contain. There are several points when the target threads are stopped, so pausing/resuming CET may help. > > > So I think the only special capability that CRIU really needs is > > > WRUSS, and > > > we need to wire that up anyway. > > > > Thanks for these notes, Andy! I can't provide any sane answer here > > since didn't > > read tech spec for this feature yet :-) > > > > -- Sincerely yours, Mike.
On 2/8/22 18:18, Edgecombe, Rick P wrote: > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote: >> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote: >>>>> But such a knob will immediately reduce the security value of >>>>> the entire >>>>> thing, and I don't have good ideas how to deal with it :( >>>> >>>> Probably a kind of latch in the task_struct which would trigger >>>> off once >>>> returt to a different address happened, thus we would be able to >>>> jump inside >>>> paratite code. Of course such trigger should be available under >>>> proper >>>> capability only. >>> >>> I'm not fully in touch with how parasite, etc works. Are we >>> talking about save or restore? >> >> We use parasite code in question during checkpoint phase as far as I >> remember. >> push addr/lret trick is used to run "injected" code (code injection >> itself is >> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look >> into this code >> for years already, do we still need to support compat mode at all? >> >>> If it's restore, what exactly does CRIU need to do? Is it just >>> that CRIU needs to return >>> out from its resume code into the to-be-resumed program without >>> tripping CET? Would it >>> be acceptable for CRIU to require that at least one shstk slot be >>> free at save time? >>> Or do we need a mechanism to atomically switch to a completely full >>> shadow stack at resume? >>> >>> Off the top of my head, a sigreturn (or sigreturn-like mechanism) >>> that is intended for >>> use for altshadowstack could safely verify a token on the >>> altshdowstack, possibly >>> compare to something in ucontext (or not -- this isn't clearly >>> necessary) and switch >>> back to the previous stack. CRIU could use that too. Obviously >>> CRIU will need a way >>> to populate the relevant stacks, but WRUSS can be used for that, >>> and I think this >>> is a fundamental requirement for CRIU -- CRIU restore absolutely >>> needs a way to write >>> the saved shadow stack data into the shadow stack. > > Still wrapping my head around the CRIU save and restore steps, but > another general approach might be to give ptrace the ability to > temporarily pause/resume/set CET enablement and SSP for a stopped > thread. Then injected code doesn't need to jump through any hoops or > possibly run into road blocks. I'm not sure how much this opens things > up if the thread has to be stopped... Hmm, that's maybe not insane. An alternative would be to add a bona fide ptrace call-a-function mechanism. I can think of two potentially usable variants: 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr, shadow stack push and all. 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal frame just like a real signal is being delivered with the specified handler. There could be a variant to opt-in to also using a specified altstack and altshadowstack. 2 would be more expensive but would avoid the need for much in the way of asm magic. The injected code could be plain C (or Rust or Zig or whatever). All of this only really handles save, not restore. I don't understand restore enough to fully understand the issue. --Andy
On Wed, Feb 9, 2022 at 6:37 PM Andy Lutomirski <luto@kernel.org> wrote: > > On 2/8/22 18:18, Edgecombe, Rick P wrote: > > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote: > >> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote: > >>>>> But such a knob will immediately reduce the security value of > >>>>> the entire > >>>>> thing, and I don't have good ideas how to deal with it :( > >>>> > >>>> Probably a kind of latch in the task_struct which would trigger > >>>> off once > >>>> returt to a different address happened, thus we would be able to > >>>> jump inside > >>>> paratite code. Of course such trigger should be available under > >>>> proper > >>>> capability only. > >>> > >>> I'm not fully in touch with how parasite, etc works. Are we > >>> talking about save or restore? > >> > >> We use parasite code in question during checkpoint phase as far as I > >> remember. > >> push addr/lret trick is used to run "injected" code (code injection > >> itself is > >> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look > >> into this code > >> for years already, do we still need to support compat mode at all? > >> > >>> If it's restore, what exactly does CRIU need to do? Is it just > >>> that CRIU needs to return > >>> out from its resume code into the to-be-resumed program without > >>> tripping CET? Would it > >>> be acceptable for CRIU to require that at least one shstk slot be > >>> free at save time? > >>> Or do we need a mechanism to atomically switch to a completely full > >>> shadow stack at resume? > >>> > >>> Off the top of my head, a sigreturn (or sigreturn-like mechanism) > >>> that is intended for > >>> use for altshadowstack could safely verify a token on the > >>> altshdowstack, possibly > >>> compare to something in ucontext (or not -- this isn't clearly > >>> necessary) and switch > >>> back to the previous stack. CRIU could use that too. Obviously > >>> CRIU will need a way > >>> to populate the relevant stacks, but WRUSS can be used for that, > >>> and I think this > >>> is a fundamental requirement for CRIU -- CRIU restore absolutely > >>> needs a way to write > >>> the saved shadow stack data into the shadow stack. > > > > Still wrapping my head around the CRIU save and restore steps, but > > another general approach might be to give ptrace the ability to > > temporarily pause/resume/set CET enablement and SSP for a stopped > > thread. Then injected code doesn't need to jump through any hoops or > > possibly run into road blocks. I'm not sure how much this opens things > > up if the thread has to be stopped... > > Hmm, that's maybe not insane. > > An alternative would be to add a bona fide ptrace call-a-function > mechanism. I can think of two potentially usable variants: > > 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr, > shadow stack push and all. > > 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal > frame just like a real signal is being delivered with the specified > handler. There could be a variant to opt-in to also using a specified > altstack and altshadowstack. > > 2 would be more expensive but would avoid the need for much in the way > of asm magic. The injected code could be plain C (or Rust or Zig or > whatever). > > All of this only really handles save, not restore. I don't understand > restore enough to fully understand the issue. FWIW, CET enabled GDB can call a function in a CET enabled process. Adding Felix who may know more about it. -- H.J.
> -----Original Message----- > From: H.J. Lu <hjl.tools@gmail.com> > Sent: Donnerstag, 10. Februar 2022 03:54 > To: Lutomirski, Andy <luto@kernel.org>; Willgerodt, Felix > <felix.willgerodt@intel.com> > Cc: Edgecombe, Rick P <rick.p.edgecombe@intel.com>; gorcunov@gmail.com; > bsingharora@gmail.com; hpa@zytor.com; Syromiatnikov, Eugene > <esyr@redhat.com>; peterz@infradead.org; rdunlap@infradead.org; > keescook@chromium.org; 0x7f454c46@gmail.com; > dave.hansen@linux.intel.com; kirill.shutemov@linux.intel.com; Eranian, > Stephane <eranian@google.com>; linux-mm@kvack.org; adrian@lisas.de; > fweimer@redhat.com; nadav.amit@gmail.com; jannh@google.com; > avagin@gmail.com; linux-arch@vger.kernel.org; kcc@google.com; > bp@alien8.de; oleg@redhat.com; pavel@ucw.cz; linux-doc@vger.kernel.org; > arnd@arndb.de; Moreira, Joao <joao.moreira@intel.com>; tglx@linutronix.de; > mike.kravetz@oracle.com; x86@kernel.org; Yang, Weijiang > <weijiang.yang@intel.com>; rppt@kernel.org; Dave.Martin@arm.com; > john.allen@amd.com; mingo@redhat.com; Hansen, Dave > <dave.hansen@intel.com>; corbet@lwn.net; linux-kernel@vger.kernel.org; > linux-api@vger.kernel.org; Shankar, Ravi V <ravi.v.shankar@intel.com> > Subject: Re: [PATCH 00/35] Shadow stacks for userspace > > On Wed, Feb 9, 2022 at 6:37 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > On 2/8/22 18:18, Edgecombe, Rick P wrote: > > > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote: > > >> On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote: > > >>>>> But such a knob will immediately reduce the security value of > > >>>>> the entire > > >>>>> thing, and I don't have good ideas how to deal with it :( > > >>>> > > >>>> Probably a kind of latch in the task_struct which would trigger > > >>>> off once > > >>>> returt to a different address happened, thus we would be able to > > >>>> jump inside > > >>>> paratite code. Of course such trigger should be available under > > >>>> proper > > >>>> capability only. > > >>> > > >>> I'm not fully in touch with how parasite, etc works. Are we > > >>> talking about save or restore? > > >> > > >> We use parasite code in question during checkpoint phase as far as I > > >> remember. > > >> push addr/lret trick is used to run "injected" code (code injection > > >> itself is > > >> done via ptrace) in compat mode at least. Dima, Andrei, I didn't look > > >> into this code > > >> for years already, do we still need to support compat mode at all? > > >> > > >>> If it's restore, what exactly does CRIU need to do? Is it just > > >>> that CRIU needs to return > > >>> out from its resume code into the to-be-resumed program without > > >>> tripping CET? Would it > > >>> be acceptable for CRIU to require that at least one shstk slot be > > >>> free at save time? > > >>> Or do we need a mechanism to atomically switch to a completely full > > >>> shadow stack at resume? > > >>> > > >>> Off the top of my head, a sigreturn (or sigreturn-like mechanism) > > >>> that is intended for > > >>> use for altshadowstack could safely verify a token on the > > >>> altshdowstack, possibly > > >>> compare to something in ucontext (or not -- this isn't clearly > > >>> necessary) and switch > > >>> back to the previous stack. CRIU could use that too. Obviously > > >>> CRIU will need a way > > >>> to populate the relevant stacks, but WRUSS can be used for that, > > >>> and I think this > > >>> is a fundamental requirement for CRIU -- CRIU restore absolutely > > >>> needs a way to write > > >>> the saved shadow stack data into the shadow stack. > > > > > > Still wrapping my head around the CRIU save and restore steps, but > > > another general approach might be to give ptrace the ability to > > > temporarily pause/resume/set CET enablement and SSP for a stopped > > > thread. Then injected code doesn't need to jump through any hoops or > > > possibly run into road blocks. I'm not sure how much this opens things > > > up if the thread has to be stopped... > > > > Hmm, that's maybe not insane. > > > > An alternative would be to add a bona fide ptrace call-a-function > > mechanism. I can think of two potentially usable variants: > > > > 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr, > > shadow stack push and all. > > > > 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal > > frame just like a real signal is being delivered with the specified > > handler. There could be a variant to opt-in to also using a specified > > altstack and altshadowstack. > > > > 2 would be more expensive but would avoid the need for much in the way > > of asm magic. The injected code could be plain C (or Rust or Zig or > > whatever). > > > > All of this only really handles save, not restore. I don't understand > > restore enough to fully understand the issue. > > FWIW, CET enabled GDB can call a function in a CET enabled process. > Adding Felix who may know more about it. > > > -- > H.J. I don't know much about CRIU or kernel code, so I will stick to explaining what our GDB patches for CET (not upstream yet) currently do. GDB does inferior calls by setting the PC to the function it wants to call and by manipulating the return address. It basically creates a dummy frame and runs that on top of where it currently is. To enable this for CET, our GDB CET patches push onto the shstk of the inferior by writing to the inferiors memory, if it isn't out of range, and by incrementing the SSP (using NT_X86_CET), both via ptrace. (GDB also has a command called 'return', which basically returns early from a function. Here GDB just decrements the SSP via ptrace.) This was based on earlier versions of the kernel patches. If the interface needs to change or if new interfaces would be available to do this better, that is fine from our pov. We didn't upstream this yet. Also, if you have any concerns with this approach please let me know. Regards, Felix Intel Deutschland GmbH Registered Address: Am Campeon 10, 85579 Neubiberg, Germany Tel: +49 89 99 8853-0, www.intel.de <http://www.intel.de> Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva Chairperson of the Supervisory Board: Nicole Lau Registered Office: Munich Commercial Register: Amtsgericht Muenchen HRB 186928
On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote: > On 2/8/22 18:18, Edgecombe, Rick P wrote: > > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote: > > > On Tue, Feb 08, 2022 at 08:21:20AM -0800, Andy Lutomirski wrote: > > > > > > But such a knob will immediately reduce the security value of > > > > > > the entire > > > > > > thing, and I don't have good ideas how to deal with it :( > > > > > > > > > > Probably a kind of latch in the task_struct which would trigger > > > > > off once > > > > > returt to a different address happened, thus we would be able to > > > > > jump inside > > > > > paratite code. Of course such trigger should be available under > > > > > proper > > > > > capability only. > > > > > > > > I'm not fully in touch with how parasite, etc works. Are we > > > > talking about save or restore? > > > > > > We use parasite code in question during checkpoint phase as far as I > > > remember. > > > push addr/lret trick is used to run "injected" code (code injection > > > itself is > > > done via ptrace) in compat mode at least. Dima, Andrei, I didn't look > > > into this code > > > for years already, do we still need to support compat mode at all? > > > > > > > If it's restore, what exactly does CRIU need to do? Is it just > > > > that CRIU needs to return > > > > out from its resume code into the to-be-resumed program without > > > > tripping CET? Would it > > > > be acceptable for CRIU to require that at least one shstk slot be > > > > free at save time? > > > > Or do we need a mechanism to atomically switch to a completely full > > > > shadow stack at resume? > > > > > > > > Off the top of my head, a sigreturn (or sigreturn-like mechanism) > > > > that is intended for > > > > use for altshadowstack could safely verify a token on the > > > > altshdowstack, possibly > > > > compare to something in ucontext (or not -- this isn't clearly > > > > necessary) and switch > > > > back to the previous stack. CRIU could use that too. Obviously > > > > CRIU will need a way > > > > to populate the relevant stacks, but WRUSS can be used for that, > > > > and I think this > > > > is a fundamental requirement for CRIU -- CRIU restore absolutely > > > > needs a way to write > > > > the saved shadow stack data into the shadow stack. > > > > Still wrapping my head around the CRIU save and restore steps, but > > another general approach might be to give ptrace the ability to > > temporarily pause/resume/set CET enablement and SSP for a stopped > > thread. Then injected code doesn't need to jump through any hoops or > > possibly run into road blocks. I'm not sure how much this opens things > > up if the thread has to be stopped... > > Hmm, that's maybe not insane. > > An alternative would be to add a bona fide ptrace call-a-function mechanism. > I can think of two potentially usable variants: > > 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr, > shadow stack push and all. > > 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal > frame just like a real signal is being delivered with the specified handler. > There could be a variant to opt-in to also using a specified altstack and > altshadowstack. I think this would be ideal. In CRIU, the parasite code is executed in the "daemon" mode and returns back via sigreturn. Right now, CRIU needs to generate a signal frame. If I understand your idea right, the signal frame will be generated by the kernel. > > 2 would be more expensive but would avoid the need for much in the way of > asm magic. The injected code could be plain C (or Rust or Zig or whatever). > > All of this only really handles save, not restore. I don't understand > restore enough to fully understand the issue. In a few words, it works like this: CRIU restores all required resources and prepares a signal frame with a target process state, then it switches to a small PIE blob, where it restores vma-s and calls rt_sigreturn. > > --Andy
On Thu, Feb 10, 2022 at 11:41:16PM -0800, avagin@gmail.com wrote: > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote: > > > > An alternative would be to add a bona fide ptrace call-a-function mechanism. > > I can think of two potentially usable variants: > > > > 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr, > > shadow stack push and all. > > > > 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal > > frame just like a real signal is being delivered with the specified handler. > > There could be a variant to opt-in to also using a specified altstack and > > altshadowstack. > > I think this would be ideal. In CRIU, the parasite code is executed in > the "daemon" mode and returns back via sigreturn. Right now, CRIU needs > to generate a signal frame. If I understand your idea right, the signal > frame will be generated by the kernel. > > > > > 2 would be more expensive but would avoid the need for much in the way of > > asm magic. The injected code could be plain C (or Rust or Zig or whatever). > > > > All of this only really handles save, not restore. I don't understand > > restore enough to fully understand the issue. > > In a few words, it works like this: CRIU restores all required resources > and prepares a signal frame with a target process state, then it > switches to a small PIE blob, where it restores vma-s and calls > rt_sigreturn. I think it's also important to note that the stack is restored as a part of the process memory, i.e. its contents is read from the images. > > > > --Andy -- Sincerely yours, Mike.
On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote: > On 2/8/22 18:18, Edgecombe, Rick P wrote: > > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote: > > > > Still wrapping my head around the CRIU save and restore steps, but > > another general approach might be to give ptrace the ability to > > temporarily pause/resume/set CET enablement and SSP for a stopped > > thread. Then injected code doesn't need to jump through any hoops or > > possibly run into road blocks. I'm not sure how much this opens things > > up if the thread has to be stopped... > > Hmm, that's maybe not insane. > > An alternative would be to add a bona fide ptrace call-a-function mechanism. > I can think of two potentially usable variants: > > 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr, > shadow stack push and all. > > 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal > frame just like a real signal is being delivered with the specified handler. > There could be a variant to opt-in to also using a specified altstack and > altshadowstack. Using ptrace() will not solve CRIU's issue with sigreturn because sigreturn is called from the victim context rather than from the criu process that controls the dump and uses ptrace(). Even with the current shadow stack interface Rick proposed, CRIU can restore the victim using ptrace without any additional knobs, but we loose an important ability to "self-cure" the victim from the parasite in case anything goes wrong with criu control process. Moreover, the issue with backward compatibility is not with ptrace but with sigreturn and it seems that criu is not its only user. So I think we need a way to allow direct calls to sigreturn that will bypass check and restore of the shadow stack. I only know that there are sigreturn users except criu that show up in Debian codesearch, and I don't know how do they use it, but for full backward compatibility we'd need to have no-CET sigreturn as default and add a new, say UC_CHECK_SHSTK flag to rt_sigframe->uc.uc_flags or even a new syscall for libc signal handling. > 2 would be more expensive but would avoid the need for much in the way of > asm magic. The injected code could be plain C (or Rust or Zig or whatever). > > All of this only really handles save, not restore. I don't understand > restore enough to fully understand the issue. Restore is more complex, will get to it later. > --Andy -- Sincerely yours, Mike.
On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote: > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote: >> On 2/8/22 18:18, Edgecombe, Rick P wrote: >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote: >> > >> > Still wrapping my head around the CRIU save and restore steps, but >> > another general approach might be to give ptrace the ability to >> > temporarily pause/resume/set CET enablement and SSP for a stopped >> > thread. Then injected code doesn't need to jump through any hoops or >> > possibly run into road blocks. I'm not sure how much this opens things >> > up if the thread has to be stopped... >> >> Hmm, that's maybe not insane. >> >> An alternative would be to add a bona fide ptrace call-a-function mechanism. >> I can think of two potentially usable variants: >> >> 1. Straight call. PTRACE_CALL_FUNCTION(addr) just emulates CALL addr, >> shadow stack push and all. >> >> 2. Signal-style. PTRACE_CALL_FUNCTION_SIGFRAME injects an actual signal >> frame just like a real signal is being delivered with the specified handler. >> There could be a variant to opt-in to also using a specified altstack and >> altshadowstack. > > Using ptrace() will not solve CRIU's issue with sigreturn because sigreturn > is called from the victim context rather than from the criu process that > controls the dump and uses ptrace(). I'm not sure I follow. > > Even with the current shadow stack interface Rick proposed, CRIU can restore > the victim using ptrace without any additional knobs, but we loose an > important ability to "self-cure" the victim from the parasite in case > anything goes wrong with criu control process. > > Moreover, the issue with backward compatibility is not with ptrace but with > sigreturn and it seems that criu is not its only user. So we need an ability for a tracer to cause the tracee to call a function and to return successfully. Apparently a gdb branch can already do this with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the trick. I don't see why we need a sigretur-but-dont-verify -- we just need this mechanism to create a frame such that sigreturn actually works. --Andy
On Mon, Feb 28, 2022 at 12:30:41PM -0800, Andy Lutomirski wrote: > > > On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote: > > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote: > >> On 2/8/22 18:18, Edgecombe, Rick P wrote: > >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote: > >> > > > > > Even with the current shadow stack interface Rick proposed, CRIU can restore > > the victim using ptrace without any additional knobs, but we loose an > > important ability to "self-cure" the victim from the parasite in case > > anything goes wrong with criu control process. > > > > Moreover, the issue with backward compatibility is not with ptrace but with > > sigreturn and it seems that criu is not its only user. > > So we need an ability for a tracer to cause the tracee to call a function > and to return successfully. Apparently a gdb branch can already do this > with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the > trick. I don't see why we need a sigretur-but-dont-verify -- we just > need this mechanism to create a frame such that sigreturn actually works. If I understand correctly, PTRACE_CALL_FUNCTION_SIGFRAME() injects a frame into the tracee and makes the tracee call sigreturn. I.e. the tracee is stopped and this is used pretty much as PTRACE_CONT or PTRACE_SYSCALL. In such case this defeats the purpose of sigreturn in CRIU because it is called asynchronously by the tracee when the tracer is about to detach or even already detached. For synchronous use-case PTRACE_SETREGSET will be enough, the rest of the sigframe can be restored by other means. And with 'criu restore' there may be even no tracer by the time sigreturn is called. > --Andy -- Sincerely yours, Mike.
On Mon, Feb 28, 2022, at 1:30 PM, Mike Rapoport wrote: > On Mon, Feb 28, 2022 at 12:30:41PM -0800, Andy Lutomirski wrote: >> >> >> On Mon, Feb 28, 2022, at 12:27 PM, Mike Rapoport wrote: >> > On Wed, Feb 09, 2022 at 06:37:53PM -0800, Andy Lutomirski wrote: >> >> On 2/8/22 18:18, Edgecombe, Rick P wrote: >> >> > On Tue, 2022-02-08 at 20:02 +0300, Cyrill Gorcunov wrote: >> >> > >> > >> > Even with the current shadow stack interface Rick proposed, CRIU can restore >> > the victim using ptrace without any additional knobs, but we loose an >> > important ability to "self-cure" the victim from the parasite in case >> > anything goes wrong with criu control process. >> > >> > Moreover, the issue with backward compatibility is not with ptrace but with >> > sigreturn and it seems that criu is not its only user. >> >> So we need an ability for a tracer to cause the tracee to call a function >> and to return successfully. Apparently a gdb branch can already do this >> with shstk, and my PTRACE_CALL_FUNCTION_SIGFRAME should also do the >> trick. I don't see why we need a sigretur-but-dont-verify -- we just >> need this mechanism to create a frame such that sigreturn actually works. > > If I understand correctly, PTRACE_CALL_FUNCTION_SIGFRAME() injects a frame > into the tracee and makes the tracee call sigreturn. > I.e. the tracee is stopped and this is used pretty much as PTRACE_CONT or > PTRACE_SYSCALL. > > In such case this defeats the purpose of sigreturn in CRIU because it is > called asynchronously by the tracee when the tracer is about to detach or > even already detached. The intent of PTRACE_CALL_FUNCTION_SIGFRAME is push a signal frame onto the stack and call a function. That function should then be able to call sigreturn just like any normal signal handler. There should be no requirement that the tracer still be attached when this happens, although the code calling sigreturn still needs to be mapped. (Specifically, on modern arches, the user runtime is expected to provide a "restorer" that calls sigreturn. A hypotheticall PTRACE_CALL_FUNCTION_SIGFRAME would either need to call sigreturn directly or provide a restorer.)
© 2016 - 2025 Red Hat, Inc.