rseq: don't promote transient TLS faults to SIGSEGV

[PATCH] rseq: don't promote transient TLS faults to SIGSEGV
Posted by Yuanhe Shu an hour ago
On return to user space the rseq slow path writes the new cpu_id /
mm_cid into the user-space rseq TLS. rseq_update_usr() already
classifies its failures in rseq_event::fatal: the flag is set only
when corrupt user data is positively identified (e.g. a bad rseq_cs
signature or an out-of-bounds abort IP) and stays clear when the
access merely hit an unresolved page fault.

rseq_slowpath_update_usr() ignores that and calls force_sig(SIGSEGV)
on any failure, so a transient page fault on a still-registered rseq
area becomes a fatal SIGSEGV. This is reachable since glibc >= 2.35
registers rseq for every thread by default: a memcg OOM victim can die
of SIGSEGV (si_code=SI_KERNEL, si_addr=NULL) shortly after fork,
before returning to user space, because the CoW of the inherited TLS
page cannot be charged to the OOM-locked memcg and the rseq write
faults.

With oom_score_adj=-1000 the OOM killer finds no killable task, so the
rseq SIGSEGV is the sole outcome; otherwise the rseq SIGSEGV can be
delivered before the OOM killer queues SIGKILL, and the process exits
139 instead of 137, breaking OOMKilled detection in container
runtimes. LTP mm/oom03 and mm/oom05 reproduce it on v7.1-rc6+, and a
strace A/B with glibc.pthread.rseq as the sole variable shows the
SIGSEGV only when rseq is registered.

Only raise SIGSEGV when rseq_event::fatal is set. A non-fatal fault
leaves the cached IDs untouched and is retried on a later return to
user; a genuinely unmapped area keeps faulting and user space takes
SIGSEGV through its own access. All corruption and ROP-hardening
checks keep their SIGSEGV.

Signal delivery is left untouched: it must abort the interrupted
critical section before the handler runs and therefore cannot safely
defer a fault.

Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com>
---
Tested on v7.1-rc6+ (vanilla):
 - LTP mm/oom03 (14/14) and mm/oom05 (8/8): pass with the patch (the
   victim is reaped with SIGKILL); without it the rseq SIGSEGV makes
   the same cases fail.
 - strace A/B on the oom03 binary with glibc.pthread.rseq as the sole
   variable: 2 SIGSEGV (SI_KERNEL, si_addr=NULL) with rseq registered,
   0 without -- isolates the cause to the rseq slow path.
 - tools/testing/selftests/rseq: run_param_test.sh,
   run_syscall_errors_test.sh, run_legacy_check.sh and
   run_timeslice_test.sh all pass.

 kernel/rseq.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index e75e3a5e312c..38a19cef4ad0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -302,11 +302,18 @@ static void rseq_slowpath_update_usr(struct pt_regs *regs)
 
 	if (unlikely(!rseq_update_usr(t, regs, &ids))) {
 		/*
-		 * Clear the errors just in case this might survive magically, but
-		 * leave the rest intact.
+		 * rseq_update_usr() sets rseq_event::fatal only on corrupt
+		 * user data, which keeps its SIGSEGV. A clear fatal bit is an
+		 * unresolved page fault on a still-registered rseq area (e.g.
+		 * a CoW that cannot be charged to an OOM-locked memcg): that
+		 * is transient, so leave the cached IDs untouched and retry on
+		 * a later return to user instead of killing the task.
 		 */
+		bool fatal = t->rseq.event.fatal;
+
 		t->rseq.event.error = 0;
-		force_sig(SIGSEGV);
+		if (fatal)
+			force_sig(SIGSEGV);
 	}
 }
 
-- 
2.39.5 (Apple Git-154)