kernel/rseq.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
On return to user space the rseq slow path writes the new cpu_id /
mm_cid into the user-space rseq TLS. rseq_update_usr() already
classifies its failures in rseq_event::fatal: the flag is set only
when corrupt user data is positively identified (e.g. a bad rseq_cs
signature or an out-of-bounds abort IP) and stays clear when the
access merely hit an unresolved page fault.
rseq_slowpath_update_usr() ignores that and calls force_sig(SIGSEGV)
on any failure, so a transient page fault on a still-registered rseq
area becomes a fatal SIGSEGV. This is reachable since glibc >= 2.35
registers rseq for every thread by default: a memcg OOM victim can die
of SIGSEGV (si_code=SI_KERNEL, si_addr=NULL) shortly after fork,
before returning to user space, because the CoW of the inherited TLS
page cannot be charged to the OOM-locked memcg and the rseq write
faults.
With oom_score_adj=-1000 the OOM killer finds no killable task, so the
rseq SIGSEGV is the sole outcome; otherwise the rseq SIGSEGV can be
delivered before the OOM killer queues SIGKILL, and the process exits
139 instead of 137, breaking OOMKilled detection in container
runtimes. LTP mm/oom03 and mm/oom05 reproduce it on v7.1-rc6+, and a
strace A/B with glibc.pthread.rseq as the sole variable shows the
SIGSEGV only when rseq is registered.
Only raise SIGSEGV when rseq_event::fatal is set. A non-fatal fault
leaves the cached IDs untouched and is retried on a later return to
user; a genuinely unmapped area keeps faulting and user space takes
SIGSEGV through its own access. All corruption and ROP-hardening
checks keep their SIGSEGV.
Signal delivery is left untouched: it must abort the interrupted
critical section before the handler runs and therefore cannot safely
defer a fault.
Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com>
---
Tested on v7.1-rc6+ (vanilla):
- LTP mm/oom03 (14/14) and mm/oom05 (8/8): pass with the patch (the
victim is reaped with SIGKILL); without it the rseq SIGSEGV makes
the same cases fail.
- strace A/B on the oom03 binary with glibc.pthread.rseq as the sole
variable: 2 SIGSEGV (SI_KERNEL, si_addr=NULL) with rseq registered,
0 without -- isolates the cause to the rseq slow path.
- tools/testing/selftests/rseq: run_param_test.sh,
run_syscall_errors_test.sh, run_legacy_check.sh and
run_timeslice_test.sh all pass.
kernel/rseq.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index e75e3a5e312c..38a19cef4ad0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -302,11 +302,18 @@ static void rseq_slowpath_update_usr(struct pt_regs *regs)
if (unlikely(!rseq_update_usr(t, regs, &ids))) {
/*
- * Clear the errors just in case this might survive magically, but
- * leave the rest intact.
+ * rseq_update_usr() sets rseq_event::fatal only on corrupt
+ * user data, which keeps its SIGSEGV. A clear fatal bit is an
+ * unresolved page fault on a still-registered rseq area (e.g.
+ * a CoW that cannot be charged to an OOM-locked memcg): that
+ * is transient, so leave the cached IDs untouched and retry on
+ * a later return to user instead of killing the task.
*/
+ bool fatal = t->rseq.event.fatal;
+
t->rseq.event.error = 0;
- force_sig(SIGSEGV);
+ if (fatal)
+ force_sig(SIGSEGV);
}
}
--
2.39.5 (Apple Git-154)
© 2016 - 2026 Red Hat, Inc.