When testing the kernel live patching with "modprobe livepatch-sample",
there is a timeout over 15 seconds from "starting patching transition"
to "patching complete", dmesg shows "unreliable stack" for user tasks
in debug mode. When executing "rmmod livepatch-sample", there exists
the similar issue.
Like x86, arch_stack_walk_reliable() should return 0 for user tasks.
It is necessary to set regs->csr_prmd as task->thread.csr_prmd first,
then use user_mode() to check whether the task is in userspace.
Here are the call chains:
klp_enable_patch()
klp_try_complete_transition()
klp_try_switch_task()
klp_check_and_switch_task()
klp_check_stack()
stack_trace_save_tsk_reliable()
arch_stack_walk_reliable()
With this patch, it takes a short time for patching and unpatching.
Before:
# modprobe livepatch-sample
# dmesg -T | tail -3
[Sat Sep 6 11:00:20 2025] livepatch: 'livepatch_sample': starting patching transition
[Sat Sep 6 11:00:35 2025] livepatch: signaling remaining tasks
[Sat Sep 6 11:00:36 2025] livepatch: 'livepatch_sample': patching complete
# echo 0 > /sys/kernel/livepatch/livepatch_sample/enabled
# rmmod livepatch_sample
rmmod: ERROR: Module livepatch_sample is in use
# rmmod livepatch_sample
# dmesg -T | tail -3
[Sat Sep 6 11:06:05 2025] livepatch: 'livepatch_sample': starting unpatching transition
[Sat Sep 6 11:06:20 2025] livepatch: signaling remaining tasks
[Sat Sep 6 11:06:21 2025] livepatch: 'livepatch_sample': unpatching complete
After:
# modprobe livepatch-sample
# dmesg -T | tail -2
[Sat Sep 6 11:19:00 2025] livepatch: 'livepatch_sample': starting patching transition
[Sat Sep 6 11:19:01 2025] livepatch: 'livepatch_sample': patching complete
# echo 0 > /sys/kernel/livepatch/livepatch_sample/enabled
# rmmod livepatch_sample
# dmesg -T | tail -2
[Sat Sep 6 11:21:10 2025] livepatch: 'livepatch_sample': starting unpatching transition
[Sat Sep 6 11:21:11 2025] livepatch: 'livepatch_sample': unpatching complete
While at it, do the similar thing for arch_stack_walk() to avoid
potential issues.
Cc: stable@vger.kernel.org # v6.9+
Fixes: 199cc14cb4f1 ("LoongArch: Add kernel livepatching support")
Reported-by: Xi Zhang <zhangxi@kylinos.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
---
arch/loongarch/kernel/stacktrace.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/loongarch/kernel/stacktrace.c b/arch/loongarch/kernel/stacktrace.c
index 9a038d1070d7..0454cce3b667 100644
--- a/arch/loongarch/kernel/stacktrace.c
+++ b/arch/loongarch/kernel/stacktrace.c
@@ -30,10 +30,15 @@ void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
}
regs->regs[1] = 0;
regs->regs[22] = 0;
+ regs->csr_prmd = task->thread.csr_prmd;
}
for (unwind_start(&state, task, regs);
!unwind_done(&state); unwind_next_frame(&state)) {
+ /* Success path for user tasks */
+ if (user_mode(regs))
+ return;
+
addr = unwind_get_return_address(&state);
if (!addr || !consume_entry(cookie, addr))
break;
@@ -57,9 +62,14 @@ int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry,
}
regs->regs[1] = 0;
regs->regs[22] = 0;
+ regs->csr_prmd = task->thread.csr_prmd;
for (unwind_start(&state, task, regs);
!unwind_done(&state) && !unwind_error(&state); unwind_next_frame(&state)) {
+ /* Success path for user tasks */
+ if (user_mode(regs))
+ return 0;
+
addr = unwind_get_return_address(&state);
/*
--
2.42.0
Hi, On Tue, 9 Sep 2025, Tiezhu Yang wrote: > When testing the kernel live patching with "modprobe livepatch-sample", > there is a timeout over 15 seconds from "starting patching transition" > to "patching complete", dmesg shows "unreliable stack" for user tasks > in debug mode. When executing "rmmod livepatch-sample", there exists > the similar issue. > > Like x86, arch_stack_walk_reliable() should return 0 for user tasks. > It is necessary to set regs->csr_prmd as task->thread.csr_prmd first, > then use user_mode() to check whether the task is in userspace. it is a nice optimization for sure, but "unreliable stack" messages point to a fact that the unwinding of these tasks is probably suboptimal and could be improved, no? It would also be nice to include these messages (not for all tasks) to the commit message. Regards Miroslav
On 2025/9/11 下午9:44, Miroslav Benes wrote: > Hi, > > On Tue, 9 Sep 2025, Tiezhu Yang wrote: > >> When testing the kernel live patching with "modprobe livepatch-sample", >> there is a timeout over 15 seconds from "starting patching transition" >> to "patching complete", dmesg shows "unreliable stack" for user tasks >> in debug mode. When executing "rmmod livepatch-sample", there exists >> the similar issue. >> >> Like x86, arch_stack_walk_reliable() should return 0 for user tasks. >> It is necessary to set regs->csr_prmd as task->thread.csr_prmd first, >> then use user_mode() to check whether the task is in userspace. > > it is a nice optimization for sure, but "unreliable stack" messages point > to a fact that the unwinding of these tasks is probably suboptimal and > could be improved, no? Yes, makes sense, I will fix "unreliable stack" in the next version. > It would also be nice to include these messages (not for all tasks) to the > commit message. OK, will do it. Thanks, Tiezhu
On 2025-09-09 19:31, Tiezhu Yang wrote: > When testing the kernel live patching with "modprobe livepatch-sample", > there is a timeout over 15 seconds from "starting patching transition" > to "patching complete", dmesg shows "unreliable stack" for user tasks > in debug mode. When executing "rmmod livepatch-sample", there exists > the similar issue. > > Like x86, arch_stack_walk_reliable() should return 0 for user tasks. > It is necessary to set regs->csr_prmd as task->thread.csr_prmd first, > then use user_mode() to check whether the task is in userspace. > > Here are the call chains: > > klp_enable_patch() > klp_try_complete_transition() > klp_try_switch_task() > klp_check_and_switch_task() > klp_check_stack() > stack_trace_save_tsk_reliable() > arch_stack_walk_reliable() > > With this patch, it takes a short time for patching and unpatching. > > Before: > > # modprobe livepatch-sample > # dmesg -T | tail -3 > [Sat Sep 6 11:00:20 2025] livepatch: 'livepatch_sample': starting patching transition > [Sat Sep 6 11:00:35 2025] livepatch: signaling remaining tasks > [Sat Sep 6 11:00:36 2025] livepatch: 'livepatch_sample': patching complete > > # echo 0 > /sys/kernel/livepatch/livepatch_sample/enabled > # rmmod livepatch_sample > rmmod: ERROR: Module livepatch_sample is in use > # rmmod livepatch_sample > # dmesg -T | tail -3 > [Sat Sep 6 11:06:05 2025] livepatch: 'livepatch_sample': starting unpatching transition > [Sat Sep 6 11:06:20 2025] livepatch: signaling remaining tasks > [Sat Sep 6 11:06:21 2025] livepatch: 'livepatch_sample': unpatching complete > > After: > > # modprobe livepatch-sample > # dmesg -T | tail -2 > [Sat Sep 6 11:19:00 2025] livepatch: 'livepatch_sample': starting patching transition > [Sat Sep 6 11:19:01 2025] livepatch: 'livepatch_sample': patching complete > > # echo 0 > /sys/kernel/livepatch/livepatch_sample/enabled > # rmmod livepatch_sample > # dmesg -T | tail -2 > [Sat Sep 6 11:21:10 2025] livepatch: 'livepatch_sample': starting unpatching transition > [Sat Sep 6 11:21:11 2025] livepatch: 'livepatch_sample': unpatching complete > > While at it, do the similar thing for arch_stack_walk() to avoid > potential issues. > > Cc: stable@vger.kernel.org # v6.9+ > Fixes: 199cc14cb4f1 ("LoongArch: Add kernel livepatching support") > Reported-by: Xi Zhang <zhangxi@kylinos.cn> > Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> > --- > arch/loongarch/kernel/stacktrace.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/arch/loongarch/kernel/stacktrace.c b/arch/loongarch/kernel/stacktrace.c > index 9a038d1070d7..0454cce3b667 100644 > --- a/arch/loongarch/kernel/stacktrace.c > +++ b/arch/loongarch/kernel/stacktrace.c > @@ -30,10 +30,15 @@ void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie, > } > regs->regs[1] = 0; > regs->regs[22] = 0; > + regs->csr_prmd = task->thread.csr_prmd; > } > > for (unwind_start(&state, task, regs); > !unwind_done(&state); unwind_next_frame(&state)) { > + /* Success path for user tasks */ > + if (user_mode(regs)) > + return; > + > addr = unwind_get_return_address(&state); > if (!addr || !consume_entry(cookie, addr)) > break; > @@ -57,9 +62,14 @@ int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry, > } > regs->regs[1] = 0; > regs->regs[22] = 0; > + regs->csr_prmd = task->thread.csr_prmd; > > for (unwind_start(&state, task, regs); > !unwind_done(&state) && !unwind_error(&state); unwind_next_frame(&state)) { > + /* Success path for user tasks */ > + if (user_mode(regs)) > + return 0; > + > addr = unwind_get_return_address(&state); > > /* Hi, Tiezhu, We update stack info by get_stack_info when meet ORC_TYPE_REGS in unwind_next_frame. And in arch_stack_walk(_reliable), we always do unwind_done before unwind_next_frame. So is there anything error in get_stack_info which causing regs is user_mode while stack is not STACK_TYPE_UNKNOWN?
On 2025/9/10 上午9:11, Jinyang He wrote: > On 2025-09-09 19:31, Tiezhu Yang wrote: > >> When testing the kernel live patching with "modprobe livepatch-sample", >> there is a timeout over 15 seconds from "starting patching transition" >> to "patching complete", dmesg shows "unreliable stack" for user tasks >> in debug mode. When executing "rmmod livepatch-sample", there exists >> the similar issue. ... >> @@ -57,9 +62,14 @@ int arch_stack_walk_reliable(stack_trace_consume_fn >> consume_entry, >> } >> regs->regs[1] = 0; >> regs->regs[22] = 0; >> + regs->csr_prmd = task->thread.csr_prmd; >> for (unwind_start(&state, task, regs); >> !unwind_done(&state) && !unwind_error(&state); >> unwind_next_frame(&state)) { >> + /* Success path for user tasks */ >> + if (user_mode(regs)) >> + return 0; >> + >> addr = unwind_get_return_address(&state); >> /* > Hi, Tiezhu, > > We update stack info by get_stack_info when meet ORC_TYPE_REGS in > unwind_next_frame. And in arch_stack_walk(_reliable), we always > do unwind_done before unwind_next_frame. So is there anything > error in get_stack_info which causing regs is user_mode while > stack is not STACK_TYPE_UNKNOWN? When testing the kernel live patching, the error code path in unwind_next_frame() is: switch (orc->fp_reg) { case ORC_REG_PREV_SP: p = (unsigned long *)(state->sp + orc->fp_offset); if (!stack_access_ok(state, (unsigned long)p, sizeof(unsigned long))) goto err; for this case, get_stack_info() does not return 0 due to in_task_stack() is not true, then goto error, state->stack_info.type = STACK_TYPE_UNKNOWN and state->error = true. In arch_stack_walk_reliable(), the loop will be break and it returns -EINVAL, thus causing unreliable stack. Maybe it can check whether the task is in userspace and set state->stack_info.type = STACK_TYPE_UNKNOWN in get_stack_info(), but I think no need to do that because it has similar effect with this patch. Thanks, Tiezhu
On 2025-09-11 19:49, Tiezhu Yang wrote: > On 2025/9/10 上午9:11, Jinyang He wrote: >> On 2025-09-09 19:31, Tiezhu Yang wrote: >> >>> When testing the kernel live patching with "modprobe livepatch-sample", >>> there is a timeout over 15 seconds from "starting patching transition" >>> to "patching complete", dmesg shows "unreliable stack" for user tasks >>> in debug mode. When executing "rmmod livepatch-sample", there exists >>> the similar issue. > > ... > >>> @@ -57,9 +62,14 @@ int >>> arch_stack_walk_reliable(stack_trace_consume_fn consume_entry, >>> } >>> regs->regs[1] = 0; >>> regs->regs[22] = 0; >>> + regs->csr_prmd = task->thread.csr_prmd; >>> for (unwind_start(&state, task, regs); >>> !unwind_done(&state) && !unwind_error(&state); >>> unwind_next_frame(&state)) { >>> + /* Success path for user tasks */ >>> + if (user_mode(regs)) >>> + return 0; >>> + >>> addr = unwind_get_return_address(&state); >>> /* >> Hi, Tiezhu, >> >> We update stack info by get_stack_info when meet ORC_TYPE_REGS in >> unwind_next_frame. And in arch_stack_walk(_reliable), we always >> do unwind_done before unwind_next_frame. So is there anything >> error in get_stack_info which causing regs is user_mode while >> stack is not STACK_TYPE_UNKNOWN? > > When testing the kernel live patching, the error code path in > unwind_next_frame() is: > > switch (orc->fp_reg) { > case ORC_REG_PREV_SP: > p = (unsigned long *)(state->sp + orc->fp_offset); > if (!stack_access_ok(state, (unsigned long)p, > sizeof(unsigned long))) > goto err; > > for this case, get_stack_info() does not return 0 due to in_task_stack() > is not true, then goto error, state->stack_info.type = STACK_TYPE_UNKNOWN > and state->error = true. In arch_stack_walk_reliable(), the loop will be > break and it returns -EINVAL, thus causing unreliable stack. The stop position of a complete stack backtrace on LoongArch should be the top of the task stack or until the address is_entry_func. Otherwise, it is not a complete stack backtrace, and thus I think it is an "unreliable stack". I'm curious about what the ORC info at this PC.
On 2025/9/12 上午9:55, Jinyang He wrote: > On 2025-09-11 19:49, Tiezhu Yang wrote: > >> On 2025/9/10 上午9:11, Jinyang He wrote: >>> On 2025-09-09 19:31, Tiezhu Yang wrote: >>> >>>> When testing the kernel live patching with "modprobe livepatch-sample", >>>> there is a timeout over 15 seconds from "starting patching transition" >>>> to "patching complete", dmesg shows "unreliable stack" for user tasks >>>> in debug mode. When executing "rmmod livepatch-sample", there exists >>>> the similar issue. ... >> for this case, get_stack_info() does not return 0 due to in_task_stack() >> is not true, then goto error, state->stack_info.type = STACK_TYPE_UNKNOWN >> and state->error = true. In arch_stack_walk_reliable(), the loop will be >> break and it returns -EINVAL, thus causing unreliable stack. > The stop position of a complete stack backtrace on LoongArch should be > the top of the task stack or until the address is_entry_func. > Otherwise, it is not a complete stack backtrace, and thus I think it > is an "unreliable stack". > I'm curious about what the ORC info at this PC. The unwind process has problem, I have found the root cause and am working to fix the "unreliable stack" issue, it should and can find the last frame, and then the user_mode() check is not necessary. Thanks, Tiezhu
© 2016 - 2025 Red Hat, Inc.