net/9p/client.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
When a 9P RPC request is interrupted by a signal, p9_client_rpc() clears
TIF_SIGPENDING and attempts to flush the pending request by calling
p9_client_flush(). Since commit 9523feac272c switched the wait from
wait_event_interruptible() to wait_event_killable(), only SIGKILL can
interrupt the wait. However, the flush path also clears TIF_SIGPENDING,
which prevents subsequent wait_event_killable() calls from detecting the
already-pending SIGKILL. This makes the task unkillable.
When a coredump is initiated by another thread, zap_process() sends
SIGKILL to all threads. If a thread is stuck in the flush path as
described above, it never exits, and the dumper thread waits forever in
coredump_wait(), resulting in a hung task.
The same issue exists in p9_client_zc_rpc().
Skip the flush and the P9_TFLUSH retry loop when a fatal signal is
pending so that the task can exit promptly.
Fixes: 9523feac272c ("net/9p: Switch to wait_event_killable()")
Reported-by: syzbot+6ed94e81a1492fe1d512@syzkaller.appspotmail.com
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
---
net/9p/client.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/net/9p/client.c b/net/9p/client.c
index f0dcf252af7e..28742f19041e 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -599,7 +599,7 @@ p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
smp_rmb();
if (err == -ERESTARTSYS && c->status == Connected &&
- type == P9_TFLUSH) {
+ type == P9_TFLUSH && !fatal_signal_pending(current)) {
sigpending = 1;
clear_thread_flag(TIF_SIGPENDING);
goto again;
@@ -609,7 +609,8 @@ p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
p9_debug(P9_DEBUG_ERROR, "req_status error %d\n", req->t_err);
err = req->t_err;
}
- if (err == -ERESTARTSYS && c->status == Connected) {
+ if (err == -ERESTARTSYS && c->status == Connected &&
+ !fatal_signal_pending(current)) {
p9_debug(P9_DEBUG_MUX, "flushing\n");
sigpending = 1;
clear_thread_flag(TIF_SIGPENDING);
@@ -694,7 +695,8 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client *c, int8_t type,
p9_debug(P9_DEBUG_ERROR, "req_status error %d\n", req->t_err);
err = req->t_err;
}
- if (err == -ERESTARTSYS && c->status == Connected) {
+ if (err == -ERESTARTSYS && c->status == Connected &&
+ !fatal_signal_pending(current)) {
p9_debug(P9_DEBUG_MUX, "flushing\n");
sigpending = 1;
clear_thread_flag(TIF_SIGPENDING);
--
2.52.0
Thanks for the patch,
Shigeru Yoshida wrote on Sun, Mar 22, 2026 at 08:18:47PM +0900:
> When a 9P RPC request is interrupted by a signal, p9_client_rpc() clears
> TIF_SIGPENDING and attempts to flush the pending request by calling
> p9_client_flush(). Since commit 9523feac272c switched the wait from
> wait_event_interruptible() to wait_event_killable(), only SIGKILL can
> interrupt the wait. However, the flush path also clears TIF_SIGPENDING,
> which prevents subsequent wait_event_killable() calls from detecting the
> already-pending SIGKILL. This makes the task unkillable.
Yes, the task is unkillable, it's a known issue -- I don't see how
that's a new problem with the switch to wait_event_killable(), as far as
I'm aware it was a problem even before the commit you listed.
Unfortunately we pretty much have to wait for the flush to complete for
consistency in some RPCs (I don't remember what exactly, before commit
728356dedeff ("9p: Add refcount to p9_req_t") there was an obvious use
after free, now there should not be a UAF problem but for some reason
I've always been waiting for async flush (something like [1], execept it
introduced other regressions and never made it because of eternal Lack
Of Time))
If you've really thought it out and can convince me we can safely do
this I'll give it another thought but in practice it's only a problem
for broken servers (or syzcaller), so I'd rather prioritize not
regressing elsewhere unless there are good arguments...
If it helps fuzzing though I guess we could have this as a mount option?
but then you'd not be fuzzing the same code path as regular apps, so I'm
not convinced it's a good idea either...
[1] https://lore.kernel.org/all/1544532108-21689-3-git-send-email-asmadeus@codewreck.org/T/#md5b45e3a9fda626a3e48489e12eed27a455707f7
Thanks,
--
Dominique Martinet | Asmadeus
Dominique Martinet wrote on Sun, Mar 22, 2026 at 09:09:31PM +0900:
> Unfortunately we pretty much have to wait for the flush to complete for
> consistency in some RPCs (I don't remember what exactly, before commit
> 728356dedeff ("9p: Add refcount to p9_req_t") there was an obvious use
> after free, now there should not be a UAF problem but for some reason
> I've always been waiting for async flush (something like [1], execept it
> introduced other regressions and never made it because of eternal Lack
> Of Time))
Just thinking about it some more, one problem even with refcount would
be that we'll be either leaking memory/the tag for the request (if we
just take the ref on return from p9_client_rpc and wait to receive the
response), or will free the tag and risk reusing it before the server
replies to the original request.
I really don't see how that could work decently without some kind of
async flush.
If you're interested in taking over, I had rebased the 2018 patches I
linked earlier in 2023 here:
https://github.com/martinetd/linux/commits/9p-async-v2
I don't remember why I didn't send it for review again in 2023,
part of the 2019 patches about async clunks definitely was a problem
because we have to wait for the clunk happy path (because some servers
flush on clunk when cache is involved), but the 2023 patches properly
only do it on retry which is a net improvement, and the flush code
should be workable in theory, so this would "just" need another rebase
(that should not be too bad) and heavy testing...
Whatever we'd merge doesn't need to be based on these patches if you
prefer to reimplement it, I'm not concerned about authorship here, but I
don't think the current approach is appropriate.
Thanks,
--
Dominique Martinet | Asmadeus
© 2016 - 2026 Red Hat, Inc.