xenbus: Use kref to track req lifetime

[PATCH] xenbus: Use kref to track req lifetime

Posted by Jason Andryuk 9 months ago

Marek reported seeing a NULL pointer fault in the xenbus_thread
callstack:
BUG: kernel NULL pointer dereference, address: 0000000000000000
RIP: e030:__wake_up_common+0x4c/0x180
Call Trace:
 <TASK>
 __wake_up_common_lock+0x82/0xd0
 process_msg+0x18e/0x2f0
 xenbus_thread+0x165/0x1c0

process_msg+0x18e is req->cb(req).  req->cb is set to xs_wake_up(), a
thin wrapper around wake_up(), or xenbus_dev_queue_reply().  It seems
like it was xs_wake_up() in this case.

It seems like req may have woken up the xs_wait_for_reply(), which
kfree()ed the req.  When xenbus_thread resumes, it faults on the zero-ed
data.

Linux Device Drivers 2nd edition states:
"Normally, a wake_up call can cause an immediate reschedule to happen,
meaning that other processes might run before wake_up returns."
... which would match the behaviour observed.

Change to keeping two krefs on each request.  One for the caller, and
one for xenbus_thread.  Each will kref_put() when finished, and the last
will free it.

This use of kref matches the description in
Documentation/core-api/kref.rst

Link: https://lore.kernel.org/xen-devel/ZO0WrR5J0xuwDIxW@mail-itl/
Reported-by: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
Fixes: fd8aa9095a95 ("xen: optimize xenbus driver for multiple concurrent xenstore accesses")
Cc: stable@vger.kernel.org
Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
---
Kinda RFC-ish as I don't know if it fixes Marek's issue.  This does seem
like the correct approach if we are seeing req free()ed out from under
xenbus_thread.

 drivers/xen/xenbus/xenbus.h              |  2 ++
 drivers/xen/xenbus/xenbus_comms.c        |  9 ++++-----
 drivers/xen/xenbus/xenbus_dev_frontend.c |  2 +-
 drivers/xen/xenbus/xenbus_xs.c           | 18 ++++++++++++++++--
 4 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus.h b/drivers/xen/xenbus/xenbus.h
index 13821e7e825e..9ac0427724a3 100644
--- a/drivers/xen/xenbus/xenbus.h
+++ b/drivers/xen/xenbus/xenbus.h
@@ -77,6 +77,7 @@ enum xb_req_state {
 struct xb_req_data {
 	struct list_head list;
 	wait_queue_head_t wq;
+	struct kref kref;
 	struct xsd_sockmsg msg;
 	uint32_t caller_req_id;
 	enum xsd_sockmsg_type type;
@@ -103,6 +104,7 @@ int xb_init_comms(void);
 void xb_deinit_comms(void);
 int xs_watch_msg(struct xs_watch_event *event);
 void xs_request_exit(struct xb_req_data *req);
+void xs_free_req(struct kref *kref);
 
 int xenbus_match(struct device *_dev, const struct device_driver *_drv);
 int xenbus_dev_probe(struct device *_dev);
diff --git a/drivers/xen/xenbus/xenbus_comms.c b/drivers/xen/xenbus/xenbus_comms.c
index e5fda0256feb..82df2da1b880 100644
--- a/drivers/xen/xenbus/xenbus_comms.c
+++ b/drivers/xen/xenbus/xenbus_comms.c
@@ -309,8 +309,8 @@ static int process_msg(void)
 			virt_wmb();
 			req->state = xb_req_state_got_reply;
 			req->cb(req);
-		} else
-			kfree(req);
+		}
+		kref_put(&req->kref, xs_free_req);
 	}
 
 	mutex_unlock(&xs_response_mutex);
@@ -386,14 +386,13 @@ static int process_writes(void)
 	state.req->msg.type = XS_ERROR;
 	state.req->err = err;
 	list_del(&state.req->list);
-	if (state.req->state == xb_req_state_aborted)
-		kfree(state.req);
-	else {
+	if (state.req->state != xb_req_state_aborted) {
 		/* write err, then update state */
 		virt_wmb();
 		state.req->state = xb_req_state_got_reply;
 		wake_up(&state.req->wq);
 	}
+	kref_put(&state.req->kref, xs_free_req);
 
 	mutex_unlock(&xb_write_mutex);
 
diff --git a/drivers/xen/xenbus/xenbus_dev_frontend.c b/drivers/xen/xenbus/xenbus_dev_frontend.c
index 46f8916597e5..f5c21ba64df5 100644
--- a/drivers/xen/xenbus/xenbus_dev_frontend.c
+++ b/drivers/xen/xenbus/xenbus_dev_frontend.c
@@ -406,7 +406,7 @@ void xenbus_dev_queue_reply(struct xb_req_data *req)
 	mutex_unlock(&u->reply_mutex);
 
 	kfree(req->body);
-	kfree(req);
+	kref_put(&req->kref, xs_free_req);
 
 	kref_put(&u->kref, xenbus_file_free);
 
diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c
index d32c726f7a12..dcf9182c8451 100644
--- a/drivers/xen/xenbus/xenbus_xs.c
+++ b/drivers/xen/xenbus/xenbus_xs.c
@@ -112,6 +112,12 @@ static void xs_suspend_exit(void)
 	wake_up_all(&xs_state_enter_wq);
 }
 
+void xs_free_req(struct kref *kref)
+{
+	struct xb_req_data *req = container_of(kref, struct xb_req_data, kref);
+	kfree(req);
+}
+
 static uint32_t xs_request_enter(struct xb_req_data *req)
 {
 	uint32_t rq_id;
@@ -237,6 +243,12 @@ static void xs_send(struct xb_req_data *req, struct xsd_sockmsg *msg)
 	req->caller_req_id = req->msg.req_id;
 	req->msg.req_id = xs_request_enter(req);
 
+	/*
+	 * Take 2nd ref.  One for this thread, and the second for the
+	 * xenbus_thread.
+	 */
+	kref_get(&req->kref);
+
 	mutex_lock(&xb_write_mutex);
 	list_add_tail(&req->list, &xb_write_list);
 	notify = list_is_singular(&xb_write_list);
@@ -261,8 +273,8 @@ static void *xs_wait_for_reply(struct xb_req_data *req, struct xsd_sockmsg *msg)
 	if (req->state == xb_req_state_queued ||
 	    req->state == xb_req_state_wait_reply)
 		req->state = xb_req_state_aborted;
-	else
-		kfree(req);
+
+	kref_put(&req->kref, xs_free_req);
 	mutex_unlock(&xb_write_mutex);
 
 	return ret;
@@ -291,6 +303,7 @@ int xenbus_dev_request_and_reply(struct xsd_sockmsg *msg, void *par)
 	req->cb = xenbus_dev_queue_reply;
 	req->par = par;
 	req->user_req = true;
+	kref_init(&req->kref);
 
 	xs_send(req, msg);
 
@@ -319,6 +332,7 @@ static void *xs_talkv(struct xenbus_transaction t,
 	req->num_vecs = num_vecs;
 	req->cb = xs_wake_up;
 	req->user_req = false;
+	kref_init(&req->kref);
 
 	msg.req_id = 0;
 	msg.tx_id = t.id;
-- 
2.34.1

Re: [PATCH] xenbus: Use kref to track req lifetime

Posted by Marek Marczykowski-Górecki 9 months ago

On Tue, May 06, 2025 at 05:09:33PM -0400, Jason Andryuk wrote:
> Marek reported seeing a NULL pointer fault in the xenbus_thread
> callstack:
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> RIP: e030:__wake_up_common+0x4c/0x180
> Call Trace:
>  <TASK>
>  __wake_up_common_lock+0x82/0xd0
>  process_msg+0x18e/0x2f0
>  xenbus_thread+0x165/0x1c0
> 
> process_msg+0x18e is req->cb(req).  req->cb is set to xs_wake_up(), a
> thin wrapper around wake_up(), or xenbus_dev_queue_reply().  It seems
> like it was xs_wake_up() in this case.
> 
> It seems like req may have woken up the xs_wait_for_reply(), which
> kfree()ed the req.  When xenbus_thread resumes, it faults on the zero-ed
> data.
> 
> Linux Device Drivers 2nd edition states:
> "Normally, a wake_up call can cause an immediate reschedule to happen,
> meaning that other processes might run before wake_up returns."
> ... which would match the behaviour observed.
> 
> Change to keeping two krefs on each request.  One for the caller, and
> one for xenbus_thread.  Each will kref_put() when finished, and the last
> will free it.
> 
> This use of kref matches the description in
> Documentation/core-api/kref.rst
> 
> Link: https://lore.kernel.org/xen-devel/ZO0WrR5J0xuwDIxW@mail-itl/
> Reported-by: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
> Fixes: fd8aa9095a95 ("xen: optimize xenbus driver for multiple concurrent xenstore accesses")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> ---
> Kinda RFC-ish as I don't know if it fixes Marek's issue.  This does seem
> like the correct approach if we are seeing req free()ed out from under
> xenbus_thread.

Thanks for the patch! I don't have easy way to test if it definitely
fixes the issues (due to poor reproduction rate), but it looks very
likely. I did run it through our CI and at least there it didn't crash
(but again, it doesn't happen often).

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Re: [PATCH] xenbus: Use kref to track req lifetime

Posted by Jürgen Groß 9 months ago

On 06.05.25 23:09, Jason Andryuk wrote:
> Marek reported seeing a NULL pointer fault in the xenbus_thread
> callstack:
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> RIP: e030:__wake_up_common+0x4c/0x180
> Call Trace:
>   <TASK>
>   __wake_up_common_lock+0x82/0xd0
>   process_msg+0x18e/0x2f0
>   xenbus_thread+0x165/0x1c0
> 
> process_msg+0x18e is req->cb(req).  req->cb is set to xs_wake_up(), a
> thin wrapper around wake_up(), or xenbus_dev_queue_reply().  It seems
> like it was xs_wake_up() in this case.
> 
> It seems like req may have woken up the xs_wait_for_reply(), which
> kfree()ed the req.  When xenbus_thread resumes, it faults on the zero-ed
> data.
> 
> Linux Device Drivers 2nd edition states:
> "Normally, a wake_up call can cause an immediate reschedule to happen,
> meaning that other processes might run before wake_up returns."
> ... which would match the behaviour observed.
> 
> Change to keeping two krefs on each request.  One for the caller, and
> one for xenbus_thread.  Each will kref_put() when finished, and the last
> will free it.
> 
> This use of kref matches the description in
> Documentation/core-api/kref.rst
> 
> Link: https://lore.kernel.org/xen-devel/ZO0WrR5J0xuwDIxW@mail-itl/
> Reported-by: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
> Fixes: fd8aa9095a95 ("xen: optimize xenbus driver for multiple concurrent xenstore accesses")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>

Reviewed-by: Juergen Gross <jgross@suse.com>

> ---
> Kinda RFC-ish as I don't know if it fixes Marek's issue.  This does seem
> like the correct approach if we are seeing req free()ed out from under
> xenbus_thread.

I think your analysis is correct. When writing this code I didn't think
of wake_up() needing to access req->wq _after_ having woken up the waiter.


Juergen

Re: [PATCH] xenbus: Use kref to track req lifetime

Posted by Jason Andryuk 9 months ago

On 2025-05-07 05:27, Jürgen Groß wrote:
> On 06.05.25 23:09, Jason Andryuk wrote:
>> Marek reported seeing a NULL pointer fault in the xenbus_thread
>> callstack:
>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>> RIP: e030:__wake_up_common+0x4c/0x180
>> Call Trace:
>>   <TASK>
>>   __wake_up_common_lock+0x82/0xd0
>>   process_msg+0x18e/0x2f0
>>   xenbus_thread+0x165/0x1c0
>>
>> process_msg+0x18e is req->cb(req).  req->cb is set to xs_wake_up(), a
>> thin wrapper around wake_up(), or xenbus_dev_queue_reply().  It seems
>> like it was xs_wake_up() in this case.
>>
>> It seems like req may have woken up the xs_wait_for_reply(), which
>> kfree()ed the req.  When xenbus_thread resumes, it faults on the zero-ed
>> data.
>>
>> Linux Device Drivers 2nd edition states:
>> "Normally, a wake_up call can cause an immediate reschedule to happen,
>> meaning that other processes might run before wake_up returns."
>> ... which would match the behaviour observed.
>>
>> Change to keeping two krefs on each request.  One for the caller, and
>> one for xenbus_thread.  Each will kref_put() when finished, and the last
>> will free it.
>>
>> This use of kref matches the description in
>> Documentation/core-api/kref.rst
>>
>> Link: https://lore.kernel.org/xen-devel/ZO0WrR5J0xuwDIxW@mail-itl/
>> Reported-by: "Marek Marczykowski-Górecki" 
>> <marmarek@invisiblethingslab.com>
>> Fixes: fd8aa9095a95 ("xen: optimize xenbus driver for multiple 
>> concurrent xenstore accesses")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> 
> Reviewed-by: Juergen Gross <jgross@suse.com>

Thanks

>> ---
>> Kinda RFC-ish as I don't know if it fixes Marek's issue.  This does seem
>> like the correct approach if we are seeing req free()ed out from under
>> xenbus_thread.
> 
> I think your analysis is correct. When writing this code I didn't think
> of wake_up() needing to access req->wq _after_ having woken up the waiter.

Yes, this was tricky.

One other thing that makes me think this is correct.  If this is the 
same underlying issue: 
https://lore.kernel.org/xen-devel/Z_lJTyVipJJEpWg2@mail-itl/

The failure is in the unlock:

pvqspinlock: lock 0xffff8881029af110 has corrupted value 0x0!
WARNING: CPU: 1 PID: 118 at kernel/locking/qspinlock_paravirt.h:504 
__pv_queued_spin_unlock_slowpath+0xdc/0x120

Which makes me think the req was fine entering wake_up(), and it's only 
found to be corrupt on the way out.

Regards,
Jason