include/linux/percpu-refcount.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
A problem was find in stable 5.10 and the root cause of it like below.
In the use of q_usage_counter of request_queue, blk_cleanup_queue using
"wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))"
to wait q_usage_counter becoming zero. however, if the q_usage_counter
becoming zero quickly, and percpu_ref_exit will execute and ref->data
will be freed, maybe another process will cause a null-defef problem
like below:
CPU0 CPU1
blk_mq_destroy_queue
blk_freeze_queue
blk_mq_freeze_queue_wait
scsi_end_request
percpu_ref_get
...
percpu_ref_put
atomic_long_sub_and_test
blk_put_queue
kobject_put
kref_put
blk_release_queue
percpu_ref_exit
ref->data -> NULL
ref->data->release(ref) -> null-deref
As suggested by Ming Lei, fix it by getting the release method before
the referebce count is minus 0.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
---
include/linux/percpu-refcount.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index d73a1c08c3e3..11e717c95acb 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -331,8 +331,11 @@ static inline void percpu_ref_put_many(struct percpu_ref *ref, unsigned long nr)
if (__ref_is_percpu(ref, &percpu_count))
this_cpu_sub(*percpu_count, nr);
- else if (unlikely(atomic_long_sub_and_test(nr, &ref->data->count)))
- ref->data->release(ref);
+ else {
+ percpu_ref_func_t *release = ref->data->release;
+ if (unlikely(atomic_long_sub_and_test(nr, &ref->data->count)))
+ release(ref);
+ }
rcu_read_unlock();
}
--
2.31.1
Hello, On Tue, Dec 06, 2022 at 05:09:39PM +0800, Zhong Jinghua wrote: > A problem was find in stable 5.10 and the root cause of it like below. > > In the use of q_usage_counter of request_queue, blk_cleanup_queue using > "wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))" > to wait q_usage_counter becoming zero. however, if the q_usage_counter > becoming zero quickly, and percpu_ref_exit will execute and ref->data > will be freed, maybe another process will cause a null-defef problem > like below: > > CPU0 CPU1 > blk_mq_destroy_queue > blk_freeze_queue > blk_mq_freeze_queue_wait > scsi_end_request > percpu_ref_get > ... > percpu_ref_put > atomic_long_sub_and_test > blk_put_queue > kobject_put > kref_put > blk_release_queue > percpu_ref_exit > ref->data -> NULL > ref->data->release(ref) -> null-deref > I remember thinking about this a while ago. I don't think this fix works as nicely as it may seem. Please correct me if I'm wrong. q->q_usage_counter has the oddity that the lifetime of the percpu_ref object isn't managed by the release function. The freeing is handled by a separate path where it depends on the percpu_ref hitting 0. So here we have 2 concurrent paths racing to run with 1 destroying the object. We probably need blk_release_queue() to wait on percpu_ref's release finishing, not starting. I think the above works in this specific case because there is a call_rcu() in blk_release_queue(). If there wasn't a call_rcu(), then by the same logic we could delay ref->data->release(ref) further and that could potentially lead to a use after free. Ideally, I think fixing the race in q->q_usage_counter's pattern is better than masking it here as I think we're being saved by the call_rcu() call further down the object release path. Thanks, Dennis > As suggested by Ming Lei, fix it by getting the release method before > the referebce count is minus 0. > > Suggested-by: Ming Lei <ming.lei@redhat.com> > Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> > --- > include/linux/percpu-refcount.h | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h > index d73a1c08c3e3..11e717c95acb 100644 > --- a/include/linux/percpu-refcount.h > +++ b/include/linux/percpu-refcount.h > @@ -331,8 +331,11 @@ static inline void percpu_ref_put_many(struct percpu_ref *ref, unsigned long nr) > > if (__ref_is_percpu(ref, &percpu_count)) > this_cpu_sub(*percpu_count, nr); > - else if (unlikely(atomic_long_sub_and_test(nr, &ref->data->count))) > - ref->data->release(ref); > + else { > + percpu_ref_func_t *release = ref->data->release; > + if (unlikely(atomic_long_sub_and_test(nr, &ref->data->count))) > + release(ref); > + } > > rcu_read_unlock(); > } > -- > 2.31.1 >
On Wed, Dec 7, 2022 at 9:08 AM Dennis Zhou <dennis@kernel.org> wrote: > > Hello, > > On Tue, Dec 06, 2022 at 05:09:39PM +0800, Zhong Jinghua wrote: > > A problem was find in stable 5.10 and the root cause of it like below. > > > > In the use of q_usage_counter of request_queue, blk_cleanup_queue using > > "wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))" > > to wait q_usage_counter becoming zero. however, if the q_usage_counter > > becoming zero quickly, and percpu_ref_exit will execute and ref->data > > will be freed, maybe another process will cause a null-defef problem > > like below: > > > > CPU0 CPU1 > > blk_mq_destroy_queue > > blk_freeze_queue > > blk_mq_freeze_queue_wait > > scsi_end_request > > percpu_ref_get > > ... > > percpu_ref_put > > atomic_long_sub_and_test > > blk_put_queue > > kobject_put > > kref_put > > blk_release_queue > > percpu_ref_exit > > ref->data -> NULL > > ref->data->release(ref) -> null-deref > > > > I remember thinking about this a while ago. I don't think this fix works > as nicely as it may seem. Please correct me if I'm wrong. > > q->q_usage_counter has the oddity that the lifetime of the percpu_ref > object isn't managed by the release function. The freeing is handled by > a separate path where it depends on the percpu_ref hitting 0. So here we > have 2 concurrent paths racing to run with 1 destroying the object. We > probably need blk_release_queue() to wait on percpu_ref's release > finishing, not starting. > > I think the above works in this specific case because there is a > call_rcu() in blk_release_queue(). If there wasn't a call_rcu(), > then by the same logic we could delay ref->data->release(ref) further > and that could potentially lead to a use after free. > > Ideally, I think fixing the race in q->q_usage_counter's pattern is > better than masking it here as I think we're being saved by the > call_rcu() call further down the object release path. The problem is actually in percpu_ref_is_zero(), which can return true before ->release() is called. And any pattern of wait_event(percpu_ref_is_zero) may imply such risk. It may be not easy to fix the issue in block layer cleanly, but can be solved in percpu-refcount simply by adding ->release_lock(spin lock) in the counter for draining atomic_long_sub_and_test() & ->release() in percpu_ref_exit(). Or simply use percpu_ref_switch_lock. Thanks, Ming
Hi, 在 2022/12/07 9:05, Dennis Zhou 写道: > Hello, > > On Tue, Dec 06, 2022 at 05:09:39PM +0800, Zhong Jinghua wrote: >> A problem was find in stable 5.10 and the root cause of it like below. >> >> In the use of q_usage_counter of request_queue, blk_cleanup_queue using >> "wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))" >> to wait q_usage_counter becoming zero. however, if the q_usage_counter >> becoming zero quickly, and percpu_ref_exit will execute and ref->data >> will be freed, maybe another process will cause a null-defef problem >> like below: >> >> CPU0 CPU1 >> blk_mq_destroy_queue >> blk_freeze_queue >> blk_mq_freeze_queue_wait >> scsi_end_request >> percpu_ref_get >> ... >> percpu_ref_put >> atomic_long_sub_and_test >> blk_put_queue >> kobject_put >> kref_put >> blk_release_queue >> percpu_ref_exit >> ref->data -> NULL >> ref->data->release(ref) -> null-deref >> > > I remember thinking about this a while ago. I don't think this fix works > as nicely as it may seem. Please correct me if I'm wrong. > > q->q_usage_counter has the oddity that the lifetime of the percpu_ref > object isn't managed by the release function. The freeing is handled by > a separate path where it depends on the percpu_ref hitting 0. So here we > have 2 concurrent paths racing to run with 1 destroying the object. We > probably need blk_release_queue() to wait on percpu_ref's release > finishing, not starting. > > I think the above works in this specific case because there is a > call_rcu() in blk_release_queue(). If there wasn't a call_rcu(), > then by the same logic we could delay ref->data->release(ref) further > and that could potentially lead to a use after free. > > Ideally, I think fixing the race in q->q_usage_counter's pattern is > better than masking it here as I think we're being saved by the > call_rcu() call further down the object release path. Agree. BTW, Wensheng used to send a patch to fix this in block layer: https://www.spinics.net/lists/kernel/msg4615696.html. Thanks, Kuai
© 2016 - 2025 Red Hat, Inc.