blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

[PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by linan666@huaweicloud.com 5 months, 2 weeks ago

From: Li Nan <linan122@huawei.com>

In __blk_mq_update_nr_hw_queues() the return value of
blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
fails, later changing the number of hw_queues or removing disk will
trigger the following warning:

  kernfs: can not remove 'nr_tags', no directory
  WARNING: CPU: 2 PID: 637 at fs/kernfs/dir.c:1707 kernfs_remove_by_name_ns+0x13f/0x160
  Call Trace:
   remove_files.isra.1+0x38/0xb0
   sysfs_remove_group+0x4d/0x100
   sysfs_remove_groups+0x31/0x60
   __kobject_del+0x23/0xf0
   kobject_del+0x17/0x40
   blk_mq_unregister_hctx+0x5d/0x80
   blk_mq_sysfs_unregister_hctxs+0x94/0xd0
   blk_mq_update_nr_hw_queues+0x124/0x760
   nullb_update_nr_hw_queues+0x71/0xf0 [null_blk]
   nullb_device_submit_queues_store+0x92/0x120 [null_blk]

kobjct_del() was called unconditionally even if sysfs creation failed.
Fix it by checkig the kobject creation statusbefore deleting it.

Fixes: 477e19dedc9d ("blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues")
Signed-off-by: Li Nan <linan122@huawei.com>
---
 block/blk-mq-sysfs.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 24656980f443..5c399ac562ea 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -150,9 +150,11 @@ static void blk_mq_unregister_hctx(struct blk_mq_hw_ctx *hctx)
 		return;
 
 	hctx_for_each_ctx(hctx, ctx, i)
-		kobject_del(&ctx->kobj);
+		if (ctx->kobj.state_in_sysfs)
+			kobject_del(&ctx->kobj);
 
-	kobject_del(&hctx->kobj);
+	if (hctx->kobj.state_in_sysfs)
+		kobject_del(&hctx->kobj);
 }
 
 static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
-- 
2.39.2

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Jens Axboe 5 months, 2 weeks ago

On Tue, 26 Aug 2025 16:48:54 +0800, linan666@huaweicloud.com wrote:
> In __blk_mq_update_nr_hw_queues() the return value of
> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
> fails, later changing the number of hw_queues or removing disk will
> trigger the following warning:
> 
>   kernfs: can not remove 'nr_tags', no directory
>   WARNING: CPU: 2 PID: 637 at fs/kernfs/dir.c:1707 kernfs_remove_by_name_ns+0x13f/0x160
>   Call Trace:
>    remove_files.isra.1+0x38/0xb0
>    sysfs_remove_group+0x4d/0x100
>    sysfs_remove_groups+0x31/0x60
>    __kobject_del+0x23/0xf0
>    kobject_del+0x17/0x40
>    blk_mq_unregister_hctx+0x5d/0x80
>    blk_mq_sysfs_unregister_hctxs+0x94/0xd0
>    blk_mq_update_nr_hw_queues+0x124/0x760
>    nullb_update_nr_hw_queues+0x71/0xf0 [null_blk]
>    nullb_device_submit_queues_store+0x92/0x120 [null_blk]
> 
> [...]

Applied, thanks!

[1/1] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx
      commit: 4c7ef92f6d4d08a27d676e4c348f4e2922cab3ed

Best regards,
-- 
Jens Axboe

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Ming Lei 5 months, 2 weeks ago

On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@huaweicloud.com wrote:
> From: Li Nan <linan122@huawei.com>
> 
> In __blk_mq_update_nr_hw_queues() the return value of
> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx

Looks we should check its return value and handle the failure in both
the call site and blk_mq_sysfs_register_hctxs().

> fails, later changing the number of hw_queues or removing disk will
> trigger the following warning:
> 
>   kernfs: can not remove 'nr_tags', no directory
>   WARNING: CPU: 2 PID: 637 at fs/kernfs/dir.c:1707 kernfs_remove_by_name_ns+0x13f/0x160
>   Call Trace:
>    remove_files.isra.1+0x38/0xb0
>    sysfs_remove_group+0x4d/0x100
>    sysfs_remove_groups+0x31/0x60
>    __kobject_del+0x23/0xf0
>    kobject_del+0x17/0x40
>    blk_mq_unregister_hctx+0x5d/0x80
>    blk_mq_sysfs_unregister_hctxs+0x94/0xd0
>    blk_mq_update_nr_hw_queues+0x124/0x760
>    nullb_update_nr_hw_queues+0x71/0xf0 [null_blk]
>    nullb_device_submit_queues_store+0x92/0x120 [null_blk]
> 
> kobjct_del() was called unconditionally even if sysfs creation failed.
> Fix it by checkig the kobject creation statusbefore deleting it.
> 
> Fixes: 477e19dedc9d ("blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues")
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
>  block/blk-mq-sysfs.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 24656980f443..5c399ac562ea 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -150,9 +150,11 @@ static void blk_mq_unregister_hctx(struct blk_mq_hw_ctx *hctx)
>  		return;
>  
>  	hctx_for_each_ctx(hctx, ctx, i)
> -		kobject_del(&ctx->kobj);
> +		if (ctx->kobj.state_in_sysfs)
> +			kobject_del(&ctx->kobj);
>  
> -	kobject_del(&hctx->kobj);
> +	if (hctx->kobj.state_in_sysfs)
> +		kobject_del(&hctx->kobj);

It is bad to use kobject internal state in block layer.


Thanks,
Ming

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Yu Kuai 5 months, 2 weeks ago

Hi,

在 2025/08/27 8:58, Ming Lei 写道:
> On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@huaweicloud.com wrote:
>> From: Li Nan <linan122@huawei.com>
>>
>> In __blk_mq_update_nr_hw_queues() the return value of
>> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
> 
> Looks we should check its return value and handle the failure in both
> the call site and blk_mq_sysfs_register_hctxs().

 From __blk_mq_update_nr_hw_queues(), the old hctxs is already
unregistered, and this function is void, we failed to register new hctxs
because of memory allocation failure. I really don't know how to handle
the failure here, do you have any suggestions?

Thanks,
Kuai

> 
>> fails, later changing the number of hw_queues or removing disk will
>> trigger the following warning:
>>
>>    kernfs: can not remove 'nr_tags', no directory
>>    WARNING: CPU: 2 PID: 637 at fs/kernfs/dir.c:1707 kernfs_remove_by_name_ns+0x13f/0x160
>>    Call Trace:
>>     remove_files.isra.1+0x38/0xb0
>>     sysfs_remove_group+0x4d/0x100
>>     sysfs_remove_groups+0x31/0x60
>>     __kobject_del+0x23/0xf0
>>     kobject_del+0x17/0x40
>>     blk_mq_unregister_hctx+0x5d/0x80
>>     blk_mq_sysfs_unregister_hctxs+0x94/0xd0
>>     blk_mq_update_nr_hw_queues+0x124/0x760
>>     nullb_update_nr_hw_queues+0x71/0xf0 [null_blk]
>>     nullb_device_submit_queues_store+0x92/0x120 [null_blk]
>>
>> kobjct_del() was called unconditionally even if sysfs creation failed.
>> Fix it by checkig the kobject creation statusbefore deleting it.
>>
>> Fixes: 477e19dedc9d ("blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues")
>> Signed-off-by: Li Nan <linan122@huawei.com>
>> ---
>>   block/blk-mq-sysfs.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
>> index 24656980f443..5c399ac562ea 100644
>> --- a/block/blk-mq-sysfs.c
>> +++ b/block/blk-mq-sysfs.c
>> @@ -150,9 +150,11 @@ static void blk_mq_unregister_hctx(struct blk_mq_hw_ctx *hctx)
>>   		return;
>>   
>>   	hctx_for_each_ctx(hctx, ctx, i)
>> -		kobject_del(&ctx->kobj);
>> +		if (ctx->kobj.state_in_sysfs)
>> +			kobject_del(&ctx->kobj);
>>   
>> -	kobject_del(&hctx->kobj);
>> +	if (hctx->kobj.state_in_sysfs)
>> +		kobject_del(&hctx->kobj);
> 
> It is bad to use kobject internal state in block layer.
> 
> 
> Thanks,
> Ming
> 
> 
> .
>

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Ming Lei 5 months, 2 weeks ago

On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/27 8:58, Ming Lei 写道:
> > On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@huaweicloud.com wrote:
> > > From: Li Nan <linan122@huawei.com>
> > > 
> > > In __blk_mq_update_nr_hw_queues() the return value of
> > > blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
> > 
> > Looks we should check its return value and handle the failure in both
> > the call site and blk_mq_sysfs_register_hctxs().
> 
> From __blk_mq_update_nr_hw_queues(), the old hctxs is already
> unregistered, and this function is void, we failed to register new hctxs
> because of memory allocation failure. I really don't know how to handle
> the failure here, do you have any suggestions?

It is out of memory, I think it is fine to do whatever to leave queue state
intact instead of making it `partial workable`, such as:

- try update nr_hw_queues to 1

- if it still fails, delete disk & mark queue as dead if disk is attached

...

Thanks,
Ming

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Li Nan 5 months, 2 weeks ago


在 2025/8/27 9:35, Ming Lei 写道:
> On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/27 8:58, Ming Lei 写道:
>>> On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@huaweicloud.com wrote:
>>>> From: Li Nan <linan122@huawei.com>
>>>>
>>>> In __blk_mq_update_nr_hw_queues() the return value of
>>>> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
>>>
>>> Looks we should check its return value and handle the failure in both
>>> the call site and blk_mq_sysfs_register_hctxs().
>>
>>  From __blk_mq_update_nr_hw_queues(), the old hctxs is already
>> unregistered, and this function is void, we failed to register new hctxs
>> because of memory allocation failure. I really don't know how to handle
>> the failure here, do you have any suggestions?
> 
> It is out of memory, I think it is fine to do whatever to leave queue state
> intact instead of making it `partial workable`, such as:
> 
> - try update nr_hw_queues to 1
> 
> - if it still fails, delete disk & mark queue as dead if disk is attached
> 

If we ignore these non-critical sysfs creation failures, the disk remains 
usable with no loss of functionality. Deleting the disk seems to escalate
the error?

-- 
Thanks,
Nan

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Ming Lei 5 months, 2 weeks ago

On Wed, Aug 27, 2025 at 11:22:06AM +0800, Li Nan wrote:
> 
> 
> 在 2025/8/27 9:35, Ming Lei 写道:
> > On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2025/08/27 8:58, Ming Lei 写道:
> > > > On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@huaweicloud.com wrote:
> > > > > From: Li Nan <linan122@huawei.com>
> > > > > 
> > > > > In __blk_mq_update_nr_hw_queues() the return value of
> > > > > blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
> > > > 
> > > > Looks we should check its return value and handle the failure in both
> > > > the call site and blk_mq_sysfs_register_hctxs().
> > > 
> > >  From __blk_mq_update_nr_hw_queues(), the old hctxs is already
> > > unregistered, and this function is void, we failed to register new hctxs
> > > because of memory allocation failure. I really don't know how to handle
> > > the failure here, do you have any suggestions?
> > 
> > It is out of memory, I think it is fine to do whatever to leave queue state
> > intact instead of making it `partial workable`, such as:
> > 
> > - try update nr_hw_queues to 1
> > 
> > - if it still fails, delete disk & mark queue as dead if disk is attached
> > 
> 
> If we ignore these non-critical sysfs creation failures, the disk remains
> usable with no loss of functionality. Deleting the disk seems to escalate
> the error?

It is more like a workaround by ignoring the sysfs register failure. And if
the issue need to be fixed in this way, you have to document it.

In case of OOM, it usually means that the system isn't usable any more.
But it is NOIO allocation and the typical use case is for error recovery in
nvme pci, so there may not be enough pages for noio allocation only. That is
the reason for ignoring sysfs register in blk_mq_update_nr_hw_queues()?

But NVMe has been pretty fragile in this area by using non-owner queue
freeze, and call blk_mq_update_nr_hw_queues() on frozen queue, so it is
really necessary to take it into account?

Thanks,
Ming

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Li Nan 5 months, 2 weeks ago


在 2025/8/27 16:10, Ming Lei 写道:
> On Wed, Aug 27, 2025 at 11:22:06AM +0800, Li Nan wrote:
>>
>>
>> 在 2025/8/27 9:35, Ming Lei 写道:
>>> On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2025/08/27 8:58, Ming Lei 写道:
>>>>> On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@huaweicloud.com wrote:
>>>>>> From: Li Nan <linan122@huawei.com>
>>>>>>
>>>>>> In __blk_mq_update_nr_hw_queues() the return value of
>>>>>> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
>>>>>
>>>>> Looks we should check its return value and handle the failure in both
>>>>> the call site and blk_mq_sysfs_register_hctxs().
>>>>
>>>>   From __blk_mq_update_nr_hw_queues(), the old hctxs is already
>>>> unregistered, and this function is void, we failed to register new hctxs
>>>> because of memory allocation failure. I really don't know how to handle
>>>> the failure here, do you have any suggestions?
>>>
>>> It is out of memory, I think it is fine to do whatever to leave queue state
>>> intact instead of making it `partial workable`, such as:
>>>
>>> - try update nr_hw_queues to 1
>>>
>>> - if it still fails, delete disk & mark queue as dead if disk is attached
>>>
>>
>> If we ignore these non-critical sysfs creation failures, the disk remains
>> usable with no loss of functionality. Deleting the disk seems to escalate
>> the error?
> 
> It is more like a workaround by ignoring the sysfs register failure. And if
> the issue need to be fixed in this way, you have to document it. >
> In case of OOM, it usually means that the system isn't usable any more.
> But it is NOIO allocation and the typical use case is for error recovery in
> nvme pci, so there may not be enough pages for noio allocation only. That is
> the reason for ignoring sysfs register in blk_mq_update_nr_hw_queues()?
> 
> But NVMe has been pretty fragile in this area by using non-owner queue
> freeze, and call blk_mq_update_nr_hw_queues() on frozen queue, so it is
> really necessary to take it into account?

I agree with your points about NOIO and NVMe.

I hit this issue in null_blk during fuzz testing with memory-fault
injection. Changing the number of hardware queues under OOM is extremely 
rare in real-world usage. So I think adding a workaround and documenting it
is sufficient. What do you think?

> 
> Thanks,
> Ming
> 


-- 
Thanks,
Nan

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Jens Axboe 5 months, 2 weeks ago

On 8/28/25 3:28 AM, Li Nan wrote:
> 
> 
> ? 2025/8/27 16:10, Ming Lei ??:
>> On Wed, Aug 27, 2025 at 11:22:06AM +0800, Li Nan wrote:
>>>
>>>
>>> ? 2025/8/27 9:35, Ming Lei ??:
>>>> On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
>>>>> Hi,
>>>>>
>>>>> ? 2025/08/27 8:58, Ming Lei ??:
>>>>>> On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@huaweicloud.com wrote:
>>>>>>> From: Li Nan <linan122@huawei.com>
>>>>>>>
>>>>>>> In __blk_mq_update_nr_hw_queues() the return value of
>>>>>>> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
>>>>>>
>>>>>> Looks we should check its return value and handle the failure in both
>>>>>> the call site and blk_mq_sysfs_register_hctxs().
>>>>>
>>>>>   From __blk_mq_update_nr_hw_queues(), the old hctxs is already
>>>>> unregistered, and this function is void, we failed to register new hctxs
>>>>> because of memory allocation failure. I really don't know how to handle
>>>>> the failure here, do you have any suggestions?
>>>>
>>>> It is out of memory, I think it is fine to do whatever to leave queue state
>>>> intact instead of making it `partial workable`, such as:
>>>>
>>>> - try update nr_hw_queues to 1
>>>>
>>>> - if it still fails, delete disk & mark queue as dead if disk is attached
>>>>
>>>
>>> If we ignore these non-critical sysfs creation failures, the disk remains
>>> usable with no loss of functionality. Deleting the disk seems to escalate
>>> the error?
>>
>> It is more like a workaround by ignoring the sysfs register failure. And if
>> the issue need to be fixed in this way, you have to document it. >
>> In case of OOM, it usually means that the system isn't usable any more.
>> But it is NOIO allocation and the typical use case is for error recovery in
>> nvme pci, so there may not be enough pages for noio allocation only. That is
>> the reason for ignoring sysfs register in blk_mq_update_nr_hw_queues()?
>>
>> But NVMe has been pretty fragile in this area by using non-owner queue
>> freeze, and call blk_mq_update_nr_hw_queues() on frozen queue, so it is
>> really necessary to take it into account?
> 
> I agree with your points about NOIO and NVMe.
> 
> I hit this issue in null_blk during fuzz testing with memory-fault
> injection. Changing the number of hardware queues under OOM is
> extremely rare in real-world usage. So I think adding a workaround and
> documenting it is sufficient. What do you think?

Working around it is fine, as it isn't a situation we really need to
worry about. But let's please not do it by poking at kobject internals.

-- 
Jens Axboe

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Yu Kuai 5 months, 2 weeks ago

Hi,

在 2025/08/29 1:23, Jens Axboe 写道:
> On 8/28/25 3:28 AM, Li Nan wrote:
>>
>>
>> ? 2025/8/27 16:10, Ming Lei ??:
>>> On Wed, Aug 27, 2025 at 11:22:06AM +0800, Li Nan wrote:
>>>>
>>>>
>>>> ? 2025/8/27 9:35, Ming Lei ??:
>>>>> On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
>>>>>> Hi,
>>>>>>
>>>>>> ? 2025/08/27 8:58, Ming Lei ??:
>>>>>>> On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@huaweicloud.com wrote:
>>>>>>>> From: Li Nan <linan122@huawei.com>
>>>>>>>>
>>>>>>>> In __blk_mq_update_nr_hw_queues() the return value of
>>>>>>>> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
>>>>>>>
>>>>>>> Looks we should check its return value and handle the failure in both
>>>>>>> the call site and blk_mq_sysfs_register_hctxs().
>>>>>>
>>>>>>    From __blk_mq_update_nr_hw_queues(), the old hctxs is already
>>>>>> unregistered, and this function is void, we failed to register new hctxs
>>>>>> because of memory allocation failure. I really don't know how to handle
>>>>>> the failure here, do you have any suggestions?
>>>>>
>>>>> It is out of memory, I think it is fine to do whatever to leave queue state
>>>>> intact instead of making it `partial workable`, such as:
>>>>>
>>>>> - try update nr_hw_queues to 1
>>>>>
>>>>> - if it still fails, delete disk & mark queue as dead if disk is attached
>>>>>
>>>>
>>>> If we ignore these non-critical sysfs creation failures, the disk remains
>>>> usable with no loss of functionality. Deleting the disk seems to escalate
>>>> the error?
>>>
>>> It is more like a workaround by ignoring the sysfs register failure. And if
>>> the issue need to be fixed in this way, you have to document it. >
>>> In case of OOM, it usually means that the system isn't usable any more.
>>> But it is NOIO allocation and the typical use case is for error recovery in
>>> nvme pci, so there may not be enough pages for noio allocation only. That is
>>> the reason for ignoring sysfs register in blk_mq_update_nr_hw_queues()?
>>>
>>> But NVMe has been pretty fragile in this area by using non-owner queue
>>> freeze, and call blk_mq_update_nr_hw_queues() on frozen queue, so it is
>>> really necessary to take it into account?
>>
>> I agree with your points about NOIO and NVMe.
>>
>> I hit this issue in null_blk during fuzz testing with memory-fault
>> injection. Changing the number of hardware queues under OOM is
>> extremely rare in real-world usage. So I think adding a workaround and
>> documenting it is sufficient. What do you think?
> 
> Working around it is fine, as it isn't a situation we really need to
> worry about. But let's please not do it by poking at kobject internals.
> 

There is already used in someplaces like sysfs_slab_unlink().

Do we prefre add a new hctx->state like BLK_MQ_S_REGISTERED?

Thanks,
Kuai

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Jens Axboe 5 months, 2 weeks ago

On 8/28/25 7:09 PM, Yu Kuai wrote:
> Hi,
> 
> ? 2025/08/29 1:23, Jens Axboe ??:
>> On 8/28/25 3:28 AM, Li Nan wrote:
>>>
>>>
>>> ? 2025/8/27 16:10, Ming Lei ??:
>>>> On Wed, Aug 27, 2025 at 11:22:06AM +0800, Li Nan wrote:
>>>>>
>>>>>
>>>>> ? 2025/8/27 9:35, Ming Lei ??:
>>>>>> On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> ? 2025/08/27 8:58, Ming Lei ??:
>>>>>>>> On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@huaweicloud.com wrote:
>>>>>>>>> From: Li Nan <linan122@huawei.com>
>>>>>>>>>
>>>>>>>>> In __blk_mq_update_nr_hw_queues() the return value of
>>>>>>>>> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
>>>>>>>>
>>>>>>>> Looks we should check its return value and handle the failure in both
>>>>>>>> the call site and blk_mq_sysfs_register_hctxs().
>>>>>>>
>>>>>>>    From __blk_mq_update_nr_hw_queues(), the old hctxs is already
>>>>>>> unregistered, and this function is void, we failed to register new hctxs
>>>>>>> because of memory allocation failure. I really don't know how to handle
>>>>>>> the failure here, do you have any suggestions?
>>>>>>
>>>>>> It is out of memory, I think it is fine to do whatever to leave queue state
>>>>>> intact instead of making it `partial workable`, such as:
>>>>>>
>>>>>> - try update nr_hw_queues to 1
>>>>>>
>>>>>> - if it still fails, delete disk & mark queue as dead if disk is attached
>>>>>>
>>>>>
>>>>> If we ignore these non-critical sysfs creation failures, the disk remains
>>>>> usable with no loss of functionality. Deleting the disk seems to escalate
>>>>> the error?
>>>>
>>>> It is more like a workaround by ignoring the sysfs register failure. And if
>>>> the issue need to be fixed in this way, you have to document it. >
>>>> In case of OOM, it usually means that the system isn't usable any more.
>>>> But it is NOIO allocation and the typical use case is for error recovery in
>>>> nvme pci, so there may not be enough pages for noio allocation only. That is
>>>> the reason for ignoring sysfs register in blk_mq_update_nr_hw_queues()?
>>>>
>>>> But NVMe has been pretty fragile in this area by using non-owner queue
>>>> freeze, and call blk_mq_update_nr_hw_queues() on frozen queue, so it is
>>>> really necessary to take it into account?
>>>
>>> I agree with your points about NOIO and NVMe.
>>>
>>> I hit this issue in null_blk during fuzz testing with memory-fault
>>> injection. Changing the number of hardware queues under OOM is
>>> extremely rare in real-world usage. So I think adding a workaround and
>>> documenting it is sufficient. What do you think?
>>
>> Working around it is fine, as it isn't a situation we really need to
>> worry about. But let's please not do it by poking at kobject internals.
>>
> 
> There is already used in someplaces like sysfs_slab_unlink().
> 
> Do we prefre add a new hctx->state like BLK_MQ_S_REGISTERED?

If it's already used in a few spots, then I guess we should just be
using it as well rather than have a state around it. So I guess it's
fine. I'll just grab the patch.

-- 
Jens Axboe

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Ming Lei 5 months, 2 weeks ago

On Thu, Aug 28, 2025 at 05:28:26PM +0800, Li Nan wrote:
> 
> 
> 在 2025/8/27 16:10, Ming Lei 写道:
> > On Wed, Aug 27, 2025 at 11:22:06AM +0800, Li Nan wrote:
> > > 
> > > 
> > > 在 2025/8/27 9:35, Ming Lei 写道:
> > > > On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
> > > > > Hi,
> > > > > 
> > > > > 在 2025/08/27 8:58, Ming Lei 写道:
> > > > > > On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@huaweicloud.com wrote:
> > > > > > > From: Li Nan <linan122@huawei.com>
> > > > > > > 
> > > > > > > In __blk_mq_update_nr_hw_queues() the return value of
> > > > > > > blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
> > > > > > 
> > > > > > Looks we should check its return value and handle the failure in both
> > > > > > the call site and blk_mq_sysfs_register_hctxs().
> > > > > 
> > > > >   From __blk_mq_update_nr_hw_queues(), the old hctxs is already
> > > > > unregistered, and this function is void, we failed to register new hctxs
> > > > > because of memory allocation failure. I really don't know how to handle
> > > > > the failure here, do you have any suggestions?
> > > > 
> > > > It is out of memory, I think it is fine to do whatever to leave queue state
> > > > intact instead of making it `partial workable`, such as:
> > > > 
> > > > - try update nr_hw_queues to 1
> > > > 
> > > > - if it still fails, delete disk & mark queue as dead if disk is attached
> > > > 
> > > 
> > > If we ignore these non-critical sysfs creation failures, the disk remains
> > > usable with no loss of functionality. Deleting the disk seems to escalate
> > > the error?
> > 
> > It is more like a workaround by ignoring the sysfs register failure. And if
> > the issue need to be fixed in this way, you have to document it. >
> > In case of OOM, it usually means that the system isn't usable any more.
> > But it is NOIO allocation and the typical use case is for error recovery in
> > nvme pci, so there may not be enough pages for noio allocation only. That is
> > the reason for ignoring sysfs register in blk_mq_update_nr_hw_queues()?
> > 
> > But NVMe has been pretty fragile in this area by using non-owner queue
> > freeze, and call blk_mq_update_nr_hw_queues() on frozen queue, so it is
> > really necessary to take it into account?
> 
> I agree with your points about NOIO and NVMe.
> 
> I hit this issue in null_blk during fuzz testing with memory-fault
> injection. Changing the number of hardware queues under OOM is extremely
> rare in real-world usage. So I think adding a workaround and documenting it
> is sufficient. What do you think?

Looks fine for me.


Thanks, 
Ming

Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

Posted by Yu Kuai 5 months, 2 weeks ago

在 2025/08/26 16:48, linan666@huaweicloud.com 写道:
> From: Li Nan <linan122@huawei.com>
> 
> In __blk_mq_update_nr_hw_queues() the return value of
> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
> fails, later changing the number of hw_queues or removing disk will
> trigger the following warning:
> 
>    kernfs: can not remove 'nr_tags', no directory
>    WARNING: CPU: 2 PID: 637 at fs/kernfs/dir.c:1707 kernfs_remove_by_name_ns+0x13f/0x160
>    Call Trace:
>     remove_files.isra.1+0x38/0xb0
>     sysfs_remove_group+0x4d/0x100
>     sysfs_remove_groups+0x31/0x60
>     __kobject_del+0x23/0xf0
>     kobject_del+0x17/0x40
>     blk_mq_unregister_hctx+0x5d/0x80
>     blk_mq_sysfs_unregister_hctxs+0x94/0xd0
>     blk_mq_update_nr_hw_queues+0x124/0x760
>     nullb_update_nr_hw_queues+0x71/0xf0 [null_blk]
>     nullb_device_submit_queues_store+0x92/0x120 [null_blk]
> 
> kobjct_del() was called unconditionally even if sysfs creation failed.
> Fix it by checkig the kobject creation statusbefore deleting it.
> 
> Fixes: 477e19dedc9d ("blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues")
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
>   block/blk-mq-sysfs.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
LGTM
Reviewed-by: Yu Kuai <yukuai3@huawei.com>