From: Yu Kuai <yukuai3@huawei.com>
request_queue->nr_requests can be changed by:
a) switching elevator by update nr_hw_queues
b) switching elevator by elevator sysfs attribute
c) configue queue sysfs attribute nr_requests
Current lock order is:
1) update_nr_hwq_lock, case a,b
2) freeze_queue
3) elevator_lock, cas a,b,c
And update nr_requests is seriablized by elevator_lock() already,
however, in the case c), we'll have to allocate new sched_tags if
nr_requests grow, and do this with elevator_lock held and queue
freezed has the risk of deadlock.
Hence use update_nr_hwq_lock instead, make it possible to allocate
memory if tags grow, meanwhile also prevent nr_requests to be changed
concurrently.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/blk-sysfs.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f99519f7a820..7ea15bf68b4b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
int ret, err;
unsigned int memflags;
struct request_queue *q = disk->queue;
+ struct blk_mq_tag_set *set = q->tag_set;
ret = queue_var_store(&nr, page, count);
if (ret < 0)
return ret;
- memflags = blk_mq_freeze_queue(q);
- mutex_lock(&q->elevator_lock);
+ /* serialize updating nr_requests with switching elevator */
+ down_write(&set->update_nr_hwq_lock);
if (nr == q->nr_requests)
goto unlock;
@@ -89,13 +90,18 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
goto unlock;
}
+ memflags = blk_mq_freeze_queue(q);
+ mutex_lock(&q->elevator_lock);
+
err = blk_mq_update_nr_requests(disk->queue, nr);
if (err)
ret = err;
-unlock:
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
+
+unlock:
+ up_write(&set->update_nr_hwq_lock);
return ret;
}
--
2.39.2
On 9/8/25 11:45 AM, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > request_queue->nr_requests can be changed by: > > a) switching elevator by update nr_hw_queues > b) switching elevator by elevator sysfs attribute > c) configue queue sysfs attribute nr_requests > > Current lock order is: > > 1) update_nr_hwq_lock, case a,b > 2) freeze_queue > 3) elevator_lock, cas a,b,c > > And update nr_requests is seriablized by elevator_lock() already, > however, in the case c), we'll have to allocate new sched_tags if > nr_requests grow, and do this with elevator_lock held and queue > freezed has the risk of deadlock. > > Hence use update_nr_hwq_lock instead, make it possible to allocate > memory if tags grow, meanwhile also prevent nr_requests to be changed > concurrently. > > Signed-off-by: Yu Kuai <yukuai3@huawei.com> > --- > block/blk-sysfs.c | 12 +++++++++--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index f99519f7a820..7ea15bf68b4b 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) > int ret, err; > unsigned int memflags; > struct request_queue *q = disk->queue; > + struct blk_mq_tag_set *set = q->tag_set; > > ret = queue_var_store(&nr, page, count); > if (ret < 0) > return ret; > > - memflags = blk_mq_freeze_queue(q); > - mutex_lock(&q->elevator_lock); > + /* serialize updating nr_requests with switching elevator */ > + down_write(&set->update_nr_hwq_lock); > > if (nr == q->nr_requests) > goto unlock; > @@ -89,13 +90,18 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) > goto unlock; > } > > + memflags = blk_mq_freeze_queue(q); > + mutex_lock(&q->elevator_lock); > + > err = blk_mq_update_nr_requests(disk->queue, nr); > if (err) > ret = err; > > -unlock: > mutex_unlock(&q->elevator_lock); > blk_mq_unfreeze_queue(q, memflags); > + > +unlock: > + up_write(&set->update_nr_hwq_lock); > return ret; > } > As you moved ->elevtor_lock here and thus directly access q->elevator without holding ->elevator_lock, please add a comment explaining why is it safe to access q->elevator without holding ->elevator_lock. So with that change, please add: Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
On 9/8/25 11:45 AM, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > request_queue->nr_requests can be changed by: > > a) switching elevator by update nr_hw_queues > b) switching elevator by elevator sysfs attribute > c) configue queue sysfs attribute nr_requests > > Current lock order is: > > 1) update_nr_hwq_lock, case a,b > 2) freeze_queue > 3) elevator_lock, cas a,b,c > > And update nr_requests is seriablized by elevator_lock() already, > however, in the case c), we'll have to allocate new sched_tags if > nr_requests grow, and do this with elevator_lock held and queue > freezed has the risk of deadlock. > > Hence use update_nr_hwq_lock instead, make it possible to allocate > memory if tags grow, meanwhile also prevent nr_requests to be changed > concurrently. > > Signed-off-by: Yu Kuai <yukuai3@huawei.com> > --- > block/blk-sysfs.c | 12 +++++++++--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index f99519f7a820..7ea15bf68b4b 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) > int ret, err; > unsigned int memflags; > struct request_queue *q = disk->queue; > + struct blk_mq_tag_set *set = q->tag_set; > > ret = queue_var_store(&nr, page, count); > if (ret < 0) > return ret; > > - memflags = blk_mq_freeze_queue(q); > - mutex_lock(&q->elevator_lock); > + /* serialize updating nr_requests with switching elevator */ > + down_write(&set->update_nr_hwq_lock); > For serializing update of nr_requests with switching elevator, we should use disable_elv_switch(). So with this change we don't need to acquire ->update_nr_hwq_lock in writer context while running blk_mq_update_nr_requests but instead it can run acquiring ->update_nr_hwq_lock in the reader context. So the code flow should be, disable_elv_switch => this would set QUEUE_FLAG_NO_ELV_SWITCH ... down_read ->update_nr_hwq_lock acquire ->freeze_lock acquire ->elevator_lock; ... ... release ->elevator_lock; release ->freeze_lock clear QUEUE_FLAG_NO_ELV_SWITCH up_read ->update_nr_hwq_lock Thanks, --Nilay
Hi, 在 2025/09/09 14:29, Nilay Shroff 写道: > > > On 9/8/25 11:45 AM, Yu Kuai wrote: >> From: Yu Kuai <yukuai3@huawei.com> >> >> request_queue->nr_requests can be changed by: >> >> a) switching elevator by update nr_hw_queues >> b) switching elevator by elevator sysfs attribute >> c) configue queue sysfs attribute nr_requests >> >> Current lock order is: >> >> 1) update_nr_hwq_lock, case a,b >> 2) freeze_queue >> 3) elevator_lock, cas a,b,c >> >> And update nr_requests is seriablized by elevator_lock() already, >> however, in the case c), we'll have to allocate new sched_tags if >> nr_requests grow, and do this with elevator_lock held and queue >> freezed has the risk of deadlock. >> >> Hence use update_nr_hwq_lock instead, make it possible to allocate >> memory if tags grow, meanwhile also prevent nr_requests to be changed >> concurrently. >> >> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >> --- >> block/blk-sysfs.c | 12 +++++++++--- >> 1 file changed, 9 insertions(+), 3 deletions(-) >> >> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >> index f99519f7a820..7ea15bf68b4b 100644 >> --- a/block/blk-sysfs.c >> +++ b/block/blk-sysfs.c >> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) >> int ret, err; >> unsigned int memflags; >> struct request_queue *q = disk->queue; >> + struct blk_mq_tag_set *set = q->tag_set; >> >> ret = queue_var_store(&nr, page, count); >> if (ret < 0) >> return ret; >> >> - memflags = blk_mq_freeze_queue(q); >> - mutex_lock(&q->elevator_lock); >> + /* serialize updating nr_requests with switching elevator */ >> + down_write(&set->update_nr_hwq_lock); >> > For serializing update of nr_requests with switching elevator, > we should use disable_elv_switch(). So with this change we > don't need to acquire ->update_nr_hwq_lock in writer context > while running blk_mq_update_nr_requests but instead it can run > acquiring ->update_nr_hwq_lock in the reader context. > > So the code flow should be, > > disable_elv_switch => this would set QUEUE_FLAG_NO_ELV_SWITCH > ... > down_read ->update_nr_hwq_lock > acquire ->freeze_lock > acquire ->elevator_lock; > ... > ... > release ->elevator_lock; > release ->freeze_lock > > clear QUEUE_FLAG_NO_ELV_SWITCH > up_read ->update_nr_hwq_lock > Yes, this make sense, however, there is also an implied condition that we should serialize queue_requests_store() with itself, what if a concurrent caller succeed the disable_elv_switch() before the down_read() in this way? t1: disable_elv_switch t2: disable_elv_switch down_read down_read Thanks, Kuai > Thanks, > --Nilay > > > . >
On 9/9/25 12:08 PM, Yu Kuai wrote: > Hi, > > 在 2025/09/09 14:29, Nilay Shroff 写道: >> >> >> On 9/8/25 11:45 AM, Yu Kuai wrote: >>> From: Yu Kuai <yukuai3@huawei.com> >>> >>> request_queue->nr_requests can be changed by: >>> >>> a) switching elevator by update nr_hw_queues >>> b) switching elevator by elevator sysfs attribute >>> c) configue queue sysfs attribute nr_requests >>> >>> Current lock order is: >>> >>> 1) update_nr_hwq_lock, case a,b >>> 2) freeze_queue >>> 3) elevator_lock, cas a,b,c >>> >>> And update nr_requests is seriablized by elevator_lock() already, >>> however, in the case c), we'll have to allocate new sched_tags if >>> nr_requests grow, and do this with elevator_lock held and queue >>> freezed has the risk of deadlock. >>> >>> Hence use update_nr_hwq_lock instead, make it possible to allocate >>> memory if tags grow, meanwhile also prevent nr_requests to be changed >>> concurrently. >>> >>> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >>> --- >>> block/blk-sysfs.c | 12 +++++++++--- >>> 1 file changed, 9 insertions(+), 3 deletions(-) >>> >>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>> index f99519f7a820..7ea15bf68b4b 100644 >>> --- a/block/blk-sysfs.c >>> +++ b/block/blk-sysfs.c >>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) >>> int ret, err; >>> unsigned int memflags; >>> struct request_queue *q = disk->queue; >>> + struct blk_mq_tag_set *set = q->tag_set; >>> ret = queue_var_store(&nr, page, count); >>> if (ret < 0) >>> return ret; >>> - memflags = blk_mq_freeze_queue(q); >>> - mutex_lock(&q->elevator_lock); >>> + /* serialize updating nr_requests with switching elevator */ >>> + down_write(&set->update_nr_hwq_lock); >>> >> For serializing update of nr_requests with switching elevator, >> we should use disable_elv_switch(). So with this change we >> don't need to acquire ->update_nr_hwq_lock in writer context >> while running blk_mq_update_nr_requests but instead it can run >> acquiring ->update_nr_hwq_lock in the reader context. >> >> So the code flow should be, >> >> disable_elv_switch => this would set QUEUE_FLAG_NO_ELV_SWITCH >> ... >> down_read ->update_nr_hwq_lock >> acquire ->freeze_lock >> acquire ->elevator_lock; >> ... >> ... >> release ->elevator_lock; >> release ->freeze_lock >> >> clear QUEUE_FLAG_NO_ELV_SWITCH >> up_read ->update_nr_hwq_lock >> > > Yes, this make sense, however, there is also an implied condition that > we should serialize queue_requests_store() with itself, what if a > concurrent caller succeed the disable_elv_switch() before the > down_read() in this way? > > t1: > disable_elv_switch > t2: > disable_elv_switch > > down_read down_read > I believe this is already protected with the kernfs internal mutex locks. So you shouldn't be able to run two sysfs store operations concurrently on the same attribute file. Thanks, --Nilay
Hi, 在 2025/09/09 14:52, Nilay Shroff 写道: > > > On 9/9/25 12:08 PM, Yu Kuai wrote: >> Hi, >> >> 在 2025/09/09 14:29, Nilay Shroff 写道: >>> >>> >>> On 9/8/25 11:45 AM, Yu Kuai wrote: >>>> From: Yu Kuai <yukuai3@huawei.com> >>>> >>>> request_queue->nr_requests can be changed by: >>>> >>>> a) switching elevator by update nr_hw_queues >>>> b) switching elevator by elevator sysfs attribute >>>> c) configue queue sysfs attribute nr_requests >>>> >>>> Current lock order is: >>>> >>>> 1) update_nr_hwq_lock, case a,b >>>> 2) freeze_queue >>>> 3) elevator_lock, cas a,b,c >>>> >>>> And update nr_requests is seriablized by elevator_lock() already, >>>> however, in the case c), we'll have to allocate new sched_tags if >>>> nr_requests grow, and do this with elevator_lock held and queue >>>> freezed has the risk of deadlock. >>>> >>>> Hence use update_nr_hwq_lock instead, make it possible to allocate >>>> memory if tags grow, meanwhile also prevent nr_requests to be changed >>>> concurrently. >>>> >>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >>>> --- >>>> block/blk-sysfs.c | 12 +++++++++--- >>>> 1 file changed, 9 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>>> index f99519f7a820..7ea15bf68b4b 100644 >>>> --- a/block/blk-sysfs.c >>>> +++ b/block/blk-sysfs.c >>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) >>>> int ret, err; >>>> unsigned int memflags; >>>> struct request_queue *q = disk->queue; >>>> + struct blk_mq_tag_set *set = q->tag_set; >>>> ret = queue_var_store(&nr, page, count); >>>> if (ret < 0) >>>> return ret; >>>> - memflags = blk_mq_freeze_queue(q); >>>> - mutex_lock(&q->elevator_lock); >>>> + /* serialize updating nr_requests with switching elevator */ >>>> + down_write(&set->update_nr_hwq_lock); >>>> >>> For serializing update of nr_requests with switching elevator, >>> we should use disable_elv_switch(). So with this change we >>> don't need to acquire ->update_nr_hwq_lock in writer context >>> while running blk_mq_update_nr_requests but instead it can run >>> acquiring ->update_nr_hwq_lock in the reader context. >>> >>> So the code flow should be, >>> >>> disable_elv_switch => this would set QUEUE_FLAG_NO_ELV_SWITCH >>> ... >>> down_read ->update_nr_hwq_lock >>> acquire ->freeze_lock >>> acquire ->elevator_lock; >>> ... >>> ... >>> release ->elevator_lock; >>> release ->freeze_lock >>> >>> clear QUEUE_FLAG_NO_ELV_SWITCH >>> up_read ->update_nr_hwq_lock >>> >> >> Yes, this make sense, however, there is also an implied condition that >> we should serialize queue_requests_store() with itself, what if a >> concurrent caller succeed the disable_elv_switch() before the >> down_read() in this way? >> >> t1: >> disable_elv_switch >> t2: >> disable_elv_switch >> >> down_read down_read >> > I believe this is already protected with the kernfs internal > mutex locks. So you shouldn't be able to run two sysfs store > operations concurrently on the same attribute file. > There really is no such internal lock, the call stack is: kernfs_fop_write_iter sysfs_kf_write queue_attr_store There is only a file level mutex kernfs_open_file->lock from the top function kernfs_fop_write_iter(), however, this lock is not the same if we open the same attribute file from different context. Thanks, Kuai > Thanks, > --Nilay > > . >
On 9/9/25 12:46 PM, Yu Kuai wrote: > Hi, > > 在 2025/09/09 14:52, Nilay Shroff 写道: >> >> >> On 9/9/25 12:08 PM, Yu Kuai wrote: >>> Hi, >>> >>> 在 2025/09/09 14:29, Nilay Shroff 写道: >>>> >>>> >>>> On 9/8/25 11:45 AM, Yu Kuai wrote: >>>>> From: Yu Kuai <yukuai3@huawei.com> >>>>> >>>>> request_queue->nr_requests can be changed by: >>>>> >>>>> a) switching elevator by update nr_hw_queues >>>>> b) switching elevator by elevator sysfs attribute >>>>> c) configue queue sysfs attribute nr_requests >>>>> >>>>> Current lock order is: >>>>> >>>>> 1) update_nr_hwq_lock, case a,b >>>>> 2) freeze_queue >>>>> 3) elevator_lock, cas a,b,c >>>>> >>>>> And update nr_requests is seriablized by elevator_lock() already, >>>>> however, in the case c), we'll have to allocate new sched_tags if >>>>> nr_requests grow, and do this with elevator_lock held and queue >>>>> freezed has the risk of deadlock. >>>>> >>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate >>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed >>>>> concurrently. >>>>> >>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >>>>> --- >>>>> block/blk-sysfs.c | 12 +++++++++--- >>>>> 1 file changed, 9 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>>>> index f99519f7a820..7ea15bf68b4b 100644 >>>>> --- a/block/blk-sysfs.c >>>>> +++ b/block/blk-sysfs.c >>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) >>>>> int ret, err; >>>>> unsigned int memflags; >>>>> struct request_queue *q = disk->queue; >>>>> + struct blk_mq_tag_set *set = q->tag_set; >>>>> ret = queue_var_store(&nr, page, count); >>>>> if (ret < 0) >>>>> return ret; >>>>> - memflags = blk_mq_freeze_queue(q); >>>>> - mutex_lock(&q->elevator_lock); >>>>> + /* serialize updating nr_requests with switching elevator */ >>>>> + down_write(&set->update_nr_hwq_lock); >>>>> >>>> For serializing update of nr_requests with switching elevator, >>>> we should use disable_elv_switch(). So with this change we >>>> don't need to acquire ->update_nr_hwq_lock in writer context >>>> while running blk_mq_update_nr_requests but instead it can run >>>> acquiring ->update_nr_hwq_lock in the reader context. >>>> >>>> So the code flow should be, >>>> >>>> disable_elv_switch => this would set QUEUE_FLAG_NO_ELV_SWITCH >>>> ... >>>> down_read ->update_nr_hwq_lock >>>> acquire ->freeze_lock >>>> acquire ->elevator_lock; >>>> ... >>>> ... >>>> release ->elevator_lock; >>>> release ->freeze_lock >>>> >>>> clear QUEUE_FLAG_NO_ELV_SWITCH >>>> up_read ->update_nr_hwq_lock >>>> >>> >>> Yes, this make sense, however, there is also an implied condition that >>> we should serialize queue_requests_store() with itself, what if a >>> concurrent caller succeed the disable_elv_switch() before the >>> down_read() in this way? >>> >>> t1: >>> disable_elv_switch >>> t2: >>> disable_elv_switch >>> >>> down_read down_read >>> >> I believe this is already protected with the kernfs internal >> mutex locks. So you shouldn't be able to run two sysfs store >> operations concurrently on the same attribute file. >> > > There really is no such internal lock, the call stack is: > > kernfs_fop_write_iter > sysfs_kf_write > queue_attr_store > > There is only a file level mutex kernfs_open_file->lock from the top > function kernfs_fop_write_iter(), however, this lock is not the same > if we open the same attribute file from different context. > Oh yes this lock only protects if the same fd is being written concurrently. However if we open the same sysfs file from different process contexts then fd would be different and so this lock wouldn't protect the simultaneous update of sysfs attribute. Having said that, looking through the code again it seems that q->nr_requests update is protected with ->elevator_lock (including both the elevator switch code and your proposed changes in this patchset). So my question is do we really need to synchronize nr_requests update code with elevator swiupdate_nr_hwq_locktch code? So in another words what if we don't at all use ->update_nr_hwq_lock in queue_requests_store? Thanks --Nilay
Hi, 在 2025/09/09 17:26, Nilay Shroff 写道: > > > On 9/9/25 12:46 PM, Yu Kuai wrote: >> Hi, >> >> 在 2025/09/09 14:52, Nilay Shroff 写道: >>> >>> >>> On 9/9/25 12:08 PM, Yu Kuai wrote: >>>> Hi, >>>> >>>> 在 2025/09/09 14:29, Nilay Shroff 写道: >>>>> >>>>> >>>>> On 9/8/25 11:45 AM, Yu Kuai wrote: >>>>>> From: Yu Kuai <yukuai3@huawei.com> >>>>>> >>>>>> request_queue->nr_requests can be changed by: >>>>>> >>>>>> a) switching elevator by update nr_hw_queues >>>>>> b) switching elevator by elevator sysfs attribute >>>>>> c) configue queue sysfs attribute nr_requests >>>>>> >>>>>> Current lock order is: >>>>>> >>>>>> 1) update_nr_hwq_lock, case a,b >>>>>> 2) freeze_queue >>>>>> 3) elevator_lock, cas a,b,c >>>>>> >>>>>> And update nr_requests is seriablized by elevator_lock() already, >>>>>> however, in the case c), we'll have to allocate new sched_tags if >>>>>> nr_requests grow, and do this with elevator_lock held and queue >>>>>> freezed has the risk of deadlock. >>>>>> >>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate >>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed >>>>>> concurrently. >>>>>> >>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >>>>>> --- >>>>>> block/blk-sysfs.c | 12 +++++++++--- >>>>>> 1 file changed, 9 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>>>>> index f99519f7a820..7ea15bf68b4b 100644 >>>>>> --- a/block/blk-sysfs.c >>>>>> +++ b/block/blk-sysfs.c >>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) >>>>>> int ret, err; >>>>>> unsigned int memflags; >>>>>> struct request_queue *q = disk->queue; >>>>>> + struct blk_mq_tag_set *set = q->tag_set; >>>>>> ret = queue_var_store(&nr, page, count); >>>>>> if (ret < 0) >>>>>> return ret; >>>>>> - memflags = blk_mq_freeze_queue(q); >>>>>> - mutex_lock(&q->elevator_lock); >>>>>> + /* serialize updating nr_requests with switching elevator */ >>>>>> + down_write(&set->update_nr_hwq_lock); >>>>>> >>>>> For serializing update of nr_requests with switching elevator, >>>>> we should use disable_elv_switch(). So with this change we >>>>> don't need to acquire ->update_nr_hwq_lock in writer context >>>>> while running blk_mq_update_nr_requests but instead it can run >>>>> acquiring ->update_nr_hwq_lock in the reader context. >>>>> >>>>> So the code flow should be, >>>>> >>>>> disable_elv_switch => this would set QUEUE_FLAG_NO_ELV_SWITCH >>>>> ... >>>>> down_read ->update_nr_hwq_lock >>>>> acquire ->freeze_lock >>>>> acquire ->elevator_lock; >>>>> ... >>>>> ... >>>>> release ->elevator_lock; >>>>> release ->freeze_lock >>>>> >>>>> clear QUEUE_FLAG_NO_ELV_SWITCH >>>>> up_read ->update_nr_hwq_lock >>>>> >>>> >>>> Yes, this make sense, however, there is also an implied condition that >>>> we should serialize queue_requests_store() with itself, what if a >>>> concurrent caller succeed the disable_elv_switch() before the >>>> down_read() in this way? >>>> >>>> t1: >>>> disable_elv_switch >>>> t2: >>>> disable_elv_switch >>>> >>>> down_read down_read >>>> >>> I believe this is already protected with the kernfs internal >>> mutex locks. So you shouldn't be able to run two sysfs store >>> operations concurrently on the same attribute file. >>> >> >> There really is no such internal lock, the call stack is: >> >> kernfs_fop_write_iter >> sysfs_kf_write >> queue_attr_store >> >> There is only a file level mutex kernfs_open_file->lock from the top >> function kernfs_fop_write_iter(), however, this lock is not the same >> if we open the same attribute file from different context. >> > Oh yes this lock only protects if the same fd is being written > concurrently. However if we open the same sysfs file from different process > contexts then fd would be different and so this lock wouldn't protect > the simultaneous update of sysfs attribute. Having said that, > looking through the code again it seems that q->nr_requests update > is protected with ->elevator_lock (including both the elevator switch > code and your proposed changes in this patchset). So my question is > do we really need to synchronize nr_requests update code with elevator > swiupdate_nr_hwq_locktch code? So in another words what if we don't at > all use ->update_nr_hwq_lock in queue_requests_store? 1) lock update_nr_hwq_lock, then no one can change nr_queuests 2) checking input nr_reqeusts 3) if grow, allocate memory Main idea here is we can checking if nr_requests grow and then allocate mermory, without concern that nr_requests can be changed after memory allocation. BTW, I think this sysfs attr is really a slow path, and it's fine to grab the write lock. Thanks, Kuai > > Thanks > --Nilay > . >
On 9/9/25 3:06 PM, Yu Kuai wrote: > Hi, > > 在 2025/09/09 17:26, Nilay Shroff 写道: >> >> >> On 9/9/25 12:46 PM, Yu Kuai wrote: >>> Hi, >>> >>> 在 2025/09/09 14:52, Nilay Shroff 写道: >>>> >>>> >>>> On 9/9/25 12:08 PM, Yu Kuai wrote: >>>>> Hi, >>>>> >>>>> 在 2025/09/09 14:29, Nilay Shroff 写道: >>>>>> >>>>>> >>>>>> On 9/8/25 11:45 AM, Yu Kuai wrote: >>>>>>> From: Yu Kuai <yukuai3@huawei.com> >>>>>>> >>>>>>> request_queue->nr_requests can be changed by: >>>>>>> >>>>>>> a) switching elevator by update nr_hw_queues >>>>>>> b) switching elevator by elevator sysfs attribute >>>>>>> c) configue queue sysfs attribute nr_requests >>>>>>> >>>>>>> Current lock order is: >>>>>>> >>>>>>> 1) update_nr_hwq_lock, case a,b >>>>>>> 2) freeze_queue >>>>>>> 3) elevator_lock, cas a,b,c >>>>>>> >>>>>>> And update nr_requests is seriablized by elevator_lock() already, >>>>>>> however, in the case c), we'll have to allocate new sched_tags if >>>>>>> nr_requests grow, and do this with elevator_lock held and queue >>>>>>> freezed has the risk of deadlock. >>>>>>> >>>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate >>>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed >>>>>>> concurrently. >>>>>>> >>>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >>>>>>> --- >>>>>>> block/blk-sysfs.c | 12 +++++++++--- >>>>>>> 1 file changed, 9 insertions(+), 3 deletions(-) >>>>>>> >>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>>>>>> index f99519f7a820..7ea15bf68b4b 100644 >>>>>>> --- a/block/blk-sysfs.c >>>>>>> +++ b/block/blk-sysfs.c >>>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) >>>>>>> int ret, err; >>>>>>> unsigned int memflags; >>>>>>> struct request_queue *q = disk->queue; >>>>>>> + struct blk_mq_tag_set *set = q->tag_set; >>>>>>> ret = queue_var_store(&nr, page, count); >>>>>>> if (ret < 0) >>>>>>> return ret; >>>>>>> - memflags = blk_mq_freeze_queue(q); >>>>>>> - mutex_lock(&q->elevator_lock); >>>>>>> + /* serialize updating nr_requests with switching elevator */ >>>>>>> + down_write(&set->update_nr_hwq_lock); >>>>>>> >>>>>> For serializing update of nr_requests with switching elevator, >>>>>> we should use disable_elv_switch(). So with this change we >>>>>> don't need to acquire ->update_nr_hwq_lock in writer context >>>>>> while running blk_mq_update_nr_requests but instead it can run >>>>>> acquiring ->update_nr_hwq_lock in the reader context. >>>>>> >>>>>> So the code flow should be, >>>>>> >>>>>> disable_elv_switch => this would set QUEUE_FLAG_NO_ELV_SWITCH >>>>>> ... >>>>>> down_read ->update_nr_hwq_lock >>>>>> acquire ->freeze_lock >>>>>> acquire ->elevator_lock; >>>>>> ... >>>>>> ... >>>>>> release ->elevator_lock; >>>>>> release ->freeze_lock >>>>>> >>>>>> clear QUEUE_FLAG_NO_ELV_SWITCH >>>>>> up_read ->update_nr_hwq_lock >>>>>> >>>>> >>>>> Yes, this make sense, however, there is also an implied condition that >>>>> we should serialize queue_requests_store() with itself, what if a >>>>> concurrent caller succeed the disable_elv_switch() before the >>>>> down_read() in this way? >>>>> >>>>> t1: >>>>> disable_elv_switch >>>>> t2: >>>>> disable_elv_switch >>>>> >>>>> down_read down_read >>>>> >>>> I believe this is already protected with the kernfs internal >>>> mutex locks. So you shouldn't be able to run two sysfs store >>>> operations concurrently on the same attribute file. >>>> >>> >>> There really is no such internal lock, the call stack is: >>> >>> kernfs_fop_write_iter >>> sysfs_kf_write >>> queue_attr_store >>> >>> There is only a file level mutex kernfs_open_file->lock from the top >>> function kernfs_fop_write_iter(), however, this lock is not the same >>> if we open the same attribute file from different context. >>> >> Oh yes this lock only protects if the same fd is being written >> concurrently. However if we open the same sysfs file from different process >> contexts then fd would be different and so this lock wouldn't protect >> the simultaneous update of sysfs attribute. Having said that, >> looking through the code again it seems that q->nr_requests update >> is protected with ->elevator_lock (including both the elevator switch >> code and your proposed changes in this patchset). So my question is >> do we really need to synchronize nr_requests update code with elevator >> swiupdate_nr_hwq_locktch code? So in another words what if we don't at >> all use ->update_nr_hwq_lock in queue_requests_store? > > 1) lock update_nr_hwq_lock, then no one can change nr_queuests > 2) checking input nr_reqeusts > 3) if grow, allocate memory > > Main idea here is we can checking if nr_requests grow and then allocate > mermory, without concern that nr_requests can be changed after memory > allocation. > If nr_requests changes after memory allocation we're still good because eventually we'd only have one consistent value of nr_requests. For instance, if process A is updating nr_requests to 128 and sched switch is updating nr_requests to 256 simultaneously then we'd either see nr_requests set to 128 or 256 in the end depending on who runs last. We wouldn't get into a situation where we find some inconsistent update to nr_requests, isn't it? > BTW, I think this sysfs attr is really a slow path, and it's fine to > grab the write lock. > Yep you're right. But I think we should avoid locks if possible. Thanks, --Nilay
Hi, 在 2025/9/9 18:11, Nilay Shroff 写道: > > On 9/9/25 3:06 PM, Yu Kuai wrote: >> Hi, >> >> 在 2025/09/09 17:26, Nilay Shroff 写道: >>> >>> On 9/9/25 12:46 PM, Yu Kuai wrote: >>>> Hi, >>>> >>>> 在 2025/09/09 14:52, Nilay Shroff 写道: >>>>> >>>>> On 9/9/25 12:08 PM, Yu Kuai wrote: >>>>>> Hi, >>>>>> >>>>>> 在 2025/09/09 14:29, Nilay Shroff 写道: >>>>>>> >>>>>>> On 9/8/25 11:45 AM, Yu Kuai wrote: >>>>>>>> From: Yu Kuai <yukuai3@huawei.com> >>>>>>>> >>>>>>>> request_queue->nr_requests can be changed by: >>>>>>>> >>>>>>>> a) switching elevator by update nr_hw_queues >>>>>>>> b) switching elevator by elevator sysfs attribute >>>>>>>> c) configue queue sysfs attribute nr_requests >>>>>>>> >>>>>>>> Current lock order is: >>>>>>>> >>>>>>>> 1) update_nr_hwq_lock, case a,b >>>>>>>> 2) freeze_queue >>>>>>>> 3) elevator_lock, cas a,b,c >>>>>>>> >>>>>>>> And update nr_requests is seriablized by elevator_lock() already, >>>>>>>> however, in the case c), we'll have to allocate new sched_tags if >>>>>>>> nr_requests grow, and do this with elevator_lock held and queue >>>>>>>> freezed has the risk of deadlock. >>>>>>>> >>>>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate >>>>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed >>>>>>>> concurrently. >>>>>>>> >>>>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >>>>>>>> --- >>>>>>>> block/blk-sysfs.c | 12 +++++++++--- >>>>>>>> 1 file changed, 9 insertions(+), 3 deletions(-) >>>>>>>> >>>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>>>>>>> index f99519f7a820..7ea15bf68b4b 100644 >>>>>>>> --- a/block/blk-sysfs.c >>>>>>>> +++ b/block/blk-sysfs.c >>>>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) >>>>>>>> int ret, err; >>>>>>>> unsigned int memflags; >>>>>>>> struct request_queue *q = disk->queue; >>>>>>>> + struct blk_mq_tag_set *set = q->tag_set; >>>>>>>> ret = queue_var_store(&nr, page, count); >>>>>>>> if (ret < 0) >>>>>>>> return ret; >>>>>>>> - memflags = blk_mq_freeze_queue(q); >>>>>>>> - mutex_lock(&q->elevator_lock); >>>>>>>> + /* serialize updating nr_requests with switching elevator */ >>>>>>>> + down_write(&set->update_nr_hwq_lock); >>>>>>>> >>>>>>> For serializing update of nr_requests with switching elevator, >>>>>>> we should use disable_elv_switch(). So with this change we >>>>>>> don't need to acquire ->update_nr_hwq_lock in writer context >>>>>>> while running blk_mq_update_nr_requests but instead it can run >>>>>>> acquiring ->update_nr_hwq_lock in the reader context. >>>>>>> >>>>>>> So the code flow should be, >>>>>>> >>>>>>> disable_elv_switch => this would set QUEUE_FLAG_NO_ELV_SWITCH >>>>>>> ... >>>>>>> down_read ->update_nr_hwq_lock >>>>>>> acquire ->freeze_lock >>>>>>> acquire ->elevator_lock; >>>>>>> ... >>>>>>> ... >>>>>>> release ->elevator_lock; >>>>>>> release ->freeze_lock >>>>>>> >>>>>>> clear QUEUE_FLAG_NO_ELV_SWITCH >>>>>>> up_read ->update_nr_hwq_lock >>>>>>> >>>>>> Yes, this make sense, however, there is also an implied condition that >>>>>> we should serialize queue_requests_store() with itself, what if a >>>>>> concurrent caller succeed the disable_elv_switch() before the >>>>>> down_read() in this way? >>>>>> >>>>>> t1: >>>>>> disable_elv_switch >>>>>> t2: >>>>>> disable_elv_switch >>>>>> >>>>>> down_read down_read >>>>>> >>>>> I believe this is already protected with the kernfs internal >>>>> mutex locks. So you shouldn't be able to run two sysfs store >>>>> operations concurrently on the same attribute file. >>>>> >>>> There really is no such internal lock, the call stack is: >>>> >>>> kernfs_fop_write_iter >>>> sysfs_kf_write >>>> queue_attr_store >>>> >>>> There is only a file level mutex kernfs_open_file->lock from the top >>>> function kernfs_fop_write_iter(), however, this lock is not the same >>>> if we open the same attribute file from different context. >>>> >>> Oh yes this lock only protects if the same fd is being written >>> concurrently. However if we open the same sysfs file from different process >>> contexts then fd would be different and so this lock wouldn't protect >>> the simultaneous update of sysfs attribute. Having said that, >>> looking through the code again it seems that q->nr_requests update >>> is protected with ->elevator_lock (including both the elevator switch >>> code and your proposed changes in this patchset). So my question is >>> do we really need to synchronize nr_requests update code with elevator >>> swiupdate_nr_hwq_locktch code? So in another words what if we don't at >>> all use ->update_nr_hwq_lock in queue_requests_store? >> 1) lock update_nr_hwq_lock, then no one can change nr_queuests >> 2) checking input nr_reqeusts >> 3) if grow, allocate memory >> >> Main idea here is we can checking if nr_requests grow and then allocate >> mermory, without concern that nr_requests can be changed after memory >> allocation. >> > If nr_requests changes after memory allocation we're still good because > eventually we'd only have one consistent value of nr_requests. For > instance, if process A is updating nr_requests to 128 and sched switch > is updating nr_requests to 256 simultaneously then we'd either see > nr_requests set to 128 or 256 in the end depending on who runs last. > We wouldn't get into a situation where we find some inconsistent update > to nr_requests, isn't it? Then we'll have to allocate memory for every input nr_requests now, we don't know for sure if tag will grow in advance this way. And even with this, we still have to hold read lock before allocating memory, to prevent nr hctxs to change. >> BTW, I think this sysfs attr is really a slow path, and it's fine to >> grab the write lock. >> > Yep you're right. But I think we should avoid locks if possible. Yes, we should avoid locks, but we should also keep code as simple as possible, I think. Thanks, Kuai > > Thanks, > --Nilay > >
On 9/9/25 4:12 PM, Yu Kuai wrote: > Hi, > > 在 2025/9/9 18:11, Nilay Shroff 写道: >> >> On 9/9/25 3:06 PM, Yu Kuai wrote: >>> Hi, >>> >>> 在 2025/09/09 17:26, Nilay Shroff 写道: >>>> >>>> On 9/9/25 12:46 PM, Yu Kuai wrote: >>>>> Hi, >>>>> >>>>> 在 2025/09/09 14:52, Nilay Shroff 写道: >>>>>> >>>>>> On 9/9/25 12:08 PM, Yu Kuai wrote: >>>>>>> Hi, >>>>>>> >>>>>>> 在 2025/09/09 14:29, Nilay Shroff 写道: >>>>>>>> >>>>>>>> On 9/8/25 11:45 AM, Yu Kuai wrote: >>>>>>>>> From: Yu Kuai <yukuai3@huawei.com> >>>>>>>>> >>>>>>>>> request_queue->nr_requests can be changed by: >>>>>>>>> >>>>>>>>> a) switching elevator by update nr_hw_queues >>>>>>>>> b) switching elevator by elevator sysfs attribute >>>>>>>>> c) configue queue sysfs attribute nr_requests >>>>>>>>> >>>>>>>>> Current lock order is: >>>>>>>>> >>>>>>>>> 1) update_nr_hwq_lock, case a,b >>>>>>>>> 2) freeze_queue >>>>>>>>> 3) elevator_lock, cas a,b,c >>>>>>>>> >>>>>>>>> And update nr_requests is seriablized by elevator_lock() already, >>>>>>>>> however, in the case c), we'll have to allocate new sched_tags if >>>>>>>>> nr_requests grow, and do this with elevator_lock held and queue >>>>>>>>> freezed has the risk of deadlock. >>>>>>>>> >>>>>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate >>>>>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed >>>>>>>>> concurrently. >>>>>>>>> >>>>>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >>>>>>>>> --- >>>>>>>>> block/blk-sysfs.c | 12 +++++++++--- >>>>>>>>> 1 file changed, 9 insertions(+), 3 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >>>>>>>>> index f99519f7a820..7ea15bf68b4b 100644 >>>>>>>>> --- a/block/blk-sysfs.c >>>>>>>>> +++ b/block/blk-sysfs.c >>>>>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count) >>>>>>>>> int ret, err; >>>>>>>>> unsigned int memflags; >>>>>>>>> struct request_queue *q = disk->queue; >>>>>>>>> + struct blk_mq_tag_set *set = q->tag_set; >>>>>>>>> ret = queue_var_store(&nr, page, count); >>>>>>>>> if (ret < 0) >>>>>>>>> return ret; >>>>>>>>> - memflags = blk_mq_freeze_queue(q); >>>>>>>>> - mutex_lock(&q->elevator_lock); >>>>>>>>> + /* serialize updating nr_requests with switching elevator */ >>>>>>>>> + down_write(&set->update_nr_hwq_lock); >>>>>>>>> >>>>>>>> For serializing update of nr_requests with switching elevator, >>>>>>>> we should use disable_elv_switch(). So with this change we >>>>>>>> don't need to acquire ->update_nr_hwq_lock in writer context >>>>>>>> while running blk_mq_update_nr_requests but instead it can run >>>>>>>> acquiring ->update_nr_hwq_lock in the reader context. >>>>>>>> >>>>>>>> So the code flow should be, >>>>>>>> >>>>>>>> disable_elv_switch => this would set QUEUE_FLAG_NO_ELV_SWITCH >>>>>>>> ... >>>>>>>> down_read ->update_nr_hwq_lock >>>>>>>> acquire ->freeze_lock >>>>>>>> acquire ->elevator_lock; >>>>>>>> ... >>>>>>>> ... >>>>>>>> release ->elevator_lock; >>>>>>>> release ->freeze_lock >>>>>>>> >>>>>>>> clear QUEUE_FLAG_NO_ELV_SWITCH >>>>>>>> up_read ->update_nr_hwq_lock >>>>>>>> >>>>>>> Yes, this make sense, however, there is also an implied condition that >>>>>>> we should serialize queue_requests_store() with itself, what if a >>>>>>> concurrent caller succeed the disable_elv_switch() before the >>>>>>> down_read() in this way? >>>>>>> >>>>>>> t1: >>>>>>> disable_elv_switch >>>>>>> t2: >>>>>>> disable_elv_switch >>>>>>> >>>>>>> down_read down_read >>>>>>> >>>>>> I believe this is already protected with the kernfs internal >>>>>> mutex locks. So you shouldn't be able to run two sysfs store >>>>>> operations concurrently on the same attribute file. >>>>>> >>>>> There really is no such internal lock, the call stack is: >>>>> >>>>> kernfs_fop_write_iter >>>>> sysfs_kf_write >>>>> queue_attr_store >>>>> >>>>> There is only a file level mutex kernfs_open_file->lock from the top >>>>> function kernfs_fop_write_iter(), however, this lock is not the same >>>>> if we open the same attribute file from different context. >>>>> >>>> Oh yes this lock only protects if the same fd is being written >>>> concurrently. However if we open the same sysfs file from different process >>>> contexts then fd would be different and so this lock wouldn't protect >>>> the simultaneous update of sysfs attribute. Having said that, >>>> looking through the code again it seems that q->nr_requests update >>>> is protected with ->elevator_lock (including both the elevator switch >>>> code and your proposed changes in this patchset). So my question is >>>> do we really need to synchronize nr_requests update code with elevator >>>> swiupdate_nr_hwq_locktch code? So in another words what if we don't at >>>> all use ->update_nr_hwq_lock in queue_requests_store? >>> 1) lock update_nr_hwq_lock, then no one can change nr_queuests >>> 2) checking input nr_reqeusts >>> 3) if grow, allocate memory >>> >>> Main idea here is we can checking if nr_requests grow and then allocate >>> mermory, without concern that nr_requests can be changed after memory >>> allocation. >>> >> If nr_requests changes after memory allocation we're still good because >> eventually we'd only have one consistent value of nr_requests. For >> instance, if process A is updating nr_requests to 128 and sched switch >> is updating nr_requests to 256 simultaneously then we'd either see >> nr_requests set to 128 or 256 in the end depending on who runs last. >> We wouldn't get into a situation where we find some inconsistent update >> to nr_requests, isn't it? > > Then we'll have to allocate memory for every input nr_requests now, we don't > know for sure if tag will grow in advance this way. And even with this, we > still have to hold read lock before allocating memory, to prevent nr hctxs > to change. > Hmm ok so we still have to acquire read lock and we can't avoid ->update_nr_hwq_lock. And that should be okay, as typically the semantics of rw_semaphore is a multiple readers and single writer lock mechanism. With the proposed patch now we've two contexts acquiring ->update_nr_hwq_lock in writer mode but for this particular case, I'm okay just to avoid unnecessary complexities making nr_requests update a reader context. And yes, as you mentioned, this code runs in slow path and we may not starve reader or writer as the code it protects is not big or complex. So with that said,I'd add review tag in another email. Thanks, --Nilay
© 2016 - 2025 Red Hat, Inc.