[v1] block: plug attempts to batch allocate tags multiple times

[PATCH] block: plug attempts to batch allocate tags multiple times

Posted by Xue He 1 month ago

From: hexue <xue01.he@samsung.com>

In the existing plug mechanism, tags are allocated in batches based on
the number of requests. However, testing has shown that the plug only
attempts batch allocation of tags once at the beginning of a batch of
I/O operations. Since the tag_mask does not always have enough available
tags to satisfy the requested number, a full batch allocation is not
guaranteed to succeed each time. The remaining tags are then allocated
individually (occurs frequently), leading to multiple single-tag
allocation overheads.

This patch aims to allow the remaining I/O operations to retry batch
allocation of tags, reducing the overhead caused by multiple
individual tag allocations.

------------------------------------------------------------------------
test result
During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
observed CPU improvements. The CPU usage of the original function
_blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
after modification.

Additionally, performance variations were observed on different devices.
workload:randread
blocksize:4k
thread:1
------------------------------------------------------------------------
                  PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
native kernel     553k iops       633k iops        793k iops
modified          553k iops       635k iops        801k iops

with Optane SSDs, the performance like
two device one thread
cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
-n1 -r4 /dev/nvme0n1 /dev/nvme1n1

base: 6.4 Million IOPS
patch: 6.49 Million IOPS

two device two thread
cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
-n1 -r4 /dev/nvme0n1 /dev/nvme1n1

base: 7.34 Million IOPS
patch: 7.48 Million IOPS
-------------------------------------------------------------------------

Signed-off-by: hexue <xue01.he@samsung.com>
---
 block/blk-mq.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b67d6c02eceb..1fb280764b76 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
 	if (blk_queue_enter(q, flags))
 		return NULL;
 
-	plug->nr_ios = 1;
-
 	rq = __blk_mq_alloc_requests(&data);
+	plug->nr_ios = data.nr_tags;
+
 	if (unlikely(!rq))
 		blk_queue_exit(q);
 	return rq;
@@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
 
 	if (plug) {
 		data.nr_tags = plug->nr_ios;
-		plug->nr_ios = 1;
 		data.cached_rqs = &plug->cached_rqs;
 	}
 
 	rq = __blk_mq_alloc_requests(&data);
+	if (plug)
+		plug->nr_ios = data.nr_tags;
+
 	if (unlikely(!rq))
 		rq_qos_cleanup(q, bio);
 	return rq;
-- 
2.34.1

Re: [PATCH] block: plug attempts to batch allocate tags multiple times

Posted by Yu Kuai 1 month ago

Hi,

在 2025/09/01 16:22, Xue He 写道:
> From: hexue <xue01.he@samsung.com>
> 
> In the existing plug mechanism, tags are allocated in batches based on
> the number of requests. However, testing has shown that the plug only
> attempts batch allocation of tags once at the beginning of a batch of
> I/O operations. Since the tag_mask does not always have enough available
> tags to satisfy the requested number, a full batch allocation is not
> guaranteed to succeed each time. The remaining tags are then allocated
> individually (occurs frequently), leading to multiple single-tag
> allocation overheads.
> 
> This patch aims to allow the remaining I/O operations to retry batch
> allocation of tags, reducing the overhead caused by multiple
> individual tag allocations.
> 
> ------------------------------------------------------------------------
> test result
> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
> observed CPU improvements. The CPU usage of the original function
> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
> after modification.
> 
> Additionally, performance variations were observed on different devices.
> workload:randread
> blocksize:4k
> thread:1
> ------------------------------------------------------------------------
>                    PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
> native kernel     553k iops       633k iops        793k iops
> modified          553k iops       635k iops        801k iops
> 
> with Optane SSDs, the performance like
> two device one thread
> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
> 

How many hw_queues and how many tags in each hw_queues in your nvme?
I feel it's unlikely that tags can be exhausted, usually cpu will become
bottleneck first.
> base: 6.4 Million IOPS
> patch: 6.49 Million IOPS
> 
> two device two thread
> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
> 
> base: 7.34 Million IOPS
> patch: 7.48 Million IOPS
> -------------------------------------------------------------------------
> 
> Signed-off-by: hexue <xue01.he@samsung.com>
> ---
>   block/blk-mq.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b67d6c02eceb..1fb280764b76 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
>   	if (blk_queue_enter(q, flags))
>   		return NULL;
>   
> -	plug->nr_ios = 1;
> -
>   	rq = __blk_mq_alloc_requests(&data);
> +	plug->nr_ios = data.nr_tags;
> +
>   	if (unlikely(!rq))
>   		blk_queue_exit(q);
>   	return rq;
> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
>   
>   	if (plug) {
>   		data.nr_tags = plug->nr_ios;
> -		plug->nr_ios = 1;
>   		data.cached_rqs = &plug->cached_rqs;
>   	}
>   
>   	rq = __blk_mq_alloc_requests(&data);
> +	if (plug)
> +		plug->nr_ios = data.nr_tags;
> +
>   	if (unlikely(!rq))
>   		rq_qos_cleanup(q, bio);
>   	return rq;
> 

In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed,
data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set
plug->nr_ios to 1 in this case.

What am I missing?

Thanks,
Kuai

Re: [PATCH] block: plug attempts to batch allocate tags multiple times

Posted by Xue He 1 month ago

On 2025/09/02 08:47 AM, Yu Kuai wrote:
>On 2025/09/01 16:22, Xue He wrote:
......
>> This patch aims to allow the remaining I/O operations to retry batch
>> allocation of tags, reducing the overhead caused by multiple
>> individual tag allocations.
>> 
>> ------------------------------------------------------------------------
>> test result
>> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
>> observed CPU improvements. The CPU usage of the original function
>> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
>> after modification.
>> 
>> Additionally, performance variations were observed on different devices.
>> workload:randread
>> blocksize:4k
>> thread:1
>> ------------------------------------------------------------------------
>>                    PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
>> native kernel     553k iops       633k iops        793k iops
>> modified          553k iops       635k iops        801k iops
>> 
>> with Optane SSDs, the performance like
>> two device one thread
>> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>> 
>
>How many hw_queues and how many tags in each hw_queues in your nvme?
>I feel it's unlikely that tags can be exhausted, usually cpu will become
>bottleneck first.

the information of my nvme like this:
number of CPU: 16
memory: 16G
nvme nvme0: 16/0/16 default/read/poll queue
cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests
1023

In more precise terms, I think it is not that the tags are fully exhausted,
but rather that after scanning the bitmap for free bits, the remaining
contiguous bits are nsufficient to meet the requirement (have but not enough).
The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c.
                    get_mask = ((1UL << nr_tags) - 1) << nr;
                    if (nr_tags > 1) {
                            printk("before %ld\n", get_mask);
                    }
                    while (!atomic_long_try_cmpxchg(ptr, &val,
                                                      get_mask | val))
                            ;
                    get_mask = (get_mask & ~val) >> nr;

where during the batch acquisition of contiguous free bits, an atomic operation
is performed, resulting in the actual tag_mask obtained differing from the
originally requested one.

Am I missing something?

>> base: 6.4 Million IOPS
>> patch: 6.49 Million IOPS
>> 
>> two device two thread
>> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>> 
>> base: 7.34 Million IOPS
>> patch: 7.48 Million IOPS
>> -------------------------------------------------------------------------
>> 
>> Signed-off-by: hexue <xue01.he@samsung.com>
>> ---
>>   block/blk-mq.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index b67d6c02eceb..1fb280764b76 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
>>   	if (blk_queue_enter(q, flags))
>>   		return NULL;
>>   
>> -	plug->nr_ios = 1;
>> -
>>   	rq = __blk_mq_alloc_requests(&data);
>> +	plug->nr_ios = data.nr_tags;
>> +
>>   	if (unlikely(!rq))
>>   		blk_queue_exit(q);
>>   	return rq;
>> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
>>   
>>   	if (plug) {
>>   		data.nr_tags = plug->nr_ios;
>> -		plug->nr_ios = 1;
>>   		data.cached_rqs = &plug->cached_rqs;
>>   	}
>>   
>>   	rq = __blk_mq_alloc_requests(&data);
>> +	if (plug)
>> +		plug->nr_ios = data.nr_tags;
>> +
>>   	if (unlikely(!rq))
>>   		rq_qos_cleanup(q, bio);
>>   	return rq;
>> 
>
>In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed,
>data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set
>plug->nr_ios to 1 in this case.
>
>What am I missing?

yes, you are right, if __blk_mq_alloc_requests_batch() failed, it will set
to 1. However, in this case, it did not fail to execute; instead, the
allocated number of tags was insufficient, as only a partial number were
allocated. Therefore, the function is considered successfully executed.

>Thanks,
>Kuai
>

Thanks,
Xue

Re: [PATCH] block: plug attempts to batch allocate tags multiple times

Posted by Yu Kuai 1 month ago

Hi,

在 2025/09/03 16:41, Xue He 写道:
> On 2025/09/02 08:47 AM, Yu Kuai wrote:
>> On 2025/09/01 16:22, Xue He wrote:
> ......
>>> This patch aims to allow the remaining I/O operations to retry batch
>>> allocation of tags, reducing the overhead caused by multiple
>>> individual tag allocations.
>>>
>>> ------------------------------------------------------------------------
>>> test result
>>> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
>>> observed CPU improvements. The CPU usage of the original function
>>> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
>>> after modification.
>>>
>>> Additionally, performance variations were observed on different devices.
>>> workload:randread
>>> blocksize:4k
>>> thread:1
>>> ------------------------------------------------------------------------
>>>                     PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
>>> native kernel     553k iops       633k iops        793k iops
>>> modified          553k iops       635k iops        801k iops
>>>
>>> with Optane SSDs, the performance like
>>> two device one thread
>>> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>>>
>>
>> How many hw_queues and how many tags in each hw_queues in your nvme?
>> I feel it's unlikely that tags can be exhausted, usually cpu will become
>> bottleneck first.
> 
> the information of my nvme like this:
> number of CPU: 16
> memory: 16G
> nvme nvme0: 16/0/16 default/read/poll queue
> cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests
> 1023
> 
> In more precise terms, I think it is not that the tags are fully exhausted,
> but rather that after scanning the bitmap for free bits, the remaining
> contiguous bits are nsufficient to meet the requirement (have but not enough).
> The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c.
>                      get_mask = ((1UL << nr_tags) - 1) << nr;
>                      if (nr_tags > 1) {
>                              printk("before %ld\n", get_mask);
>                      }
>                      while (!atomic_long_try_cmpxchg(ptr, &val,
>                                                        get_mask | val))
>                              ;
>                      get_mask = (get_mask & ~val) >> nr;
> 
> where during the batch acquisition of contiguous free bits, an atomic operation
> is performed, resulting in the actual tag_mask obtained differing from the
> originally requested one.

Yes, so this function will likely to obtain less tags than nr_tags,the
mask is always start from first zero bit with nr_tags bit, and
sbitmap_deferred_clear() is called uncondionally, it's likely there are
non-zero bits within this range.

Just wonder, do you consider fixing this directly in
__blk_mq_alloc_requests_batch()?

  - call sbitmap_deferred_clear() and retry on allocation failure, so
that the whole word can be used even if previous allocated request are
done, especially for nvme with huge tag depths;
  - retry blk_mq_get_tags() until data->nr_tags is zero;
> 
> Am I missing something?
> 
>>> base: 6.4 Million IOPS
>>> patch: 6.49 Million IOPS
>>>
>>> two device two thread
>>> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>>>
>>> base: 7.34 Million IOPS
>>> patch: 7.48 Million IOPS
>>> -------------------------------------------------------------------------
>>>
>>> Signed-off-by: hexue <xue01.he@samsung.com>
>>> ---
>>>    block/blk-mq.c | 8 +++++---
>>>    1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index b67d6c02eceb..1fb280764b76 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
>>>    	if (blk_queue_enter(q, flags))
>>>    		return NULL;
>>>    
>>> -	plug->nr_ios = 1;
>>> -
>>>    	rq = __blk_mq_alloc_requests(&data);
>>> +	plug->nr_ios = data.nr_tags;
>>> +
>>>    	if (unlikely(!rq))
>>>    		blk_queue_exit(q);
>>>    	return rq;
>>> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
>>>    
>>>    	if (plug) {
>>>    		data.nr_tags = plug->nr_ios;
>>> -		plug->nr_ios = 1;
>>>    		data.cached_rqs = &plug->cached_rqs;
>>>    	}
>>>    
>>>    	rq = __blk_mq_alloc_requests(&data);
>>> +	if (plug)
>>> +		plug->nr_ios = data.nr_tags;
>>> +
>>>    	if (unlikely(!rq))
>>>    		rq_qos_cleanup(q, bio);
>>>    	return rq;
>>>
>>
>> In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed,
>> data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set
>> plug->nr_ios to 1 in this case.
>>
>> What am I missing?
> 
> yes, you are right, if __blk_mq_alloc_requests_batch() failed, it will set
> to 1. However, in this case, it did not fail to execute; instead, the
> allocated number of tags was insufficient, as only a partial number were
> allocated. Therefore, the function is considered successfully executed.
> 

Thanks for the explanation, I understand this now.

Thanks,
Kuai

>> Thanks,
>> Kuai
>>
> 
> Thanks,
> Xue
> 
> .
>

Re: [PATCH] block: plug attempts to batch allocate tags multiple times

Posted by Xue He 3 weeks ago

On 2025/09/03 18:35 PM, Yu Kuai wrote:
>On 2025/09/03 16:41 PM, Xue He wrote:
>> On 2025/09/02 08:47 AM, Yu Kuai wrote:
>>> On 2025/09/01 16:22, Xue He wrote:
>> ......
>
>
>Yes, so this function will likely to obtain less tags than nr_tags,the
>mask is always start from first zero bit with nr_tags bit, and
>sbitmap_deferred_clear() is called uncondionally, it's likely there are
>non-zero bits within this range.
>
>Just wonder, do you consider fixing this directly in
>__blk_mq_alloc_requests_batch()?
>
>  - call sbitmap_deferred_clear() and retry on allocation failure, so
>that the whole word can be used even if previous allocated request are
>done, especially for nvme with huge tag depths;
>  - retry blk_mq_get_tags() until data->nr_tags is zero;

Hi, Yu Kuai, I'm not entirely sure if I understand correctly, but during
each tag allocation, sbitmap_deferred_clear() is typically called first, 
as seen in the __sbitmap_queue_get_batch() function.

        for (i = 0; i < sb->map_nr; i++) {
                struct sbitmap_word *map = &sb->map[index];
                unsigned long get_mask;
                unsigned int map_depth = __map_depth(sb, index);
                unsigned long val;

                sbitmap_deferred_clear(map, 0, 0, 0);
------------------------------------------------------------------------
so I try to recall blk_mq_get_tags() until data->nr_tags is zero, like:

-       int i, nr = 0;

-       tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
-       if (unlikely(!tag_mask))
-               return NULL;
-
-       tags = blk_mq_tags_from_data(data);
-       for (i = 0; tag_mask; i++) {
-               if (!(tag_mask & (1UL << i)))
-                       continue;
-               tag = tag_offset + i;
-               prefetch(tags->static_rqs[tag]);
-               tag_mask &= ~(1UL << i);
-               rq = blk_mq_rq_ctx_init(data, tags, tag);
-               rq_list_add_head(data->cached_rqs, rq);
-               nr++;
-       }
-       if (!(data->rq_flags & RQF_SCHED_TAGS))
-               blk_mq_add_active_requests(data->hctx, nr);
-       /* caller already holds a reference, add for remainder */
-       percpu_ref_get_many(&data->q->q_usage_counter, nr - 1);
-       data->nr_tags -= nr;
+       do {
+               int i, nr = 0;
+               tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
+               if (unlikely(!tag_mask))
+                       return NULL;
+               tags = blk_mq_tags_from_data(data);
+               for (i = 0; tag_mask; i++) {
+                       if (!(tag_mask & (1UL << i)))
+                               continue;
+                       tag = tag_offset + i;
+                       prefetch(tags->static_rqs[tag]);
+                       tag_mask &= ~(1UL << i);
+                       rq = blk_mq_rq_ctx_init(data, tags, tag);
+                       rq_list_add_head(data->cached_rqs, rq);
+                       nr++;
+               }
+               if (!(data->rq_flags & RQF_SCHED_TAGS))
+                       blk_mq_add_active_requests(data->hctx, nr);
+               /* caller already holds a reference, add for remainder */
+               percpu_ref_get_many(&data->q->q_usage_counter, nr - 1);
+               data->nr_tags -= nr;
+       } while (data->nr_tags);

I added a loop structure, it also achieve a good results like before,
but I have a question: although the loop will retry tag allocation
when the required number of tags is not met, there is a risk of an
infinite loop, right? However, I couldn't think of a safer condition
to terminate the loop. Do you have any suggestions?

Thanks,
Xue

Re: [PATCH] block: plug attempts to batch allocate tags multiple times

Posted by Yu Kuai 2 weeks, 4 days ago

Hi,

在 2025/09/12 11:06, Xue He 写道:
> On 2025/09/03 18:35 PM, Yu Kuai wrote:
>> On 2025/09/03 16:41 PM, Xue He wrote:
>>> On 2025/09/02 08:47 AM, Yu Kuai wrote:
>>>> On 2025/09/01 16:22, Xue He wrote:
>>> ......
>>
>>
>> Yes, so this function will likely to obtain less tags than nr_tags,the
>> mask is always start from first zero bit with nr_tags bit, and
>> sbitmap_deferred_clear() is called uncondionally, it's likely there are
>> non-zero bits within this range.
>>
>> Just wonder, do you consider fixing this directly in
>> __blk_mq_alloc_requests_batch()?
>>
>>   - call sbitmap_deferred_clear() and retry on allocation failure, so
>> that the whole word can be used even if previous allocated request are
>> done, especially for nvme with huge tag depths;
>>   - retry blk_mq_get_tags() until data->nr_tags is zero;
> 
> Hi, Yu Kuai, I'm not entirely sure if I understand correctly, but during
> each tag allocation, sbitmap_deferred_clear() is typically called first,
> as seen in the __sbitmap_queue_get_batch() function.
> 
>          for (i = 0; i < sb->map_nr; i++) {
>                  struct sbitmap_word *map = &sb->map[index];
>                  unsigned long get_mask;
>                  unsigned int map_depth = __map_depth(sb, index);
>                  unsigned long val;
> 
>                  sbitmap_deferred_clear(map, 0, 0, 0);

Yes, this is called first each time, and I don't feel this is good for
performance anyway. I think it can be dealyed after a try first, like
sbitmap_find_bit_in_word().

> ------------------------------------------------------------------------
> so I try to recall blk_mq_get_tags() until data->nr_tags is zero, like:
> 
> -       int i, nr = 0;
> 
> -       tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
> -       if (unlikely(!tag_mask))
> -               return NULL;
> -
> -       tags = blk_mq_tags_from_data(data);
> -       for (i = 0; tag_mask; i++) {
> -               if (!(tag_mask & (1UL << i)))
> -                       continue;
> -               tag = tag_offset + i;
> -               prefetch(tags->static_rqs[tag]);
> -               tag_mask &= ~(1UL << i);
> -               rq = blk_mq_rq_ctx_init(data, tags, tag);
> -               rq_list_add_head(data->cached_rqs, rq);
> -               nr++;
> -       }
> -       if (!(data->rq_flags & RQF_SCHED_TAGS))
> -               blk_mq_add_active_requests(data->hctx, nr);
> -       /* caller already holds a reference, add for remainder */
> -       percpu_ref_get_many(&data->q->q_usage_counter, nr - 1);
> -       data->nr_tags -= nr;
> +       do {
> +               int i, nr = 0;
> +               tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset);
> +               if (unlikely(!tag_mask))
> +                       return NULL;
> +               tags = blk_mq_tags_from_data(data);
> +               for (i = 0; tag_mask; i++) {
> +                       if (!(tag_mask & (1UL << i)))
> +                               continue;
> +                       tag = tag_offset + i;
> +                       prefetch(tags->static_rqs[tag]);
> +                       tag_mask &= ~(1UL << i);
> +                       rq = blk_mq_rq_ctx_init(data, tags, tag);
> +                       rq_list_add_head(data->cached_rqs, rq);
> +                       nr++;
> +               }
> +               if (!(data->rq_flags & RQF_SCHED_TAGS))
> +                       blk_mq_add_active_requests(data->hctx, nr);
> +               /* caller already holds a reference, add for remainder */
> +               percpu_ref_get_many(&data->q->q_usage_counter, nr - 1);
> +               data->nr_tags -= nr;
> +       } while (data->nr_tags);
> 
> I added a loop structure, it also achieve a good results like before,
> but I have a question: although the loop will retry tag allocation
> when the required number of tags is not met, there is a risk of an
> infinite loop, right? However, I couldn't think of a safer condition
> to terminate the loop. Do you have any suggestions?

Yes, this is what I have in mind. Why do you think there can be infinite
loop? We should allcocate at least one tag by blk_mq_get_tags() in each
loop, or return directly.

Thanks,
Kuai

> 
> Thanks,
> Xue
> 
> .
>

Re: [PATCH] block: plug attempts to batch allocate tags multiple times

Posted by Xue He 2 weeks, 2 days ago

On 2025/09/15 10:22 AM, Yu Kuai wrote:
>On 2025/09/12 11:06, Xue He wrote:
>> On 2025/09/03 18:35 PM, Yu Kuai wrote:
>>> On 2025/09/03 16:41 PM, Xue He wrote:
>>>> On 2025/09/02 08:47 AM, Yu Kuai wrote:
>>>>> On 2025/09/01 16:22, Xue He wrote:
>>>> ......
>> 
>> I added a loop structure, it also achieve a good results like before,
>> but I have a question: although the loop will retry tag allocation
>> when the required number of tags is not met, there is a risk of an
>> infinite loop, right? However, I couldn't think of a safer condition
>> to terminate the loop. Do you have any suggestions?
>
>Yes, this is what I have in mind. Why do you think there can be infinite
>loop? We should allcocate at least one tag by blk_mq_get_tags() in each
>loop, or return directly.

Understand your point now. I will send v2 patch.
Thanks,
Xue

Re: [PATCH] block: plug attempts to batch allocate tags multiple times

Posted by Xue He 4 weeks, 1 day ago

On 2025/09/03/18:35PM, Yu Kuai wrote:
>On 2025/09/03 16:41 PM, Xue He wrote:
>> On 2025/09/02 08:47 AM, Yu Kuai wrote:
>>> On 2025/09/01 16:22 PM, Xue He wrote:
>> ......
>> 
>> the information of my nvme like this:
>> number of CPU: 16
>> memory: 16G
>> nvme nvme0: 16/0/16 default/read/poll queue
>> cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests
>> 1023
>> 
>> In more precise terms, I think it is not that the tags are fully exhausted,
>> but rather that after scanning the bitmap for free bits, the remaining
>> contiguous bits are nsufficient to meet the requirement (have but not enough).
>> The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c.
>>                      get_mask = ((1UL << nr_tags) - 1) << nr;
>>                      if (nr_tags > 1) {
>>                              printk("before %ld\n", get_mask);
>>                      }
>>                      while (!atomic_long_try_cmpxchg(ptr, &val,
>>                                                        get_mask | val))
>>                              ;
>>                      get_mask = (get_mask & ~val) >> nr;
>> 
>> where during the batch acquisition of contiguous free bits, an atomic operation
>> is performed, resulting in the actual tag_mask obtained differing from the
>> originally requested one.
>
>Yes, so this function will likely to obtain less tags than nr_tags,the
>mask is always start from first zero bit with nr_tags bit, and
>sbitmap_deferred_clear() is called uncondionally, it's likely there are
>non-zero bits within this range.
>
>Just wonder, do you consider fixing this directly in
>__blk_mq_alloc_requests_batch()?
>
>  - call sbitmap_deferred_clear() and retry on allocation failure, so
>that the whole word can be used even if previous allocated request are
>done, especially for nvme with huge tag depths;
>  - retry blk_mq_get_tags() until data->nr_tags is zero;
>

I haven't tried this yet, as I'm concerned that if it spin here, it might
introduce more latency. Anyway, I may try to implement this idea and do some
tests to observe the results.

Thanks.