block/blk-mq.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
From: hexue <xue01.he@samsung.com>
In the existing plug mechanism, tags are allocated in batches based on
the number of requests. However, testing has shown that the plug only
attempts batch allocation of tags once at the beginning of a batch of
I/O operations. Since the tag_mask does not always have enough available
tags to satisfy the requested number, a full batch allocation is not
guaranteed to succeed each time. The remaining tags are then allocated
individually (occurs frequently), leading to multiple single-tag
allocation overheads.
This patch aims to allow the remaining I/O operations to retry batch
allocation of tags, reducing the overhead caused by multiple
individual tag allocations.
------------------------------------------------------------------------
test result
During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
observed CPU improvements. The CPU usage of the original function
_blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
after modification.
Additionally, performance variations were observed on different devices.
workload:randread
blocksize:4k
thread:1
------------------------------------------------------------------------
PCIe Gen3 SSD PCIe Gen4 SSD PCIe Gen5 SSD
native kernel 553k iops 633k iops 793k iops
modified 553k iops 635k iops 801k iops
with Optane SSDs, the performance like
two device one thread
cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
-n1 -r4 /dev/nvme0n1 /dev/nvme1n1
base: 6.4 Million IOPS
patch: 6.49 Million IOPS
two device two thread
cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
-n1 -r4 /dev/nvme0n1 /dev/nvme1n1
base: 7.34 Million IOPS
patch: 7.48 Million IOPS
-------------------------------------------------------------------------
Signed-off-by: hexue <xue01.he@samsung.com>
---
block/blk-mq.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b67d6c02eceb..1fb280764b76 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
if (blk_queue_enter(q, flags))
return NULL;
- plug->nr_ios = 1;
-
rq = __blk_mq_alloc_requests(&data);
+ plug->nr_ios = data.nr_tags;
+
if (unlikely(!rq))
blk_queue_exit(q);
return rq;
@@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
if (plug) {
data.nr_tags = plug->nr_ios;
- plug->nr_ios = 1;
data.cached_rqs = &plug->cached_rqs;
}
rq = __blk_mq_alloc_requests(&data);
+ if (plug)
+ plug->nr_ios = data.nr_tags;
+
if (unlikely(!rq))
rq_qos_cleanup(q, bio);
return rq;
--
2.34.1
Hi, 在 2025/09/01 16:22, Xue He 写道: > From: hexue <xue01.he@samsung.com> > > In the existing plug mechanism, tags are allocated in batches based on > the number of requests. However, testing has shown that the plug only > attempts batch allocation of tags once at the beginning of a batch of > I/O operations. Since the tag_mask does not always have enough available > tags to satisfy the requested number, a full batch allocation is not > guaranteed to succeed each time. The remaining tags are then allocated > individually (occurs frequently), leading to multiple single-tag > allocation overheads. > > This patch aims to allow the remaining I/O operations to retry batch > allocation of tags, reducing the overhead caused by multiple > individual tag allocations. > > ------------------------------------------------------------------------ > test result > During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool > observed CPU improvements. The CPU usage of the original function > _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82% > after modification. > > Additionally, performance variations were observed on different devices. > workload:randread > blocksize:4k > thread:1 > ------------------------------------------------------------------------ > PCIe Gen3 SSD PCIe Gen4 SSD PCIe Gen5 SSD > native kernel 553k iops 633k iops 793k iops > modified 553k iops 635k iops 801k iops > > with Optane SSDs, the performance like > two device one thread > cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 > -n1 -r4 /dev/nvme0n1 /dev/nvme1n1 > How many hw_queues and how many tags in each hw_queues in your nvme? I feel it's unlikely that tags can be exhausted, usually cpu will become bottleneck first. > base: 6.4 Million IOPS > patch: 6.49 Million IOPS > > two device two thread > cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 > -n1 -r4 /dev/nvme0n1 /dev/nvme1n1 > > base: 7.34 Million IOPS > patch: 7.48 Million IOPS > ------------------------------------------------------------------------- > > Signed-off-by: hexue <xue01.he@samsung.com> > --- > block/blk-mq.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index b67d6c02eceb..1fb280764b76 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q, > if (blk_queue_enter(q, flags)) > return NULL; > > - plug->nr_ios = 1; > - > rq = __blk_mq_alloc_requests(&data); > + plug->nr_ios = data.nr_tags; > + > if (unlikely(!rq)) > blk_queue_exit(q); > return rq; > @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q, > > if (plug) { > data.nr_tags = plug->nr_ios; > - plug->nr_ios = 1; > data.cached_rqs = &plug->cached_rqs; > } > > rq = __blk_mq_alloc_requests(&data); > + if (plug) > + plug->nr_ios = data.nr_tags; > + > if (unlikely(!rq)) > rq_qos_cleanup(q, bio); > return rq; > In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed, data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set plug->nr_ios to 1 in this case. What am I missing? Thanks, Kuai
On 2025/09/02 08:47 AM, Yu Kuai wrote: >On 2025/09/01 16:22, Xue He wrote: ...... >> This patch aims to allow the remaining I/O operations to retry batch >> allocation of tags, reducing the overhead caused by multiple >> individual tag allocations. >> >> ------------------------------------------------------------------------ >> test result >> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool >> observed CPU improvements. The CPU usage of the original function >> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82% >> after modification. >> >> Additionally, performance variations were observed on different devices. >> workload:randread >> blocksize:4k >> thread:1 >> ------------------------------------------------------------------------ >> PCIe Gen3 SSD PCIe Gen4 SSD PCIe Gen5 SSD >> native kernel 553k iops 633k iops 793k iops >> modified 553k iops 635k iops 801k iops >> >> with Optane SSDs, the performance like >> two device one thread >> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 >> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1 >> > >How many hw_queues and how many tags in each hw_queues in your nvme? >I feel it's unlikely that tags can be exhausted, usually cpu will become >bottleneck first. the information of my nvme like this: number of CPU: 16 memory: 16G nvme nvme0: 16/0/16 default/read/poll queue cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests 1023 In more precise terms, I think it is not that the tags are fully exhausted, but rather that after scanning the bitmap for free bits, the remaining contiguous bits are nsufficient to meet the requirement (have but not enough). The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c. get_mask = ((1UL << nr_tags) - 1) << nr; if (nr_tags > 1) { printk("before %ld\n", get_mask); } while (!atomic_long_try_cmpxchg(ptr, &val, get_mask | val)) ; get_mask = (get_mask & ~val) >> nr; where during the batch acquisition of contiguous free bits, an atomic operation is performed, resulting in the actual tag_mask obtained differing from the originally requested one. Am I missing something? >> base: 6.4 Million IOPS >> patch: 6.49 Million IOPS >> >> two device two thread >> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 >> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1 >> >> base: 7.34 Million IOPS >> patch: 7.48 Million IOPS >> ------------------------------------------------------------------------- >> >> Signed-off-by: hexue <xue01.he@samsung.com> >> --- >> block/blk-mq.c | 8 +++++--- >> 1 file changed, 5 insertions(+), 3 deletions(-) >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index b67d6c02eceb..1fb280764b76 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q, >> if (blk_queue_enter(q, flags)) >> return NULL; >> >> - plug->nr_ios = 1; >> - >> rq = __blk_mq_alloc_requests(&data); >> + plug->nr_ios = data.nr_tags; >> + >> if (unlikely(!rq)) >> blk_queue_exit(q); >> return rq; >> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q, >> >> if (plug) { >> data.nr_tags = plug->nr_ios; >> - plug->nr_ios = 1; >> data.cached_rqs = &plug->cached_rqs; >> } >> >> rq = __blk_mq_alloc_requests(&data); >> + if (plug) >> + plug->nr_ios = data.nr_tags; >> + >> if (unlikely(!rq)) >> rq_qos_cleanup(q, bio); >> return rq; >> > >In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed, >data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set >plug->nr_ios to 1 in this case. > >What am I missing? yes, you are right, if __blk_mq_alloc_requests_batch() failed, it will set to 1. However, in this case, it did not fail to execute; instead, the allocated number of tags was insufficient, as only a partial number were allocated. Therefore, the function is considered successfully executed. >Thanks, >Kuai > Thanks, Xue
Hi, 在 2025/09/03 16:41, Xue He 写道: > On 2025/09/02 08:47 AM, Yu Kuai wrote: >> On 2025/09/01 16:22, Xue He wrote: > ...... >>> This patch aims to allow the remaining I/O operations to retry batch >>> allocation of tags, reducing the overhead caused by multiple >>> individual tag allocations. >>> >>> ------------------------------------------------------------------------ >>> test result >>> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool >>> observed CPU improvements. The CPU usage of the original function >>> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82% >>> after modification. >>> >>> Additionally, performance variations were observed on different devices. >>> workload:randread >>> blocksize:4k >>> thread:1 >>> ------------------------------------------------------------------------ >>> PCIe Gen3 SSD PCIe Gen4 SSD PCIe Gen5 SSD >>> native kernel 553k iops 633k iops 793k iops >>> modified 553k iops 635k iops 801k iops >>> >>> with Optane SSDs, the performance like >>> two device one thread >>> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 >>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1 >>> >> >> How many hw_queues and how many tags in each hw_queues in your nvme? >> I feel it's unlikely that tags can be exhausted, usually cpu will become >> bottleneck first. > > the information of my nvme like this: > number of CPU: 16 > memory: 16G > nvme nvme0: 16/0/16 default/read/poll queue > cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests > 1023 > > In more precise terms, I think it is not that the tags are fully exhausted, > but rather that after scanning the bitmap for free bits, the remaining > contiguous bits are nsufficient to meet the requirement (have but not enough). > The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c. > get_mask = ((1UL << nr_tags) - 1) << nr; > if (nr_tags > 1) { > printk("before %ld\n", get_mask); > } > while (!atomic_long_try_cmpxchg(ptr, &val, > get_mask | val)) > ; > get_mask = (get_mask & ~val) >> nr; > > where during the batch acquisition of contiguous free bits, an atomic operation > is performed, resulting in the actual tag_mask obtained differing from the > originally requested one. Yes, so this function will likely to obtain less tags than nr_tags,the mask is always start from first zero bit with nr_tags bit, and sbitmap_deferred_clear() is called uncondionally, it's likely there are non-zero bits within this range. Just wonder, do you consider fixing this directly in __blk_mq_alloc_requests_batch()? - call sbitmap_deferred_clear() and retry on allocation failure, so that the whole word can be used even if previous allocated request are done, especially for nvme with huge tag depths; - retry blk_mq_get_tags() until data->nr_tags is zero; > > Am I missing something? > >>> base: 6.4 Million IOPS >>> patch: 6.49 Million IOPS >>> >>> two device two thread >>> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 >>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1 >>> >>> base: 7.34 Million IOPS >>> patch: 7.48 Million IOPS >>> ------------------------------------------------------------------------- >>> >>> Signed-off-by: hexue <xue01.he@samsung.com> >>> --- >>> block/blk-mq.c | 8 +++++--- >>> 1 file changed, 5 insertions(+), 3 deletions(-) >>> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index b67d6c02eceb..1fb280764b76 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q, >>> if (blk_queue_enter(q, flags)) >>> return NULL; >>> >>> - plug->nr_ios = 1; >>> - >>> rq = __blk_mq_alloc_requests(&data); >>> + plug->nr_ios = data.nr_tags; >>> + >>> if (unlikely(!rq)) >>> blk_queue_exit(q); >>> return rq; >>> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q, >>> >>> if (plug) { >>> data.nr_tags = plug->nr_ios; >>> - plug->nr_ios = 1; >>> data.cached_rqs = &plug->cached_rqs; >>> } >>> >>> rq = __blk_mq_alloc_requests(&data); >>> + if (plug) >>> + plug->nr_ios = data.nr_tags; >>> + >>> if (unlikely(!rq)) >>> rq_qos_cleanup(q, bio); >>> return rq; >>> >> >> In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed, >> data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set >> plug->nr_ios to 1 in this case. >> >> What am I missing? > > yes, you are right, if __blk_mq_alloc_requests_batch() failed, it will set > to 1. However, in this case, it did not fail to execute; instead, the > allocated number of tags was insufficient, as only a partial number were > allocated. Therefore, the function is considered successfully executed. > Thanks for the explanation, I understand this now. Thanks, Kuai >> Thanks, >> Kuai >> > > Thanks, > Xue > > . >
On 2025/09/03 18:35 PM, Yu Kuai wrote: >On 2025/09/03 16:41 PM, Xue He wrote: >> On 2025/09/02 08:47 AM, Yu Kuai wrote: >>> On 2025/09/01 16:22, Xue He wrote: >> ...... > > >Yes, so this function will likely to obtain less tags than nr_tags,the >mask is always start from first zero bit with nr_tags bit, and >sbitmap_deferred_clear() is called uncondionally, it's likely there are >non-zero bits within this range. > >Just wonder, do you consider fixing this directly in >__blk_mq_alloc_requests_batch()? > > - call sbitmap_deferred_clear() and retry on allocation failure, so >that the whole word can be used even if previous allocated request are >done, especially for nvme with huge tag depths; > - retry blk_mq_get_tags() until data->nr_tags is zero; Hi, Yu Kuai, I'm not entirely sure if I understand correctly, but during each tag allocation, sbitmap_deferred_clear() is typically called first, as seen in the __sbitmap_queue_get_batch() function. for (i = 0; i < sb->map_nr; i++) { struct sbitmap_word *map = &sb->map[index]; unsigned long get_mask; unsigned int map_depth = __map_depth(sb, index); unsigned long val; sbitmap_deferred_clear(map, 0, 0, 0); ------------------------------------------------------------------------ so I try to recall blk_mq_get_tags() until data->nr_tags is zero, like: - int i, nr = 0; - tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset); - if (unlikely(!tag_mask)) - return NULL; - - tags = blk_mq_tags_from_data(data); - for (i = 0; tag_mask; i++) { - if (!(tag_mask & (1UL << i))) - continue; - tag = tag_offset + i; - prefetch(tags->static_rqs[tag]); - tag_mask &= ~(1UL << i); - rq = blk_mq_rq_ctx_init(data, tags, tag); - rq_list_add_head(data->cached_rqs, rq); - nr++; - } - if (!(data->rq_flags & RQF_SCHED_TAGS)) - blk_mq_add_active_requests(data->hctx, nr); - /* caller already holds a reference, add for remainder */ - percpu_ref_get_many(&data->q->q_usage_counter, nr - 1); - data->nr_tags -= nr; + do { + int i, nr = 0; + tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset); + if (unlikely(!tag_mask)) + return NULL; + tags = blk_mq_tags_from_data(data); + for (i = 0; tag_mask; i++) { + if (!(tag_mask & (1UL << i))) + continue; + tag = tag_offset + i; + prefetch(tags->static_rqs[tag]); + tag_mask &= ~(1UL << i); + rq = blk_mq_rq_ctx_init(data, tags, tag); + rq_list_add_head(data->cached_rqs, rq); + nr++; + } + if (!(data->rq_flags & RQF_SCHED_TAGS)) + blk_mq_add_active_requests(data->hctx, nr); + /* caller already holds a reference, add for remainder */ + percpu_ref_get_many(&data->q->q_usage_counter, nr - 1); + data->nr_tags -= nr; + } while (data->nr_tags); I added a loop structure, it also achieve a good results like before, but I have a question: although the loop will retry tag allocation when the required number of tags is not met, there is a risk of an infinite loop, right? However, I couldn't think of a safer condition to terminate the loop. Do you have any suggestions? Thanks, Xue
Hi, 在 2025/09/12 11:06, Xue He 写道: > On 2025/09/03 18:35 PM, Yu Kuai wrote: >> On 2025/09/03 16:41 PM, Xue He wrote: >>> On 2025/09/02 08:47 AM, Yu Kuai wrote: >>>> On 2025/09/01 16:22, Xue He wrote: >>> ...... >> >> >> Yes, so this function will likely to obtain less tags than nr_tags,the >> mask is always start from first zero bit with nr_tags bit, and >> sbitmap_deferred_clear() is called uncondionally, it's likely there are >> non-zero bits within this range. >> >> Just wonder, do you consider fixing this directly in >> __blk_mq_alloc_requests_batch()? >> >> - call sbitmap_deferred_clear() and retry on allocation failure, so >> that the whole word can be used even if previous allocated request are >> done, especially for nvme with huge tag depths; >> - retry blk_mq_get_tags() until data->nr_tags is zero; > > Hi, Yu Kuai, I'm not entirely sure if I understand correctly, but during > each tag allocation, sbitmap_deferred_clear() is typically called first, > as seen in the __sbitmap_queue_get_batch() function. > > for (i = 0; i < sb->map_nr; i++) { > struct sbitmap_word *map = &sb->map[index]; > unsigned long get_mask; > unsigned int map_depth = __map_depth(sb, index); > unsigned long val; > > sbitmap_deferred_clear(map, 0, 0, 0); Yes, this is called first each time, and I don't feel this is good for performance anyway. I think it can be dealyed after a try first, like sbitmap_find_bit_in_word(). > ------------------------------------------------------------------------ > so I try to recall blk_mq_get_tags() until data->nr_tags is zero, like: > > - int i, nr = 0; > > - tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset); > - if (unlikely(!tag_mask)) > - return NULL; > - > - tags = blk_mq_tags_from_data(data); > - for (i = 0; tag_mask; i++) { > - if (!(tag_mask & (1UL << i))) > - continue; > - tag = tag_offset + i; > - prefetch(tags->static_rqs[tag]); > - tag_mask &= ~(1UL << i); > - rq = blk_mq_rq_ctx_init(data, tags, tag); > - rq_list_add_head(data->cached_rqs, rq); > - nr++; > - } > - if (!(data->rq_flags & RQF_SCHED_TAGS)) > - blk_mq_add_active_requests(data->hctx, nr); > - /* caller already holds a reference, add for remainder */ > - percpu_ref_get_many(&data->q->q_usage_counter, nr - 1); > - data->nr_tags -= nr; > + do { > + int i, nr = 0; > + tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset); > + if (unlikely(!tag_mask)) > + return NULL; > + tags = blk_mq_tags_from_data(data); > + for (i = 0; tag_mask; i++) { > + if (!(tag_mask & (1UL << i))) > + continue; > + tag = tag_offset + i; > + prefetch(tags->static_rqs[tag]); > + tag_mask &= ~(1UL << i); > + rq = blk_mq_rq_ctx_init(data, tags, tag); > + rq_list_add_head(data->cached_rqs, rq); > + nr++; > + } > + if (!(data->rq_flags & RQF_SCHED_TAGS)) > + blk_mq_add_active_requests(data->hctx, nr); > + /* caller already holds a reference, add for remainder */ > + percpu_ref_get_many(&data->q->q_usage_counter, nr - 1); > + data->nr_tags -= nr; > + } while (data->nr_tags); > > I added a loop structure, it also achieve a good results like before, > but I have a question: although the loop will retry tag allocation > when the required number of tags is not met, there is a risk of an > infinite loop, right? However, I couldn't think of a safer condition > to terminate the loop. Do you have any suggestions? Yes, this is what I have in mind. Why do you think there can be infinite loop? We should allcocate at least one tag by blk_mq_get_tags() in each loop, or return directly. Thanks, Kuai > > Thanks, > Xue > > . >
On 2025/09/15 10:22 AM, Yu Kuai wrote: >On 2025/09/12 11:06, Xue He wrote: >> On 2025/09/03 18:35 PM, Yu Kuai wrote: >>> On 2025/09/03 16:41 PM, Xue He wrote: >>>> On 2025/09/02 08:47 AM, Yu Kuai wrote: >>>>> On 2025/09/01 16:22, Xue He wrote: >>>> ...... >> >> I added a loop structure, it also achieve a good results like before, >> but I have a question: although the loop will retry tag allocation >> when the required number of tags is not met, there is a risk of an >> infinite loop, right? However, I couldn't think of a safer condition >> to terminate the loop. Do you have any suggestions? > >Yes, this is what I have in mind. Why do you think there can be infinite >loop? We should allcocate at least one tag by blk_mq_get_tags() in each >loop, or return directly. Understand your point now. I will send v2 patch. Thanks, Xue
On 2025/09/03/18:35PM, Yu Kuai wrote: >On 2025/09/03 16:41 PM, Xue He wrote: >> On 2025/09/02 08:47 AM, Yu Kuai wrote: >>> On 2025/09/01 16:22 PM, Xue He wrote: >> ...... >> >> the information of my nvme like this: >> number of CPU: 16 >> memory: 16G >> nvme nvme0: 16/0/16 default/read/poll queue >> cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests >> 1023 >> >> In more precise terms, I think it is not that the tags are fully exhausted, >> but rather that after scanning the bitmap for free bits, the remaining >> contiguous bits are nsufficient to meet the requirement (have but not enough). >> The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c. >> get_mask = ((1UL << nr_tags) - 1) << nr; >> if (nr_tags > 1) { >> printk("before %ld\n", get_mask); >> } >> while (!atomic_long_try_cmpxchg(ptr, &val, >> get_mask | val)) >> ; >> get_mask = (get_mask & ~val) >> nr; >> >> where during the batch acquisition of contiguous free bits, an atomic operation >> is performed, resulting in the actual tag_mask obtained differing from the >> originally requested one. > >Yes, so this function will likely to obtain less tags than nr_tags,the >mask is always start from first zero bit with nr_tags bit, and >sbitmap_deferred_clear() is called uncondionally, it's likely there are >non-zero bits within this range. > >Just wonder, do you consider fixing this directly in >__blk_mq_alloc_requests_batch()? > > - call sbitmap_deferred_clear() and retry on allocation failure, so >that the whole word can be used even if previous allocated request are >done, especially for nvme with huge tag depths; > - retry blk_mq_get_tags() until data->nr_tags is zero; > I haven't tried this yet, as I'm concerned that if it spin here, it might introduce more latency. Anyway, I may try to implement this idea and do some tests to observe the results. Thanks.
© 2016 - 2025 Red Hat, Inc.