Raise the QEMU NVMe controller's MDTS limit from 9 (2 MiB) to 11 (8 MiB)
to match the Linux host driver's NVME_MAX_BYTES.
Commit 53493c1f83 ("hw/nvme: cap MDTS value for internal limitation")
needed the 2 MiB cap because dma_blk_io() submitted the full sglist
in one preadv()/pwritev() call, which the host kernel rejects when the
iovec count exceeds IOV_MAX. This series moves the IOV_MAX bound down to
dma_blk_cb(), where we batch in IOV_MAX chunks when necessary.
dma_blk_cb() now breaks its accumulation at IOV_MAX and submits the
next chunk via the existing re-entry path. With that the DMA-path
nvme_map_addr() guard is removed, MDTS moves to 11 via a new
NVME_MDTS_MAX, and mdts=0 is coerced to the same limit: Linux's host
driver clamps at NVME_MAX_BYTES = 8 MiB regardless of MDTS, so honoring
"unlimited" buys nothing.
Verification with blkalgn and fio:
fio \
--name=mdts-stress \
--filename=/mnt/mdts/stress.fio \
--rw=randwrite \
--bs=8M \
--ioengine=psync \
--direct=0 \
--numjobs=8 \
--time_based \
--runtime=600 \
--fsync=8 \
--end_fsync=1 \
--group_reporting \
--refill_buffers \
--norandommap \
--offset_align=32k
blkalgn-libbpf --disk=nvme3n1 --trace
I/O Granularity Histogram for Device nvme3n1 (lbads: 12 - 4096 bytes)
Total I/Os: 134993
Bytes : count distribution
0 : 12378 |**** |
32768 : 3622 |* |
65536 : 2 | |
1048576 : 71 | |
2097152 : 827 | |
2588672 : 50 | |
3670016 : 62 | |
4194304 : 357 | |
4718592 : 62 | |
5799936 : 50 | |
6291456 : 644 | |
7340032 : 71 | |
8388608 : 116797 |****************************************|
I/O Alignment Histogram for Device nvme3n1
Bytes : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 12378 |********** |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | |
16384 -> 32767 : 0 | |
32768 -> 65535 : 49360 |****************************************|
65536 -> 131071 : 2 | |
131072 -> 262143 : 0 | |
262144 -> 524287 : 0 | |
524288 -> 1048575 : 36747 |***************************** |
1048576 -> 2097151 : 36506 |***************************** |
/sys/block/nvme3n1/queue/
max_segments = 256
max_segment_size = 4294967295
max_hw_sectors_kb = 8192
max_sectors_kb = 8192
virt_boundary_mask = 0
logical_block_size = 4096
physical_block_size = 4096
minimum_io_size = 4096
optimal_io_size = 0
dma_alignment = 3
chunk_sectors = 0
nr_requests = 1023
nvme id-ctrl /dev/nvme3n1
mdts : 11
cntrltype : 1
sgls : 0x80001
xfs_info /mnt/mdts
meta-data=/dev/nvme3n1 isize=512 agcount=4, agsize=163840 blks
= sectsz=32768 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0 metadir=0
data = bsize=32768 blocks=655360, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=32768 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=32768 blocks=5446, version=2
= sectsz=32768 sunit=1 blks, lazy-count=1
realtime =none extsz=32768 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
= zoned=0 start=0 reserved=0
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
---
Daniel Gomez (3):
dma-helpers: chunk dma_blk_cb at IOV_MAX
hw/nvme: drop DMA-path IOV_MAX guard
hw/nvme: allow mdts up to 8192 KiB
hw/nvme/ctrl.c | 15 +++++++++------
system/dma-helpers.c | 3 ++-
2 files changed, 11 insertions(+), 7 deletions(-)
---
base-commit: 98b060da3a4f92b2a994ead5b16a87e783baf77c
change-id: 20260528-align-nvme-mdts-with-linux-67d618f6730b
Best regards,
--
Daniel Gomez <da.gomez@samsung.com>