From: Keith Busch <kbusch@kernel.org>
We're already iterating every segment, so check these for a valid IO at
the same time.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
block/blk-merge.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 70d704615be52..81bdad915699a 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -298,6 +298,9 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
unsigned nsegs = 0, bytes = 0;
bio_for_each_bvec(bv, bio, iter) {
+ if (bv.bv_offset & lim->dma_alignment)
+ return -EINVAL;
+
/*
* If the queue doesn't support SG gaps and adding this
* offset would create a gap, disallow it.
@@ -341,6 +344,8 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
* we do not use the full hardware limits.
*/
bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim));
+ if (!bytes)
+ return -EINVAL;
/*
* Bio splitting may cause subtle trouble such as hang when doing sync
--
2.47.3
On Tue, Aug 05, 2025 at 07:11:17AM -0700, Keith Busch wrote: > bio_for_each_bvec(bv, bio, iter) { > + if (bv.bv_offset & lim->dma_alignment) > + return -EINVAL; I have a question about this part here because testing with various scsi is showing odd results. NVMe wants this check to actually be this: if ((bv.bv_offset | bv.bv_len) & lim->dma_alignment) because the dma alignment defines not only the starting address offset, but also the length. NVMe's default alignment is 4 bytes. But I can't make that change because many scsi devices don't set the dma alignment and get the default 511 value. This is fine for the memory address offset, but the lengths sent for various inquriy commands are much smaller, like 4 and 32 byte lengths. That length wouldn't pass the dma alignment granularity, so I think the default value is far too conservative. Does the address start size need to be a different limit than minimum length? I feel like they should be the same, but maybe that's just an nvme thing.
Hi Keith! > NVMe wants this check to actually be this: > > if ((bv.bv_offset | bv.bv_len) & lim->dma_alignment) > > because the dma alignment defines not only the starting address offset, > but also the length. NVMe's default alignment is 4 bytes. dma_alignment defines the alignment of the DMA starting address. We don't have a dedicated queue limit for the transfer length granularity described by NVMe. -- Martin K. Petersen
On Wed, Aug 13, 2025 at 05:23:47PM -0400, Martin K. Petersen wrote: > dma_alignment defines the alignment of the DMA starting address. We > don't have a dedicated queue limit for the transfer length granularity > described by NVMe. Darn, but thanks for confirming. I'll see if I can get by without needing a new limit, or look into adding one if not. Worst case, I can also let the device return the error, though I think we prefer not to send an IO that we know should fail.
On Wed, Aug 13, 2025 at 03:52:44PM -0600, Keith Busch wrote: > On Wed, Aug 13, 2025 at 05:23:47PM -0400, Martin K. Petersen wrote: > > dma_alignment defines the alignment of the DMA starting address. We > > don't have a dedicated queue limit for the transfer length granularity > > described by NVMe. > > Darn, but thanks for confirming. I'll see if I can get by without > needing a new limit, or look into adding one if not. Worst case, I can > also let the device return the error, though I think we prefer not to > send an IO that we know should fail. Allowing an unprivileged user application to trivially trigger an I/O error sounds like a bad idea. But I think simply applying the dma_alignment to the length of READ/WRITE commands might be work, while ignoring it for other types of commands.
On 8/13/25 1:06 PM, Keith Busch wrote: > But I can't make that change because many scsi devices don't set the dma > alignment and get the default 511 value. This is fine for the memory > address offset, but the lengths sent for various inquriy commands are > much smaller, like 4 and 32 byte lengths. That length wouldn't pass the > dma alignment granularity, so I think the default value is far too > conservative. Does the address start size need to be a different limit > than minimum length? I feel like they should be the same, but maybe > that's just an nvme thing. Hi Keith, Maybe I misunderstood your question. It seems to me that the SCSI core sets the DMA alignment by default to four bytes. From drivers/scsi/hosts.c: /* 32-byte (dword) is a common minimum for HBAs. */ if (sht->dma_alignment) shost->dma_alignment = sht->dma_alignment; else shost->dma_alignment = 3; Bart.
On Wed, Aug 13, 2025 at 01:41:49PM -0700, Bart Van Assche wrote: > On 8/13/25 1:06 PM, Keith Busch wrote: > > But I can't make that change because many scsi devices don't set the dma > > alignment and get the default 511 value. This is fine for the memory > > address offset, but the lengths sent for various inquriy commands are > > much smaller, like 4 and 32 byte lengths. That length wouldn't pass the > > dma alignment granularity, so I think the default value is far too > > conservative. Does the address start size need to be a different limit > > than minimum length? I feel like they should be the same, but maybe > > that's just an nvme thing. > > Hi Keith, > > Maybe I misunderstood your question. It seems to me that the SCSI core > sets the DMA alignment by default to four bytes. From > drivers/scsi/hosts.c: Thanks, I think you got my meaning. I'm using the AHCI driver. It looks like ata_scsi_dev_config() overrides the dma_alignment to sector_size - 1, and that pattern goes way back, almost 20 years ago, so maybe I can't change it.
On 8/14/25 6:01 AM, Keith Busch wrote: > On Wed, Aug 13, 2025 at 01:41:49PM -0700, Bart Van Assche wrote: >> On 8/13/25 1:06 PM, Keith Busch wrote: >>> But I can't make that change because many scsi devices don't set the dma >>> alignment and get the default 511 value. This is fine for the memory >>> address offset, but the lengths sent for various inquriy commands are >>> much smaller, like 4 and 32 byte lengths. That length wouldn't pass the >>> dma alignment granularity, so I think the default value is far too >>> conservative. Does the address start size need to be a different limit >>> than minimum length? I feel like they should be the same, but maybe >>> that's just an nvme thing. >> >> Hi Keith, >> >> Maybe I misunderstood your question. It seems to me that the SCSI core >> sets the DMA alignment by default to four bytes. From >> drivers/scsi/hosts.c: > > Thanks, I think you got my meaning. > > I'm using the AHCI driver. It looks like ata_scsi_dev_config() overrides > the dma_alignment to sector_size - 1, and that pattern goes way back, > almost 20 years ago, so maybe I can't change it. That is probably buggy now in the sense that the scsi layer should be able to send any command with a size not aligned to the LBA size or ATA sector (512 B) and libata-scsi SAT should do the translation using an internal 512B aligned command size. What makes a mess here is that SCSI allows having a media-access command specifying a transfer size that is not aligned on the LBA size. The transfer will be "short" in that case, which is perfectly fine with SCSI. But ATA does not allow that. It is all or nothing and the command size thus must always be aligned to the LBA size. I think that dma_alignment was abused to check that. But I think it should not be too hard to check the alignment in libata-scsi when translating the command. SAS HBAs should be doing something similar too. Have never exactly tested that though, and I am afraid how many SAS HBAs will not like unaligned command to ATA devices... We also have the different alignment for management commands (most of which use 512B sector size) and media access commands which use the actual device LBA size alignment. So it is a mess :) -- Damien Le Moal Western Digital Research
On 8/13/25 2:01 PM, Keith Busch wrote: > I'm using the AHCI driver. It looks like ata_scsi_dev_config() overrides > the dma_alignment to sector_size - 1, and that pattern goes way back, > almost 20 years ago, so maybe I can't change it. From an AHCI specification (https://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/serial-ata-ahci-spec-rev1-3-1.pdf): "Data Base Address (DBA): Indicates the 32-bit physical address of the data block. The block must be word aligned, indicated by bit 0 being reserved." Please note that I have no experience with programming AHCI controllers. Bart.
On Tue, Aug 05, 2025 at 07:11:17AM -0700, Keith Busch wrote: > @@ -341,6 +344,8 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim, > * we do not use the full hardware limits. > */ > bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim)); > + if (!bytes) > + return -EINVAL; How is this related to the other hunk and the patch description?
On Sun, Aug 10, 2025 at 07:37:36AM -0700, Christoph Hellwig wrote: > On Tue, Aug 05, 2025 at 07:11:17AM -0700, Keith Busch wrote: > > @@ -341,6 +344,8 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim, > > * we do not use the full hardware limits. > > */ > > bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim)); > > + if (!bytes) > > + return -EINVAL; > > How is this related to the other hunk and the patch description? The patchset allows you to submit an io with vectors that are partial logical blocks. Misuse could create a bio that exceeds the device max vectors or introduces virtual boundary gaps, requiring a split into something that is smaller than a block size. This check catches that. Quick example: nvme with a 4k logical block size, and the usual 4k virtual boundary. Send an io with four vectors iov_len=1k. The total size is block sized, but there's no way that could split into a valid io. There's a test specifically for this in my reply about xfstests.
On Sun, Aug 10, 2025 at 09:39:50AM -0600, Keith Busch wrote: > > > bytes = ALIGN_DOWN(bytes, bio_split_alignment(bio, lim)); > > > + if (!bytes) > > > + return -EINVAL; > > > > How is this related to the other hunk and the patch description? > > The patchset allows you to submit an io with vectors that are partial > logical blocks. Misuse could create a bio that exceeds the device max > vectors or introduces virtual boundary gaps, requiring a split into > something that is smaller than a block size. This check catches that. > > Quick example: nvme with a 4k logical block size, and the usual 4k > virtual boundary. Send an io with four vectors iov_len=1k. The total > size is block sized, but there's no way that could split into a valid > io. There's a test specifically for this in my reply about xfstests. Can you turn the above into a comment explaining the check?
© 2016 - 2025 Red Hat, Inc.