block/bio-integrity-fs.c | 8 ++++++-- block/bio-integrity.c | 4 ++-- block/t10-pi.c | 7 ++++--- drivers/nvme/target/io-cmd-bdev.c | 3 +-- drivers/target/target_core_iblock.c | 3 +-- include/linux/bio-integrity.h | 11 ----------- include/linux/blk-integrity.h | 27 ++++++++++++++++++++------- include/linux/bvec.h | 1 + 8 files changed, 35 insertions(+), 29 deletions(-)
The block layer's integrity code currently sets the seed (initial
reference tag) in units of 512-byte sectors but increments it in units
of integrity intervals. Not only do the T10 DIF formats require ref tags
to be the lower bits of the logical block address, but mixing the two
units means the ref tags used for a particular logical block vary based
on its offset within a read/write request. This looks to be a
longstanding bug affecting block devices that support integrity with
block sizes > 512 bytes; I'm surprised it wasn't noticed before.
Also fix the newly added fs_bio_integrity_verify() to pass
bio_integrity_verify() a struct bdev_iter representing the data instead
of the integrity. Most of the integrity data is currently being skipped.
v3:
- Drop bi and bip arguments to bip_set_seed() (Christoph)
v2:
- Reorder fixes before refactoring commits
- Use u64, SECTOR_SHIFT (Christoph)
- Don't take sector_t in bip_set_seed() (Christoph)
Caleb Sander Mateos (6):
block: use integrity interval instead of sector as seed
bio-integrity-fs: pass data iter to bio_integrity_verify()
blk-integrity: take u64 in bio_integrity_intervals()
bio-integrity-fs: use integrity interval instead of sector as seed
t10-pi: use bio_integrity_intervals() helper
blk-integrity: avoid sector_t in bip_{get,set}_seed()
block/bio-integrity-fs.c | 8 ++++++--
block/bio-integrity.c | 4 ++--
block/t10-pi.c | 7 ++++---
drivers/nvme/target/io-cmd-bdev.c | 3 +--
drivers/target/target_core_iblock.c | 3 +--
include/linux/bio-integrity.h | 11 -----------
include/linux/blk-integrity.h | 27 ++++++++++++++++++++-------
include/linux/bvec.h | 1 +
8 files changed, 35 insertions(+), 29 deletions(-)
--
2.45.2
Hi Caleb! > The block layer's integrity code currently sets the seed (initial > reference tag) in units of 512-byte sectors but increments it in units > of integrity intervals I don't necessarily agree with the premise that the seed needs to be expressed in any particular unit. The seed is a start value, nothing more. We happen to set it to the block number in the block layer since we need to be able to know what to compare against on completion (for Type 1 + the restrictive Linux implementation of Type 2). But that does not imply that the seed needs to be specified in any particular unit. Submitters set the seed to whichever value makes sense to them (i.e. it could be the offset within a file as opposed to the eventual LBA on the backend device). And then that seed is incremented by 1 for each integrity interval of data in the PI sent to/received from the device. The conversion between the submitter's view of what the first ref tag should be (i.e. seed) and what is required by the hardware (for instance lower 32 bits of device LBA) is the reason we perform remapping. The seed is intentionally different in the submitter's protection envelope compared to the device's protection envelope. Using the block layer block number as seed was just a convenience since that provided a predictable value for any I/O that had its PI autogenerated. I never intended for the actual LBA to be used as seed value on a 4Kn device. Initially we just used 0 as the seed. Leveraging the block number just added a bit of additional protection. I confess I haven't tested 4Kn in a while since things sort of converged on 512e. But I used to run nightly tests on a SCSI storage with 4Kn blocks just fine. > This looks to be a longstanding bug affecting block devices that > support integrity with block sizes > 512 bytes; I'm surprised it > wasn't noticed before. Are you seeing this with NVMe or SCSI? -- Martin K. Petersen
On Thu, Apr 16, 2026 at 8:26 PM Martin K. Petersen <martin.petersen@oracle.com> wrote: > > > Hi Caleb! > > > The block layer's integrity code currently sets the seed (initial > > reference tag) in units of 512-byte sectors but increments it in units > > of integrity intervals > > I don't necessarily agree with the premise that the seed needs to be > expressed in any particular unit. The seed is a start value, nothing > more. NVM Command Set specification 1.1 section 5.3.3 requires the reference tag to increment by 1 per logical block, so that seems to determine the increment unit: > If the Reference Tag Check bit of the PRCHK field is set to ‘1’ and the namespace is > formatted for Type 1 or Type 2 protection, then the controller compares the Logical Block > Reference Tag to the computed reference tag. The computed reference tag depends on > the Protection Type: > ▪ If the namespace is formatted for Type 1 protection, the value of the computed > reference tag for the first logical block of the command is the value contained in > the Initial Logical Block Reference Tag (ILBRT) or Expected Initial Logical Block > Reference Tag (EILBRT) field in the command, and the computed reference tag is > incremented for each subsequent logical block. The controller shall complete the > command with a status of Invalid Protection Information if the ILBRT field or the > EILBRT field does not match the value of the least significant bits of the SLBA field > sized to the number of bits in the Logical Block Reference Tag (refer to section > 5.3.1.4). > Note: Unlike SCSI Protection Information Type 1 protection which implicitly uses > the least significant four bytes of the LBA, the controller always uses the ILBRT or > EILBRT field and requires the host to initialize the ILBRT or EILBRT field to the > least significant bits of the LBA sized to the number of bits in the Logical Block > Reference Tag when Type 1 protection is used. > ▪ If the namespace is formatted for Type 2 protection, the value of the computed > reference tag for the first logical block of the command is the value contained in > the Initial Logical Block Reference Tag (ILBRT) or Expected Initial Logical Block > Reference Tag (EILBRT) field in the command, and the computed reference tag is > incremented for each subsequent logical block. The ref tag used for a particular block needs to be consistent. And since reftag(block N) can be computed as the reftag(M) + N - M if block N is accessed as part of an I/O that begins at block M, the function must be of the form reftag(block N) = N + c for some constant c. Thus, the ref tag seed needs to be computed in units of logical blocks (integrity intervals); no other unit (e.g. 512-byte sectors) works. To see the issue with the current approach, consider an example accessing LBA 1 on a device with a 4 KB block size. If the block is written as part of a write that begins at LBA 0, its ref tag in the generated PI will be 1 (sector 0 + 1 integrity interval). If it's later read by a read starting at LBA 1, its expected ref tag will be 8 (sector 8 + 0 integrity intervals), and the auto-integrity code will fail the read due to a reftag mismatch. This seems completely unworkable for a block storage device. > > We happen to set it to the block number in the block layer since we need > to be able to know what to compare against on completion (for Type 1 + > the restrictive Linux implementation of Type 2). But that does not imply > that the seed needs to be specified in any particular unit. Submitters > set the seed to whichever value makes sense to them (i.e. it could be > the offset within a file as opposed to the eventual LBA on the backend I agree, the seed doesn't need to match the final LBA, but it does need to be in *units* of logical blocks, plus some constant offset. > device). And then that seed is incremented by 1 for each integrity > interval of data in the PI sent to/received from the device. The > conversion between the submitter's view of what the first ref tag should > be (i.e. seed) and what is required by the hardware (for instance lower > 32 bits of device LBA) is the reason we perform remapping. The seed is > intentionally different in the submitter's protection envelope compared > to the device's protection envelope. > > Using the block layer block number as seed was just a convenience since > that provided a predictable value for any I/O that had its PI > autogenerated. I never intended for the actual LBA to be used as seed > value on a 4Kn device. Initially we just used 0 as the seed. Leveraging > the block number just added a bit of additional protection. > > I confess I haven't tested 4Kn in a while since things sort of converged > on 512e. But I used to run nightly tests on a SCSI storage with 4Kn > blocks just fine. > > > This looks to be a longstanding bug affecting block devices that > > support integrity with block sizes > 512 bytes; I'm surprised it > > wasn't noticed before. > > Are you seeing this with NVMe or SCSI? With a ublk device. It should affect any block device that supports integrity and has a logical block size > 512. Best, Caleb
Hi Caleb! > NVM Command Set specification 1.1 section 5.3.3 requires the reference > tag to increment by 1 per logical block, so that seems to determine > the increment unit: SCSI allows PI to be interleaved at intervals smaller than the logical block size. This was done for PI compatibility in mixed environments with both 512[en] and 4Kn disks. Interleaving allows 8 bytes of PI per 512 bytes of data on devices using 4 KB logical blocks. That is the reason why we use the term "integrity interval" instead of assuming logical block size. > The ref tag used for a particular block needs to be consistent. And > since reftag(block N) can be computed as the reftag(M) + N - M if > block N is accessed as part of an I/O that begins at block M, the > function must be of the form reftag(block N) = N + c for some constant > c. Thus, the ref tag seed needs to be computed in units of logical > blocks (integrity intervals); no other unit (e.g. 512-byte sectors) > works. Whoever attaches the PI decides on the seed value. In the case of the block layer it made sense to use block layer sector number since that value is inevitably going to be the same for a future read. Note that with MD, DM, and partitioning in the mix, the sector number seen by whoever submits the I/O is going to be different from the LBAs on the target devices which eventually receive the I/O. Nobody says there is a computable constant offset. Think scattered LVM extent allocations. Or RAID stripes placed at mismatched LBA offsets. > To see the issue with the current approach, consider an example > accessing LBA 1 on a device with a 4 KB block size. If the block is > written as part of a write that begins at LBA 0, its ref tag in the > generated PI will be 1 (sector 0 + 1 integrity interval). If it's > later read by a read starting at LBA 1, its expected ref tag will be 8 > (sector 8 + 0 integrity intervals), and the auto-integrity code will > fail the read due to a reftag mismatch. Something is broken, then. Because the ref tag in the received PI should have been remapped to start at 8 in that case. > I agree, the seed doesn't need to match the final LBA, but it does > need to be in *units* of logical blocks, plus some constant offset. Your concept of "unit" still sends the wrong message. The seed is an integer value used to initialize a counter or hardware register. The seed only has meaning to whichever entity submits the I/O. To everything else it is a value used for remapping ref tags from the I/O submitter's point of view to whichever interpretation is mandated by the storage hardware's PI format. > With a ublk device. It should affect any block device that supports > integrity and has a logical block size > 512. It sounds like the seed value is set incorrectly for reads in your configuration. -- Martin K. Petersen
On Mon, Apr 20, 2026 at 7:09 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> Hi Caleb!
>
> > NVM Command Set specification 1.1 section 5.3.3 requires the reference
> > tag to increment by 1 per logical block, so that seems to determine
> > the increment unit:
>
> SCSI allows PI to be interleaved at intervals smaller than the logical
> block size. This was done for PI compatibility in mixed environments
> with both 512[en] and 4Kn disks. Interleaving allows 8 bytes of PI per
> 512 bytes of data on devices using 4 KB logical blocks. That is the
> reason why we use the term "integrity interval" instead of assuming
> logical block size.
Thanks for the explanation, I'm not too familiar with SCSI. I meant to
refer to integrity intervals in my explanation if they differ from the
logical block size.
>
> > The ref tag used for a particular block needs to be consistent. And
> > since reftag(block N) can be computed as the reftag(M) + N - M if
> > block N is accessed as part of an I/O that begins at block M, the
> > function must be of the form reftag(block N) = N + c for some constant
> > c. Thus, the ref tag seed needs to be computed in units of logical
> > blocks (integrity intervals); no other unit (e.g. 512-byte sectors)
> > works.
>
> Whoever attaches the PI decides on the seed value. In the case of the
> block layer it made sense to use block layer sector number since that
> value is inevitably going to be the same for a future read.
I'm not following "going to be the same for a future read". The block
can be read back by an I/O with a different starting
offset/sector/seed, as my example illustrates. When the integrity
interval size differs from the sector size (512 bytes), mixing the two
units results in a different ref tag seed for the block depending on
the starting offset of the I/O.
>
> Note that with MD, DM, and partitioning in the mix, the sector number
> seen by whoever submits the I/O is going to be different from the LBAs
> on the target devices which eventually receive the I/O. Nobody says
> there is a computable constant offset. Think scattered LVM extent
> allocations. Or RAID stripes placed at mismatched LBA offsets.
The constant offset relationship still needs to hold over any
contiguous range of a backing block device that can be accessed by a
single I/O. For example, with partitions, it's not possible for a
single I/O to cross a partition boundary, so each partition can have a
different constant offset between the ref tags and absolute integrity
interval numbers. With RAID, each shard can have a different constant
offset. etc.
>
> > To see the issue with the current approach, consider an example
> > accessing LBA 1 on a device with a 4 KB block size. If the block is
> > written as part of a write that begins at LBA 0, its ref tag in the
> > generated PI will be 1 (sector 0 + 1 integrity interval). If it's
> > later read by a read starting at LBA 1, its expected ref tag will be 8
> > (sector 8 + 0 integrity intervals), and the auto-integrity code will
> > fail the read due to a reftag mismatch.
>
> Something is broken, then. Because the ref tag in the received PI should
> have been remapped to start at 8 in that case.
Ah, I missed the remapping piece. Thanks for pointing that out. I
guess I was testing with a ublk device that doesn't advertise
BLK_INTEGRITY_REF_TAG. Since commit 203247c5cb97 ("blk-integrity:
support arbitrary buffer alignment"), the ref tag is unconditionally
set in the PI from the (sector) seed, but the remapping is conditional
on BLK_INTEGRITY_REF_TAG. That explains why I was seeing ref tags in
the PI that didn't match the integrity interval numbers.
So seems like patch 1 ("block: use integrity interval instead of
sector as seed") doesn't need a Fixes tag. Still, I'm confused why the
auto-integrity code bothers setting the seed to the sector number in
the first place if it's going to be remapped later. Why not just leave
the seed zeroed?
Best,
Caleb
>
> > I agree, the seed doesn't need to match the final LBA, but it does
> > need to be in *units* of logical blocks, plus some constant offset.
>
> Your concept of "unit" still sends the wrong message. The seed is an
> integer value used to initialize a counter or hardware register. The
> seed only has meaning to whichever entity submits the I/O. To everything
> else it is a value used for remapping ref tags from the I/O submitter's
> point of view to whichever interpretation is mandated by the storage
> hardware's PI format.
>
> > With a ublk device. It should affect any block device that supports
> > integrity and has a logical block size > 512.
>
> It sounds like the seed value is set incorrectly for reads in your
> configuration.
>
> --
> Martin K. Petersen
On Thu, Apr 23, 2026 at 11:02 AM Caleb Sander Mateos
<csander@purestorage.com> wrote:
>
> On Mon, Apr 20, 2026 at 7:09 PM Martin K. Petersen
> <martin.petersen@oracle.com> wrote:
> >
> >
> > Hi Caleb!
> >
> > > NVM Command Set specification 1.1 section 5.3.3 requires the reference
> > > tag to increment by 1 per logical block, so that seems to determine
> > > the increment unit:
> >
> > SCSI allows PI to be interleaved at intervals smaller than the logical
> > block size. This was done for PI compatibility in mixed environments
> > with both 512[en] and 4Kn disks. Interleaving allows 8 bytes of PI per
> > 512 bytes of data on devices using 4 KB logical blocks. That is the
> > reason why we use the term "integrity interval" instead of assuming
> > logical block size.
>
> Thanks for the explanation, I'm not too familiar with SCSI. I meant to
> refer to integrity intervals in my explanation if they differ from the
> logical block size.
>
> >
> > > The ref tag used for a particular block needs to be consistent. And
> > > since reftag(block N) can be computed as the reftag(M) + N - M if
> > > block N is accessed as part of an I/O that begins at block M, the
> > > function must be of the form reftag(block N) = N + c for some constant
> > > c. Thus, the ref tag seed needs to be computed in units of logical
> > > blocks (integrity intervals); no other unit (e.g. 512-byte sectors)
> > > works.
> >
> > Whoever attaches the PI decides on the seed value. In the case of the
> > block layer it made sense to use block layer sector number since that
> > value is inevitably going to be the same for a future read.
>
> I'm not following "going to be the same for a future read". The block
> can be read back by an I/O with a different starting
> offset/sector/seed, as my example illustrates. When the integrity
> interval size differs from the sector size (512 bytes), mixing the two
> units results in a different ref tag seed for the block depending on
> the starting offset of the I/O.
>
> >
> > Note that with MD, DM, and partitioning in the mix, the sector number
> > seen by whoever submits the I/O is going to be different from the LBAs
> > on the target devices which eventually receive the I/O. Nobody says
> > there is a computable constant offset. Think scattered LVM extent
> > allocations. Or RAID stripes placed at mismatched LBA offsets.
>
> The constant offset relationship still needs to hold over any
> contiguous range of a backing block device that can be accessed by a
> single I/O. For example, with partitions, it's not possible for a
> single I/O to cross a partition boundary, so each partition can have a
> different constant offset between the ref tags and absolute integrity
> interval numbers. With RAID, each shard can have a different constant
> offset. etc.
>
> >
> > > To see the issue with the current approach, consider an example
> > > accessing LBA 1 on a device with a 4 KB block size. If the block is
> > > written as part of a write that begins at LBA 0, its ref tag in the
> > > generated PI will be 1 (sector 0 + 1 integrity interval). If it's
> > > later read by a read starting at LBA 1, its expected ref tag will be 8
> > > (sector 8 + 0 integrity intervals), and the auto-integrity code will
> > > fail the read due to a reftag mismatch.
> >
> > Something is broken, then. Because the ref tag in the received PI should
> > have been remapped to start at 8 in that case.
>
> Ah, I missed the remapping piece. Thanks for pointing that out. I
> guess I was testing with a ublk device that doesn't advertise
> BLK_INTEGRITY_REF_TAG. Since commit 203247c5cb97 ("blk-integrity:
> support arbitrary buffer alignment"), the ref tag is unconditionally
> set in the PI from the (sector) seed, but the remapping is conditional
> on BLK_INTEGRITY_REF_TAG. That explains why I was seeing ref tags in
> the PI that didn't match the integrity interval numbers.
>
> So seems like patch 1 ("block: use integrity interval instead of
> sector as seed") doesn't need a Fixes tag. Still, I'm confused why the
> auto-integrity code bothers setting the seed to the sector number in
> the first place if it's going to be remapped later. Why not just leave
> the seed zeroed?
Martin,
I would appreciate a response here. Would you be okay with patch 1 if
the Fixes tags were dropped? Do you think we can get rid of the ref
tag seed initialization entirely if the ref tags get remapped later
anyways? Even if patch 1 is not required for correctness, patch 2 is a
fix for a separate issue introduced in the 7.1 merge window and has
reviews from Christoph and Anuj. I would prefer not to hold up that
fix over this ref tag seed discussion.
Best,
Caleb
Hi Caleb!
Sorry about the delay. Been away for a few weeks...
>> So seems like patch 1 ("block: use integrity interval instead of
>> sector as seed") doesn't need a Fixes tag. Still, I'm confused why
>> the auto-integrity code bothers setting the seed to the sector number
>> in the first place if it's going to be remapped later. Why not just
>> leave the seed zeroed?
It adds a bit of extra protection in the sense that it there is one more
parameter that can be validated. The premise of the integrity
infrastructure is that things in the two supplied buffers (data + PI) as
well as the control path (bip in the block layer case plus the SCSI or
NVMe command fields) all need to agree for the I/O to go through.
It is valid to generate the PI starting with 0. But that is
indistinguishable from "the seed value was not initialized".
> I would appreciate a response here. Would you be okay with patch 1 if
> the Fixes tags were dropped?
I am afraid I still don't completely understand why things are broken.
For writes, the meaning of the bip seed is: "This is the value you
should expect in the ref tag for the first integrity interval in the PI
buffer I prepared". With block layer autoprotect, the seed is set before
generating the PI and thus implicitly affects the generation.
When the write operation subsequently reaches the bottom of the stack,
we will check that the first ref tag in the PI buffer matches the
supplied seed value. And then proceed to remap the ref tags for each
protection interval to the target LBA + n since that is what the storage
requires (ignoring the odd Type 2 interval mismatch for now).
For reads, the meaning of the bip seed is: "This is what I expect to
receive in the ref tag for the first integrity interval in the PI
buffer". At the bottom of the stack we will receive PI from the storage
and that will contain ref tags matching the lower 32 bits of the LBA
since that is what the hardware returns. And we will then remap all
those ref tags starting with whichever bip seed value was requested by
the caller. It doesn't matter whether the requested seed value was 0,
10, or 42. The ref tags are remapped to whatever the caller wants them
to be.
I tend to think of the seed as a register you program with the value you
want. And then hardware or software remaps between what the storage
device's protection envelope requires and what the application (or in
this case the block layer) requested. With SCSI + DIX 1.1, the seed
literally controls a remapping register in the HBA ASIC. In NVMe we have
ILBRT/EILBRT.
--
Martin K. Petersen
On Tue, May 12, 2026 at 7:16 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> Hi Caleb!
>
> Sorry about the delay. Been away for a few weeks...
>
> >> So seems like patch 1 ("block: use integrity interval instead of
> >> sector as seed") doesn't need a Fixes tag. Still, I'm confused why
> >> the auto-integrity code bothers setting the seed to the sector number
> >> in the first place if it's going to be remapped later. Why not just
> >> leave the seed zeroed?
>
> It adds a bit of extra protection in the sense that it there is one more
> parameter that can be validated. The premise of the integrity
> infrastructure is that things in the two supplied buffers (data + PI) as
> well as the control path (bip in the block layer case plus the SCSI or
> NVMe command fields) all need to agree for the I/O to go through.
>
> It is valid to generate the PI starting with 0. But that is
> indistinguishable from "the seed value was not initialized".
>
> > I would appreciate a response here. Would you be okay with patch 1 if
> > the Fixes tags were dropped?
>
> I am afraid I still don't completely understand why things are broken.
Nothing is broken, I just mean that the seed value stored in
bip_iter.bi_sector is strange in that it's initialized in units of
512-byte sectors but incremented in units of integrity intervals. As
you point out, the remapping step makes the initial seed value
irrelevant, but I was certainly confused by it when I printed it
during some debugging. I can update the commit message to clarify the
rationale for the change.
>
> For writes, the meaning of the bip seed is: "This is the value you
> should expect in the ref tag for the first integrity interval in the PI
> buffer I prepared". With block layer autoprotect, the seed is set before
> generating the PI and thus implicitly affects the generation.
>
> When the write operation subsequently reaches the bottom of the stack,
> we will check that the first ref tag in the PI buffer matches the
> supplied seed value. And then proceed to remap the ref tags for each
> protection interval to the target LBA + n since that is what the storage
> requires (ignoring the odd Type 2 interval mismatch for now).
>
> For reads, the meaning of the bip seed is: "This is what I expect to
> receive in the ref tag for the first integrity interval in the PI
> buffer". At the bottom of the stack we will receive PI from the storage
> and that will contain ref tags matching the lower 32 bits of the LBA
> since that is what the hardware returns. And we will then remap all
> those ref tags starting with whichever bip seed value was requested by
> the caller. It doesn't matter whether the requested seed value was 0,
> 10, or 42. The ref tags are remapped to whatever the caller wants them
> to be.
>
> I tend to think of the seed as a register you program with the value you
> want. And then hardware or software remaps between what the storage
> device's protection envelope requires and what the application (or in
> this case the block layer) requested. With SCSI + DIX 1.1, the seed
> literally controls a remapping register in the HBA ASIC. In NVMe we have
> ILBRT/EILBRT.
What I find confusing is that the seed value stored in
bip_iter.bi_sector isn't what's actually passed to the SCSI/NVMe
device. It's only used in blk_integrity_iterate() and
__blk_reftag_remap() to generate/verify/remap the reftags in the
integrity/PI buffer. However, (E)ILBRT field (taking NVMe as an
example) comes from the physical block device offset rather than the
reftag seed. See t10_pi_ref_tag(), which returns blk_rq_pos()
converted to integrity intervals. It looks like this works because the
remap step ensures the reftags passed in the integrity buffer match
the physical integrity interval numbers, but this means the device is
comparing physical integrity interval numbers rather than reftag
seeds. My point is that if the remap step undoes the effect of the
seed by setting all the reftags in the integrity buffer to their
physical integrity interval, I don't see why the block integrity code
bothers setting a seed in the first place.
But it sounds like this may be a longer discussion, so I will split
out the two fixes for 7.1 into a separate series.
Thanks,
Caleb
© 2016 - 2026 Red Hat, Inc.