[PATCH RFC 0/3] hw/block/nvme: dif-based end-to-end data protection support

Klaus Jensen posted 3 patches 3 years, 4 months ago
Failed in applying to current master (apply log)
hw/block/nvme-ns.h    |  22 +-
hw/block/nvme.h       |  36 +++
include/block/nvme.h  |  24 +-
hw/block/nvme-ns.c    |  66 ++++-
hw/block/nvme.c       | 616 ++++++++++++++++++++++++++++++++++++++----
hw/block/trace-events |  10 +
6 files changed, 704 insertions(+), 70 deletions(-)
[PATCH RFC 0/3] hw/block/nvme: dif-based end-to-end data protection support
Posted by Klaus Jensen 3 years, 4 months ago
From: Klaus Jensen <k.jensen@samsung.com>

This series adds support for extended LBAs and end-to-end data
protection. Marked RFC, since there are a bunch of issues that could use
some discussion.

Storing metadata bytes contiguously with the logical block data and
creating a physically extended logical block basically breaks the DULBE
and deallocation support I just added. Formatting a namespace with
protection information requires the app- and reftags of deallocated or
unwritten blocks to be 0xffff and 0xffffffff respectively; this could be
used to reintroduce DULBE support in that case, albeit at a somewhat
higher cost than the block status flag-based approach.

There is basically three ways of storing metadata (and maybe a forth,
but that is probably quite the endeavour):

  1. Storing metadata as extended blocks directly on the blockdev. That
     is the approach used in this RFC.

  2. Use a separate blockdev. Incidentially, this is also the easiest
     and most straightforward solution to support MPTR-based "separate
     metadata". This also allows DULBE and block deallocation to be
     supported using the existing approach.

  3. A hybrid of 1 and 2 where the metadata is stored contiguously at
    the end of the nvme-ns blockdev.

Option 1 obviously works well with DIF-based protection information and
extended LBAs since it maps one to one. Option 2 works flawlessly with
MPTR-based metadata, but extended LBAs can be "emulated" at the cost of
a bunch of scatter/gather operations.

The 4th option is extending an existing image format (QCOW2) or create
something on top of RAW to supports metadata bytes per block. But both
approaches require full API support through the block layer. And
probably a lot of other stuff that I did not think about.

Anyway, we would love some comments on this.

Gollu Appalanaidu (2):
  nvme: add support for extended LBAs
  hw/block/nvme: end-to-end data protection

Klaus Jensen (1):
  hw/block/nvme: refactor nvme_dma

 hw/block/nvme-ns.h    |  22 +-
 hw/block/nvme.h       |  36 +++
 include/block/nvme.h  |  24 +-
 hw/block/nvme-ns.c    |  66 ++++-
 hw/block/nvme.c       | 616 ++++++++++++++++++++++++++++++++++++++----
 hw/block/trace-events |  10 +
 6 files changed, 704 insertions(+), 70 deletions(-)

-- 
2.29.2


Re: [PATCH RFC 0/3] hw/block/nvme: dif-based end-to-end data protection support
Posted by Keith Busch 3 years, 4 months ago
On Thu, Dec 17, 2020 at 10:02:19PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> This series adds support for extended LBAs and end-to-end data
> protection. Marked RFC, since there are a bunch of issues that could use
> some discussion.
> 
> Storing metadata bytes contiguously with the logical block data and
> creating a physically extended logical block basically breaks the DULBE
> and deallocation support I just added. Formatting a namespace with
> protection information requires the app- and reftags of deallocated or
> unwritten blocks to be 0xffff and 0xffffffff respectively; this could be
> used to reintroduce DULBE support in that case, albeit at a somewhat
> higher cost than the block status flag-based approach.
> 
> There is basically three ways of storing metadata (and maybe a forth,
> but that is probably quite the endeavour):
> 
>   1. Storing metadata as extended blocks directly on the blockdev. That
>      is the approach used in this RFC.
> 
>   2. Use a separate blockdev. Incidentially, this is also the easiest
>      and most straightforward solution to support MPTR-based "separate
>      metadata". This also allows DULBE and block deallocation to be
>      supported using the existing approach.
> 
>   3. A hybrid of 1 and 2 where the metadata is stored contiguously at
>     the end of the nvme-ns blockdev.
> 
> Option 1 obviously works well with DIF-based protection information and
> extended LBAs since it maps one to one. Option 2 works flawlessly with
> MPTR-based metadata, but extended LBAs can be "emulated" at the cost of
> a bunch of scatter/gather operations.

Are there any actual users of extended metadata that we care about? I'm
aware of only a few niche places that can even access an extended
metadata format. There's not kernel support in any major OS that I know
of.

Option 2 sounds fine.

If option 3 means that you're still using MPTR, but just sequester space
at the end of the backing block device for meta-data purposes, then that
is fine too. You can even resize it dynamically if you want to support
different metadata sizes.

> The 4th option is extending an existing image format (QCOW2) or create
> something on top of RAW to supports metadata bytes per block. But both
> approaches require full API support through the block layer. And
> probably a lot of other stuff that I did not think about.

It definitely sounds appealing to push the feature to a lower level if
you're really willing to see that through.

In any case, calculating T10 CRCs is *really* slow unless you have
special hardware and software support for it.

Re: [PATCH RFC 0/3] hw/block/nvme: dif-based end-to-end data protection support
Posted by Klaus Jensen 3 years, 4 months ago
On Dec 17 13:14, Keith Busch wrote:
> On Thu, Dec 17, 2020 at 10:02:19PM +0100, Klaus Jensen wrote:
> > From: Klaus Jensen <k.jensen@samsung.com>
> > 
> > This series adds support for extended LBAs and end-to-end data
> > protection. Marked RFC, since there are a bunch of issues that could use
> > some discussion.
> > 
> > Storing metadata bytes contiguously with the logical block data and
> > creating a physically extended logical block basically breaks the DULBE
> > and deallocation support I just added. Formatting a namespace with
> > protection information requires the app- and reftags of deallocated or
> > unwritten blocks to be 0xffff and 0xffffffff respectively; this could be
> > used to reintroduce DULBE support in that case, albeit at a somewhat
> > higher cost than the block status flag-based approach.
> > 
> > There is basically three ways of storing metadata (and maybe a forth,
> > but that is probably quite the endeavour):
> > 
> >   1. Storing metadata as extended blocks directly on the blockdev. That
> >      is the approach used in this RFC.
> > 
> >   2. Use a separate blockdev. Incidentially, this is also the easiest
> >      and most straightforward solution to support MPTR-based "separate
> >      metadata". This also allows DULBE and block deallocation to be
> >      supported using the existing approach.
> > 
> >   3. A hybrid of 1 and 2 where the metadata is stored contiguously at
> >     the end of the nvme-ns blockdev.
> > 
> > Option 1 obviously works well with DIF-based protection information and
> > extended LBAs since it maps one to one. Option 2 works flawlessly with
> > MPTR-based metadata, but extended LBAs can be "emulated" at the cost of
> > a bunch of scatter/gather operations.
> 
> Are there any actual users of extended metadata that we care about? I'm
> aware of only a few niche places that can even access an extended
> metadata format. There's not kernel support in any major OS that I know
> of.
> 

Yes, there are definitely actual users in enterprise storage. But the
main use case here is testing (using extended LBAs with SPDK for
instance).

> Option 2 sounds fine.
> 
> If option 3 means that you're still using MPTR, but just sequester space
> at the end of the backing block device for meta-data purposes, then that
> is fine too. You can even resize it dynamically if you want to support
> different metadata sizes.

Heh, I tend to think that my English vocabulary is pretty decent but I
had to look up 'sequester'. I just learned a new word today \o/

Yes. I actually also like option 3. Technically option 2 does not break
image interoperability between devices (ide, virtio), but you would
leave out a bunch of metadata that your application might depend on, so
I don't see any way to not break interoperability really.

And I would then be just fine with "emulating" extended LBAs at the cost
of more I/O. Because I would like the device to support that mode of
operation as well. We have not implemented this, but my gut feeling says
that it can be done.

> 
> > The 4th option is extending an existing image format (QCOW2) or create
> > something on top of RAW to supports metadata bytes per block. But both
> > approaches require full API support through the block layer. And
> > probably a lot of other stuff that I did not think about.
> 
> It definitely sounds appealing to push the feature to a lower level if
> you're really willing to see that through.
> 

Yes, its super appealing and I would like to have some input from the
block layer guys on this. That is, if anyone has ever explored it?

> In any case, calculating T10 CRCs is *really* slow unless you have
> special hardware and software support for it.
> 

Yeah. I know this is super slow. For for emulation and testing purposes
I think it is a nice feature for the device to optionally offer.
Re: [PATCH RFC 0/3] hw/block/nvme: dif-based end-to-end data protection support
Posted by Keith Busch 3 years, 4 months ago
On Fri, Dec 18, 2020 at 10:39:01AM +0100, Klaus Jensen wrote:
> On Dec 17 13:14, Keith Busch wrote:
> > On Thu, Dec 17, 2020 at 10:02:19PM +0100, Klaus Jensen wrote:
> > 
> > Are there any actual users of extended metadata that we care about? I'm
> > aware of only a few niche places that can even access an extended
> > metadata format. There's not kernel support in any major OS that I know
> > of.
> > 
> 
> Yes, there are definitely actual users in enterprise storage. But the
> main use case here is testing (using extended LBAs with SPDK for
> instance).

Fair enough.

And just to make sure we're coverging on the same nomenclature, spec
suggests "extended" metadata means the interleaved format that does not
use the MPTR field. Metadata with the MPTR field is referred to as
"separate". I'm only mentioning this because I've been in confused
conversations where "extended LBA" interchangably meant either method.
 
> Yes. I actually also like option 3. Technically option 2 does not break
> image interoperability between devices (ide, virtio), but you would
> leave out a bunch of metadata that your application might depend on, so
> I don't see any way to not break interoperability really.

Either is fine. If you're switching metadata modes through your qemu
parameters, you could think of this as a firmware change, which doesn't
guarantee the same LBA formats.

> And I would then be just fine with "emulating" extended LBAs at the cost
> of more I/O. Because I would like the device to support that mode of
> operation as well. We have not implemented this, but my gut feeling says
> that it can be done.

It can. My qemu tree from way back did this, but infradead.org lost it
all and never recovered. I didn't retain a local copy either, but
starting from scratch is probably the best course anyway.

> > In any case, calculating T10 CRCs is *really* slow unless you have
> > special hardware and software support for it.
> > 
> 
> Yeah. I know this is super slow. For for emulation and testing purposes
> I think it is a nice feature for the device to optionally offer.

Bonus if you want to implement this with PCLMULQDQ support in x86-64
hosts. For reference, here's the linux kernel's implementation:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/crypto/crct10dif-pcl-asm_64.S

I wouldn't necessarily tie metadata support to T10DIF, though, since
it has uses beyond protection info.