[Qemu-devel] [PATCH 0/8] nvme: v1.3, sgls, metadata and new 'ocssd' device

Klaus Birkelund Jensen posted 8 patches 4 years, 11 months ago
Test asan passed
Test docker-mingw@fedora failed
Test docker-clang@ubuntu passed
Test checkpatch failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20190517084234.26923-1-klaus@birkelund.eu
Maintainers: Eric Blake <eblake@redhat.com>, Kevin Wolf <kwolf@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, Max Reitz <mreitz@redhat.com>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, Markus Armbruster <armbru@redhat.com>, Keith Busch <keith.busch@intel.com>, Fam Zheng <fam@euphon.net>
MAINTAINERS                |   14 +-
Makefile.objs              |    1 +
block.c                    |    2 +-
block/Makefile.objs        |    2 +-
block/nvme.c               |   20 +-
block/ocssd.c              |  690 ++++++++++
hw/block/Makefile.objs     |    2 +-
hw/block/nvme.c            | 1405 -------------------
hw/block/nvme.h            |   92 --
hw/block/nvme/nvme.c       | 2485 +++++++++++++++++++++++++++++++++
hw/block/nvme/ocssd.c      | 2647 ++++++++++++++++++++++++++++++++++++
hw/block/nvme/ocssd.h      |  140 ++
hw/block/nvme/trace-events |  136 ++
hw/block/trace-events      |   91 --
include/block/block_int.h  |    3 +
include/block/nvme.h       |  152 ++-
include/block/ocssd.h      |  231 ++++
include/hw/block/nvme.h    |  233 ++++
include/hw/pci/pci_ids.h   |    2 +
qapi/block-core.json       |   47 +-
20 files changed, 6774 insertions(+), 1621 deletions(-)
create mode 100644 block/ocssd.c
delete mode 100644 hw/block/nvme.c
delete mode 100644 hw/block/nvme.h
create mode 100644 hw/block/nvme/nvme.c
create mode 100644 hw/block/nvme/ocssd.c
create mode 100644 hw/block/nvme/ocssd.h
create mode 100644 hw/block/nvme/trace-events
create mode 100644 include/block/ocssd.h
create mode 100644 include/hw/block/nvme.h
[Qemu-devel] [PATCH 0/8] nvme: v1.3, sgls, metadata and new 'ocssd' device
Posted by Klaus Birkelund Jensen 4 years, 11 months ago
Hi,

This series of patches contains a number of refactorings to the emulated
nvme device, adds additional features, such as support for metadata and
scatter gather lists, and bumps the supported NVMe version to 1.3.
Lastly, it contains a new 'ocssd' device.

The motivation for the first seven patches is to set everything up for
the final patch that adds a new 'ocssd' device and associated block
driver that implements the OpenChannel 2.0 specification[1]. Many of us
in the OpenChannel comunity have used a qemu fork[2] for emulation of
OpenChannel devices. The fork is itself based on Keith's qemu-nvme
tree[3] and we recently merged mainline qemu into it, but the result is
still a "hybrid" nvme device that supports both conventional nvme and
the OCSSD 2.0 spec through a 'dialect' mechanism. Merging instead of
rebasing also created a pretty messy commit history and my efforts to
try and rebase our work onto mainline was getting hairy to say the
least. And I was never really happy with the dialect approach anyway.

I have instead prepared this series of fresh patches that incrementally
adds additional features to the nvme device to bring it into shape for
finally introducing a new (and separate) 'ocssd' device that emulates an
OpenChannel 2.0 device by reusing core functionality from the nvme
device. Providing a separate ocssd device ensures that no ocssd specific
stuff creeps into the nvme device.

The ocssd device is backed by a new 'ocssd' block driver that holds
internal meta data and keeps state permanent across power cycles. In the
future I think we could use the same approach for the nvme device to
keep internal metadata such as utilization and deallocated blocks. For
now, the nvme device does not support the Deallocated and Unwritten
Logical Block Error (DULBE) feature or the Data Set Management command
as this would require such support.

I have tried to make the patches to the nvme device in this series as
digestible as possible, but I undestand that commit 310fcd5965e5 ("nvme:
bump supported spec to 1.3") is pretty huge. I can try to chop it up if
required, but the changes pretty much needs to be done in bulk to
actually implement v1.3.

This version was recently used to find a bug in use of SGLs in the Linux
kernel, so I believe there is some value in introducing these new
features. As for the ocssd device I believe that it is time it is
included upstream and not kept seperately. I have knowledge of at least
one other qemu fork implementing OCSSD 2.0 used by the SPDK team and I
think we could all benefit from using a common implementation. The ocssd
device is feature complete with respect to the OCSSD 2.0 spec (mandatory
as well as optional features).

  [1]: http://lightnvm.io/docs/OCSSD-2_0-20180129.pdf
  [2]: https://github.com/OpenChannelSSD/qemu-nvme
  [3]: http://git.infradead.org/users/kbusch/qemu-nvme.git


Klaus Birkelund Jensen (8):
  nvme: move device parameters to separate struct
  nvme: bump supported spec to 1.3
  nvme: simplify PRP mappings
  nvme: allow multiple i/o's per request
  nvme: add support for metadata
  nvme: add support for scatter gather lists
  nvme: keep a copy of the NVMe command in request
  nvme: add an OpenChannel 2.0 NVMe device (ocssd)

 MAINTAINERS                |   14 +-
 Makefile.objs              |    1 +
 block.c                    |    2 +-
 block/Makefile.objs        |    2 +-
 block/nvme.c               |   20 +-
 block/ocssd.c              |  690 ++++++++++
 hw/block/Makefile.objs     |    2 +-
 hw/block/nvme.c            | 1405 -------------------
 hw/block/nvme.h            |   92 --
 hw/block/nvme/nvme.c       | 2485 +++++++++++++++++++++++++++++++++
 hw/block/nvme/ocssd.c      | 2647 ++++++++++++++++++++++++++++++++++++
 hw/block/nvme/ocssd.h      |  140 ++
 hw/block/nvme/trace-events |  136 ++
 hw/block/trace-events      |   91 --
 include/block/block_int.h  |    3 +
 include/block/nvme.h       |  152 ++-
 include/block/ocssd.h      |  231 ++++
 include/hw/block/nvme.h    |  233 ++++
 include/hw/pci/pci_ids.h   |    2 +
 qapi/block-core.json       |   47 +-
 20 files changed, 6774 insertions(+), 1621 deletions(-)
 create mode 100644 block/ocssd.c
 delete mode 100644 hw/block/nvme.c
 delete mode 100644 hw/block/nvme.h
 create mode 100644 hw/block/nvme/nvme.c
 create mode 100644 hw/block/nvme/ocssd.c
 create mode 100644 hw/block/nvme/ocssd.h
 create mode 100644 hw/block/nvme/trace-events
 create mode 100644 include/block/ocssd.h
 create mode 100644 include/hw/block/nvme.h

-- 
2.21.0

Re: [Qemu-devel] [PATCH 0/8] nvme: v1.3, sgls, metadata and new 'ocssd' device
Posted by Kevin Wolf 4 years, 11 months ago
Am 17.05.2019 um 10:42 hat Klaus Birkelund Jensen geschrieben:
> Hi,
> 
> This series of patches contains a number of refactorings to the emulated
> nvme device, adds additional features, such as support for metadata and
> scatter gather lists, and bumps the supported NVMe version to 1.3.
> Lastly, it contains a new 'ocssd' device.
> 
> The motivation for the first seven patches is to set everything up for
> the final patch that adds a new 'ocssd' device and associated block
> driver that implements the OpenChannel 2.0 specification[1]. Many of us
> in the OpenChannel comunity have used a qemu fork[2] for emulation of
> OpenChannel devices. The fork is itself based on Keith's qemu-nvme
> tree[3] and we recently merged mainline qemu into it, but the result is
> still a "hybrid" nvme device that supports both conventional nvme and
> the OCSSD 2.0 spec through a 'dialect' mechanism. Merging instead of
> rebasing also created a pretty messy commit history and my efforts to
> try and rebase our work onto mainline was getting hairy to say the
> least. And I was never really happy with the dialect approach anyway.
> 
> I have instead prepared this series of fresh patches that incrementally
> adds additional features to the nvme device to bring it into shape for
> finally introducing a new (and separate) 'ocssd' device that emulates an
> OpenChannel 2.0 device by reusing core functionality from the nvme
> device. Providing a separate ocssd device ensures that no ocssd specific
> stuff creeps into the nvme device.
> 
> The ocssd device is backed by a new 'ocssd' block driver that holds
> internal meta data and keeps state permanent across power cycles. In the
> future I think we could use the same approach for the nvme device to
> keep internal metadata such as utilization and deallocated blocks.

A backend driver that is specific for a guest device model (i.e. the
device model requires this driver, and the backend is useless without
the device) sounds like a very questionable design.

Metadata like OcssdFormatHeader that is considered part of the image
data, which means that the _actual_ image content without metadata isn't
directly accessible any more feels like a bad idea, too. Simple things
like what a resize operation means (change only the actual disk size as
usual, or is the new size disk + metadata?) become confusing. Attaching
an image to a different device becomes impossible.

The block format driver doesn't seem to actually add much functionality
to a specially crafted raw image: It provides a convenient way to create
such special images and it dumps some values in 'qemu-img info', but the
actual interpretation of the data is left to the device model.

Looking at the options it does provide, my impression is that these
should really be qdev properties, and the place to store them
persistently is something like the libvirt XML. The device doesn't
change any of the values, so there is nothing that QEMU actually needs
to store. What you invented is a one-off way to pass a config file to a
device, but only for one specific device type.

I think this needs to use a much more standard approach to be mergable.

Markus (CCed) as the maintainer for the configuration mechanisms may
have an opinion on this, too.

> For now, the nvme device does not support the Deallocated and
> Unwritten Logical Block Error (DULBE) feature or the Data Set
> Management command as this would require such support.

Doesn't bdrv_co_block_status() provide all the information you need for
that?

Kevin

Re: [Qemu-devel] [PATCH 0/8] nvme: v1.3, sgls, metadata and new 'ocssd' device
Posted by Klaus Birkelund 4 years, 11 months ago
On Mon, May 20, 2019 at 03:01:24PM +0200, Kevin Wolf wrote:
> Am 17.05.2019 um 10:42 hat Klaus Birkelund Jensen geschrieben:
> > Hi,
> > 
> > This series of patches contains a number of refactorings to the emulated
> > nvme device, adds additional features, such as support for metadata and
> > scatter gather lists, and bumps the supported NVMe version to 1.3.
> > Lastly, it contains a new 'ocssd' device.
> > 
> > The motivation for the first seven patches is to set everything up for
> > the final patch that adds a new 'ocssd' device and associated block
> > driver that implements the OpenChannel 2.0 specification[1]. Many of us
> > in the OpenChannel comunity have used a qemu fork[2] for emulation of
> > OpenChannel devices. The fork is itself based on Keith's qemu-nvme
> > tree[3] and we recently merged mainline qemu into it, but the result is
> > still a "hybrid" nvme device that supports both conventional nvme and
> > the OCSSD 2.0 spec through a 'dialect' mechanism. Merging instead of
> > rebasing also created a pretty messy commit history and my efforts to
> > try and rebase our work onto mainline was getting hairy to say the
> > least. And I was never really happy with the dialect approach anyway.
> > 
> > I have instead prepared this series of fresh patches that incrementally
> > adds additional features to the nvme device to bring it into shape for
> > finally introducing a new (and separate) 'ocssd' device that emulates an
> > OpenChannel 2.0 device by reusing core functionality from the nvme
> > device. Providing a separate ocssd device ensures that no ocssd specific
> > stuff creeps into the nvme device.
> > 
> > The ocssd device is backed by a new 'ocssd' block driver that holds
> > internal meta data and keeps state permanent across power cycles. In the
> > future I think we could use the same approach for the nvme device to
> > keep internal metadata such as utilization and deallocated blocks.
> 
> A backend driver that is specific for a guest device model (i.e. the
> device model requires this driver, and the backend is useless without
> the device) sounds like a very questionable design.
> 
> Metadata like OcssdFormatHeader that is considered part of the image
> data, which means that the _actual_ image content without metadata isn't
> directly accessible any more feels like a bad idea, too. Simple things
> like what a resize operation means (change only the actual disk size as
> usual, or is the new size disk + metadata?) become confusing. Attaching
> an image to a different device becomes impossible.
> 
> The block format driver doesn't seem to actually add much functionality
> to a specially crafted raw image: It provides a convenient way to create
> such special images and it dumps some values in 'qemu-img info', but the
> actual interpretation of the data is left to the device model.
> 
> Looking at the options it does provide, my impression is that these
> should really be qdev properties, and the place to store them
> persistently is something like the libvirt XML. The device doesn't
> change any of the values, so there is nothing that QEMU actually needs
> to store. What you invented is a one-off way to pass a config file to a
> device, but only for one specific device type.
> 
> I think this needs to use a much more standard approach to be mergable.
> 
> Markus (CCed) as the maintainer for the configuration mechanisms may
> have an opinion on this, too.

Hi Kevin,

Thank you for going through my motivations. I see what you mean. And
yes, the main reason I did it like that was for the convenience of being
able to `qemu-img create`'ing the image. I'll reconsider how to do this.

> 
> > For now, the nvme device does not support the Deallocated and
> > Unwritten Logical Block Error (DULBE) feature or the Data Set
> > Management command as this would require such support.
> 
> Doesn't bdrv_co_block_status() provide all the information you need for
> that?
> 

That does look useful. I'll look into it.

Thanks!