[PATCH v2 00/15] hw/nvme: SR-IOV with Virtualization Enhancements

Łukasz Gieryk posted 15 patches 2 years, 5 months ago
Test checkpatch failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20211116153446.317143-1-lukasz.gieryk@linux.intel.com
Maintainers: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, Klaus Jensen <its@irrelevant.dk>, "Michael S. Tsirkin" <mst@redhat.com>, Hanna Reitz <hreitz@redhat.com>, Kevin Wolf <kwolf@redhat.com>, "Philippe Mathieu-Daudé" <philmd@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, Keith Busch <kbusch@kernel.org>, Fam Zheng <fam@euphon.net>
There is a newer version of this series
docs/pcie_sriov.txt          | 115 +++++++
docs/system/devices/nvme.rst |  35 ++
hw/nvme/ctrl.c               | 634 ++++++++++++++++++++++++++++++++---
hw/nvme/ns.c                 |   2 +-
hw/nvme/nvme.h               |  51 ++-
hw/nvme/subsys.c             |  74 +++-
hw/nvme/trace-events         |   6 +
hw/pci/meson.build           |   1 +
hw/pci/pci.c                 |  97 ++++--
hw/pci/pcie.c                |   5 +
hw/pci/pcie_sriov.c          | 301 +++++++++++++++++
hw/pci/trace-events          |   5 +
include/block/nvme.h         |  65 ++++
include/hw/pci/pci.h         |  12 +-
include/hw/pci/pci_ids.h     |   1 +
include/hw/pci/pci_regs.h    |   1 +
include/hw/pci/pcie.h        |   6 +
include/hw/pci/pcie_sriov.h  |  75 +++++
include/qemu/typedefs.h      |   2 +
19 files changed, 1407 insertions(+), 81 deletions(-)
create mode 100644 docs/pcie_sriov.txt
create mode 100644 hw/pci/pcie_sriov.c
create mode 100644 include/hw/pci/pcie_sriov.h
[PATCH v2 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
Posted by Łukasz Gieryk 2 years, 5 months ago
This is the updated version of SR-IOV support for the NVMe device.

Changes since v1:
 - Dropped the "pcie: Set default and supported MaxReadReq to 512" patch.
   The original author agrees it may be no longer needed for recent
   kernels.
 - Dropped the "pcie: Add callback preceding SR-IOV VFs update" patch.
   A customized pc->config_write callback is used instead.
 - Split the "hw/nvme: Calculate BAR attributes in a function” patch into
   cleanup and update parts.
 - Reworked the configuration options related to SR-IOV.
 - Fixed nvme_update_msixcap_ts() for platform without MSIX support.
 - Updated handling of SUBSYS_SLOT_RSVD values in nvme_subsys_ctrl().
 - Updated error codes returned from the Virtualization Management
   command (DNR is set).
 - Updated typedef/enum names mismatch.
 - Few other minor tweaks and improvements.

List of known gaps and nice-to-haves:

1) Interaction of secondary controllers with namespaces is not 100%
following the spec

The limitation: VF has to be visible on the PCI bus first, and only then
such VF can have a namespace attached.

The problem is that the mapping between controller and attached
namespaces is stored in the NvmeCtrl object, and unrealized VF doesn’t
have this object allocated. There are multiple ways to address the
issue, but none of those we’ve considered so far is elegant. The fact,
that the namespace-related code area is busy (pending patches, some
rework?), doesn’t help either.


2) VFs report and support the same features as the parent PF

Due to security reasons, user of a VF should be not allowed to interact
with other VFs. Essentially, VFs should not announce and support:
Namespace Management, Attachment, corresponding Identify commands, and
maybe other features as well.


3) PMR and CMB must be disabled when SR-IOV is active

The main concern is whether PMR/CMB should be even implemented for a
device that supports SR-IOV. If the answer is yes, then another question
arises: how the feature should behave? Simulation of different devices
may require different implementation.

It's too early to get into such details, so we’ve decided to disallow
both features altogether if SR-IOV is enabled.


4) The Private Resources Mode

The NVMe Spec defines Flexible Resources as an optional feature. VFs can
alternatively support a fixed number of queues/interrupts.

This SR-IOV implementation supports Flexible Resources with the
Virtualization Management command, and a fallback to Private Resources
is not available. Support for such fallback, if there’s demand, can be
implemented later.


5) Support for Namespace Management command

Device with virtualization enhancements must support the Namespace
Management command. The command is not absolutely necessary to use
SR-IOV, but for a more complete set of features you may want to
cherry-pick this patch:
https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg03107.html
together with this fix:
https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06734.html


Knut Omang (2):
  pcie: Add support for Single Root I/O Virtualization (SR/IOV)
  pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt

Lukasz Maniak (4):
  hw/nvme: Add support for SR-IOV
  hw/nvme: Add support for Primary Controller Capabilities
  hw/nvme: Add support for Secondary Controller List
  docs: Add documentation for SR-IOV and Virtualization Enhancements

Łukasz Gieryk (9):
  pcie: Add helpers to the SR/IOV API
  pcie: Add 1.2 version token for the Power Management Capability
  hw/nvme: Implement the Function Level Reset
  hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
  hw/nvme: Remove reg_size variable and update BAR0 size calculation
  hw/nvme: Calculate BAR attributes in a function
  hw/nvme: Initialize capability structures for primary/secondary
    controllers
  hw/nvme: Add support for the Virtualization Management command
  hw/nvme: Update the initalization place for the AER queue

 docs/pcie_sriov.txt          | 115 +++++++
 docs/system/devices/nvme.rst |  35 ++
 hw/nvme/ctrl.c               | 634 ++++++++++++++++++++++++++++++++---
 hw/nvme/ns.c                 |   2 +-
 hw/nvme/nvme.h               |  51 ++-
 hw/nvme/subsys.c             |  74 +++-
 hw/nvme/trace-events         |   6 +
 hw/pci/meson.build           |   1 +
 hw/pci/pci.c                 |  97 ++++--
 hw/pci/pcie.c                |   5 +
 hw/pci/pcie_sriov.c          | 301 +++++++++++++++++
 hw/pci/trace-events          |   5 +
 include/block/nvme.h         |  65 ++++
 include/hw/pci/pci.h         |  12 +-
 include/hw/pci/pci_ids.h     |   1 +
 include/hw/pci/pci_regs.h    |   1 +
 include/hw/pci/pcie.h        |   6 +
 include/hw/pci/pcie_sriov.h  |  75 +++++
 include/qemu/typedefs.h      |   2 +
 19 files changed, 1407 insertions(+), 81 deletions(-)
 create mode 100644 docs/pcie_sriov.txt
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

-- 
2.25.1


Re: [PATCH v2 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
Posted by Klaus Jensen 2 years, 5 months ago
Hi Lukasz,

I've been through this. I have a couple of review comments, but overall
looks good for inclusion in nvme-next. Would be nice to get this in
early in the cycle so it can mature there for 7.0.

I'd like that we mark this support experimental, so we can easily do
some changes to how parameters work since I'm not sure we completely
agree on that yet.

By the way, in the future, please add me and Keith as CCs on the entire
series so we get CC'ed on replies to the cover-letter ;)

On Nov 16 16:34, Łukasz Gieryk wrote:
> This is the updated version of SR-IOV support for the NVMe device.
> 
> Changes since v1:
>  - Dropped the "pcie: Set default and supported MaxReadReq to 512" patch.
>    The original author agrees it may be no longer needed for recent
>    kernels.
>  - Dropped the "pcie: Add callback preceding SR-IOV VFs update" patch.
>    A customized pc->config_write callback is used instead.
>  - Split the "hw/nvme: Calculate BAR attributes in a function” patch into
>    cleanup and update parts.
>  - Reworked the configuration options related to SR-IOV.
>  - Fixed nvme_update_msixcap_ts() for platform without MSIX support.
>  - Updated handling of SUBSYS_SLOT_RSVD values in nvme_subsys_ctrl().
>  - Updated error codes returned from the Virtualization Management
>    command (DNR is set).
>  - Updated typedef/enum names mismatch.
>  - Few other minor tweaks and improvements.
> 
> List of known gaps and nice-to-haves:
> 
> 1) Interaction of secondary controllers with namespaces is not 100%
> following the spec
> 
> The limitation: VF has to be visible on the PCI bus first, and only then
> such VF can have a namespace attached.
> 

Looking at the spec I'm not even sure what the expected behavior is
supposed to be, can you elaborate? I rebased this on latest, and with
Hannes changes, shared namespaces will be attached by default, which
seems to be reasonable.

> The problem is that the mapping between controller and attached
> namespaces is stored in the NvmeCtrl object, and unrealized VF doesn’t
> have this object allocated. There are multiple ways to address the
> issue, but none of those we’ve considered so far is elegant. The fact,
> that the namespace-related code area is busy (pending patches, some
> rework?), doesn’t help either.
> 
> 
> 2) VFs report and support the same features as the parent PF
> 
> Due to security reasons, user of a VF should be not allowed to interact
> with other VFs. Essentially, VFs should not announce and support:
> Namespace Management, Attachment, corresponding Identify commands, and
> maybe other features as well.
> 

This can be relatively easily fixed I think. I have a patch that already
does this for Administrative controller types and it should be
applicable here as well. I can follow up with that.

> 
> 3) PMR and CMB must be disabled when SR-IOV is active
> 
> The main concern is whether PMR/CMB should be even implemented for a
> device that supports SR-IOV. If the answer is yes, then another question
> arises: how the feature should behave? Simulation of different devices
> may require different implementation.
> 
> It's too early to get into such details, so we’ve decided to disallow
> both features altogether if SR-IOV is enabled.
> 

Reasonable enough for now.

> 
> 4) The Private Resources Mode
> 
> The NVMe Spec defines Flexible Resources as an optional feature. VFs can
> alternatively support a fixed number of queues/interrupts.
> 
> This SR-IOV implementation supports Flexible Resources with the
> Virtualization Management command, and a fallback to Private Resources
> is not available. Support for such fallback, if there’s demand, can be
> implemented later.
> 

Acceptable.

> 
> 5) Support for Namespace Management command
> 
> Device with virtualization enhancements must support the Namespace
> Management command. The command is not absolutely necessary to use
> SR-IOV, but for a more complete set of features you may want to
> cherry-pick this patch:
> https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg03107.html
> together with this fix:
> https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06734.html
> 

I think we can live with just Namespace Attachment suppoort for now,
until the above patches are accepted.
Re: [PATCH v2 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
Posted by Łukasz Gieryk 2 years, 5 months ago
On Wed, Nov 24, 2021 at 09:03:06AM +0100, Klaus Jensen wrote:
> Hi Lukasz,
> 
> I've been through this. I have a couple of review comments, but overall
> looks good for inclusion in nvme-next. Would be nice to get this in
> early in the cycle so it can mature there for 7.0.

We (I’m speaking on behalf of the other Lukasz) are really happy to
read that. We will do our best to make it happen.

> 
> I'd like that we mark this support experimental, so we can easily do
> some changes to how parameters work since I'm not sure we completely
> agree on that yet.
> 
> By the way, in the future, please add me and Keith as CCs on the entire
> series so we get CC'ed on replies to the cover-letter ;)
> 

> > List of known gaps and nice-to-haves:
> > 
> > 1) Interaction of secondary controllers with namespaces is not 100%
> > following the spec
> > 
> > The limitation: VF has to be visible on the PCI bus first, and only then
> > such VF can have a namespace attached.
> > 
> 
> Looking at the spec I'm not even sure what the expected behavior is
> supposed to be, can you elaborate? I rebased this on latest, and with
> Hannes changes, shared namespaces will be attached by default, which
> seems to be reasonable.

An example flow:

# Release flexible resources from PF (assuming it’s /dev/nvme0)
nvme virt-mgmt -c 0 -r 0 -n 0 -a 1 /dev/nvme0
nvme virt-mgmt -c 0 -r 1 -n 0 -a 1 /dev/nvme0
echo 1 > /sys/class/nvme/nvme0/reset_controller
# Bind sane minimums to VF1 (cntlid=1) and set it online
nvme virt-mgmt -c 1 -r 0 -n 5 -a 8 /dev/nvme0
nvme virt-mgmt -c 1 -r 1 -n 5 -a 8 /dev/nvme0
nvme virt-mgmt -c 1 -a 9 /dev/nvme0
# Enable 2 VFs
echo 2 > /sys/bus/pci/devices/<PF’s id>/sriov_numvfs
# PF, VF1 and VF2 are visible on PCI
lspci | grep Non-Volatile
# The NVMe driver is bound to PF and VF1 (the only online VF)
nvme list -v
# VFs shall eventually not support Ns Management/Attachment commands,
# and namespaces should be attached to VFs (i.e., their secondary
# controllers) through the PF.
# A namespace can be attached to VF1, VF2
nvme attach-ns /dev/nvme0 -c 1 -n X
nvme attach-ns /dev/nvme0 -c 2 -n X
# According to the spec this should also succeed, but today it won’t
nvme attach-ns /dev/nvme0 -c 3 -n X

VF3’s NvmeCtrl object is not yet allocated, so today there’s nothing
for nvme_subsys_ctrl() to return for cntlid=3, besides NULL (the
current behavior) or SUBSYS_SLOT_RSVD.

Relevant use cases:
 - admin can configure disabled VFs,
 - information about attached ns persists when VFs are disabled,
are not that critical, but of course it’s a discrepancy from what a
real device can handle.

In my opinion, to handle the cases correctly, information about attached
namespaces could be moved to subsystem. Could you share your thoughts
whether such approach would make sense?


Re: [PATCH v2 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
Posted by Klaus Jensen 2 years, 4 months ago
On Nov 25 15:15, Łukasz Gieryk wrote:
> On Wed, Nov 24, 2021 at 09:03:06AM +0100, Klaus Jensen wrote:
> > Hi Lukasz,
> > 
> > I've been through this. I have a couple of review comments, but overall
> > looks good for inclusion in nvme-next. Would be nice to get this in
> > early in the cycle so it can mature there for 7.0.
> 
> We (I’m speaking on behalf of the other Lukasz) are really happy to
> read that. We will do our best to make it happen.
> 

Keith, do you have any comments on this series - otherwise I'd like to
stage this for nvme-next come January.

> > 
> > I'd like that we mark this support experimental, so we can easily do
> > some changes to how parameters work since I'm not sure we completely
> > agree on that yet.
> > 
> > By the way, in the future, please add me and Keith as CCs on the entire
> > series so we get CC'ed on replies to the cover-letter ;)
> > 
> 
> > > List of known gaps and nice-to-haves:
> > > 
> > > 1) Interaction of secondary controllers with namespaces is not 100%
> > > following the spec
> > > 
> > > The limitation: VF has to be visible on the PCI bus first, and only then
> > > such VF can have a namespace attached.
> > > 
> > 
> > Looking at the spec I'm not even sure what the expected behavior is
> > supposed to be, can you elaborate? I rebased this on latest, and with
> > Hannes changes, shared namespaces will be attached by default, which
> > seems to be reasonable.
> 
> An example flow:
> 
> # Release flexible resources from PF (assuming it’s /dev/nvme0)
> nvme virt-mgmt -c 0 -r 0 -n 0 -a 1 /dev/nvme0
> nvme virt-mgmt -c 0 -r 1 -n 0 -a 1 /dev/nvme0
> echo 1 > /sys/class/nvme/nvme0/reset_controller
> # Bind sane minimums to VF1 (cntlid=1) and set it online
> nvme virt-mgmt -c 1 -r 0 -n 5 -a 8 /dev/nvme0
> nvme virt-mgmt -c 1 -r 1 -n 5 -a 8 /dev/nvme0
> nvme virt-mgmt -c 1 -a 9 /dev/nvme0
> # Enable 2 VFs
> echo 2 > /sys/bus/pci/devices/<PF’s id>/sriov_numvfs
> # PF, VF1 and VF2 are visible on PCI
> lspci | grep Non-Volatile
> # The NVMe driver is bound to PF and VF1 (the only online VF)
> nvme list -v
> # VFs shall eventually not support Ns Management/Attachment commands,
> # and namespaces should be attached to VFs (i.e., their secondary
> # controllers) through the PF.
> # A namespace can be attached to VF1, VF2
> nvme attach-ns /dev/nvme0 -c 1 -n X
> nvme attach-ns /dev/nvme0 -c 2 -n X
> # According to the spec this should also succeed, but today it won’t
> nvme attach-ns /dev/nvme0 -c 3 -n X
> 
> VF3’s NvmeCtrl object is not yet allocated, so today there’s nothing
> for nvme_subsys_ctrl() to return for cntlid=3, besides NULL (the
> current behavior) or SUBSYS_SLOT_RSVD.
> 
> Relevant use cases:
>  - admin can configure disabled VFs,
>  - information about attached ns persists when VFs are disabled,
> are not that critical, but of course it’s a discrepancy from what a
> real device can handle.
> 
> In my opinion, to handle the cases correctly, information about attached
> namespaces could be moved to subsystem. Could you share your thoughts
> whether such approach would make sense?
> 

Thanks for the explaination.

I actually already had this sort-of conversation with Hannes[1].
Different issue, but same solution (that is, having a per-CNTLID
namespace list maintained in the subsystem).

I have a refactor series coming up that will address this, so for now, I
don't worry about this not being implemented exactly as the spec defines.

  [1]: https://lore.kernel.org/all/20210909094308.122038-1-hare@suse.de/t/#u
Re: [PATCH v2 00/15] hw/nvme: SR-IOV with Virtualization Enhancements
Posted by Łukasz Gieryk 2 years, 4 months ago
On Mon, Dec 20, 2021 at 08:12:06AM +0100, Klaus Jensen wrote:
> On Nov 25 15:15, Łukasz Gieryk wrote:
> > On Wed, Nov 24, 2021 at 09:03:06AM +0100, Klaus Jensen wrote:
> > > Hi Lukasz,
> > > 
> > > I've been through this. I have a couple of review comments, but overall
> > > looks good for inclusion in nvme-next. Would be nice to get this in
> > > early in the cycle so it can mature there for 7.0.
> > 
> > We (I’m speaking on behalf of the other Lukasz) are really happy to
> > read that. We will do our best to make it happen.
> > 
> 
> Keith, do you have any comments on this series - otherwise I'd like to
> stage this for nvme-next come January.
> 

Please, wait with the staging - we are about to release the v3.

We have few fixes that will make the device more compliant with the NVMe
and SR-IOV Spec.