drivers/nvme/host/core.c | 312 ++++++++++++++++++++++++++++++++++++++++ drivers/nvme/host/ioctl.c | 53 ++++++- drivers/nvme/host/nvme.h | 20 +++ include/linux/nvme.h | 50 ++++++- include/uapi/linux/nvme_ioctl.h | 29 ++++ 5 files changed, 462 insertions(+), 2 deletions(-)
This RFC implements Controller Data Queue (CDQ) support in the NVMe
driver, a variation of my original RFC sent last July [2]. It exposes an
ioctl interface for userspace to create, configure, and delete CDQs
backed by DMA-mapped user memory with eventfd notification. In this
version I explore how the CDQ protocol logic might live outside the
kernel; the ioctl serves as a testing tool but is not necessarily the
final interface.
This RFC exists within a broader goal, which is to enable NVMe namespace
migration. The timing feels right as hardware with CDQ capability
exists, NVMe fully specifies the feature and there is growing interest
in Live Migration which by extension includes CDQ.
There is however, no clear consensus on how NVMe Live Migration should
land in the Linux kernel. The 2022 discussion [1] explored a VFIO-based
approach but reached no conclusion, likely because the specification was
not yet mature.
To move CDQ forward, I would like to understand where the LM logic belongs. I
currently see two options (of which I have no particular preference):
1. VFIO: Implement NVMe LM following the VFIO state machine, similar to what
was proposed in 2022.
2. VM manager interface: Bypass VFIO and implement LM logic in the interface
between the VM manager (e.g., QEMU) and the NVMe driver.
One aspect that has not received much attention in previous discussions
is namespace migration as prior work focused on migrating state and not
the actual data. Migrating potential terabytes is IMO a distinct use
case worth considering. LSF/MM/BPF is in a week. I hope this series
encourages folks to revisit their positions, give their opinions and set
the stage for face2face discussions.
Best
PS: I'm including the regular NVMe contacts and the folks that seemed to
have strong opinions in [2]. I always find it difficult to decide who to
include in these so let me know if you want to be removed in the future
or if I have missed someone.
[1] https://lore.kernel.org/20221206055816.292304-1-lei.rao@intel.com
[2] https://lore.kernel.org/20250714-jag-cdq-v1-0-01e027d256d5@kernel.org
Signed-off-by: Joel Granados <joel.granados@kernel.org>
---
Joel Granados (5):
nvme: Add CDQ data structures to nvme spec header
nvme: Add CDQ data structures to host driver
nvme: Add NVME_AER_ONE_SHOT callback handler
nvme: Implement CDQ core functionality
nvme: Add CDQ ioctl interface
drivers/nvme/host/core.c | 312 ++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/ioctl.c | 53 ++++++-
drivers/nvme/host/nvme.h | 20 +++
include/linux/nvme.h | 50 ++++++-
include/uapi/linux/nvme_ioctl.h | 29 ++++
5 files changed, 462 insertions(+), 2 deletions(-)
---
base-commit: 028ef9c96e96197026887c0f092424679298aae8
change-id: 20260424-jag-cdq-lkml-cd9b7c79983d
Best regards,
--
Joel Granados <joel.granados@kernel.org>
On Fri, Apr 24, 2026 at 01:37:50PM +0200, Joel Granados wrote: > There is however, no clear consensus on how NVMe Live Migration should > land in the Linux kernel. The 2022 discussion [1] explored a VFIO-based > approach but reached no conclusion, likely because the specification was > not yet mature. Yes it was paused until the spec matures, then I expect it to go forward. > To move CDQ forward, I would like to understand where the LM logic belongs. I > currently see two options (of which I have no particular preference): > > 1. VFIO: Implement NVMe LM following the VFIO state machine, similar to what > was proposed in 2022. > 2. VM manager interface: Bypass VFIO and implement LM logic in the interface > between the VM manager (e.g., QEMU) and the NVMe driver. I imagined it to be split between VFIO for the pci and volatile guest state and something else for the namespace setup and media migration. Media migration is only needed for local drive so there use cases that don't need this component. We have many drivers fitting into the VFIO scheme now and good VMM coverage, I don't see a reason to throw it out. > One aspect that has not received much attention in previous discussions > is namespace migration as prior work focused on migrating state and not > the actual data. Migrating potential terabytes is IMO a distinct use > case worth considering. Yes Though IDK if just plumbing the entire CDQ to userspace is the right choice for NVMe.. We don't know what future specs will add to CDQ, it may not be appropriate to treat it so insecurely. Jason
On Fri, Apr 24, 2026 at 10:06:15AM -0300, Jason Gunthorpe wrote: > On Fri, Apr 24, 2026 at 01:37:50PM +0200, Joel Granados wrote: > > > There is however, no clear consensus on how NVMe Live Migration should > > land in the Linux kernel. The 2022 discussion [1] explored a VFIO-based > > approach but reached no conclusion, likely because the specification was > > not yet mature. > > Yes it was paused until the spec matures, then I expect it to go > forward. > > > To move CDQ forward, I would like to understand where the LM logic belongs. I > > currently see two options (of which I have no particular preference): > > > > 1. VFIO: Implement NVMe LM following the VFIO state machine, similar to what > > was proposed in 2022. > > 2. VM manager interface: Bypass VFIO and implement LM logic in the interface > > between the VM manager (e.g., QEMU) and the NVMe driver. > > I imagined it to be split between VFIO for the pci and volatile guest > state and something else for the namespace setup and media migration. That is an option. If we end up with an approach that supports namespace migration, I'm happy :) > Media migration is only needed for local drive so there use cases that > don't need this component. Indeed, And there should be an approach that supports those use cases as well. > > We have many drivers fitting into the VFIO scheme now and good VMM > coverage, I don't see a reason to throw it out. And this is why I included it as one of the ways to implement it. > > > One aspect that has not received much attention in previous discussions > > is namespace migration as prior work focused on migrating state and not > > the actual data. Migrating potential terabytes is IMO a distinct use > > case worth considering. > > Yes > > Though IDK if just plumbing the entire CDQ to userspace is the right > choice for NVMe.. We don't know what future specs will add to CDQ, it > may not be appropriate to treat it so insecurely. Agreed, this RFC is just one of many ways of doing it. My original one is fully contained inside the NVMe driver. One thing that is clear with the tests that I have made, is that it is easy to move the CDQ logic in and out of user space (depending on what is needed). Best -- Joel Granados
On Fri, Apr 24, 2026 at 10:06:15AM -0300, Jason Gunthorpe wrote: > On Fri, Apr 24, 2026 at 01:37:50PM +0200, Joel Granados wrote: > > > There is however, no clear consensus on how NVMe Live Migration should > > land in the Linux kernel. The 2022 discussion [1] explored a VFIO-based > > approach but reached no conclusion, likely because the specification was > > not yet mature. > > Yes it was paused until the spec matures, then I expect it to go > forward. And it will happen in the nvme software working group. Which should be up an running if Samsung hadn't done everything in the power to torpedo it. Because of that I do not exact Samsung to have any major impact in how this will be implemented in Linux. Note that we also can't discuss any of this at LSF/MM in public, so Joel side channel loading it onto the schedule should be removed as well.
On Fri, Apr 24, 2026 at 03:24:23PM +0200, Christoph Hellwig wrote: > On Fri, Apr 24, 2026 at 10:06:15AM -0300, Jason Gunthorpe wrote: > > On Fri, Apr 24, 2026 at 01:37:50PM +0200, Joel Granados wrote: > > > > > There is however, no clear consensus on how NVMe Live Migration should > > > land in the Linux kernel. The 2022 discussion [1] explored a VFIO-based > > > approach but reached no conclusion, likely because the specification was > > > not yet mature. > > > > Yes it was paused until the spec matures, then I expect it to go > > forward. > > And it will happen in the nvme software working group. Which should be > up an running if Samsung hadn't done everything in the power to torpedo There is nothing that indicates to me that Samsung "torpedoed" the creation of the nvme SW working group. > it. Because of that I do not exact Samsung to have any major impact in > how this will be implemented in Linux. > > Note that we also can't discuss any of this at LSF/MM in public, so I see no reason not to have the discussion at LSF/MM. It is the perfect venue to unpack the (potential) interaction between the vfio and nvme drivers. Regardless of what is currently brewing in NVMe, there is no reason why the fundamental architecture cannot be discussed. Of course, we need to be careful of what is mentioned in public, but I see that as a detail that does not prevent from having the conversation. Best -- Joel Granados
© 2016 - 2026 Red Hat, Inc.