[PATCH RFC 0/8] nvme: Add Controller Data Queue to the nvme driver

Joel Granados posted 8 patches 2 months, 3 weeks ago
drivers/nvme/host/core.c        | 253 ++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/ioctl.c       |  47 +++++++-
drivers/nvme/host/nvme.h        |  20 ++++
include/linux/nvme.h            |  30 +++++
include/uapi/linux/nvme_ioctl.h |  12 ++
5 files changed, 361 insertions(+), 1 deletion(-)
[PATCH RFC 0/8] nvme: Add Controller Data Queue to the nvme driver
Posted by Joel Granados 2 months, 3 weeks ago
This series introduces support for Controller Data Queues (CDQs) in the
NVMe driver. CDQs allow an NVME controller to post information to the
host through a single completion queue. This series adds data structures,
helpers, and the user interface required to create, read, and delete CDQs.

Motivation
==========
The main motivation is to enable Controller Data Queues as described in
the 2.2 revision of the NVME base specification. This series places the
kernel as an intermediary between the NVME controller producing CDQ
entries and the user space process consuming them. It is general enough
to encompass different use cases that require controller initiated
communication delivered outside the regular I/O traffic streams (like
LBA tracking for example).

What is done
============
* Added nvme_admin_cdq opcode and NVME_FEAT_CDQ feature flag
* Defined a new struct nvme_cdq command for create/delete operations
* Added a cdq_nvme_queue struct that holds the CDQ state
* Added an xarray for each nvme_ctrl that holds a reference to all
  controller CDQs.
* Added a new ioctl (NVME_IOCTL_ADMIN_CDQ) and argument struct
  (nvme_cdq_cmd) for CDQ creation
* Added helpers for consuming CDQs: nvme_cdq_{next,send_feature,traverse}
* Added helpers for CDQ admin: nvme_cdq_{free,alloc,create,delete}

In summary, this series implements creation, consumption, and cleanup of
Controller Data Queues, providing a file-descriptor based interface for
user space to read CDQ entries.

CDQ life cycle
==============
To create a CDQ, user space defines the number of entries, entry size,
location of the phase tag (8.1.6.2 NVME base spec), MOS field (5.1.4
NVME base spec) and if necessary, CQS field (5.1.4.1.1 NVME base spec).
All these are passed through the NVME_IOCTL_ADMIN_CDQ ioctl which
allocates and connects the controller to CDQ memory and returns the CDQ
ID (defined by the controller) and a CDQ file descriptor (CDQ FD).

The CDQ FD is used to consume entries through read system call. For
every "read", all available (new) entries are copied from the
internal Kernel CDQ buffer to the user space buffer.

The CDQ ID, on the other hand, is meant for interactions that are
outside CDQ creation and consumption. In these cases the caller is
expected to send NVME commands down through one of the already available
mechanisms (like the NVME_IOCTL_ADMIN_CMD ioctl).

CDQ data structures and memory are cleaned up when the release file
operation is called on the FD, which usually means the close system call
or the user process gets killed.

Testing
=======
The User Data Migration Queue (5.1.4.1.1 NVME base spec) implemented in
the QEMU NVME device [1] is used for testing purposes. CDQ creation,
consumption and deletion is shown by calling a CDQ example in libvfn [2]
(a low level NVME/PCIe library) from within QEMU. For brevity, I have
*not* included any of the testing commands; but I can provide them if
needed.

Questions
=========

Here are some questions that where on my mind.

1. I have used an ioctl for the CDQ creation. Any better alternatives?
2. The deletion is handled by closing the file descriptor. Should this
   be handled by the ioctl?

Any feedback, questions or comments is greatly appreciated

Best

[1] https://github.com/SamsungDS/qemu/tree/nvme.tp4159
[2] https://github.com/Joelgranados/libvfn/blob/jag/cdq/examples/cdq.c

Signed-off-by: Joel Granados <joel.granados@kernel.org>
---
Joel Granados (8):
      nvme: Add CDQ command definitions for contiguous PRPs
      nvme: Add cdq data structure to nvme_ctrl
      nvme: Add file descriptor to read CDQs
      nvme: Add function to create a CDQ
      nvme: Add function to delete CDQ
      nvme: Add a release ops to cdq file ops
      nvme: Add Controller Data Queue (CDQ) ioctl command
      nvme: Connect CDQ ioctl to nvme driver

 drivers/nvme/host/core.c        | 253 ++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/ioctl.c       |  47 +++++++-
 drivers/nvme/host/nvme.h        |  20 ++++
 include/linux/nvme.h            |  30 +++++
 include/uapi/linux/nvme_ioctl.h |  12 ++
 5 files changed, 361 insertions(+), 1 deletion(-)
---
base-commit: 0ff41df1cb268fc69e703a08a57ee14ae967d0ca
change-id: 20250624-jag-cdq-691ed7e68c1c

Best regards,
-- 
Joel Granados <joel.granados@kernel.org>
Re: [PATCH RFC 0/8] nvme: Add Controller Data Queue to the nvme driver
Posted by Christoph Hellwig 2 months, 3 weeks ago
On Mon, Jul 14, 2025 at 11:15:31AM +0200, Joel Granados wrote:
> Motivation
> ==========
> The main motivation is to enable Controller Data Queues as described in
> the 2.2 revision of the NVME base specification. This series places the
> kernel as an intermediary between the NVME controller producing CDQ
> entries and the user space process consuming them. It is general enough
> to encompass different use cases that require controller initiated
> communication delivered outside the regular I/O traffic streams (like
> LBA tracking for example).

That's rather blurbish.  The only use case for CDQs in NVMe 2.2 is
tracking of dirty LBAs for live migration, and the live migration
feature in 2.2 is completely broken because the hyperscalers wanted
to win a point.  So for CDQs to be useful in Linux we'll need the
proper live migration still under heavy development.  With that I'd
very much expect the kernel to manage the CDQs just like any other
queue, and not a random user ioctl.  So what would be the use case for
a user controlled CDQ?
Re: [PATCH RFC 0/8] nvme: Add Controller Data Queue to the nvme driver
Posted by Joel Granados 2 months, 2 weeks ago
On Mon, Jul 14, 2025 at 03:02:31PM +0200, Christoph Hellwig wrote:
> On Mon, Jul 14, 2025 at 11:15:31AM +0200, Joel Granados wrote:
> > Motivation
> > ==========
> > The main motivation is to enable Controller Data Queues as described in
> > the 2.2 revision of the NVME base specification. This series places the
> > kernel as an intermediary between the NVME controller producing CDQ
> > entries and the user space process consuming them. It is general enough
> > to encompass different use cases that require controller initiated
> > communication delivered outside the regular I/O traffic streams (like
> > LBA tracking for example).

Thx for the feedback. Much appreciated.

> 
> That's rather blurbish.  The only use case for CDQs in NVMe 2.2 is
> tracking of dirty LBAs for live migration, and the live migration
Yes, that is my understanding of nvme 2.2 as well.

> feature in 2.2 is completely broken because the hyperscalers wanted
> to win a point.  So for CDQs to be useful in Linux we'll need the
> proper live migration still under heavy development.  With that I'd
Do you mean in the specification body or patch series in the mailing
lists?

> very much expect the kernel to manage the CDQs just like any other
> queue, and not a random user ioctl.
This is a great segue to a question: If CDQ is like any other queue,
what is the best way of handling the lack of CDQ submission queues?
Something like snooping all submissions for these CDQs and triggering a
CDQ consume on every submission?

I went with the ioctl as the faster way to get it to work; I might
explore what having it as just another queue would look like.

> So what would be the use case for a user controlled CDQ?
Do you mean a hypothetical list besides LM in NVME 2.2?

Best

-- 

Joel Granados
Re: [PATCH RFC 0/8] nvme: Add Controller Data Queue to the nvme driver
Posted by Christoph Hellwig 2 months, 2 weeks ago
On Fri, Jul 18, 2025 at 01:33:34PM +0200, Joel Granados wrote:
> > to win a point.  So for CDQs to be useful in Linux we'll need the
> > proper live migration still under heavy development.  With that I'd
> Do you mean in the specification body or patch series in the mailing
> lists?

Actual code.  As I said I very much expect CDQ creation and usage
to be kernel driven for live migration.

> > very much expect the kernel to manage the CDQs just like any other
> > queue, and not a random user ioctl.
> This is a great segue to a question: If CDQ is like any other queue,
> what is the best way of handling the lack of CDQ submission queues?
> Something like snooping all submissions for these CDQs and triggering a
> CDQ consume on every submission?

I don't understand this question and proposed answer at all all.

> I went with the ioctl as the faster way to get it to work;

Get _what_ to work?

> I might
> explore what having it as just another queue would look like.
> 
> > So what would be the use case for a user controlled CDQ?
> Do you mean a hypothetical list besides LM in NVME 2.2?

As out line in the last two mails I don't see how live migration would
work with user controlled CDQs.  Maybe I'm wrong, but nothing in this
thread seems to even try to explain how that would work.