drivers/nvme/host/core.c | 253 ++++++++++++++++++++++++++++++++++++++++ drivers/nvme/host/ioctl.c | 47 +++++++- drivers/nvme/host/nvme.h | 20 ++++ include/linux/nvme.h | 30 +++++ include/uapi/linux/nvme_ioctl.h | 12 ++ 5 files changed, 361 insertions(+), 1 deletion(-)
This series introduces support for Controller Data Queues (CDQs) in the NVMe driver. CDQs allow an NVME controller to post information to the host through a single completion queue. This series adds data structures, helpers, and the user interface required to create, read, and delete CDQs. Motivation ========== The main motivation is to enable Controller Data Queues as described in the 2.2 revision of the NVME base specification. This series places the kernel as an intermediary between the NVME controller producing CDQ entries and the user space process consuming them. It is general enough to encompass different use cases that require controller initiated communication delivered outside the regular I/O traffic streams (like LBA tracking for example). What is done ============ * Added nvme_admin_cdq opcode and NVME_FEAT_CDQ feature flag * Defined a new struct nvme_cdq command for create/delete operations * Added a cdq_nvme_queue struct that holds the CDQ state * Added an xarray for each nvme_ctrl that holds a reference to all controller CDQs. * Added a new ioctl (NVME_IOCTL_ADMIN_CDQ) and argument struct (nvme_cdq_cmd) for CDQ creation * Added helpers for consuming CDQs: nvme_cdq_{next,send_feature,traverse} * Added helpers for CDQ admin: nvme_cdq_{free,alloc,create,delete} In summary, this series implements creation, consumption, and cleanup of Controller Data Queues, providing a file-descriptor based interface for user space to read CDQ entries. CDQ life cycle ============== To create a CDQ, user space defines the number of entries, entry size, location of the phase tag (8.1.6.2 NVME base spec), MOS field (5.1.4 NVME base spec) and if necessary, CQS field (5.1.4.1.1 NVME base spec). All these are passed through the NVME_IOCTL_ADMIN_CDQ ioctl which allocates and connects the controller to CDQ memory and returns the CDQ ID (defined by the controller) and a CDQ file descriptor (CDQ FD). The CDQ FD is used to consume entries through read system call. For every "read", all available (new) entries are copied from the internal Kernel CDQ buffer to the user space buffer. The CDQ ID, on the other hand, is meant for interactions that are outside CDQ creation and consumption. In these cases the caller is expected to send NVME commands down through one of the already available mechanisms (like the NVME_IOCTL_ADMIN_CMD ioctl). CDQ data structures and memory are cleaned up when the release file operation is called on the FD, which usually means the close system call or the user process gets killed. Testing ======= The User Data Migration Queue (5.1.4.1.1 NVME base spec) implemented in the QEMU NVME device [1] is used for testing purposes. CDQ creation, consumption and deletion is shown by calling a CDQ example in libvfn [2] (a low level NVME/PCIe library) from within QEMU. For brevity, I have *not* included any of the testing commands; but I can provide them if needed. Questions ========= Here are some questions that where on my mind. 1. I have used an ioctl for the CDQ creation. Any better alternatives? 2. The deletion is handled by closing the file descriptor. Should this be handled by the ioctl? Any feedback, questions or comments is greatly appreciated Best [1] https://github.com/SamsungDS/qemu/tree/nvme.tp4159 [2] https://github.com/Joelgranados/libvfn/blob/jag/cdq/examples/cdq.c Signed-off-by: Joel Granados <joel.granados@kernel.org> --- Joel Granados (8): nvme: Add CDQ command definitions for contiguous PRPs nvme: Add cdq data structure to nvme_ctrl nvme: Add file descriptor to read CDQs nvme: Add function to create a CDQ nvme: Add function to delete CDQ nvme: Add a release ops to cdq file ops nvme: Add Controller Data Queue (CDQ) ioctl command nvme: Connect CDQ ioctl to nvme driver drivers/nvme/host/core.c | 253 ++++++++++++++++++++++++++++++++++++++++ drivers/nvme/host/ioctl.c | 47 +++++++- drivers/nvme/host/nvme.h | 20 ++++ include/linux/nvme.h | 30 +++++ include/uapi/linux/nvme_ioctl.h | 12 ++ 5 files changed, 361 insertions(+), 1 deletion(-) --- base-commit: 0ff41df1cb268fc69e703a08a57ee14ae967d0ca change-id: 20250624-jag-cdq-691ed7e68c1c Best regards, -- Joel Granados <joel.granados@kernel.org>
On Mon, Jul 14, 2025 at 11:15:31AM +0200, Joel Granados wrote: > Motivation > ========== > The main motivation is to enable Controller Data Queues as described in > the 2.2 revision of the NVME base specification. This series places the > kernel as an intermediary between the NVME controller producing CDQ > entries and the user space process consuming them. It is general enough > to encompass different use cases that require controller initiated > communication delivered outside the regular I/O traffic streams (like > LBA tracking for example). That's rather blurbish. The only use case for CDQs in NVMe 2.2 is tracking of dirty LBAs for live migration, and the live migration feature in 2.2 is completely broken because the hyperscalers wanted to win a point. So for CDQs to be useful in Linux we'll need the proper live migration still under heavy development. With that I'd very much expect the kernel to manage the CDQs just like any other queue, and not a random user ioctl. So what would be the use case for a user controlled CDQ?
On Mon, Jul 14, 2025 at 03:02:31PM +0200, Christoph Hellwig wrote: > On Mon, Jul 14, 2025 at 11:15:31AM +0200, Joel Granados wrote: > > Motivation > > ========== > > The main motivation is to enable Controller Data Queues as described in > > the 2.2 revision of the NVME base specification. This series places the > > kernel as an intermediary between the NVME controller producing CDQ > > entries and the user space process consuming them. It is general enough > > to encompass different use cases that require controller initiated > > communication delivered outside the regular I/O traffic streams (like > > LBA tracking for example). Thx for the feedback. Much appreciated. > > That's rather blurbish. The only use case for CDQs in NVMe 2.2 is > tracking of dirty LBAs for live migration, and the live migration Yes, that is my understanding of nvme 2.2 as well. > feature in 2.2 is completely broken because the hyperscalers wanted > to win a point. So for CDQs to be useful in Linux we'll need the > proper live migration still under heavy development. With that I'd Do you mean in the specification body or patch series in the mailing lists? > very much expect the kernel to manage the CDQs just like any other > queue, and not a random user ioctl. This is a great segue to a question: If CDQ is like any other queue, what is the best way of handling the lack of CDQ submission queues? Something like snooping all submissions for these CDQs and triggering a CDQ consume on every submission? I went with the ioctl as the faster way to get it to work; I might explore what having it as just another queue would look like. > So what would be the use case for a user controlled CDQ? Do you mean a hypothetical list besides LM in NVME 2.2? Best -- Joel Granados
On Fri, Jul 18, 2025 at 01:33:34PM +0200, Joel Granados wrote: > > to win a point. So for CDQs to be useful in Linux we'll need the > > proper live migration still under heavy development. With that I'd > Do you mean in the specification body or patch series in the mailing > lists? Actual code. As I said I very much expect CDQ creation and usage to be kernel driven for live migration. > > very much expect the kernel to manage the CDQs just like any other > > queue, and not a random user ioctl. > This is a great segue to a question: If CDQ is like any other queue, > what is the best way of handling the lack of CDQ submission queues? > Something like snooping all submissions for these CDQs and triggering a > CDQ consume on every submission? I don't understand this question and proposed answer at all all. > I went with the ioctl as the faster way to get it to work; Get _what_ to work? > I might > explore what having it as just another queue would look like. > > > So what would be the use case for a user controlled CDQ? > Do you mean a hypothetical list besides LM in NVME 2.2? As out line in the last two mails I don't see how live migration would work with user controlled CDQs. Maybe I'm wrong, but nothing in this thread seems to even try to explain how that would work.
© 2016 - 2025 Red Hat, Inc.