drivers/nvme/host/sysfs.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
From: Abhishek <abhishekbapat@google.com>
The initialization of the max_hw_sectors_kb value is performed by the
NVMe driver through the invocation of the NVMe Identify Controller
command, followed by the subsequent retrieval of the MDTS (Max Data
Transfer Size) field. Commit 3710e2b056cb ("nvme-pci: clamp
max_hw_sectors based on DMA optimized limitation") introduced a
limitation on the value of max_hw_sectors_kb, restricting it to 128KiB
(MDTS = 5). This restricion was implemented to mitigate lockups
encountered in high-core count AMD servers.
Currently, user space applications have two options for obtaining the
max_hw_sectors_kb value to determine the payload size of the NVMe
command they wish to issue. They can either execute the Identify
Controller command or query the kernel. In instances where the
underlying NVMe device supports MDTS > 5 (128KiB), the user space
application can potentially create an NVMe command with a payload size
greater than 128KiB, if it fetches the MDTS value through the Identify
Controller command. However, this would result in an Invalid Argument
(-EINVAL) kernel error, preventing the application from issuing the
required command through any of the kernel supported I/O API. Presently,
the kernel exposes max_hw_sectors_kb value through a queue sysfs file.
However, this file is only present for an NVMe device if a namespace has
been created on the same NVMe device, necessitating the existence of a
namespace to query the value of max_hw_sectors_kb. This dependency is
semantically incorrect as MDTS is a controller-associated field (section
5.1.13, NVMe specification 2.1) and should be accessible regardless of
the presence of a namespace on the NVMe device.
Expose the value of max_hw_sectors_kb through NVMe sysfs to remove the
dependency of having a namespace on the device before accessing its
value.
Signed-off-by: Abhishek <abhishekbapat@google.com>
---
drivers/nvme/host/sysfs.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index b68a9e5f1ea3..1af2b2cf1a6c 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -546,6 +546,17 @@ static ssize_t dctype_show(struct device *dev,
}
static DEVICE_ATTR_RO(dctype);
+static ssize_t max_hw_sectors_kb_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+ u32 max_hw_sectors_kb = ctrl->max_hw_sectors >> 1;
+
+ return sysfs_emit(buf, "%u\n", max_hw_sectors_kb);
+}
+static DEVICE_ATTR_RO(max_hw_sectors_kb);
+
#ifdef CONFIG_NVME_HOST_AUTH
static ssize_t nvme_ctrl_dhchap_secret_show(struct device *dev,
struct device_attribute *attr, char *buf)
@@ -687,6 +698,7 @@ static struct attribute *nvme_dev_attrs[] = {
&dev_attr_kato.attr,
&dev_attr_cntrltype.attr,
&dev_attr_dctype.attr,
+ &dev_attr_max_hw_sectors_kb.attr,
#ifdef CONFIG_NVME_HOST_AUTH
&dev_attr_dhchap_secret.attr,
&dev_attr_dhchap_ctrl_secret.attr,
--
2.47.0.rc1.288.g06298d1525-goog
On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote: > max_hw_sectors based on DMA optimized limitation") introduced a > limitation on the value of max_hw_sectors_kb, restricting it to 128KiB > (MDTS = 5). This restricion was implemented to mitigate lockups > encountered in high-core count AMD servers. There are other limits that can constrain transfer sizes below the device's MDTS. For example, the driver can only preallocate so much space for DMA and SGL descriptors, so 8MB is the current max transfer sizes the driver can support, and a device's MDTS can be much bigger than that. Anyway, yeah, I guess having a controller generic way to export this sounds like a good idea, but I wonder if the nvme driver is the right place to do it. The request_queue has all the limits you need to know about, but these are only exported if a gendisk is attached to it. Maybe we can create a queue subdirectory to the char dev too.
On Thu, Oct 17, 2024 at 10:40:36AM -0600, Keith Busch wrote: > On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote: > > max_hw_sectors based on DMA optimized limitation") introduced a > > limitation on the value of max_hw_sectors_kb, restricting it to 128KiB > > (MDTS = 5). This restricion was implemented to mitigate lockups > > encountered in high-core count AMD servers. > > There are other limits that can constrain transfer sizes below the > device's MDTS. For example, the driver can only preallocate so much > space for DMA and SGL descriptors, so 8MB is the current max transfer > sizes the driver can support, and a device's MDTS can be much bigger > than that. Yes. Plus the virt boundary for PRPs, and for non-PCIe tranfers there's also plenty of other hardware limits due to e.g. the FC HBA and the RDMA HCA limit. There's also been some talk of a new PCIe SGL variant with hard limits. So I agree that exposting limits on I/O would be very useful, but it's also kinda non-trivial. > Anyway, yeah, I guess having a controller generic way to export this > sounds like a good idea, but I wonder if the nvme driver is the right > place to do it. The request_queue has all the limits you need to know > about, but these are only exported if a gendisk is attached to it. > Maybe we can create a queue subdirectory to the char dev too. If we want it controller wide to e.g. include the admin queue the gendisk won't really help unfortunately.
On 18/10/2024 8:14, Christoph Hellwig wrote: > On Thu, Oct 17, 2024 at 10:40:36AM -0600, Keith Busch wrote: >> On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote: >>> max_hw_sectors based on DMA optimized limitation") introduced a >>> limitation on the value of max_hw_sectors_kb, restricting it to 128KiB >>> (MDTS = 5). This restricion was implemented to mitigate lockups >>> encountered in high-core count AMD servers. >> There are other limits that can constrain transfer sizes below the >> device's MDTS. For example, the driver can only preallocate so much >> space for DMA and SGL descriptors, so 8MB is the current max transfer >> sizes the driver can support, and a device's MDTS can be much bigger >> than that. > Yes. Plus the virt boundary for PRPs, and for non-PCIe tranfers > there's also plenty of other hardware limits due to e.g. the FC HBA > and the RDMA HCA limit. There's also been some talk of a new PCIe > SGL variant with hard limits. > > So I agree that exposting limits on I/O would be very useful, but it's > also kinda non-trivial. I think the ctrl misc device attributes are fine to expose this and other types of attributes (like we already do today).
On Thu, Oct 17, 2024 at 9:40 AM Keith Busch <kbusch@kernel.org> wrote: > > On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote: > > max_hw_sectors based on DMA optimized limitation") introduced a > > limitation on the value of max_hw_sectors_kb, restricting it to 128KiB > > (MDTS = 5). This restricion was implemented to mitigate lockups > > encountered in high-core count AMD servers. > > There are other limits that can constrain transfer sizes below the > device's MDTS. For example, the driver can only preallocate so much > space for DMA and SGL descriptors, so 8MB is the current max transfer > sizes the driver can support, and a device's MDTS can be much bigger > than that. > > Anyway, yeah, I guess having a controller generic way to export this > sounds like a good idea, but I wonder if the nvme driver is the right > place to do it. The request_queue has all the limits you need to know > about, but these are only exported if a gendisk is attached to it. > Maybe we can create a queue subdirectory to the char dev too. Are you suggesting that all the files from the queue subdirectory should be included in the char dev (/sys/class/nvme/nvmeX/queue/)? Or that just the max_hw_sectors_kb value should be shared within the queue subdirectory? And if not the nvme driver, where else can this be done from?
On Thu, Oct 17, 2024 at 02:32:18PM -0700, Abhishek Bapat wrote: > On Thu, Oct 17, 2024 at 9:40 AM Keith Busch <kbusch@kernel.org> wrote: > > > > On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote: > > > max_hw_sectors based on DMA optimized limitation") introduced a > > > limitation on the value of max_hw_sectors_kb, restricting it to 128KiB > > > (MDTS = 5). This restricion was implemented to mitigate lockups > > > encountered in high-core count AMD servers. > > > > There are other limits that can constrain transfer sizes below the > > device's MDTS. For example, the driver can only preallocate so much > > space for DMA and SGL descriptors, so 8MB is the current max transfer > > sizes the driver can support, and a device's MDTS can be much bigger > > than that. > > > > Anyway, yeah, I guess having a controller generic way to export this > > sounds like a good idea, but I wonder if the nvme driver is the right > > place to do it. The request_queue has all the limits you need to know > > about, but these are only exported if a gendisk is attached to it. > > Maybe we can create a queue subdirectory to the char dev too. > > Are you suggesting that all the files from the queue subdirectory should > be included in the char dev (/sys/class/nvme/nvmeX/queue/)? Or that > just the max_hw_sectors_kb value should be shared within the queue > subdirectory? And if not the nvme driver, where else can this be done > from? You'd may want to know max_sectors_kb, dma_alignment, nr_requests, virt_boundary_mask. Maybe some others. The request_queue is owned by the block layer, so that seems like an okay place to export it, but attached to some other device's sysfs directory instead of a gendisk. I'm just suggesting this because it doesn't sound like this is an nvme specific problem.
On Tue, Oct 22, 2024 at 08:53:47AM -0600, Keith Busch wrote: > You'd may want to know max_sectors_kb, dma_alignment, nr_requests, > virt_boundary_mask. Maybe some others. > > The request_queue is owned by the block layer, so that seems like an > okay place to export it, but attached to some other device's sysfs > directory instead of a gendisk. > > I'm just suggesting this because it doesn't sound like this is an nvme > specific problem. Well, it's a problem specific to passthrough without a gendisk, which is the NVMe admin queue and the /dev/sg device. So it's common-ish :) Note that for the programs using passthrough sysfs isn't actually a very good interface, as finding the right directory is pain, as is opening, reading and parsing one ASCIII file per limit. One thing I've been wanting to do also for mkfs tools and similar is a generic extensible ioctl to dump all the queue limits. That's a lot easier and faster for the tools and would work very well here. Note that we could still be adding new limits at any point of time (although I have a hard time thinking what limit we don't have yet), so we still can't guarantee that non-trivial I/O will always work.
On 23/10/2024 8:24, Christoph Hellwig wrote: > On Tue, Oct 22, 2024 at 08:53:47AM -0600, Keith Busch wrote: >> You'd may want to know max_sectors_kb, dma_alignment, nr_requests, >> virt_boundary_mask. Maybe some others. >> >> The request_queue is owned by the block layer, so that seems like an >> okay place to export it, but attached to some other device's sysfs >> directory instead of a gendisk. >> >> I'm just suggesting this because it doesn't sound like this is an nvme >> specific problem. > Well, it's a problem specific to passthrough without a gendisk, which is > the NVMe admin queue and the /dev/sg device. So it's common-ish :) > > > Note that for the programs using passthrough sysfs isn't actually a very > good interface, as finding the right directory is pain, as is opening, > reading and parsing one ASCIII file per limit. > > One thing I've been wanting to do also for mkfs tools and similar is a > generic extensible ioctl to dump all the queue limits. That's a lot > easier and faster for the tools and would work very well here. > > Note that we could still be adding new limits at any point of time > (although I have a hard time thinking what limit we don't have yet), > so we still can't guarantee that non-trivial I/O will always work. Makes sense to me. Although people would still like to be able to see this value outside of an application context. We can probably extend nvme-cli to display this info...
On 22/10/2024 17:53, Keith Busch wrote: > On Thu, Oct 17, 2024 at 02:32:18PM -0700, Abhishek Bapat wrote: >> On Thu, Oct 17, 2024 at 9:40 AM Keith Busch <kbusch@kernel.org> wrote: >>> On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote: >>>> max_hw_sectors based on DMA optimized limitation") introduced a >>>> limitation on the value of max_hw_sectors_kb, restricting it to 128KiB >>>> (MDTS = 5). This restricion was implemented to mitigate lockups >>>> encountered in high-core count AMD servers. >>> There are other limits that can constrain transfer sizes below the >>> device's MDTS. For example, the driver can only preallocate so much >>> space for DMA and SGL descriptors, so 8MB is the current max transfer >>> sizes the driver can support, and a device's MDTS can be much bigger >>> than that. >>> >>> Anyway, yeah, I guess having a controller generic way to export this >>> sounds like a good idea, but I wonder if the nvme driver is the right >>> place to do it. The request_queue has all the limits you need to know >>> about, but these are only exported if a gendisk is attached to it. >>> Maybe we can create a queue subdirectory to the char dev too. >> Are you suggesting that all the files from the queue subdirectory should >> be included in the char dev (/sys/class/nvme/nvmeX/queue/)? Or that >> just the max_hw_sectors_kb value should be shared within the queue >> subdirectory? And if not the nvme driver, where else can this be done >> from? > You'd may want to know max_sectors_kb, dma_alignment, nr_requests, > virt_boundary_mask. Maybe some others. > > The request_queue is owned by the block layer, so that seems like an > okay place to export it, but attached to some other device's sysfs > directory instead of a gendisk. > > I'm just suggesting this because it doesn't sound like this is an nvme > specific problem. Won't it be confusing to find queue/ directory in controller nvmeX sysfs entry?
On Tue, Oct 22, 2024 at 06:35:11PM +0300, Sagi Grimberg wrote: > On 22/10/2024 17:53, Keith Busch wrote: > > On Thu, Oct 17, 2024 at 02:32:18PM -0700, Abhishek Bapat wrote: > > > > The request_queue is owned by the block layer, so that seems like an > > okay place to export it, but attached to some other device's sysfs > > directory instead of a gendisk. > > > > I'm just suggesting this because it doesn't sound like this is an nvme > > specific problem. > > Won't it be confusing to find queue/ directory in controller nvmeX sysfs > entry? It's the attributes of the request queue associated with that controller, so I think a queue/ directory under it makes sense. That's how it looks for gendisks, so why not for disk-less queues? Many queue attributes only make sense for gendisks, though, so maybe need to tweak visibility if we decide to do it like this.
On 22/10/2024 18:51, Keith Busch wrote: > On Tue, Oct 22, 2024 at 06:35:11PM +0300, Sagi Grimberg wrote: >> On 22/10/2024 17:53, Keith Busch wrote: >>> On Thu, Oct 17, 2024 at 02:32:18PM -0700, Abhishek Bapat wrote: >>> >>> The request_queue is owned by the block layer, so that seems like an >>> okay place to export it, but attached to some other device's sysfs >>> directory instead of a gendisk. >>> >>> I'm just suggesting this because it doesn't sound like this is an nvme >>> specific problem. >> Won't it be confusing to find queue/ directory in controller nvmeX sysfs >> entry? > It's the attributes of the request queue associated with that > controller, so I think a queue/ directory under it makes sense. That's > how it looks for gendisks, so why not for disk-less queues? > > Many queue attributes only make sense for gendisks, though, so maybe > need to tweak visibility if we decide to do it like this. It'd be good to name it "admin_queue" to clarify a bit (although its still confusing).
I agree it would be convenient for the kernel to expose a generic "max data size" limit for the NVMe controller. For example, nvme-cli currently pessimistically assumes a controller's maximum data transfer size is 4 KB when sending Get Log Page commands: https://github.com/linux-nvme/libnvme/blob/8cdd746b324bd84a0666e7a265aa253dbda9d932/src/nvme/ioctl.c#L330. Fetching large log pages results in a lot of Get Log Page commands. If nvme-cli could tell that the controller and kernel support larger data transfers, it could fetch the entire log page (or a much larger chunk) in a Get Log Page command. Best, Caleb On Thu, Oct 17, 2024 at 9:46 AM Keith Busch <kbusch@kernel.org> wrote: > > On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote: > > max_hw_sectors based on DMA optimized limitation") introduced a > > limitation on the value of max_hw_sectors_kb, restricting it to 128KiB > > (MDTS = 5). This restricion was implemented to mitigate lockups > > encountered in high-core count AMD servers. > > There are other limits that can constrain transfer sizes below the > device's MDTS. For example, the driver can only preallocate so much > space for DMA and SGL descriptors, so 8MB is the current max transfer > sizes the driver can support, and a device's MDTS can be much bigger > than that. > > Anyway, yeah, I guess having a controller generic way to export this > sounds like a good idea, but I wonder if the nvme driver is the right > place to do it. The request_queue has all the limits you need to know > about, but these are only exported if a gendisk is attached to it. > Maybe we can create a queue subdirectory to the char dev too. >
Hi Abhishek, On Wed, 16 Oct 2024 at 14:31, Abhishek Bapat <abhishekbapat@google.com> wrote: > > From: Abhishek <abhishekbapat@google.com> Here and in the S-o-b line: Please use your full legal name. > diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c > index b68a9e5f1ea3..1af2b2cf1a6c 100644 > --- a/drivers/nvme/host/sysfs.c > +++ b/drivers/nvme/host/sysfs.c > @@ -546,6 +546,17 @@ static ssize_t dctype_show(struct device *dev, > } > static DEVICE_ATTR_RO(dctype); > > +static ssize_t max_hw_sectors_kb_show(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + struct nvme_ctrl *ctrl = dev_get_drvdata(dev); > + u32 max_hw_sectors_kb = ctrl->max_hw_sectors >> 1; In what unit is max_hw_sector stored? If it's "number of sectors", is this conversion to size correct, or should SECTOR_SHIFT be used? Thanks, -- -Prashant
On Wed, Oct 16, 2024 at 2:54 PM Prashant Malani <pmalani@google.com> wrote: > > Hi Abhishek, > > > On Wed, 16 Oct 2024 at 14:31, Abhishek Bapat <abhishekbapat@google.com> wrote: > > > > From: Abhishek <abhishekbapat@google.com> > > Here and in the S-o-b line: Please use your full legal name. Acknowledged. > > > diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c > > index b68a9e5f1ea3..1af2b2cf1a6c 100644 > > --- a/drivers/nvme/host/sysfs.c > > +++ b/drivers/nvme/host/sysfs.c > > @@ -546,6 +546,17 @@ static ssize_t dctype_show(struct device *dev, > > } > > static DEVICE_ATTR_RO(dctype); > > > > +static ssize_t max_hw_sectors_kb_show(struct device *dev, > > + struct device_attribute *attr, > > + char *buf) > > +{ > > + struct nvme_ctrl *ctrl = dev_get_drvdata(dev); > > + u32 max_hw_sectors_kb = ctrl->max_hw_sectors >> 1; > > In what unit is max_hw_sector stored? If it's "number of sectors", is this > conversion to size correct, or should SECTOR_SHIFT be used? The unit for max_hw_sectors is sectors. Left shifting with SECTOR_SHIFT will change the unit to bytes, and then we will have to right shift by 10 to convert it into KiB. The net effect is right shifting by one as the value of SECTOR_SHIFT is 9. This programming pattern is used within the Block layer code as well.
© 2016 - 2024 Red Hat, Inc.