nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

[PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Abhishek Bapat 1 month, 1 week ago

From: Abhishek <abhishekbapat@google.com>

The initialization of the max_hw_sectors_kb value is performed by the
NVMe driver through the invocation of the NVMe Identify Controller
command, followed by the subsequent retrieval of the MDTS (Max Data
Transfer Size) field. Commit 3710e2b056cb ("nvme-pci: clamp
max_hw_sectors based on DMA optimized limitation") introduced a
limitation on the value of max_hw_sectors_kb, restricting it to 128KiB
(MDTS = 5). This restricion was implemented to mitigate lockups
encountered in high-core count AMD servers.

Currently, user space applications have two options for obtaining the
max_hw_sectors_kb value to determine the payload size of the NVMe
command they wish to issue. They can either execute the Identify
Controller command or query the kernel. In instances where the
underlying NVMe device supports MDTS > 5 (128KiB), the user space
application can potentially create an NVMe command with a payload size
greater than 128KiB, if it fetches the MDTS value through the Identify
Controller command. However, this would result in an Invalid Argument
(-EINVAL) kernel error, preventing the application from issuing the
required command through any of the kernel supported I/O API. Presently,
the kernel exposes max_hw_sectors_kb value through a queue sysfs file.
However, this file is only present for an NVMe device if a namespace has
been created on the same NVMe device, necessitating the existence of a
namespace to query the value of max_hw_sectors_kb. This dependency is
semantically incorrect as MDTS is a controller-associated field (section
5.1.13, NVMe specification 2.1) and should be accessible regardless of
the presence of a namespace on the NVMe device.

Expose the value of max_hw_sectors_kb through NVMe sysfs to remove the
dependency of having a namespace on the device before accessing its
value.

Signed-off-by: Abhishek <abhishekbapat@google.com>
---
 drivers/nvme/host/sysfs.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index b68a9e5f1ea3..1af2b2cf1a6c 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -546,6 +546,17 @@ static ssize_t dctype_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(dctype);
 
+static ssize_t max_hw_sectors_kb_show(struct device *dev,
+				      struct device_attribute *attr,
+				      char *buf)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+	u32 max_hw_sectors_kb = ctrl->max_hw_sectors >> 1;
+
+	return sysfs_emit(buf, "%u\n", max_hw_sectors_kb);
+}
+static DEVICE_ATTR_RO(max_hw_sectors_kb);
+
 #ifdef CONFIG_NVME_HOST_AUTH
 static ssize_t nvme_ctrl_dhchap_secret_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
@@ -687,6 +698,7 @@ static struct attribute *nvme_dev_attrs[] = {
 	&dev_attr_kato.attr,
 	&dev_attr_cntrltype.attr,
 	&dev_attr_dctype.attr,
+	&dev_attr_max_hw_sectors_kb.attr,
 #ifdef CONFIG_NVME_HOST_AUTH
 	&dev_attr_dhchap_secret.attr,
 	&dev_attr_dhchap_ctrl_secret.attr,
-- 
2.47.0.rc1.288.g06298d1525-goog

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Keith Busch 1 month, 1 week ago

On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote:
> max_hw_sectors based on DMA optimized limitation") introduced a
> limitation on the value of max_hw_sectors_kb, restricting it to 128KiB
> (MDTS = 5). This restricion was implemented to mitigate lockups
> encountered in high-core count AMD servers.

There are other limits that can constrain transfer sizes below the
device's MDTS. For example, the driver can only preallocate so much
space for DMA and SGL descriptors, so 8MB is the current max transfer
sizes the driver can support, and a device's MDTS can be much bigger
than that.

Anyway, yeah, I guess having a controller generic way to export this
sounds like a good idea, but I wonder if the nvme driver is the right
place to do it. The request_queue has all the limits you need to know
about, but these are only exported if a gendisk is attached to it.
Maybe we can create a queue subdirectory to the char dev too.

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Christoph Hellwig 1 month, 1 week ago

On Thu, Oct 17, 2024 at 10:40:36AM -0600, Keith Busch wrote:
> On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote:
> > max_hw_sectors based on DMA optimized limitation") introduced a
> > limitation on the value of max_hw_sectors_kb, restricting it to 128KiB
> > (MDTS = 5). This restricion was implemented to mitigate lockups
> > encountered in high-core count AMD servers.
> 
> There are other limits that can constrain transfer sizes below the
> device's MDTS. For example, the driver can only preallocate so much
> space for DMA and SGL descriptors, so 8MB is the current max transfer
> sizes the driver can support, and a device's MDTS can be much bigger
> than that.

Yes.  Plus the virt boundary for PRPs, and for non-PCIe tranfers
there's also plenty of other hardware limits due to e.g. the FC HBA
and the RDMA HCA limit.  There's also been some talk of a new PCIe
SGL variant with hard limits.

So I agree that exposting limits on I/O would be very useful, but it's
also kinda non-trivial.

> Anyway, yeah, I guess having a controller generic way to export this
> sounds like a good idea, but I wonder if the nvme driver is the right
> place to do it. The request_queue has all the limits you need to know
> about, but these are only exported if a gendisk is attached to it.
> Maybe we can create a queue subdirectory to the char dev too. 

If we want it controller wide to e.g. include the admin queue the
gendisk won't really help unfortunately.

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Sagi Grimberg 1 month ago



On 18/10/2024 8:14, Christoph Hellwig wrote:
> On Thu, Oct 17, 2024 at 10:40:36AM -0600, Keith Busch wrote:
>> On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote:
>>> max_hw_sectors based on DMA optimized limitation") introduced a
>>> limitation on the value of max_hw_sectors_kb, restricting it to 128KiB
>>> (MDTS = 5). This restricion was implemented to mitigate lockups
>>> encountered in high-core count AMD servers.
>> There are other limits that can constrain transfer sizes below the
>> device's MDTS. For example, the driver can only preallocate so much
>> space for DMA and SGL descriptors, so 8MB is the current max transfer
>> sizes the driver can support, and a device's MDTS can be much bigger
>> than that.
> Yes.  Plus the virt boundary for PRPs, and for non-PCIe tranfers
> there's also plenty of other hardware limits due to e.g. the FC HBA
> and the RDMA HCA limit.  There's also been some talk of a new PCIe
> SGL variant with hard limits.
>
> So I agree that exposting limits on I/O would be very useful, but it's
> also kinda non-trivial.

I think the ctrl misc device attributes are fine to expose this and other
types of attributes (like we already do today).

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Abhishek Bapat 1 month, 1 week ago

On Thu, Oct 17, 2024 at 9:40 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote:
> > max_hw_sectors based on DMA optimized limitation") introduced a
> > limitation on the value of max_hw_sectors_kb, restricting it to 128KiB
> > (MDTS = 5). This restricion was implemented to mitigate lockups
> > encountered in high-core count AMD servers.
>
> There are other limits that can constrain transfer sizes below the
> device's MDTS. For example, the driver can only preallocate so much
> space for DMA and SGL descriptors, so 8MB is the current max transfer
> sizes the driver can support, and a device's MDTS can be much bigger
> than that.
>
> Anyway, yeah, I guess having a controller generic way to export this
> sounds like a good idea, but I wonder if the nvme driver is the right
> place to do it. The request_queue has all the limits you need to know
> about, but these are only exported if a gendisk is attached to it.
> Maybe we can create a queue subdirectory to the char dev too.

Are you suggesting that all the files from the queue subdirectory should
be included in the char dev (/sys/class/nvme/nvmeX/queue/)? Or that
just the max_hw_sectors_kb value should be shared within the queue
subdirectory? And if not the nvme driver, where else can this be done
from?

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Keith Busch 1 month ago

On Thu, Oct 17, 2024 at 02:32:18PM -0700, Abhishek Bapat wrote:
> On Thu, Oct 17, 2024 at 9:40 AM Keith Busch <kbusch@kernel.org> wrote:
> >
> > On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote:
> > > max_hw_sectors based on DMA optimized limitation") introduced a
> > > limitation on the value of max_hw_sectors_kb, restricting it to 128KiB
> > > (MDTS = 5). This restricion was implemented to mitigate lockups
> > > encountered in high-core count AMD servers.
> >
> > There are other limits that can constrain transfer sizes below the
> > device's MDTS. For example, the driver can only preallocate so much
> > space for DMA and SGL descriptors, so 8MB is the current max transfer
> > sizes the driver can support, and a device's MDTS can be much bigger
> > than that.
> >
> > Anyway, yeah, I guess having a controller generic way to export this
> > sounds like a good idea, but I wonder if the nvme driver is the right
> > place to do it. The request_queue has all the limits you need to know
> > about, but these are only exported if a gendisk is attached to it.
> > Maybe we can create a queue subdirectory to the char dev too.
> 
> Are you suggesting that all the files from the queue subdirectory should
> be included in the char dev (/sys/class/nvme/nvmeX/queue/)? Or that
> just the max_hw_sectors_kb value should be shared within the queue
> subdirectory? And if not the nvme driver, where else can this be done
> from?

You'd may want to know max_sectors_kb, dma_alignment, nr_requests,
virt_boundary_mask. Maybe some others.

The request_queue is owned by the block layer, so that seems like an
okay place to export it, but attached to some other device's sysfs
directory instead of a gendisk.

I'm just suggesting this because it doesn't sound like this is an nvme
specific problem.

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Christoph Hellwig 1 month ago

On Tue, Oct 22, 2024 at 08:53:47AM -0600, Keith Busch wrote:
> You'd may want to know max_sectors_kb, dma_alignment, nr_requests,
> virt_boundary_mask. Maybe some others.
> 
> The request_queue is owned by the block layer, so that seems like an
> okay place to export it, but attached to some other device's sysfs
> directory instead of a gendisk.
> 
> I'm just suggesting this because it doesn't sound like this is an nvme
> specific problem.

Well, it's a problem specific to passthrough without a gendisk, which is
the NVMe admin queue and the /dev/sg device.  So it's common-ish :)

Note that for the programs using passthrough sysfs isn't actually a very
good interface, as finding the right directory is pain, as is opening,
reading and parsing one ASCIII file per limit.

One thing I've been wanting to do also for mkfs tools and similar is a
generic extensible ioctl to dump all the queue limits.  That's a lot
easier and faster for the tools and would work very well here.

Note that we could still be adding new limits at any point of time
(although I have a hard time thinking what limit we don't have yet),
so we still can't guarantee that non-trivial I/O will always work.

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Sagi Grimberg 1 month ago



On 23/10/2024 8:24, Christoph Hellwig wrote:
> On Tue, Oct 22, 2024 at 08:53:47AM -0600, Keith Busch wrote:
>> You'd may want to know max_sectors_kb, dma_alignment, nr_requests,
>> virt_boundary_mask. Maybe some others.
>>
>> The request_queue is owned by the block layer, so that seems like an
>> okay place to export it, but attached to some other device's sysfs
>> directory instead of a gendisk.
>>
>> I'm just suggesting this because it doesn't sound like this is an nvme
>> specific problem.
> Well, it's a problem specific to passthrough without a gendisk, which is
> the NVMe admin queue and the /dev/sg device.  So it's common-ish :)
>
>
> Note that for the programs using passthrough sysfs isn't actually a very
> good interface, as finding the right directory is pain, as is opening,
> reading and parsing one ASCIII file per limit.
>
> One thing I've been wanting to do also for mkfs tools and similar is a
> generic extensible ioctl to dump all the queue limits.  That's a lot
> easier and faster for the tools and would work very well here.
>
> Note that we could still be adding new limits at any point of time
> (although I have a hard time thinking what limit we don't have yet),
> so we still can't guarantee that non-trivial I/O will always work.

Makes sense to me. Although people would still like to be able to
see this value outside of an application context. We can probably
extend nvme-cli to display this info...

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Sagi Grimberg 1 month ago



On 22/10/2024 17:53, Keith Busch wrote:
> On Thu, Oct 17, 2024 at 02:32:18PM -0700, Abhishek Bapat wrote:
>> On Thu, Oct 17, 2024 at 9:40 AM Keith Busch <kbusch@kernel.org> wrote:
>>> On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote:
>>>> max_hw_sectors based on DMA optimized limitation") introduced a
>>>> limitation on the value of max_hw_sectors_kb, restricting it to 128KiB
>>>> (MDTS = 5). This restricion was implemented to mitigate lockups
>>>> encountered in high-core count AMD servers.
>>> There are other limits that can constrain transfer sizes below the
>>> device's MDTS. For example, the driver can only preallocate so much
>>> space for DMA and SGL descriptors, so 8MB is the current max transfer
>>> sizes the driver can support, and a device's MDTS can be much bigger
>>> than that.
>>>
>>> Anyway, yeah, I guess having a controller generic way to export this
>>> sounds like a good idea, but I wonder if the nvme driver is the right
>>> place to do it. The request_queue has all the limits you need to know
>>> about, but these are only exported if a gendisk is attached to it.
>>> Maybe we can create a queue subdirectory to the char dev too.
>> Are you suggesting that all the files from the queue subdirectory should
>> be included in the char dev (/sys/class/nvme/nvmeX/queue/)? Or that
>> just the max_hw_sectors_kb value should be shared within the queue
>> subdirectory? And if not the nvme driver, where else can this be done
>> from?
> You'd may want to know max_sectors_kb, dma_alignment, nr_requests,
> virt_boundary_mask. Maybe some others.
>
> The request_queue is owned by the block layer, so that seems like an
> okay place to export it, but attached to some other device's sysfs
> directory instead of a gendisk.
>
> I'm just suggesting this because it doesn't sound like this is an nvme
> specific problem.

Won't it be confusing to find queue/ directory in controller nvmeX sysfs 
entry?

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Keith Busch 1 month ago

On Tue, Oct 22, 2024 at 06:35:11PM +0300, Sagi Grimberg wrote:
> On 22/10/2024 17:53, Keith Busch wrote:
> > On Thu, Oct 17, 2024 at 02:32:18PM -0700, Abhishek Bapat wrote:
> > 
> > The request_queue is owned by the block layer, so that seems like an
> > okay place to export it, but attached to some other device's sysfs
> > directory instead of a gendisk.
> > 
> > I'm just suggesting this because it doesn't sound like this is an nvme
> > specific problem.
> 
> Won't it be confusing to find queue/ directory in controller nvmeX sysfs
> entry?

It's the attributes of the request queue associated with that
controller, so I think a queue/ directory under it makes sense. That's
how it looks for gendisks, so why not for disk-less queues?

Many queue attributes only make sense for gendisks, though, so maybe
need to tweak visibility if we decide to do it like this.

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Sagi Grimberg 1 month ago



On 22/10/2024 18:51, Keith Busch wrote:
> On Tue, Oct 22, 2024 at 06:35:11PM +0300, Sagi Grimberg wrote:
>> On 22/10/2024 17:53, Keith Busch wrote:
>>> On Thu, Oct 17, 2024 at 02:32:18PM -0700, Abhishek Bapat wrote:
>>>
>>> The request_queue is owned by the block layer, so that seems like an
>>> okay place to export it, but attached to some other device's sysfs
>>> directory instead of a gendisk.
>>>
>>> I'm just suggesting this because it doesn't sound like this is an nvme
>>> specific problem.
>> Won't it be confusing to find queue/ directory in controller nvmeX sysfs
>> entry?
> It's the attributes of the request queue associated with that
> controller, so I think a queue/ directory under it makes sense. That's
> how it looks for gendisks, so why not for disk-less queues?
>
> Many queue attributes only make sense for gendisks, though, so maybe
> need to tweak visibility if we decide to do it like this.

It'd be good to name it "admin_queue" to clarify a bit (although its still
confusing).

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Caleb Sander 1 month, 1 week ago

I agree it would be convenient for the kernel to expose a generic "max
data size" limit for the NVMe controller. For example, nvme-cli
currently pessimistically assumes a controller's maximum data transfer
size is 4 KB when sending Get Log Page commands:
https://github.com/linux-nvme/libnvme/blob/8cdd746b324bd84a0666e7a265aa253dbda9d932/src/nvme/ioctl.c#L330.
Fetching large log pages results in a lot of Get Log Page commands. If
nvme-cli could tell that the controller and kernel support larger data
transfers, it could fetch the entire log page (or a much larger chunk)
in a Get Log Page command.

Best,
Caleb

On Thu, Oct 17, 2024 at 9:46 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Oct 16, 2024 at 09:31:08PM +0000, Abhishek Bapat wrote:
> > max_hw_sectors based on DMA optimized limitation") introduced a
> > limitation on the value of max_hw_sectors_kb, restricting it to 128KiB
> > (MDTS = 5). This restricion was implemented to mitigate lockups
> > encountered in high-core count AMD servers.
>
> There are other limits that can constrain transfer sizes below the
> device's MDTS. For example, the driver can only preallocate so much
> space for DMA and SGL descriptors, so 8MB is the current max transfer
> sizes the driver can support, and a device's MDTS can be much bigger
> than that.
>
> Anyway, yeah, I guess having a controller generic way to export this
> sounds like a good idea, but I wonder if the nvme driver is the right
> place to do it. The request_queue has all the limits you need to know
> about, but these are only exported if a gendisk is attached to it.
> Maybe we can create a queue subdirectory to the char dev too.
>

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Prashant Malani 1 month, 1 week ago

Hi Abhishek,


On Wed, 16 Oct 2024 at 14:31, Abhishek Bapat <abhishekbapat@google.com> wrote:
>
> From: Abhishek <abhishekbapat@google.com>

Here and in the S-o-b line: Please use your full legal name.

> diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
> index b68a9e5f1ea3..1af2b2cf1a6c 100644
> --- a/drivers/nvme/host/sysfs.c
> +++ b/drivers/nvme/host/sysfs.c
> @@ -546,6 +546,17 @@ static ssize_t dctype_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(dctype);
>
> +static ssize_t max_hw_sectors_kb_show(struct device *dev,
> +                                     struct device_attribute *attr,
> +                                     char *buf)
> +{
> +       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
> +       u32 max_hw_sectors_kb = ctrl->max_hw_sectors >> 1;

In what unit is max_hw_sector stored? If it's "number of sectors", is this
conversion to size correct, or should SECTOR_SHIFT be used?

Thanks,

-- 
-Prashant

Re: [PATCH] nvme-sysfs: display max_hw_sectors_kb without requiring namespaces

Posted by Abhishek Bapat 1 month, 1 week ago

On Wed, Oct 16, 2024 at 2:54 PM Prashant Malani <pmalani@google.com> wrote:
>
> Hi Abhishek,
>
>
> On Wed, 16 Oct 2024 at 14:31, Abhishek Bapat <abhishekbapat@google.com> wrote:
> >
> > From: Abhishek <abhishekbapat@google.com>
>
> Here and in the S-o-b line: Please use your full legal name.
Acknowledged.

>
> > diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
> > index b68a9e5f1ea3..1af2b2cf1a6c 100644
> > --- a/drivers/nvme/host/sysfs.c
> > +++ b/drivers/nvme/host/sysfs.c
> > @@ -546,6 +546,17 @@ static ssize_t dctype_show(struct device *dev,
> >  }
> >  static DEVICE_ATTR_RO(dctype);
> >
> > +static ssize_t max_hw_sectors_kb_show(struct device *dev,
> > +                                     struct device_attribute *attr,
> > +                                     char *buf)
> > +{
> > +       struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
> > +       u32 max_hw_sectors_kb = ctrl->max_hw_sectors >> 1;
>
> In what unit is max_hw_sector stored? If it's "number of sectors", is this
> conversion to size correct, or should SECTOR_SHIFT be used?

The unit for max_hw_sectors is sectors. Left shifting with SECTOR_SHIFT
will change the unit to bytes, and then we will have to right shift by 10
to convert it into KiB. The net effect is right shifting by one as the value
of SECTOR_SHIFT is 9. This programming pattern is used within the
Block layer code as well.