[RFC PATCH] nvme: add an opt-in to use AWUPF

John Garry posted 1 patch 1 month, 2 weeks ago
[RFC PATCH] nvme: add an opt-in to use AWUPF
Posted by John Garry 1 month, 2 weeks ago
As described at [0], many parts of the atomic write specification are
lacking.

For now, there is nothing which we can do in software about the lack of
a dedicated NVMe write atomic command.

As for reading the atomic write limits, it is felt that the per-namespace
values are mostly properly specified and it is assumed that they are
properly implemented.

The specification of NAWUPF is quite clear. However the specification of
NABSPF is less clear. The lack of clarity in NABSPF comes from deciding
whether NABSPF applies when NSABP is 0 - it is assumed that NSABPF does
not apply when NSABP is 0.

As for the per-controller AWUPF, how this value applies to shared
namespaces is missing in the specification. Furthermore, the value is in
terms of logical blocks, which is an NS entity.

It would be preferred to stop honouring AWUPF altogether, but this may
needlessly disable atomic write support for many "good" devices which
only specify AWUPF. Currently all validation of controller-related
atomics limits is dropped.

As a compromise, allow users of such devices to use atomic writes at
their own risk by adding a per-subsystem sysfs file to enable reading of
AWUPF.

A udev rule like the following could be used to auto-enable:
SUBSYSTEM=="nvme-subsystem", ACTION=="add", RUN+="/bin/sh -c 'echo 1 > /sys/$devpath/use_awupf'"

Such a rule would need to run before any filesystem is mounted on the
device.

Having a per-controller file could also work as an alternative, but it
is unlikely that a system will have a mix of controllers which we would
want to selectively enable reading AWUPF.

Note that AWUPF not only effects atomic write support, but also the
physical block size reported for the device.

[0] https://lore.kernel.org/linux-nvme/20250707141834.GA30198@lst.de/

Signed-off-by: John Garry <john.g.garry@oracle.com>

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 812c1565114fd..e8c78e000d9d7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2040,7 +2040,7 @@ static u32 nvme_configure_atomic_write(struct nvme_ns *ns,
 		atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
 		if (id->nabspf)
 			boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
-	} else {
+	} else if (ns->ctrl->subsys->use_awupf) {
 		/*
 		 * Use the controller wide atomic write unit.  This sucks
 		 * because the limit is defined in terms of logical blocks while
@@ -2049,6 +2049,8 @@ static u32 nvme_configure_atomic_write(struct nvme_ns *ns,
 		 * values for different controllers in the subsystem.
 		 */
 		atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs;
+	} else {
+		atomic_bs = bs;
 	}
 
 	lim->atomic_write_hw_max = atomic_bs;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cfd2b5b90b915..d75a82851b684 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -447,6 +447,7 @@ struct nvme_subsystem {
 #ifdef CONFIG_NVME_MULTIPATH
 	enum nvme_iopolicy	iopolicy;
 #endif
+	bool			use_awupf;
 };
 
 /*
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f0..ca2e4b5937289 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -894,6 +894,75 @@ static ssize_t nvme_subsys_show_type(struct device *dev,
 }
 static SUBSYS_ATTR_RO(subsystype, S_IRUGO, nvme_subsys_show_type);
 
+static ssize_t nvme_subsys_use_awupf_show(struct device *dev,
+				    struct device_attribute *attr,
+				    char *buf)
+{
+	struct nvme_subsystem *subsys =
+		container_of(dev, struct nvme_subsystem, dev);
+
+	return sysfs_emit(buf, "%d\n", subsys->use_awupf);
+}
+
+static ssize_t nvme_subsys_use_awupf_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct nvme_subsystem *subsys =
+		container_of(dev, struct nvme_subsystem, dev);
+	struct nvme_ns_head *h;
+	struct nvme_ctrl *tmp;
+	bool enabled;
+	int err;
+
+	err = kstrtobool(buf, &enabled);
+	if (err)
+		return -EINVAL;
+
+	mutex_lock(&nvme_subsystems_lock);
+	if (subsys->use_awupf == enabled)
+		goto out_unlock;
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		struct queue_limits lim;
+		unsigned int memflags;
+
+		if (!nvme_ns_head_multipath(h))
+			continue;
+
+		/*
+		 * The head NS queue atomic limits are stacked. We need to zero
+		 * to stack all the limits per-controller afresh.
+		 */
+		lim = queue_limits_start_update(h->disk->queue);
+		memflags = blk_mq_freeze_queue(h->disk->queue);
+
+		lim.atomic_write_hw_max = 0;
+		lim.atomic_write_hw_unit_min = 0;
+		lim.atomic_write_hw_unit_max = 0;
+		lim.atomic_write_hw_boundary = 0;
+
+		err = queue_limits_commit_update(h->disk->queue, &lim);
+		blk_mq_unfreeze_queue(h->disk->queue, memflags);
+		if (err) {
+			count = err;
+			goto out_unlock;
+		}
+	}
+
+	list_for_each_entry(tmp, &subsys->ctrls, subsys_entry)
+		nvme_queue_scan(tmp);
+
+	subsys->use_awupf = enabled;
+out_unlock:
+	mutex_unlock(&nvme_subsystems_lock);
+
+	return count;
+}
+
+static struct device_attribute subsys_attr_use_awupf = \
+	__ATTR(use_awupf, S_IRUGO | S_IWUSR, \
+	nvme_subsys_use_awupf_show, nvme_subsys_use_awupf_store);
+
 #define nvme_subsys_show_str_function(field)				\
 static ssize_t subsys_##field##_show(struct device *dev,		\
 			    struct device_attribute *attr, char *buf)	\
@@ -918,6 +987,7 @@ static struct attribute *nvme_subsys_attrs[] = {
 #ifdef CONFIG_NVME_MULTIPATH
 	&subsys_attr_iopolicy.attr,
 #endif
+	&subsys_attr_use_awupf.attr,
 	NULL,
 };
 
-- 
2.43.5
Re: [RFC PATCH] nvme: add an opt-in to use AWUPF
Posted by Christoph Hellwig 1 month, 1 week ago
On Wed, Aug 20, 2025 at 03:02:20PM +0000, John Garry wrote:
> As described at [0], many parts of the atomic write specification are
> lacking.

I like your british understatement.

> +	list_for_each_entry(tmp, &subsys->ctrls, subsys_entry)
> +		nvme_queue_scan(tmp);

queueing a full rescan here seems expensive.  What about just keeping
the awupf value in our internal data structures and always use it
for the physical block size calculation, but only apply it to the
atomic limits based on a flag?
Re: [RFC PATCH] nvme: add an opt-in to use AWUPF
Posted by John Garry 1 month, 1 week ago
On 21/08/2025 09:35, Christoph Hellwig wrote:
>> +	list_for_each_entry(tmp, &subsys->ctrls, subsys_entry)
>> +		nvme_queue_scan(tmp);
> queueing a full rescan here seems expensive.  What about just keeping
> the awupf value in our internal data structures and always use it
> for the physical block size calculation,

is that even ok? physical block size implies atomicity also.

> but only apply it to the
> atomic limits based on a flag?

We have flag BLK_FEAT_ATOMIC_WRITES, which could be used, but in 
blk_validate_atomic_write_limits() we zero the hw limits if that flag is 
unset. So we need to fill in the hw limits like in this patch.
Re: [RFC PATCH] nvme: add an opt-in to use AWUPF
Posted by Keith Busch 1 month, 2 weeks ago
On Wed, Aug 20, 2025 at 03:02:20PM +0000, John Garry wrote:
> It would be preferred to stop honouring AWUPF altogether, but this may
> needlessly disable atomic write support for many "good" devices which
> only specify AWUPF. Currently all validation of controller-related
> atomics limits is dropped.

These "good" devices that only report AWUPF, is there some set of
characteristics that generally applies to all of them? I tried to list
out conditions for when I think the value could be counted on here:

  https://lore.kernel.org/linux-nvme/aGvuRS8VmC0JXAR3@kbusch-mbp/

I just don't know if you know of any devices where that criteria doesn't
git. If not, maybe we can work with that without introducing more user
knobs.
Re: [RFC PATCH] nvme: add an opt-in to use AWUPF
Posted by John Garry 1 month, 1 week ago
On 20/08/2025 22:51, Keith Busch wrote:
> On Wed, Aug 20, 2025 at 03:02:20PM +0000, John Garry wrote:
>> It would be preferred to stop honouring AWUPF altogether, but this may
>> needlessly disable atomic write support for many "good" devices which
>> only specify AWUPF. Currently all validation of controller-related
>> atomics limits is dropped.
> 
> These "good" devices that only report AWUPF, is there some set of
> characteristics that generally applies to all of them? I tried to list
> out conditions for when I think the value could be counted on here:
> 
>    https://lore.kernel.org/linux-nvme/aGvuRS8VmC0JXAR3@kbusch-mbp/
> 
> I just don't know if you know of any devices where that criteria doesn't
> git. If not, maybe we can work with that without introducing more user
> knobs.

About the rules,

   1. CMIC == 0; and
   2. OACS.NMS == 0; and
   3.
     a. FNA.FNS == 1; or
     b. NN == 1

I have access to two controllers and they both set OACS.NMS and neither 
set FNA.FNS. I wonder how common these rules would pass to be useful. 
Then having 1x namespace is quite limiting also.

Thanks,
John
Re: [RFC PATCH] nvme: add an opt-in to use AWUPF
Posted by Christoph Hellwig 1 month, 1 week ago
On Wed, Aug 20, 2025 at 03:51:02PM -0600, Keith Busch wrote:
> On Wed, Aug 20, 2025 at 03:02:20PM +0000, John Garry wrote:
> > It would be preferred to stop honouring AWUPF altogether, but this may
> > needlessly disable atomic write support for many "good" devices which
> > only specify AWUPF. Currently all validation of controller-related
> > atomics limits is dropped.
> 
> These "good" devices that only report AWUPF, is there some set of
> characteristics that generally applies to all of them? I tried to list
> out conditions for when I think the value could be counted on here:
> 
>   https://lore.kernel.org/linux-nvme/aGvuRS8VmC0JXAR3@kbusch-mbp/
> 
> I just don't know if you know of any devices where that criteria doesn't
> git. If not, maybe we can work with that without introducing more user
> knobs.

Given how broken the field is I'd rather not make any general exception.
If you want to use a device claiming that you use the override and you
are responsible.  NAWUPF has been added in Linux 1.2, and the brokenness
of AWUPF should have been obvious to firmware engineers ever since they
had to deal with multiple LBA formats.  There's really no excuse for not
supporting the sane version that is a trivial firmware addition.
Re: [RFC PATCH] nvme: add an opt-in to use AWUPF
Posted by Nilay Shroff 1 month, 1 week ago

On 8/21/25 3:21 AM, Keith Busch wrote:
> On Wed, Aug 20, 2025 at 03:02:20PM +0000, John Garry wrote:
>> It would be preferred to stop honouring AWUPF altogether, but this may
>> needlessly disable atomic write support for many "good" devices which
>> only specify AWUPF. Currently all validation of controller-related
>> atomics limits is dropped.
> 
> These "good" devices that only report AWUPF, is there some set of
> characteristics that generally applies to all of them? 
Yes, I know of such “good” devices. Typically, they report a consistent
AWUPF value across all controllers, and that value does not change when
the namespaces attached to those controllers are reformatted.
As we know, though, validating this in code is difficult. A past attempt
to implement such validation failed because there are disks in the field
that do change the AWUPF value when reformatting namespaces.

> I tried to list out conditions for when I think the value could be counted on here:
> 
>   https://lore.kernel.org/linux-nvme/aGvuRS8VmC0JXAR3@kbusch-mbp/
> 
Regarding the checks you listed, they seem too restrictive. They only support
AWUPF for single-controller devices and for disks that either have only one
namespace or support formatting all namespaces together (i.e., do not support
formatting individual namespaces).

> I just don't know if you know of any devices where that criteria doesn't
> git. If not, maybe we can work with that without introducing more user
> knobs.
Given this, IMO, an opt-in approach should be considered, letting users
decide whether they want to enable atomic write support or not.

Thanks,
--Nilay
Re: [RFC PATCH] nvme: add an opt-in to use AWUPF
Posted by Keith Busch 1 month, 1 week ago
On Thu, Aug 21, 2025 at 10:55:31AM +0530, Nilay Shroff wrote:
> On 8/21/25 3:21 AM, Keith Busch wrote:
> > I tried to list out conditions for when I think the value could be counted on here:
> > 
> >   https://lore.kernel.org/linux-nvme/aGvuRS8VmC0JXAR3@kbusch-mbp/
> > 
> Regarding the checks you listed, they seem too restrictive. They only support
> AWUPF for single-controller devices and for disks that either have only one
> namespace or support formatting all namespaces together (i.e., do not support
> formatting individual namespaces).

Being restrictive was the point. It constrains to 1.0 type behavior
where things were much simpler. I am just shocked vendors went through
the trouble to implement the complicated features but couldn't be
bothered to do the trivial one that makes it useful.