As described at [0], many parts of the atomic write specification are
lacking.
For now, there is nothing which we can do in software about the lack of
a dedicated NVMe write atomic command.
As for reading the atomic write limits, it is felt that the per-namespace
values are mostly properly specified and it is assumed that they are
properly implemented.
The specification of NAWUPF is quite clear. However the specification of
NABSPF is less clear. The lack of clarity in NABSPF comes from deciding
whether NABSPF applies when NSABP is 0 - it is assumed that NSABPF does
not apply when NSABP is 0.
As for the per-controller AWUPF, how this value applies to shared
namespaces is missing in the specification. Furthermore, the value is in
terms of logical blocks, which is an NS entity.
It would be preferred to stop honouring AWUPF altogether, but this may
needlessly disable atomic write support for many "good" devices which
only specify AWUPF. Currently all validation of controller-related
atomics limits is dropped.
As a compromise, allow users of such devices to use atomic writes at
their own risk by adding a per-subsystem sysfs file to enable reading of
AWUPF.
A udev rule like the following could be used to auto-enable:
SUBSYSTEM=="nvme-subsystem", ACTION=="add", RUN+="/bin/sh -c 'echo 1 > /sys/$devpath/use_awupf'"
Such a rule would need to run before any filesystem is mounted on the
device.
Having a per-controller file could also work as an alternative, but it
is unlikely that a system will have a mix of controllers which we would
want to selectively enable reading AWUPF.
Note that AWUPF not only effects atomic write support, but also the
physical block size reported for the device.
[0] https://lore.kernel.org/linux-nvme/20250707141834.GA30198@lst.de/
Signed-off-by: John Garry <john.g.garry@oracle.com>
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 812c1565114fd..e8c78e000d9d7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2040,7 +2040,7 @@ static u32 nvme_configure_atomic_write(struct nvme_ns *ns,
atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
if (id->nabspf)
boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
- } else {
+ } else if (ns->ctrl->subsys->use_awupf) {
/*
* Use the controller wide atomic write unit. This sucks
* because the limit is defined in terms of logical blocks while
@@ -2049,6 +2049,8 @@ static u32 nvme_configure_atomic_write(struct nvme_ns *ns,
* values for different controllers in the subsystem.
*/
atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs;
+ } else {
+ atomic_bs = bs;
}
lim->atomic_write_hw_max = atomic_bs;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cfd2b5b90b915..d75a82851b684 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -447,6 +447,7 @@ struct nvme_subsystem {
#ifdef CONFIG_NVME_MULTIPATH
enum nvme_iopolicy iopolicy;
#endif
+ bool use_awupf;
};
/*
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f0..ca2e4b5937289 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -894,6 +894,75 @@ static ssize_t nvme_subsys_show_type(struct device *dev,
}
static SUBSYS_ATTR_RO(subsystype, S_IRUGO, nvme_subsys_show_type);
+static ssize_t nvme_subsys_use_awupf_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_subsystem *subsys =
+ container_of(dev, struct nvme_subsystem, dev);
+
+ return sysfs_emit(buf, "%d\n", subsys->use_awupf);
+}
+
+static ssize_t nvme_subsys_use_awupf_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct nvme_subsystem *subsys =
+ container_of(dev, struct nvme_subsystem, dev);
+ struct nvme_ns_head *h;
+ struct nvme_ctrl *tmp;
+ bool enabled;
+ int err;
+
+ err = kstrtobool(buf, &enabled);
+ if (err)
+ return -EINVAL;
+
+ mutex_lock(&nvme_subsystems_lock);
+ if (subsys->use_awupf == enabled)
+ goto out_unlock;
+
+ list_for_each_entry(h, &subsys->nsheads, entry) {
+ struct queue_limits lim;
+ unsigned int memflags;
+
+ if (!nvme_ns_head_multipath(h))
+ continue;
+
+ /*
+ * The head NS queue atomic limits are stacked. We need to zero
+ * to stack all the limits per-controller afresh.
+ */
+ lim = queue_limits_start_update(h->disk->queue);
+ memflags = blk_mq_freeze_queue(h->disk->queue);
+
+ lim.atomic_write_hw_max = 0;
+ lim.atomic_write_hw_unit_min = 0;
+ lim.atomic_write_hw_unit_max = 0;
+ lim.atomic_write_hw_boundary = 0;
+
+ err = queue_limits_commit_update(h->disk->queue, &lim);
+ blk_mq_unfreeze_queue(h->disk->queue, memflags);
+ if (err) {
+ count = err;
+ goto out_unlock;
+ }
+ }
+
+ list_for_each_entry(tmp, &subsys->ctrls, subsys_entry)
+ nvme_queue_scan(tmp);
+
+ subsys->use_awupf = enabled;
+out_unlock:
+ mutex_unlock(&nvme_subsystems_lock);
+
+ return count;
+}
+
+static struct device_attribute subsys_attr_use_awupf = \
+ __ATTR(use_awupf, S_IRUGO | S_IWUSR, \
+ nvme_subsys_use_awupf_show, nvme_subsys_use_awupf_store);
+
#define nvme_subsys_show_str_function(field) \
static ssize_t subsys_##field##_show(struct device *dev, \
struct device_attribute *attr, char *buf) \
@@ -918,6 +987,7 @@ static struct attribute *nvme_subsys_attrs[] = {
#ifdef CONFIG_NVME_MULTIPATH
&subsys_attr_iopolicy.attr,
#endif
+ &subsys_attr_use_awupf.attr,
NULL,
};
--
2.43.5
On Wed, Aug 20, 2025 at 03:02:20PM +0000, John Garry wrote: > As described at [0], many parts of the atomic write specification are > lacking. I like your british understatement. > + list_for_each_entry(tmp, &subsys->ctrls, subsys_entry) > + nvme_queue_scan(tmp); queueing a full rescan here seems expensive. What about just keeping the awupf value in our internal data structures and always use it for the physical block size calculation, but only apply it to the atomic limits based on a flag?
On 21/08/2025 09:35, Christoph Hellwig wrote: >> + list_for_each_entry(tmp, &subsys->ctrls, subsys_entry) >> + nvme_queue_scan(tmp); > queueing a full rescan here seems expensive. What about just keeping > the awupf value in our internal data structures and always use it > for the physical block size calculation, is that even ok? physical block size implies atomicity also. > but only apply it to the > atomic limits based on a flag? We have flag BLK_FEAT_ATOMIC_WRITES, which could be used, but in blk_validate_atomic_write_limits() we zero the hw limits if that flag is unset. So we need to fill in the hw limits like in this patch.
On Wed, Aug 20, 2025 at 03:02:20PM +0000, John Garry wrote: > It would be preferred to stop honouring AWUPF altogether, but this may > needlessly disable atomic write support for many "good" devices which > only specify AWUPF. Currently all validation of controller-related > atomics limits is dropped. These "good" devices that only report AWUPF, is there some set of characteristics that generally applies to all of them? I tried to list out conditions for when I think the value could be counted on here: https://lore.kernel.org/linux-nvme/aGvuRS8VmC0JXAR3@kbusch-mbp/ I just don't know if you know of any devices where that criteria doesn't git. If not, maybe we can work with that without introducing more user knobs.
On 20/08/2025 22:51, Keith Busch wrote: > On Wed, Aug 20, 2025 at 03:02:20PM +0000, John Garry wrote: >> It would be preferred to stop honouring AWUPF altogether, but this may >> needlessly disable atomic write support for many "good" devices which >> only specify AWUPF. Currently all validation of controller-related >> atomics limits is dropped. > > These "good" devices that only report AWUPF, is there some set of > characteristics that generally applies to all of them? I tried to list > out conditions for when I think the value could be counted on here: > > https://lore.kernel.org/linux-nvme/aGvuRS8VmC0JXAR3@kbusch-mbp/ > > I just don't know if you know of any devices where that criteria doesn't > git. If not, maybe we can work with that without introducing more user > knobs. About the rules, 1. CMIC == 0; and 2. OACS.NMS == 0; and 3. a. FNA.FNS == 1; or b. NN == 1 I have access to two controllers and they both set OACS.NMS and neither set FNA.FNS. I wonder how common these rules would pass to be useful. Then having 1x namespace is quite limiting also. Thanks, John
On Wed, Aug 20, 2025 at 03:51:02PM -0600, Keith Busch wrote: > On Wed, Aug 20, 2025 at 03:02:20PM +0000, John Garry wrote: > > It would be preferred to stop honouring AWUPF altogether, but this may > > needlessly disable atomic write support for many "good" devices which > > only specify AWUPF. Currently all validation of controller-related > > atomics limits is dropped. > > These "good" devices that only report AWUPF, is there some set of > characteristics that generally applies to all of them? I tried to list > out conditions for when I think the value could be counted on here: > > https://lore.kernel.org/linux-nvme/aGvuRS8VmC0JXAR3@kbusch-mbp/ > > I just don't know if you know of any devices where that criteria doesn't > git. If not, maybe we can work with that without introducing more user > knobs. Given how broken the field is I'd rather not make any general exception. If you want to use a device claiming that you use the override and you are responsible. NAWUPF has been added in Linux 1.2, and the brokenness of AWUPF should have been obvious to firmware engineers ever since they had to deal with multiple LBA formats. There's really no excuse for not supporting the sane version that is a trivial firmware addition.
On 8/21/25 3:21 AM, Keith Busch wrote: > On Wed, Aug 20, 2025 at 03:02:20PM +0000, John Garry wrote: >> It would be preferred to stop honouring AWUPF altogether, but this may >> needlessly disable atomic write support for many "good" devices which >> only specify AWUPF. Currently all validation of controller-related >> atomics limits is dropped. > > These "good" devices that only report AWUPF, is there some set of > characteristics that generally applies to all of them? Yes, I know of such “good” devices. Typically, they report a consistent AWUPF value across all controllers, and that value does not change when the namespaces attached to those controllers are reformatted. As we know, though, validating this in code is difficult. A past attempt to implement such validation failed because there are disks in the field that do change the AWUPF value when reformatting namespaces. > I tried to list out conditions for when I think the value could be counted on here: > > https://lore.kernel.org/linux-nvme/aGvuRS8VmC0JXAR3@kbusch-mbp/ > Regarding the checks you listed, they seem too restrictive. They only support AWUPF for single-controller devices and for disks that either have only one namespace or support formatting all namespaces together (i.e., do not support formatting individual namespaces). > I just don't know if you know of any devices where that criteria doesn't > git. If not, maybe we can work with that without introducing more user > knobs. Given this, IMO, an opt-in approach should be considered, letting users decide whether they want to enable atomic write support or not. Thanks, --Nilay
On Thu, Aug 21, 2025 at 10:55:31AM +0530, Nilay Shroff wrote: > On 8/21/25 3:21 AM, Keith Busch wrote: > > I tried to list out conditions for when I think the value could be counted on here: > > > > https://lore.kernel.org/linux-nvme/aGvuRS8VmC0JXAR3@kbusch-mbp/ > > > Regarding the checks you listed, they seem too restrictive. They only support > AWUPF for single-controller devices and for disks that either have only one > namespace or support formatting all namespaces together (i.e., do not support > formatting individual namespaces). Being restrictive was the point. It constrains to 1.0 type behavior where things were much simpler. I am just shocked vendors went through the trouble to implement the complicated features but couldn't be bothered to do the trivial one that makes it useful.
© 2016 - 2025 Red Hat, Inc.