[RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features

Zhang Yi posted 10 patches 9 months ago
There is a newer version of this series
[RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Zhang Yi 9 months ago
From: Zhang Yi <yi.zhang@huawei.com>

Currently, disks primarily implement the write zeroes command (aka
REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
physically writing zeros to the disk media (e.g., HDDs), while the
second performs an unmap operation on the logical blocks, effectively
putting them into a deallocated state (e.g., SSDs). The first method is
generally slow, while the second method is typically very fast.

For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
the write zeros operation by placing disk blocks into a deallocated
state. However, it is difficult to ascertain whether the storage device
supports unmap write zeroes. We cannot determine this solely by querying
bdev_limits(bdev)->max_write_zeroes_sectors.

Therefore, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP
and the corresponding sysfs entry, to indicate whether the block device
explicitly supports the unmapped write zeroes command. Each device
driver should set this bit if it is certain that the attached disk
supports this command. If the bit is not set, the disk either does not
support it, or its support status is unknown.

For the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP should be
supported both by the stacking driver and all underlying devices.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/ABI/stable/sysfs-block | 14 ++++++++++++++
 block/blk-settings.c                 |  6 ++++++
 block/blk-sysfs.c                    |  3 +++
 include/linux/blkdev.h               |  3 +++
 4 files changed, 26 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 890cde28bf90..67513c0d9233 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -742,6 +742,20 @@ Description:
 		0, write zeroes is not supported by the device.
 
 
+What:		/sys/block/<disk>/queue/write_zeroes_unmap
+Date:		January 2025
+Contact:	Zhang Yi <yi.zhang@huawei.com>
+Description:
+		[RO] Devices that explicitly support the unmap write zeroes
+		operation in which a single write zeroes request with the unmap
+		bit set to zero out the range of contiguous blocks on storage
+		by freeing blocks, rather than writing physical zeroes to the
+		media. If write_zeroes_unmap is 1, this indicates that the
+		device explicitly supports the write zero command. Otherwise,
+		the device either does not support it, or its support status is
+		unknown.
+
+
 What:		/sys/block/<disk>/queue/zone_append_max_bytes
 Date:		May 2020
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 6b2dbe645d23..3331d07bd5d9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -697,6 +697,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->features &= ~BLK_FEAT_NOWAIT;
 	if (!(b->features & BLK_FEAT_POLL))
 		t->features &= ~BLK_FEAT_POLL;
+	if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP))
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
 
 	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
 
@@ -819,6 +821,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->zone_write_granularity = 0;
 		t->max_zone_append_sectors = 0;
 	}
+
+	if (!t->max_write_zeroes_sectors)
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
+
 	blk_stack_atomic_writes_limits(t, b, start);
 
 	return ret;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d584461a1d84..6f00e9a8f8b6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -261,6 +261,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page)	\
 
 QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
 QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
+QUEUE_SYSFS_FEATURE_SHOW(write_zeroes_unmap, BLK_FEAT_WRITE_ZEROES_UNMAP);
 
 static ssize_t queue_poll_show(struct gendisk *disk, char *page)
 {
@@ -510,6 +511,7 @@ QUEUE_LIM_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
 
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
+QUEUE_LIM_RO_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap");
 QUEUE_LIM_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
 QUEUE_LIM_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
 
@@ -656,6 +658,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_atomic_write_unit_min_entry.attr,
 	&queue_atomic_write_unit_max_entry.attr,
 	&queue_max_write_zeroes_sectors_entry.attr,
+	&queue_write_zeroes_unmap_entry.attr,
 	&queue_max_zone_append_sectors_entry.attr,
 	&queue_zone_write_granularity_entry.attr,
 	&queue_rotational_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e39c45bc0a97..5d280c7fba65 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -342,6 +342,9 @@ typedef unsigned int __bitwise blk_features_t;
 #define BLK_FEAT_ATOMIC_WRITES \
 	((__force blk_features_t)(1u << 16))
 
+/* supports unmap write zeroes command */
+#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
+
 /*
  * Flags automatically inherited when stacking limits.
  */
-- 
2.46.1
Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Christoph Hellwig 8 months, 1 week ago
On Tue, Mar 18, 2025 at 03:35:36PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Currently, disks primarily implement the write zeroes command (aka
> REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
> physically writing zeros to the disk media (e.g., HDDs), while the
> second performs an unmap operation on the logical blocks, effectively
> putting them into a deallocated state (e.g., SSDs). The first method is
> generally slow, while the second method is typically very fast.
> 
> For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
> REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
> the write zeros operation by placing disk blocks into

Note that this is a can, not a must.  The NVMe definition of Write
Zeroes is unfortunately pretty stupid.

> +		[RO] Devices that explicitly support the unmap write zeroes
> +		operation in which a single write zeroes request with the unmap
> +		bit set to zero out the range of contiguous blocks on storage
> +		by freeing blocks, rather than writing physical zeroes to the
> +		media.

This is not actually guaranteed for nvme or scsi.
Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Zhang Yi 8 months, 1 week ago
On 2025/4/9 18:31, Christoph Hellwig wrote:
> On Tue, Mar 18, 2025 at 03:35:36PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Currently, disks primarily implement the write zeroes command (aka
>> REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
>> physically writing zeros to the disk media (e.g., HDDs), while the
>> second performs an unmap operation on the logical blocks, effectively
>> putting them into a deallocated state (e.g., SSDs). The first method is
>> generally slow, while the second method is typically very fast.
>>
>> For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
>> REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
>> the write zeros operation by placing disk blocks into
> 
> Note that this is a can, not a must.  The NVMe definition of Write
> Zeroes is unfortunately pretty stupid.
> 
>> +		[RO] Devices that explicitly support the unmap write zeroes
>> +		operation in which a single write zeroes request with the unmap
>> +		bit set to zero out the range of contiguous blocks on storage
>> +		by freeing blocks, rather than writing physical zeroes to the
>> +		media.
> 
> This is not actually guaranteed for nvme or scsi.

Thank you for your review and comments. However, I'm not sure I fully
understand your points. Could you please provide more details?

AFAIK, the NVMe protocol has the following description in the latest
NVM Command Set Specification Figure 82 and Figure 114:

===
Deallocate (DEAC): If this bit is set to ‘1’, then the host is
requesting that the controller deallocate the specified logical blocks.
If this bit is cleared to ‘0’, then the host is not requesting that
the controller deallocate the specified logical blocks...

DLFEAT:
Write Zeroes Deallocation Support (WZDS): If this bit is set to ‘1’,
then the controller supports the Deallocate bit in the Write Zeroes
command for this namespace...
Deallocation Read Behavior (DRB): This field indicates the deallocated
logical block read behavior. For a logical block that is deallocated,
this field indicates the values read from that deallocated logical block
and its metadata (excluding protection information)...

  Value  Definition
  001b   A deallocated logical block returns all bytes cleared to 0h
===

At the same time, the current kernel determines whether to set the
unmap bit when submitting the write zeroes command based on the above
protocol. So I think this rules should be clear now.

Were you saying that what is described in this protocol is not a
mandatory requirement? Which means the disks that claiming to support
the UNMAP write zeroes command(WZDS=1,DRB=1), but in fact, they still
write actual zeroes data to the storage media? Or were you referring
to some irregular disks that do not obey the protocol and mislead
users?

Thanks,
Yi.

Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Christoph Hellwig 8 months, 1 week ago
On Thu, Apr 10, 2025 at 11:52:17AM +0800, Zhang Yi wrote:
> 
> Thank you for your review and comments. However, I'm not sure I fully
> understand your points. Could you please provide more details?
> 
> AFAIK, the NVMe protocol has the following description in the latest
> NVM Command Set Specification Figure 82 and Figure 114:
> 
> ===
> Deallocate (DEAC): If this bit is set to ‘1’, then the host is
> requesting that the controller deallocate the specified logical blocks.
> If this bit is cleared to ‘0’, then the host is not requesting that
> the controller deallocate the specified logical blocks...
> 
> DLFEAT:
> Write Zeroes Deallocation Support (WZDS): If this bit is set to ‘1’,
> then the controller supports the Deallocate bit in the Write Zeroes
> command for this namespace...

Yes.  The host is requesting, not the controller shall.  It's not
guaranteed behavior and the controller might as well actually write
zeroes to the media.  That is rather stupid, but still.

Also note that some write zeroes implementations in consumer devices
are really slow even when deallocation is requested so that we had
to blacklist them.

> Were you saying that what is described in this protocol is not a
> mandatory requirement? Which means the disks that claiming to support
> the UNMAP write zeroes command(WZDS=1,DRB=1), but in fact, they still
> write actual zeroes data to the storage media? Or were you referring
> to some irregular disks that do not obey the protocol and mislead
> users?

The are at least allowed to.

Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Zhang Yi 8 months, 1 week ago
On 2025/4/10 15:15, Christoph Hellwig wrote:
> On Thu, Apr 10, 2025 at 11:52:17AM +0800, Zhang Yi wrote:
>>
>> Thank you for your review and comments. However, I'm not sure I fully
>> understand your points. Could you please provide more details?
>>
>> AFAIK, the NVMe protocol has the following description in the latest
>> NVM Command Set Specification Figure 82 and Figure 114:
>>
>> ===
>> Deallocate (DEAC): If this bit is set to ‘1’, then the host is
>> requesting that the controller deallocate the specified logical blocks.
>> If this bit is cleared to ‘0’, then the host is not requesting that
>> the controller deallocate the specified logical blocks...
>>
>> DLFEAT:
>> Write Zeroes Deallocation Support (WZDS): If this bit is set to ‘1’,
>> then the controller supports the Deallocate bit in the Write Zeroes
>> command for this namespace...
> 
> Yes.  The host is requesting, not the controller shall.  It's not
> guaranteed behavior and the controller might as well actually write
> zeroes to the media.  That is rather stupid, but still.

IIUC, the DEAC is requested by the host, but the WZDS and DRB bits in
DLFEAT is returned by the controller(no?). The host will only initiate
a DEAC request when both WZDS and DRB are satisfied. So I think that
if the disk controller returns WZDS=1 and DRB=1, the kernel can only
trust it according to the protocol and then set
BLK_FEAT_WRITE_ZEROES_UNMAP flag, the kernel can't and also do not
need to identify those irregular disks.

> 
> Also note that some write zeroes implementations in consumer devices
> are really slow even when deallocation is requested so that we had
> to blacklist them.

Yes, indeed. For now, the kernel can only detect through protocol
specifications, and there seems to be no better way to distinguish
the specific behavior of the disk. Perhaps we should emphasize that
this write_zeroes_unmap tag is not equivalent to disk support for
'fast' write zeros in the DOC.

Thanks.
Yi.

Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Keith Busch 8 months, 1 week ago
On Thu, Apr 10, 2025 at 09:15:59AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 10, 2025 at 11:52:17AM +0800, Zhang Yi wrote:
> > 
> > Thank you for your review and comments. However, I'm not sure I fully
> > understand your points. Could you please provide more details?
> > 
> > AFAIK, the NVMe protocol has the following description in the latest
> > NVM Command Set Specification Figure 82 and Figure 114:
> > 
> > ===
> > Deallocate (DEAC): If this bit is set to `1´, then the host is
> > requesting that the controller deallocate the specified logical blocks.
> > If this bit is cleared to `0´, then the host is not requesting that
> > the controller deallocate the specified logical blocks...
> > 
> > DLFEAT:
> > Write Zeroes Deallocation Support (WZDS): If this bit is set to `1´,
> > then the controller supports the Deallocate bit in the Write Zeroes
> > command for this namespace...
> 
> Yes.  The host is requesting, not the controller shall.  It's not
> guaranteed behavior and the controller might as well actually write
> zeroes to the media.  That is rather stupid, but still.

I guess some controllers _really_ want specific alignments to
successfully do a proper discard. While still not guaranteed in spec, I
think it is safe to assume a proper deallocation will occur if you align
to NPDA and NPDG. Otherwise, the controller may do a read-modify-write
to ensure zeroes are returned for the requested LBA range on anything
that straddles an implementation specific boundary.
Re: [RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Zhang Yi 8 months, 1 week ago
On 2025/4/10 16:20, Keith Busch wrote:
> On Thu, Apr 10, 2025 at 09:15:59AM +0200, Christoph Hellwig wrote:
>> On Thu, Apr 10, 2025 at 11:52:17AM +0800, Zhang Yi wrote:
>>>
>>> Thank you for your review and comments. However, I'm not sure I fully
>>> understand your points. Could you please provide more details?
>>>
>>> AFAIK, the NVMe protocol has the following description in the latest
>>> NVM Command Set Specification Figure 82 and Figure 114:
>>>
>>> ===
>>> Deallocate (DEAC): If this bit is set to `1´, then the host is
>>> requesting that the controller deallocate the specified logical blocks.
>>> If this bit is cleared to `0´, then the host is not requesting that
>>> the controller deallocate the specified logical blocks...
>>>
>>> DLFEAT:
>>> Write Zeroes Deallocation Support (WZDS): If this bit is set to `1´,
>>> then the controller supports the Deallocate bit in the Write Zeroes
>>> command for this namespace...
>>
>> Yes.  The host is requesting, not the controller shall.  It's not
>> guaranteed behavior and the controller might as well actually write
>> zeroes to the media.  That is rather stupid, but still.
> 
> I guess some controllers _really_ want specific alignments to
> successfully do a proper discard. While still not guaranteed in spec, I
> think it is safe to assume a proper deallocation will occur if you align
> to NPDA and NPDG. Otherwise, the controller may do a read-modify-write
> to ensure zeroes are returned for the requested LBA range on anything
> that straddles an implementation specific boundary.
> 

I understand. A proper deallocation has certain constraints, but I
guess it should be useful for most scenarios. Thank you for
the explanation.

Thanks,
Yi.