[RFC PATCH v2 0/8] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag

Zhang Yi posted 8 patches 11 months ago
There is a newer version of this series
Documentation/ABI/stable/sysfs-block | 14 +++++++
block/blk-settings.c                 |  6 +++
block/blk-sysfs.c                    |  3 ++
block/fops.c                         | 37 +++++++++--------
drivers/md/dm-table.c                |  3 +-
drivers/nvme/host/core.c             | 21 +++++-----
drivers/scsi/sd.c                    |  5 +++
fs/ext4/extents.c                    | 59 ++++++++++++++++++++++------
fs/open.c                            |  1 +
include/linux/blkdev.h               |  3 ++
include/linux/falloc.h               |  3 +-
include/trace/events/ext4.h          |  3 +-
include/uapi/linux/falloc.h          | 18 +++++++++
13 files changed, 134 insertions(+), 42 deletions(-)
[RFC PATCH v2 0/8] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
Posted by Zhang Yi 11 months ago
From: Zhang Yi <yi.zhang@huawei.com>

Changes since v1:
 - Switch to add a new write zeroes operation, FALLOC_FL_WRITE_ZEROES,
   in fallocate, instead of just adding a supported flag to
   FALLOC_FL_ZERO_RANGE.
 - Introduce a new flag BLK_FEAT_WRITE_ZEROES_UNMAP to the block
   device's queue limit features, and implement it on SCSI sd driver,
   NVMe SSD driver and dm driver.
 - Implement FALLOC_FL_WRITE_ZEROES on both the ext4 filesystem and
   block device (bdev).

v1: https://lore.kernel.org/linux-fsdevel/20241228014522.2395187-1-yi.zhang@huaweicloud.com/

Currently, we can use the fallocate command to quickly create a
pre-allocated file. However, on most filesystems, such as ext4 and XFS,
fallocate create pre-allocation blocks in an unwritten state, and the
FALLOC_FL_ZERO_RANGE flag also behaves similarly. The extent state must
be converted to a written state when the user writes data into this
range later, which can trigger numerous metadata changes and consequent
journal I/O. This may leads to significant write amplification and
performance degradation in synchronous write mode. Therefore, we need a
method to create a pre-allocated file with written extents that can be
used for pure overwriting. At the monent, the only method available is
to create an empty file and write zero data into it (for example, using
'dd' with a large block size). However, this method is slow and consumes
a considerable amount of disk bandwidth, we must pre-allocate files in
advance but cannot add pre-allocated files while user business services
are running.

Fortunately, with the development and more and more widely used of
flash-based storage devices, we can efficiently write zeros to SSDs
using the unmap write zeroes command if the devices do not write
physical zeroes to the media. For example, if SCSI SSDs support the
UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command
does not write actual data to the device, instead, NVMe converts the
zeroed range to a deallocated state, which works fast and consumes
almost no disk write bandwidth. Consequently, this feature can provide
us with a faster method for creating pre-allocated files with written
extents and zeroed data.

This series aims to implement this by:
1. Introduce a new feature BLK_FEAT_WRITE_ZEROES_UNMAP to the block
   device queue limit features, which indicates whether the storage is
   device explicitly supports the unmapped write zeroes command. This
   flag should be set to 1 by the driver it the attached disk supports
   this command. Users can check this flag by querying:

       /sys/block/<disk>/queue/write_zeroes_unmap

2. Introduce a new flag FALLOC_FL_FORCE_ZERO into the fallocate,
   filesystems with this operaion should allocate written extents and
   issuing zeroes to the range of the device. If the device supports
   unmap write zeroes command, the zeroing can be accelerated, if not,
   we currently still allow to fall back to submit zeroes data. Users
   can verify if the device supports the unmap write zeroes command and
   then decide whether to use it.

I initially implemented the BLK_FEAT_WRITE_ZEROES_UNMAP flag for SCSI
and NVMe drivers, and I also added the FALLOC_FL_FORCE_ZERO flag for
ext4 and block devices. Any comments are welcome. Once the kernel
changes are finalized, I will do comprehensive tests, and update the
man page documentation, as well as the corresponding user-mode tools.

NOTE: this series is based on my ext4 fallocate refactor series[2] which
      hasn't been merged to the mainline yet.

I've briefly modified xfs_io and fallocate tool in util-linux[3], and
tested performance with this series on ext4 filesystem on my machine
with an Intel Xeon Gold 6248R CPU, a 7TB KCD61LUL7T68 NVMe SSD which
supports unmap write zeroes command with the Deallocated state and the
DEAC bit. Feel free to give it a try.

0. Ensure the NVMe device supports WRITE_ZERO command.

 $ cat /sys/block/nvme5n1/queue/write_zeroes_max_bytes
   8388608
 $ nvme id-ns -H /dev/nvme5n1 | grep -i -A 3 "dlfeat"
   dlfeat  : 25
   [4:4] : 0x1   Guard Field of Deallocated Logical Blocks is set to CRC
                 of The Value Read
   [3:3] : 0x1   Deallocate Bit in the Write Zeroes Command is Supported
   [2:0] : 0x1   Bytes Read From a Deallocated Logical Block and its
                 Metadata are 0x00

1. Compare 'dd' and fallocate with force zero range, the zero range is
   significantly faster than 'dd'.

 a) Create a 1GB zeroed file.
  $ dd if=/dev/zero of=foo bs=2M count=512 oflag=direct
    512+0 records in
    512+0 records out
    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.504496 s, 2.1 GB/s

  $ time fallocate -Z -l 1G bar  # -Z is a new option to do actual zero
    real    0m0.171s
    user    0m0.001s
    sys     0m0.003s

 b) Create a 10GB zeroed file.
  $ dd if=/dev/zero of=foo bs=2M count=5120 oflag=direct  
    5120+0 records in
    5120+0 records out
    10737418240 bytes (11 GB, 10 GiB) copied, 5.04009 s, 2.1 GB/s

  $ time fallocate -Z -l 10G bar
    real    0m1.724s
    user    0m0.000s
    sys     0m0.024s

2. Run fio overwrite and fallocate with force zero range simultaneously,
   fallocate has little impact on write bandwidth and only slightly
   affects write latency.

 a) Test bandwidth costs.
  $ fio -directory=/test -direct=1 -iodepth=10 -fsync=0 -rw=write \
        -numjobs=10 -bs=2M -ioengine=libaio -size=20G -runtime=20 \
        -fallocate=none -overwrite=1 -group_reportin -name=bw_test

   Without background zero range:
    bw (MiB/s): min= 2068, max= 2280, per=100.00%, avg=2186.40

   With background zero range:
    bw (MiB/s): min= 2056, max= 2308, per=100.00%, avg=2186.20

 b) Test write latency costs.
  $ fio -filename=/test/foo -direct=1 -iodepth=1 -fsync=0 -rw=write \
        -numjobs=1 -bs=4k -ioengine=psync -size=5G -runtime=20 \
        -fallocate=none -overwrite=1 -group_reportin -name=lat_test

   Without background zero range:
   lat (nsec): min=9269, max=71635, avg=9840.65

   With a background zero range:
   lat (usec): min=9, max=982, avg=11.03

3. Compare overwriting in a pre-allocated unwritten file and a written
   file in O_DSYNC mode. Write to a file with written extents is much
   faster.

  # First mkfs and create a test file according to below three cases,
  # and then run fio.

  $ fio -filename=/test/foo -direct=1 -iodepth=1 -fdatasync=1 \
        -rw=write -numjobs=1 -bs=4k -ioengine=psync -size=5G \
        -runtime=20 -fallocate=none -group_reportin -name=test

   unwritten file:                 IOPS=20.1k, BW=78.7MiB/s
   unwritten file + fast_commit:   IOPS=42.9k, BW=167MiB/s
   written file:                   IOPS=98.8k, BW=386MiB/s

Thanks,
Yi.

---

[1] https://nvmexpress.org/specifications/
    NVM Command Set Specification, section 3.2.8
[2] https://lore.kernel.org/linux-ext4/20241220011637.1157197-1-yi.zhang@huaweicloud.com/
[3] Here is a simple support of xfs_io and fallocate tool in util-linux.
    Feel free to give it a try.

1. util-linux
diff --git a/sys-utils/fallocate.c b/sys-utils/fallocate.c
index ac7c687f2..a2bfa8d39 100644
--- a/sys-utils/fallocate.c
+++ b/sys-utils/fallocate.c
@@ -66,6 +66,10 @@
 # define FALLOC_FL_INSERT_RANGE		0x20
 #endif
 
+#ifndef FALLOC_FL_WRITE_ZEROES
+# define FALLOC_FL_WRITE_ZEROES		0x80
+#endif
+
 #include "nls.h"
 #include "strutils.h"
 #include "c.h"
@@ -95,6 +99,7 @@ static void __attribute__((__noreturn__)) usage(void)
 	fputs(_(" -o, --offset <num>   offset for range operations, in bytes\n"), out);
 	fputs(_(" -p, --punch-hole     replace a range with a hole (implies -n)\n"), out);
 	fputs(_(" -z, --zero-range     zero and ensure allocation of a range\n"), out);
+	fputs(_(" -w, --write-zeroes   write zeroes and ensure allocation of a range\n"), out);
 #ifdef HAVE_POSIX_FALLOCATE
 	fputs(_(" -x, --posix          use posix_fallocate(3) instead of fallocate(2)\n"), out);
 #endif
@@ -305,6 +310,7 @@ int main(int argc, char **argv)
 	    { "dig-holes",      no_argument,       NULL, 'd' },
 	    { "insert-range",   no_argument,       NULL, 'i' },
 	    { "zero-range",     no_argument,       NULL, 'z' },
+	    { "write-zeroes",   no_argument,       NULL, 'w' },
 	    { "offset",         required_argument, NULL, 'o' },
 	    { "length",         required_argument, NULL, 'l' },
 	    { "posix",          no_argument,       NULL, 'x' },
@@ -313,9 +319,10 @@ int main(int argc, char **argv)
 	};
 
 	static const ul_excl_t excl[] = {	/* rows and cols in ASCII order */
-		{ 'c', 'd', 'p', 'z' },
+		{ 'c', 'd', 'p', 'z', 'w' },
 		{ 'c', 'n' },
-		{ 'x', 'c', 'd', 'i', 'n', 'p', 'z'},
+		{ 'w', 'n' },
+		{ 'x', 'c', 'd', 'i', 'n', 'p', 'z', 'w'},
 		{ 0 }
 	};
 	int excl_st[ARRAY_SIZE(excl)] = UL_EXCL_STATUS_INIT;
@@ -325,7 +332,7 @@ int main(int argc, char **argv)
 	textdomain(PACKAGE);
 	close_stdout_atexit();
 
-	while ((c = getopt_long(argc, argv, "hvVncpdizxl:o:", longopts, NULL))
+	while ((c = getopt_long(argc, argv, "hvVncpdizwxl:o:", longopts, NULL))
 			!= -1) {
 
 		err_exclusive_options(c, longopts, excl, excl_st);
@@ -355,6 +362,9 @@ int main(int argc, char **argv)
 		case 'z':
 			mode |= FALLOC_FL_ZERO_RANGE;
 			break;
+		case 'w':
+			mode |= FALLOC_FL_WRITE_ZEROES;
+			break;
 		case 'x':
 #ifdef HAVE_POSIX_FALLOCATE
 			posix = 1;

2. xfs_io
diff --git a/io/prealloc.c b/io/prealloc.c
index 8e968c9f..96daf1a1 100644
--- a/io/prealloc.c
+++ b/io/prealloc.c
@@ -30,6 +30,10 @@
 #define FALLOC_FL_UNSHARE_RANGE 0x40
 #endif
 
+#ifndef FALLOC_FL_WRITE_ZEROES
+#define FALLOC_FL_WRITE_ZEROES 0x80
+#endif
+
 static cmdinfo_t allocsp_cmd;
 static cmdinfo_t freesp_cmd;
 static cmdinfo_t resvsp_cmd;
@@ -377,6 +381,28 @@ funshare_f(
 	return 0;
 }
 
+static int
+fwrite_zeroes_f(
+	int		argc,
+	char		**argv)
+{
+	xfs_flock64_t	segment;
+	int		mode = FALLOC_FL_WRITE_ZEROES;
+
+	if (!offset_length(argv[1], argv[2], &segment)) {
+		exitcode = 1;
+		return 0;
+	}
+
+	if (fallocate(file->fd, mode,
+			segment.l_start, segment.l_len)) {
+		perror("fallocate");
+		exitcode = 1;
+		return 0;
+	}
+	return 0;
+}
+
 void
 prealloc_init(void)
 {
@@ -489,4 +515,14 @@ prealloc_init(void)
 	funshare_cmd.oneline =
 	_("unshares shared blocks within the range");
 	add_command(&funshare_cmd);
+
+	funshare_cmd.name = "fwrite_zeroes";
+	funshare_cmd.cfunc = fwrite_zeroes_f;
+	funshare_cmd.argmin = 2;
+	funshare_cmd.argmax = 2;
+	funshare_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
+	funshare_cmd.args = _("off len");
+	funshare_cmd.oneline =
+	_("zeroes space and eliminates holes by allocating and writing zeroes");
+	add_command(&funshare_cmd);
 }


Zhang Yi (8):
  block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
  nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
  scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap
    zeroing mode
  dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
  fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
  block: add FALLOC_FL_WRITE_ZEROES support
  block: factor out common part in blkdev_fallocate()
  ext4: add FALLOC_FL_WRITE_ZEROES support

 Documentation/ABI/stable/sysfs-block | 14 +++++++
 block/blk-settings.c                 |  6 +++
 block/blk-sysfs.c                    |  3 ++
 block/fops.c                         | 37 +++++++++--------
 drivers/md/dm-table.c                |  3 +-
 drivers/nvme/host/core.c             | 21 +++++-----
 drivers/scsi/sd.c                    |  5 +++
 fs/ext4/extents.c                    | 59 ++++++++++++++++++++++------
 fs/open.c                            |  1 +
 include/linux/blkdev.h               |  3 ++
 include/linux/falloc.h               |  3 +-
 include/trace/events/ext4.h          |  3 +-
 include/uapi/linux/falloc.h          | 18 +++++++++
 13 files changed, 134 insertions(+), 42 deletions(-)

-- 
2.39.2
Re: [RFC PATCH v2 0/8] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
Posted by Chaitanya Kulkarni 11 months ago
On 1/15/25 03:46, Zhang Yi wrote:
> Currently, we can use the fallocate command to quickly create a
> pre-allocated file. However, on most filesystems, such as ext4 and XFS,
> fallocate create pre-allocation blocks in an unwritten state, and the
> FALLOC_FL_ZERO_RANGE flag also behaves similarly. The extent state must
> be converted to a written state when the user writes data into this
> range later, which can trigger numerous metadata changes and consequent
> journal I/O. This may leads to significant write amplification and
> performance degradation in synchronous write mode. Therefore, we need a
> method to create a pre-allocated file with written extents that can be
> used for pure overwriting. At the monent, the only method available is
> to create an empty file and write zero data into it (for example, using
> 'dd' with a large block size). However, this method is slow and consumes
> a considerable amount of disk bandwidth, we must pre-allocate files in
> advance but cannot add pre-allocated files while user business services
> are running.

it will be very useful if we can get some blktests for scsi/nvme/dm.
Please note that this not a blocker to get this path series to be merged,
but this will help everyone including regular tests runs we do to ensure
the stability of new interface.

if you do please CC and Shinichiro (added to CC list) to we can help those
tests review and potentially also can provide tested by tag tht can help
this work to move forward.

-ck


Re: [RFC PATCH v2 0/8] fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
Posted by Zhang Yi 11 months ago
On 2025/1/16 5:07, Chaitanya Kulkarni wrote:
> On 1/15/25 03:46, Zhang Yi wrote:
>> Currently, we can use the fallocate command to quickly create a
>> pre-allocated file. However, on most filesystems, such as ext4 and XFS,
>> fallocate create pre-allocation blocks in an unwritten state, and the
>> FALLOC_FL_ZERO_RANGE flag also behaves similarly. The extent state must
>> be converted to a written state when the user writes data into this
>> range later, which can trigger numerous metadata changes and consequent
>> journal I/O. This may leads to significant write amplification and
>> performance degradation in synchronous write mode. Therefore, we need a
>> method to create a pre-allocated file with written extents that can be
>> used for pure overwriting. At the monent, the only method available is
>> to create an empty file and write zero data into it (for example, using
>> 'dd' with a large block size). However, this method is slow and consumes
>> a considerable amount of disk bandwidth, we must pre-allocate files in
>> advance but cannot add pre-allocated files while user business services
>> are running.
> 
> it will be very useful if we can get some blktests for scsi/nvme/dm.
> Please note that this not a blocker to get this path series to be merged,
> but this will help everyone including regular tests runs we do to ensure
> the stability of new interface.

Hello, Chaitanya,

Thanks for your feedback! Yeah, the proposal for this series is still under
discussion, I will add counterpart tests to both blktests and fstests once
the solution is determined.

> 
> if you do please CC and Shinichiro (added to CC list) to we can help those
> tests review and potentially also can provide tested by tag tht can help
> this work to move forward.
> 
Sure, this will be very helpful.

Thanks,
Yi.
[RFC PATCH v2 1/8] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Zhang Yi 11 months ago
From: Zhang Yi <yi.zhang@huawei.com>

Currently, it's hard to know whether the storage device supports unmap
write zeroes. We cannot determine it only by checking if the disk
supports the write zeroes command, as for some HDDs that do submit
actual zeros to the disk media even if they claim to support the write
zeroes command, but that should be very slow.

Therefor, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP and
the corresponding sysfs entry, to indicate whether the block device
explicitly supports the unmapped write zeroes command. Each device
driver should set this bit if it is certain that the attached disk
supports this command. If the bit is not set, the disk either does not
support it, or its support status is unknown.

For the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP should be
supported both by the stacking driver and all underlying devices.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 Documentation/ABI/stable/sysfs-block | 14 ++++++++++++++
 block/blk-settings.c                 |  6 ++++++
 block/blk-sysfs.c                    |  3 +++
 include/linux/blkdev.h               |  3 +++
 4 files changed, 26 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 0cceb2badc83..ab4117cefd9a 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -722,6 +722,20 @@ Description:
 		0, write zeroes is not supported by the device.
 
 
+What:		/sys/block/<disk>/queue/write_zeroes_unmap
+Date:		January 2025
+Contact:	Zhang Yi <yi.zhang@huawei.com>
+Description:
+		[RO] Devices that explicitly support the unmap write zeroes
+		operation in which a single write zeroes request with the unmap
+		bit set to zero out the range of contiguous blocks on storage
+		by freeing blocks, rather than writing physical zeroes to the
+		media. If write_zeroes_unmap is 1, this indicates that the
+		device explicitly supports the write zero command. Otherwise,
+		the device either does not support it, or its support status is
+		unknown.
+
+
 What:		/sys/block/<disk>/queue/zone_append_max_bytes
 Date:		May 2020
 Contact:	linux-block@vger.kernel.org
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 8f09e33f41f6..a8bf2f8f0634 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -652,6 +652,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->features &= ~BLK_FEAT_NOWAIT;
 	if (!(b->features & BLK_FEAT_POLL))
 		t->features &= ~BLK_FEAT_POLL;
+	if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP))
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
 
 	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
 
@@ -774,6 +776,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->zone_write_granularity = 0;
 		t->max_zone_append_sectors = 0;
 	}
+
+	if (!t->max_write_zeroes_sectors)
+		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
+
 	blk_stack_atomic_writes_limits(t, b);
 
 	return ret;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 767598e719ab..13f22bee19d2 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -248,6 +248,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page)	\
 QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL);
 QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
 QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
+QUEUE_SYSFS_FEATURE_SHOW(write_zeroes_unmap, BLK_FEAT_WRITE_ZEROES_UNMAP);
 
 static ssize_t queue_zoned_show(struct gendisk *disk, char *page)
 {
@@ -468,6 +469,7 @@ QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
 
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
+QUEUE_RO_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap");
 QUEUE_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
 QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
 
@@ -615,6 +617,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_poll_delay_entry.attr,
 	&queue_virt_boundary_mask_entry.attr,
 	&queue_dma_alignment_entry.attr,
+	&queue_write_zeroes_unmap_entry.attr,
 	NULL,
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 378d3a1a22fc..14ba1e2709bb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -335,6 +335,9 @@ typedef unsigned int __bitwise blk_features_t;
 #define BLK_FEAT_ATOMIC_WRITES_STACKED \
 	((__force blk_features_t)(1u << 16))
 
+/* supports unmap write zeroes command */
+#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))
+
 /*
  * Flags automatically inherited when stacking limits.
  */
-- 
2.39.2
Re: [RFC PATCH v2 1/8] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by John Garry 10 months, 2 weeks ago
On 15/01/2025 11:46, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Currently, it's hard to know whether the storage device supports unmap
> write zeroes. We cannot determine it only by checking if the disk
> supports the write zeroes command, as for some HDDs that do submit
> actual zeros to the disk media even if they claim to support the write
> zeroes command, but that should be very slow.

This second sentence is too long, such that your meaning is hard to 
understand.

> 
> Therefor, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP and

Therefore?

> the corresponding sysfs entry, to indicate whether the block device
> explicitly supports the unmapped write zeroes command. Each device
> driver should set this bit if it is certain that the attached disk
> supports this command. 

How can they be certain? You already wrote that some claim to support 
it, yet don't really. Well, I think that is what you meant.

> If the bit is not set, the disk either does not
> support it, or its support status is unknown.
> 
> For the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP should be
> supported both by the stacking driver and all underlying devices.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>   Documentation/ABI/stable/sysfs-block | 14 ++++++++++++++
>   block/blk-settings.c                 |  6 ++++++
>   block/blk-sysfs.c                    |  3 +++
>   include/linux/blkdev.h               |  3 +++
>   4 files changed, 26 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index 0cceb2badc83..ab4117cefd9a 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -722,6 +722,20 @@ Description:
>   		0, write zeroes is not supported by the device.
>   
>   
> +What:		/sys/block/<disk>/queue/write_zeroes_unmap
> +Date:		January 2025
> +Contact:	Zhang Yi <yi.zhang@huawei.com>
> +Description:
> +		[RO] Devices that explicitly support the unmap write zeroes
> +		operation in which a single write zeroes request with the unmap
> +		bit set to zero out the range of contiguous blocks on storage

which bit are you referring to?

> +		by freeing blocks, rather than writing physical zeroes to the
> +		media. If write_zeroes_unmap is 1, this indicates that the
> +		device explicitly supports the write zero command. Otherwise,
> +		the device either does not support it, or its support status is
> +		unknown.

I am struggling to understand the full meaning of a value of '0'.

Does it means that either:
a. it does not support write zero
b. it does support write zero, yet just did not set 
BLK_FEAT_WRITE_ZEROES_UNMAP


> +
> +
>   What:		/sys/block/<disk>/queue/zone_append_max_bytes
>   Date:		May 2020
>   Contact:	linux-block@vger.kernel.org
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 8f09e33f41f6..a8bf2f8f0634 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -652,6 +652,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>   		t->features &= ~BLK_FEAT_NOWAIT;
>   	if (!(b->features & BLK_FEAT_POLL))
>   		t->features &= ~BLK_FEAT_POLL;
> +	if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP))
> +		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;

Why not just set this in BLK_FEAT_INHERIT_MASK? It's seems like a 
sensible thing to do...

>   
>   	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
>   
> @@ -774,6 +776,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>   		t->zone_write_granularity = 0;
>   		t->max_zone_append_sectors = 0;
>   	}
> +
> +	if (!t->max_write_zeroes_sectors)
> +		t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
> +
>   	blk_stack_atomic_writes_limits(t, b);
>   
>   	return ret;
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 767598e719ab..13f22bee19d2 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -248,6 +248,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page)	\
>   QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL);
>   QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
>   QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
> +QUEUE_SYSFS_FEATURE_SHOW(write_zeroes_unmap, BLK_FEAT_WRITE_ZEROES_UNMAP);
>   
>   static ssize_t queue_zoned_show(struct gendisk *disk, char *page)
>   {
> @@ -468,6 +469,7 @@ QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
>   
>   QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
>   QUEUE_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
> +QUEUE_RO_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap");
>   QUEUE_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
>   QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
>   
> @@ -615,6 +617,7 @@ static struct attribute *queue_attrs[] = {
>   	&queue_poll_delay_entry.attr,
>   	&queue_virt_boundary_mask_entry.attr,
>   	&queue_dma_alignment_entry.attr,
> +	&queue_write_zeroes_unmap_entry.attr,
>   	NULL,
>   };
>   
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 378d3a1a22fc..14ba1e2709bb 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -335,6 +335,9 @@ typedef unsigned int __bitwise blk_features_t;
>   #define BLK_FEAT_ATOMIC_WRITES_STACKED \
>   	((__force blk_features_t)(1u << 16))
>   
> +/* supports unmap write zeroes command */
> +#define BLK_FEAT_WRITE_ZEROES_UNMAP	((__force blk_features_t)(1u << 17))

Is this flag ever checked within the kernel?

If not, I assume your idea is that the user checks this flag via sysfs 
for the block device which the fs is mounted on just to know if 
FALLOC_FL_WRITE_ZEROES is definitely fast, right?

> +
>   /*
>    * Flags automatically inherited when stacking limits.
>    */
Re: [RFC PATCH v2 1/8] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Zhang Yi 10 months, 1 week ago
On 2025/1/29 0:46, John Garry wrote:
> On 15/01/2025 11:46, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Currently, it's hard to know whether the storage device supports unmap
>> write zeroes. We cannot determine it only by checking if the disk
>> supports the write zeroes command, as for some HDDs that do submit
>> actual zeros to the disk media even if they claim to support the write
>> zeroes command, but that should be very slow.
> 
> This second sentence is too long, such that your meaning is hard to understand.
> 
>>
>> Therefor, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP and
> 
> Therefore?
> 
>> the corresponding sysfs entry, to indicate whether the block device
>> explicitly supports the unmapped write zeroes command. Each device
>> driver should set this bit if it is certain that the attached disk
>> supports this command. 
> 
> How can they be certain? You already wrote that some claim to support it, yet don't really. Well, I think that is what you meant.
> 

Hi, John. thanks for your reply!

Sorry for the late and not make it clear enough earlier. Currently, there
are four situations of write zeroes command (aka REQ_OP_WRITE_ZEROES)
supported by various disks and backend storage devices.

A. Devices that do not support the write zeroes command
   These devices have bdev_limits(bdev)->max_write_zeroes_sectors set to
   zero.
B. Devices that support the write zeroes command
   These devices have bdev_limits(bdev)->max_write_zeroes_sectors set to a
   non-zero value. They can be further categorized into three
   sub-situations:
B.1. Devices that write physical zeroes to the media
     These devices perform the write zeroes operation by physically writing
     zeroes to the storage media, which can be very slow (e.g., HDDs).
B.2. Devices that support unmap write zeroes
     These devices can offload the write zeroes operation by unmapping the
     logical blocks, effectively putting them into a deallocated state
     (e.g., SSDs). This operation is typically very fast, allowing
     filesystems to use this command to quickly create zeroed files. NVMe
     and SCSI disk drivers already support this and can query the attached
     disks to determine whether they support unmap write zeroes (please see
     patches 2 and 3 for details).
B.3. The implementation of write zeroes on disks are unknown
     This category includes non-standard disks and some network storage
     devices where the exact implementation of the write zeroes command is
     unclear.

Currently, users can only distinguish A and B through querying

   /sys/block/<disk>/queue/write_zeroes_unmap

However, we cannot differentiate between the three sub-situations within B.
At the very least, we should aim to clearly distinguish between B.1 and B.2
on the standard HDDs and SSDs. After this series, we should set
BLK_FEAT_WRITE_ZEROES_UNMAP for devices in B.2, while leaving it unset for
the other three cases.

>> If the bit is not set, the disk either does not
>> support it, or its support status is unknown.
>>
>> For the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP should be
>> supported both by the stacking driver and all underlying devices.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>   Documentation/ABI/stable/sysfs-block | 14 ++++++++++++++
>>   block/blk-settings.c                 |  6 ++++++
>>   block/blk-sysfs.c                    |  3 +++
>>   include/linux/blkdev.h               |  3 +++
>>   4 files changed, 26 insertions(+)
>>
>> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
>> index 0cceb2badc83..ab4117cefd9a 100644
>> --- a/Documentation/ABI/stable/sysfs-block
>> +++ b/Documentation/ABI/stable/sysfs-block
>> @@ -722,6 +722,20 @@ Description:
>>           0, write zeroes is not supported by the device.
>>     +What:        /sys/block/<disk>/queue/write_zeroes_unmap
>> +Date:        January 2025
>> +Contact:    Zhang Yi <yi.zhang@huawei.com>
>> +Description:
>> +        [RO] Devices that explicitly support the unmap write zeroes
>> +        operation in which a single write zeroes request with the unmap
>> +        bit set to zero out the range of contiguous blocks on storage
> 
> which bit are you referring to?

Different disks and drivers have different control bit. For NVMe devices,
the control bit is NVME_WZ_DEAC, while for SCSI SSDs, it is set via
"cmd->cmnd[1] = 0x8", please see nvme_setup_write_zeroes() and
sd_setup_write_same[16|10]_cmnd() for details.

> 
>> +        by freeing blocks, rather than writing physical zeroes to the
>> +        media. If write_zeroes_unmap is 1, this indicates that the
>> +        device explicitly supports the write zero command. Otherwise,
>> +        the device either does not support it, or its support status is
>> +        unknown.
> 
> I am struggling to understand the full meaning of a value of '0'.
> 
> Does it means that either:
> a. it does not support write zero
> b. it does support write zero, yet just did not set BLK_FEAT_WRITE_ZEROES_UNMAP
> 

Yes, if it is set to 0, it should be the cases A, B.1 and B.3 I
mentioned above.

>> +
>> +
>>   What:        /sys/block/<disk>/queue/zone_append_max_bytes
>>   Date:        May 2020
>>   Contact:    linux-block@vger.kernel.org
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 8f09e33f41f6..a8bf2f8f0634 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -652,6 +652,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>           t->features &= ~BLK_FEAT_NOWAIT;
>>       if (!(b->features & BLK_FEAT_POLL))
>>           t->features &= ~BLK_FEAT_POLL;
>> +    if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP))
>> +        t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
> 
> Why not just set this in BLK_FEAT_INHERIT_MASK? It's seems like a sensible thing to do...
> 
No, we can't inherit BLK_FEAT_WRITE_ZEROES_UNMAP if any of the stacking
driver and underlying device not support unmap write zeroes. The
stacking driver should set it by default, and drop it if one unsupported
device attached, please see patch 4 for the device-mapper case.

>>         t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
>>   @@ -774,6 +776,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>           t->zone_write_granularity = 0;
>>           t->max_zone_append_sectors = 0;
>>       }
>> +
>> +    if (!t->max_write_zeroes_sectors)
>> +        t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
>> +
>>       blk_stack_atomic_writes_limits(t, b);
>>         return ret;
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 767598e719ab..13f22bee19d2 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -248,6 +248,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page)    \
>>   QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL);
>>   QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
>>   QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
>> +QUEUE_SYSFS_FEATURE_SHOW(write_zeroes_unmap, BLK_FEAT_WRITE_ZEROES_UNMAP);
>>     static ssize_t queue_zoned_show(struct gendisk *disk, char *page)
>>   {
>> @@ -468,6 +469,7 @@ QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
>>     QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
>>   QUEUE_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
>> +QUEUE_RO_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap");
>>   QUEUE_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
>>   QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
>>   @@ -615,6 +617,7 @@ static struct attribute *queue_attrs[] = {
>>       &queue_poll_delay_entry.attr,
>>       &queue_virt_boundary_mask_entry.attr,
>>       &queue_dma_alignment_entry.attr,
>> +    &queue_write_zeroes_unmap_entry.attr,
>>       NULL,
>>   };
>>   diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 378d3a1a22fc..14ba1e2709bb 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -335,6 +335,9 @@ typedef unsigned int __bitwise blk_features_t;
>>   #define BLK_FEAT_ATOMIC_WRITES_STACKED \
>>       ((__force blk_features_t)(1u << 16))
>>   +/* supports unmap write zeroes command */
>> +#define BLK_FEAT_WRITE_ZEROES_UNMAP    ((__force blk_features_t)(1u << 17))
> 
> Is this flag ever checked within the kernel?
> 
> If not, I assume your idea is that the user checks this flag via sysfs for the block device which the fs is mounted on just to know if FALLOC_FL_WRITE_ZEROES is definitely fast, right?

You are right.

Thanks,
Yi.

Re: [RFC PATCH v2 1/8] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Zhang Yi 10 months, 1 week ago
On 2025/2/7 20:22, Zhang Yi wrote:
> On 2025/1/29 0:46, John Garry wrote:
>> On 15/01/2025 11:46, Zhang Yi wrote:
>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> Currently, it's hard to know whether the storage device supports unmap
>>> write zeroes. We cannot determine it only by checking if the disk
>>> supports the write zeroes command, as for some HDDs that do submit
>>> actual zeros to the disk media even if they claim to support the write
>>> zeroes command, but that should be very slow.
>>
>> This second sentence is too long, such that your meaning is hard to understand.
>>
>>>
>>> Therefor, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP and
>>
>> Therefore?
>>
>>> the corresponding sysfs entry, to indicate whether the block device
>>> explicitly supports the unmapped write zeroes command. Each device
>>> driver should set this bit if it is certain that the attached disk
>>> supports this command. 
>>
>> How can they be certain? You already wrote that some claim to support it, yet don't really. Well, I think that is what you meant.
>>
> 
> Hi, John. thanks for your reply!
> 
> Sorry for the late and not make it clear enough earlier. Currently, there
> are four situations of write zeroes command (aka REQ_OP_WRITE_ZEROES)
> supported by various disks and backend storage devices.
> 
> A. Devices that do not support the write zeroes command
>    These devices have bdev_limits(bdev)->max_write_zeroes_sectors set to
>    zero.
> B. Devices that support the write zeroes command
>    These devices have bdev_limits(bdev)->max_write_zeroes_sectors set to a
>    non-zero value. They can be further categorized into three
>    sub-situations:
> B.1. Devices that write physical zeroes to the media
>      These devices perform the write zeroes operation by physically writing
>      zeroes to the storage media, which can be very slow (e.g., HDDs).
> B.2. Devices that support unmap write zeroes
>      These devices can offload the write zeroes operation by unmapping the
>      logical blocks, effectively putting them into a deallocated state
>      (e.g., SSDs). This operation is typically very fast, allowing
>      filesystems to use this command to quickly create zeroed files. NVMe
>      and SCSI disk drivers already support this and can query the attached
>      disks to determine whether they support unmap write zeroes (please see
>      patches 2 and 3 for details).
> B.3. The implementation of write zeroes on disks are unknown
>      This category includes non-standard disks and some network storage
>      devices where the exact implementation of the write zeroes command is
>      unclear.
> 
> Currently, users can only distinguish A and B through querying
> 
>    /sys/block/<disk>/queue/write_zeroes_unmap
                             ^^^^^^^^^^^^^^^^^^
Oh, sorry, it should be 'write_zeroes_max_bytes'

     /sys/block/<disk>/queue/write_zeroes_max_bytes

Thanks,
Yi.
Re: [RFC PATCH v2 1/8] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
Posted by Christoph Hellwig 10 months, 2 weeks ago
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
[RFC PATCH v2 2/8] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
Posted by Zhang Yi 11 months ago
From: Zhang Yi <yi.zhang@huawei.com>

When the device supports the Write Zeroes command and the DEAC bit, it
indicates that the deallocate bit in the Write Zeroes command is
supported, and the bytes read from a deallocated logical block are
zeroes. This means the device supports unmap Write Zeroes, so set the
BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 drivers/nvme/host/core.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a970168a3014..8a359370154d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2210,22 +2210,25 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	if (!nvme_init_integrity(ns->head, &lim, info))
 		capacity = 0;
 
-	ret = queue_limits_commit_update(ns->disk->queue, &lim);
-	if (ret) {
-		blk_mq_unfreeze_queue(ns->disk->queue);
-		goto out;
-	}
-
-	set_capacity_and_notify(ns->disk, capacity);
-
 	/*
 	 * Only set the DEAC bit if the device guarantees that reads from
 	 * deallocated data return zeroes.  While the DEAC bit does not
 	 * require that, it must be a no-op if reads from deallocated data
 	 * do not return zeroes.
 	 */
-	if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3)))
+	if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3))) {
 		ns->head->features |= NVME_NS_DEAC;
+		if (lim.max_write_zeroes_sectors)
+			lim.features |= BLK_FEAT_WRITE_ZEROES_UNMAP;
+	}
+
+	ret = queue_limits_commit_update(ns->disk->queue, &lim);
+	if (ret) {
+		blk_mq_unfreeze_queue(ns->disk->queue);
+		goto out;
+	}
+
+	set_capacity_and_notify(ns->disk, capacity);
 	set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info));
 	set_bit(NVME_NS_READY, &ns->flags);
 	blk_mq_unfreeze_queue(ns->disk->queue);
-- 
2.39.2
Re: [RFC PATCH v2 2/8] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
Posted by Christoph Hellwig 10 months, 2 weeks ago
I think you also need to add BLK_FEAT_WRITE_ZEROES_UNMAP to the list
of features supported by the nvme-mpath stacking driver in
nvme_mpath_alloc_disk, so that this gets propagated to multipathed
devices.

Otherwise this looks good to me.
Re: [RFC PATCH v2 2/8] nvme: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports DEAC bit
Posted by Zhang Yi 10 months, 1 week ago
On 2025/1/28 14:46, Christoph Hellwig wrote:
> I think you also need to add BLK_FEAT_WRITE_ZEROES_UNMAP to the list
> of features supported by the nvme-mpath stacking driver in
> nvme_mpath_alloc_disk, so that this gets propagated to multipathed
> devices.
> 
> Otherwise this looks good to me.

Ha, thanks for pointing this out, I missed this case, will add.

Thanks,
Yi.
[RFC PATCH v2 3/8] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode
Posted by Zhang Yi 11 months ago
From: Zhang Yi <yi.zhang@huawei.com>

When the device supports the Write Zeroes command and the zeroing mode
is set to SD_ZERO_WS16_UNMAP or SD_ZERO_WS10_UNMAP, this means that the
device supports unmap Write Zeroes, so set the corresponding
BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 drivers/scsi/sd.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8947dab132d7..95e115c69286 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1122,6 +1122,11 @@ static void sd_config_write_same(struct scsi_disk *sdkp,
 	else
 		sdkp->zeroing_mode = SD_ZERO_WRITE;
 
+	if (sdkp->max_ws_blocks &&
+	    (sdkp->zeroing_mode == SD_ZERO_WS16_UNMAP ||
+	     sdkp->zeroing_mode == SD_ZERO_WS10_UNMAP))
+		lim->features |= BLK_FEAT_WRITE_ZEROES_UNMAP;
+
 	if (sdkp->max_ws_blocks &&
 	    sdkp->physical_block_size > logical_block_size) {
 		/*
-- 
2.39.2
Re: [RFC PATCH v2 3/8] scsi: sd: set BLK_FEAT_WRITE_ZEROES_UNMAP if device supports unmap zeroing mode
Posted by Christoph Hellwig 10 months, 2 weeks ago
On Wed, Jan 15, 2025 at 07:46:32PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When the device supports the Write Zeroes command and the zeroing mode
> is set to SD_ZERO_WS16_UNMAP or SD_ZERO_WS10_UNMAP, this means that the
> device supports unmap Write Zeroes, so set the corresponding
> BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
[RFC PATCH v2 4/8] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
Posted by Zhang Yi 11 months ago
From: Zhang Yi <yi.zhang@huawei.com>

Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature on stacking queue limits by
default. This feature shall be disabled if any underlying device does
not support it.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 drivers/md/dm-table.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index bd8b796ae683..58cce31bcc1e 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -598,7 +598,8 @@ int dm_split_args(int *argc, char ***argvp, char *input)
 static void dm_set_stacking_limits(struct queue_limits *limits)
 {
 	blk_set_stacking_limits(limits);
-	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
+	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL |
+			    BLK_FEAT_WRITE_ZEROES_UNMAP;
 }
 
 /*
-- 
2.39.2
Re: [RFC PATCH v2 4/8] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
Posted by Benjamin Marzinski 10 months, 1 week ago
On Wed, Jan 15, 2025 at 07:46:33PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature on stacking queue limits by
> default. This feature shall be disabled if any underlying device does
> not support it.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  drivers/md/dm-table.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index bd8b796ae683..58cce31bcc1e 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -598,7 +598,8 @@ int dm_split_args(int *argc, char ***argvp, char *input)
>  static void dm_set_stacking_limits(struct queue_limits *limits)
>  {
>  	blk_set_stacking_limits(limits);
> -	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
> +	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL |
> +			    BLK_FEAT_WRITE_ZEROES_UNMAP;
>  }
>  

dm_table_set_restrictions() can set limits->max_write_zeroes_sectors to
0, and it's called after dm_calculate_queue_limits(), which calls
blk_stack_limits(). Just to avoid having the BLK_FEAT_WRITE_ZEROES_UNMAP
still set while a device's max_write_zeroes_sectors is 0, it seems like
you would want to clear it as well if dm_table_set_restrictions() sets
limits->max_write_zeroes_sectors to 0.

-Ben

>  /*
> -- 
> 2.39.2
>
Re: [RFC PATCH v2 4/8] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
Posted by Zhang Yi 10 months, 1 week ago
On 2025/2/8 6:14, Benjamin Marzinski wrote:
> On Wed, Jan 15, 2025 at 07:46:33PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature on stacking queue limits by
>> default. This feature shall be disabled if any underlying device does
>> not support it.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  drivers/md/dm-table.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>> index bd8b796ae683..58cce31bcc1e 100644
>> --- a/drivers/md/dm-table.c
>> +++ b/drivers/md/dm-table.c
>> @@ -598,7 +598,8 @@ int dm_split_args(int *argc, char ***argvp, char *input)
>>  static void dm_set_stacking_limits(struct queue_limits *limits)
>>  {
>>  	blk_set_stacking_limits(limits);
>> -	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
>> +	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL |
>> +			    BLK_FEAT_WRITE_ZEROES_UNMAP;
>>  }
>>  
> 
> dm_table_set_restrictions() can set limits->max_write_zeroes_sectors to
> 0, and it's called after dm_calculate_queue_limits(), which calls
> blk_stack_limits(). Just to avoid having the BLK_FEAT_WRITE_ZEROES_UNMAP
> still set while a device's max_write_zeroes_sectors is 0, it seems like
> you would want to clear it as well if dm_table_set_restrictions() sets
> limits->max_write_zeroes_sectors to 0.
> 

Hi, Ben!

Yeah, right. Thanks for pointing this out, and I also checked other
instances in dm where max_write_zeroes_sectors is set to 0, and it seems
we should also clear BLK_FEAT_WRITE_ZEROES_UNMAP in
disable_write_zeroes() as well.

Thanks,
Yi.
Re: [RFC PATCH v2 4/8] dm: add BLK_FEAT_WRITE_ZEROES_UNMAP support
Posted by Benjamin Marzinski 10 months ago
On Sat, Feb 08, 2025 at 11:12:57AM +0800, Zhang Yi wrote:
> On 2025/2/8 6:14, Benjamin Marzinski wrote:
> > On Wed, Jan 15, 2025 at 07:46:33PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature on stacking queue limits by
> >> default. This feature shall be disabled if any underlying device does
> >> not support it.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  drivers/md/dm-table.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> >> index bd8b796ae683..58cce31bcc1e 100644
> >> --- a/drivers/md/dm-table.c
> >> +++ b/drivers/md/dm-table.c
> >> @@ -598,7 +598,8 @@ int dm_split_args(int *argc, char ***argvp, char *input)
> >>  static void dm_set_stacking_limits(struct queue_limits *limits)
> >>  {
> >>  	blk_set_stacking_limits(limits);
> >> -	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
> >> +	limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL |
> >> +			    BLK_FEAT_WRITE_ZEROES_UNMAP;
> >>  }
> >>  
> > 
> > dm_table_set_restrictions() can set limits->max_write_zeroes_sectors to
> > 0, and it's called after dm_calculate_queue_limits(), which calls
> > blk_stack_limits(). Just to avoid having the BLK_FEAT_WRITE_ZEROES_UNMAP
> > still set while a device's max_write_zeroes_sectors is 0, it seems like
> > you would want to clear it as well if dm_table_set_restrictions() sets
> > limits->max_write_zeroes_sectors to 0.
> > 
> 
> Hi, Ben!
> 
> Yeah, right. Thanks for pointing this out, and I also checked other
> instances in dm where max_write_zeroes_sectors is set to 0, and it seems
> we should also clear BLK_FEAT_WRITE_ZEROES_UNMAP in
> disable_write_zeroes() as well.

Yep. Makes sense.

Thanks
-Ben

> 
> Thanks,
> Yi.
[RFC PATCH v2 5/8] fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate
Posted by Zhang Yi 11 months ago
From: Zhang Yi <yi.zhang@huawei.com>

With the development of flash-based storage devices, we can quickly
write zeros to SSDs using the WRITE_ZERO command if the devices do not
actually write physical zeroes to the media. Therefore, we can use this
command to quickly preallocate a real all-zero file with written
extents. This approach should be beneficial for subsequent pure
overwriting within this file, as it can save on block allocation and,
consequently, significant metadata changes, which should greatly improve
overwrite performance on certain filesystems.

Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to
fallocate. This flag is used to convert a specified range of a file to
zeros by issuing a zeroing operation. Blocks should be allocated for the
regions that span holes in the file, and the entire range is converted
to written extents. If the underlying device supports the actual offload
write zeroes command, the process of zeroing out operation can be
accelerated. If it does not, we currently don't prevent the file system
from writing actual zeros to the device. This provides users with a new
method to quickly generate a zeroed file, users no longer need to write
zero data to create a file with written extents.

Users can check the disk support of unmap write zeroes command by
querying:

    /sys/block/<disk>/queue/write_zeroes_unmap

Finally, this flag should not be used in conjunction with the
FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is
not permitted, and filesystems that always require out-of-place writes
should not support this flag since they still need to allocated new
blocks during subsequent overwrites.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/open.c                   |  1 +
 include/linux/falloc.h      |  3 ++-
 include/uapi/linux/falloc.h | 18 ++++++++++++++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index e6911101fe71..49be16d07168 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -265,6 +265,7 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		if (!(mode & FALLOC_FL_KEEP_SIZE))
 			return -EOPNOTSUPP;
 		break;
+	case FALLOC_FL_WRITE_ZEROES:
 	case FALLOC_FL_COLLAPSE_RANGE:
 	case FALLOC_FL_INSERT_RANGE:
 		if (mode & FALLOC_FL_KEEP_SIZE)
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 3f49f3df6af5..7c38c6b76b60 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -36,7 +36,8 @@ struct space_resv {
 				 FALLOC_FL_COLLAPSE_RANGE |	\
 				 FALLOC_FL_ZERO_RANGE |		\
 				 FALLOC_FL_INSERT_RANGE |	\
-				 FALLOC_FL_UNSHARE_RANGE)
+				 FALLOC_FL_UNSHARE_RANGE |	\
+				 FALLOC_FL_WRITE_ZEROES)
 
 /* on ia32 l_start is on a 32-bit boundary */
 #if defined(CONFIG_X86_64)
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 5810371ed72b..a4ddb166dbd2 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -78,4 +78,22 @@
  */
 #define FALLOC_FL_UNSHARE_RANGE		0x40
 
+/*
+ * FALLOC_FL_WRITE_ZEROES is used to convert a specified range of a file to
+ * zeros by issuing a zeroing operation. Blocks should be allocated for the
+ * regions that span holes in the file, and the entire range is converted to
+ * written extents. This flag is beneficial for subsequent pure overwriting
+ * within this range, as it can save on block allocation and, consequently,
+ * significant metadata changes. Therefore, filesystems that always require
+ * out-of-place writes should not support this flag.
+ *
+ * Different filesystems may implement different limitations on the
+ * granularity of the zeroing operation. Most will preferably be accelerated
+ * by submitting write zeroes command if the backing storage supports, which
+ * may not physically write zeros to the media.
+ *
+ * This flag cannot be used in conjunction with the FALLOC_FL_KEEP_SIZE.
+ */
+#define FALLOC_FL_WRITE_ZEROES		0x80
+
 #endif /* _UAPI_FALLOC_H_ */
-- 
2.39.2
[RFC PATCH v2 6/8] block: add FALLOC_FL_WRITE_ZEROES support
Posted by Zhang Yi 11 months ago
From: Zhang Yi <yi.zhang@huawei.com>

Add support for FALLOC_FL_WRITE_ZEROES. It directly calls
blkdev_issue_zeroout() with flags set to 0. The underlying process will
attempt to use the fastest method for issuing zeroes. First, the block
layer will try to issue a write zeroes command if the storage device
supports it; if not, it will fall back to issuing zeroed data. Then, the
storage device driver may attempt to submit an unmap write zero command
if the device supports it; if not, the driver may fall back to
submitting a no-unmap write zeroes command.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 block/fops.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/fops.c b/block/fops.c
index 13a67940d040..56486de0960b 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -777,7 +777,7 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 #define	BLKDEV_FALLOC_FL_SUPPORTED					\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
-		 FALLOC_FL_ZERO_RANGE)
+		 FALLOC_FL_ZERO_RANGE | FALLOC_FL_WRITE_ZEROES)
 
 static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 			     loff_t len)
@@ -836,6 +836,15 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 					     len >> SECTOR_SHIFT, GFP_KERNEL,
 					     BLKDEV_ZERO_NOFALLBACK);
 		break;
+	case FALLOC_FL_WRITE_ZEROES:
+		error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
+		if (error)
+			goto fail;
+
+		error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
+					     len >> SECTOR_SHIFT, GFP_KERNEL,
+					     0);
+		break;
 	default:
 		error = -EOPNOTSUPP;
 	}
-- 
2.39.2
[RFC PATCH v2 7/8] block: factor out common part in blkdev_fallocate()
Posted by Zhang Yi 11 months ago
From: Zhang Yi <yi.zhang@huawei.com>

Only the flags passed to blkdev_issue_zeroout() differ among the three
zeroing branches in blkdev_fallocate(). Therefore, do cleanup by
factoring them out.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 block/fops.c | 40 +++++++++++++++-------------------------
 1 file changed, 15 insertions(+), 25 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 56486de0960b..22eda3a43908 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -786,6 +786,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	struct block_device *bdev = I_BDEV(inode);
 	loff_t end = start + len - 1;
 	loff_t isize;
+	unsigned int flags;
 	int error;
 
 	/* Fail if we don't recognize the flags. */
@@ -812,43 +813,32 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 
 	filemap_invalidate_lock(inode->i_mapping);
 
-	/*
-	 * Invalidate the page cache, including dirty pages, for valid
-	 * de-allocate mode calls to fallocate().
-	 */
 	switch (mode) {
 	case FALLOC_FL_ZERO_RANGE:
 	case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
-		error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
-		if (error)
-			goto fail;
-
-		error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
-					     len >> SECTOR_SHIFT, GFP_KERNEL,
-					     BLKDEV_ZERO_NOUNMAP);
+		flags = BLKDEV_ZERO_NOUNMAP;
 		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
-		error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
-		if (error)
-			goto fail;
-
-		error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
-					     len >> SECTOR_SHIFT, GFP_KERNEL,
-					     BLKDEV_ZERO_NOFALLBACK);
+		flags = BLKDEV_ZERO_NOFALLBACK;
 		break;
 	case FALLOC_FL_WRITE_ZEROES:
-		error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
-		if (error)
-			goto fail;
-
-		error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
-					     len >> SECTOR_SHIFT, GFP_KERNEL,
-					     0);
+		flags = 0;
 		break;
 	default:
 		error = -EOPNOTSUPP;
+		goto fail;
 	}
 
+	/*
+	 * Invalidate the page cache, including dirty pages, for valid
+	 * de-allocate mode calls to fallocate().
+	 */
+	error = truncate_bdev_range(bdev, file_to_blk_mode(file), start, end);
+	if (error)
+		goto fail;
+
+	error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT,
+				     len >> SECTOR_SHIFT, GFP_KERNEL, flags);
  fail:
 	filemap_invalidate_unlock(inode->i_mapping);
 	return error;
-- 
2.39.2
[RFC PATCH v2 8/8] ext4: add FALLOC_FL_WRITE_ZEROES support
Posted by Zhang Yi 11 months ago
From: Zhang Yi <yi.zhang@huawei.com>

Add support for FALLOC_FL_WRITE_ZEROES. This first allocates blocks as
unwritten, then issues a zero command outside of the running journal
handle, and finally converts them to a written state.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c           | 59 ++++++++++++++++++++++++++++++-------
 include/trace/events/ext4.h |  3 +-
 2 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 1b028be19193..e937a714085c 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4483,6 +4483,8 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	struct ext4_map_blocks map;
 	unsigned int credits;
 	loff_t epos, old_size = i_size_read(inode);
+	unsigned int blkbits = inode->i_blkbits;
+	bool alloc_zero = false;
 
 	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
 	map.m_lblk = offset;
@@ -4495,6 +4497,17 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	if (len <= EXT_UNWRITTEN_MAX_LEN)
 		flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
 
+	/*
+	 * Do the actual write zero during a running journal transaction
+	 * costs a lot. First allocate an unwritten extent and then
+	 * convert it to written after zeroing it out.
+	 */
+	if (flags & EXT4_GET_BLOCKS_ZERO) {
+		flags &= ~EXT4_GET_BLOCKS_ZERO;
+		flags |= EXT4_GET_BLOCKS_UNWRIT_EXT;
+		alloc_zero = true;
+	}
+
 	/*
 	 * credits to insert 1 extent into extent tree
 	 */
@@ -4531,9 +4544,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 		 * allow a full retry cycle for any remaining allocations
 		 */
 		retries = 0;
-		map.m_lblk += ret;
-		map.m_len = len = len - ret;
-		epos = (loff_t)map.m_lblk << inode->i_blkbits;
+		epos = (loff_t)(map.m_lblk + ret) << blkbits;
 		inode_set_ctime_current(inode);
 		if (new_size) {
 			if (epos > new_size)
@@ -4553,6 +4564,21 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 		ret2 = ret3 ? ret3 : ret2;
 		if (unlikely(ret2))
 			break;
+
+		if (alloc_zero &&
+		    (map.m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN))) {
+			ret2 = ext4_issue_zeroout(inode, map.m_lblk, map.m_pblk,
+						  map.m_len);
+			if (likely(!ret2))
+				ret2 = ext4_convert_unwritten_extents(NULL,
+					inode, (loff_t)map.m_lblk << blkbits,
+					(loff_t)map.m_len << blkbits);
+			if (ret2)
+				break;
+		}
+
+		map.m_lblk += ret;
+		map.m_len = len = len - ret;
 	}
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
@@ -4618,7 +4644,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	if (end_lblk > start_lblk) {
 		ext4_lblk_t zero_blks = end_lblk - start_lblk;
 
-		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
+		if (mode & FALLOC_FL_WRITE_ZEROES)
+			flags = EXT4_GET_BLOCKS_CREATE_ZERO | EXT4_EX_NOCACHE;
+		else
+			flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
+				  EXT4_EX_NOCACHE);
 		ret = ext4_alloc_file_blocks(file, start_lblk, zero_blks,
 					     new_size, flags);
 		if (ret)
@@ -4730,8 +4760,8 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 
 	/* Return error if mode is not supported */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
-		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
-		     FALLOC_FL_INSERT_RANGE))
+		     FALLOC_FL_ZERO_RANGE | FALLOC_FL_COLLAPSE_RANGE |
+		     FALLOC_FL_INSERT_RANGE | FALLOC_FL_WRITE_ZEROES))
 		return -EOPNOTSUPP;
 
 	inode_lock(inode);
@@ -4762,16 +4792,23 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (ret)
 		goto out_invalidate_lock;
 
-	if (mode & FALLOC_FL_PUNCH_HOLE)
+	switch (mode & FALLOC_FL_MODE_MASK) {
+	case FALLOC_FL_PUNCH_HOLE:
 		ret = ext4_punch_hole(file, offset, len);
-	else if (mode & FALLOC_FL_COLLAPSE_RANGE)
+		break;
+	case FALLOC_FL_COLLAPSE_RANGE:
 		ret = ext4_collapse_range(file, offset, len);
-	else if (mode & FALLOC_FL_INSERT_RANGE)
+		break;
+	case FALLOC_FL_INSERT_RANGE:
 		ret = ext4_insert_range(file, offset, len);
-	else if (mode & FALLOC_FL_ZERO_RANGE)
+		break;
+	case FALLOC_FL_ZERO_RANGE:
+	case FALLOC_FL_WRITE_ZEROES:
 		ret = ext4_zero_range(file, offset, len, mode);
-	else
+		break;
+	default:
 		ret = -EOPNOTSUPP;
+	}
 
 out_invalidate_lock:
 	filemap_invalidate_unlock(mapping);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 156908641e68..6f9cf2811733 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -92,7 +92,8 @@ TRACE_DEFINE_ENUM(ES_REFERENCED_B);
 	{ FALLOC_FL_KEEP_SIZE,		"KEEP_SIZE"},		\
 	{ FALLOC_FL_PUNCH_HOLE,		"PUNCH_HOLE"},		\
 	{ FALLOC_FL_COLLAPSE_RANGE,	"COLLAPSE_RANGE"},	\
-	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"})
+	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"},		\
+	{ FALLOC_FL_WRITE_ZEROES,	"WRITE_ZEROES"})
 
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_XATTR);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_CROSS_RENAME);
-- 
2.39.2