[PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks

Zhang Yi posted 9 patches 1 year, 7 months ago
[PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Zhang Yi 1 year, 7 months ago
From: Zhang Yi <yi.zhang@huawei.com>

Current clone operation could be non-atomic if the destination of a file
is beyond EOF, user could get a file with corrupted (zeroed) data on
crash.

The problem is about preallocations. If you write some data into a file:

	[A...B)

and XFS decides to preallocate some post-eof blocks, then it can create
a delayed allocation reservation:

	[A.........D)

The writeback path tries to convert delayed extents to real ones by
allocating blocks. If there aren't enough contiguous free space, we can
end up with two extents, the first real and the second still delalloc:

	[A....C)[C.D)

After that, both the in-memory and the on-disk file sizes are still B.
If we clone into the range [E...F) from another file:

	[A....C)[C.D)      [E...F)

then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the
range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc
extent, its pagecache will be zeroed and both the in-memory and on-disk
size will be updated to D after flushing but before cloning. This is
wrong, because the user can see the size change and read the zeroes
while the clone operation is ongoing.

We need to keep the in-memory and on-disk size before the clone
operation starts, so instead of writing zeroes through the page cache
for delayed ranges beyond EOF, we convert these ranges to unwritten and
invalidate any cached data over that range beyond EOF.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
Changes since v4:

Move the delalloc converting hunk before searching the COW fork. Because
if the file has been reflinked and copied on write,
xfs_bmap_extsize_align() aligned the range of COW delalloc extent, after
the writeback, there might be some unwritten extents left over in the
COW fork that overlaps the delalloc extent we found in data fork.

  data fork  ...wwww|dddddddddd...
  cow fork          |uuuuuuuuuu...
                    ^
                  i_size

In my v4, we search the COW fork before checking the delalloc extent,
goto found_cow tag and return unconverted delalloc srcmap in the above
case, so the delayed extent in the data fork will have no chance to
convert to unwritten, it will lead to delalloc extent residue and break
generic/522 after merging patch 6.

 fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 236ee78aa75b..2857ef1b0272 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1022,6 +1022,24 @@ xfs_buffered_write_iomap_begin(
 		goto out_unlock;
 	}
 
+	/*
+	 * For zeroing, trim a delalloc extent that extends beyond the EOF
+	 * block.  If it starts beyond the EOF block, convert it to an
+	 * unwritten extent.
+	 */
+	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
+	    isnullstartblock(imap.br_startblock)) {
+		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
+
+		if (offset_fsb >= eof_fsb)
+			goto convert_delay;
+		if (end_fsb > eof_fsb) {
+			end_fsb = eof_fsb;
+			xfs_trim_extent(&imap, offset_fsb,
+					end_fsb - offset_fsb);
+		}
+	}
+
 	/*
 	 * Search the COW fork extent list even if we did not find a data fork
 	 * extent.  This serves two purposes: first this implements the
@@ -1167,6 +1185,17 @@ xfs_buffered_write_iomap_begin(
 	xfs_iunlock(ip, lockmode);
 	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
 
+convert_delay:
+	xfs_iunlock(ip, lockmode);
+	truncate_pagecache(inode, offset);
+	error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
+					   iomap, NULL);
+	if (error)
+		return error;
+
+	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
+	return 0;
+
 found_cow:
 	seq = xfs_iomap_inode_sequence(ip, 0);
 	if (imap.br_startoff <= offset_fsb) {
-- 
2.39.2
Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Lai, Yi 8 months ago
Hi Zhang Yi,

Greetings!

I used Syzkaller and found that there is WARNING in xfs_bmapi_convert_one_delalloc in linux v6.15-rc2.

After bisection and the first bad commit is:
"
5ce5674187c3 xfs: convert delayed extents to unwritten when zeroing post eof blocks
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250413_025108_xfs_bmapi_convert_one_delalloc/bzImage_0af2f6be1b4281385b618cb86ad946eded089ac8
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250413_025108_xfs_bmapi_convert_one_delalloc/0af2f6be1b4281385b618cb86ad946eded089ac8_dmesg.log

"
[   21.631771] ------------[ cut here ]------------
[   21.632173] WARNING: CPU: 1 PID: 760 at fs/xfs/libxfs/xfs_bmap.c:4401 xfs_bmapi_convert_one_delalloc+0x520/0xca0
[   21.633001] Modules linked in:
[   21.633292] CPU: 1 UID: 0 PID: 760 Comm: repro Tainted: G        W           6.15.0-rc2-v6.15-rc2 #1 PREEMPT(voluntary) 
[   21.634151] Tainted: [W]=WARN
[   21.634401] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   21.635476] RIP: 0010:xfs_bmapi_convert_one_delalloc+0x520/0xca0
[   21.635970] Code: fc ff ff e8 32 76 eb fe 44 89 fe bf 02 00 00 00 41 bc f5 ff ff ff e8 cf 70 eb fe 41 83 ff 02 0f 84 03 ff ff ff e8 10 76 eb fe <0f> 0b e9 f7 fe ff ff e8 04 76 eb fe 48 c7 c0 c0 ca ff 89 48 ba 00
[   21.637426] RSP: 0018:ff1100001cc37228 EFLAGS: 00010293
[   21.637855] RAX: 0000000000000000 RBX: ff1100001cc373e8 RCX: ffffffff829c44b1
[   21.638430] RDX: ff110000108b2bc0 RSI: ffffffff829c44c0 RDI: 0000000000000005
[   21.639001] RBP: ff1100001cc37410 R08: ff1100001cc37318 R09: fffffbfff0fdc374
[   21.639614] R10: 0000000000000000 R11: ff110000108b3a18 R12: 00000000fffffff5
[   21.640180] R13: ff1100001d471680 R14: ff1100000ec74000 R15: 0000000000000000
[   21.640751] FS:  00007f4c1b99a640(0000) GS:ff110000e3a84000(0000) knlGS:0000000000000000
[   21.641394] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   21.641862] CR2: 00007f4c14cfd000 CR3: 000000000d4f4006 CR4: 0000000000771ef0
[   21.642439] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   21.643007] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   21.643620] PKRU: 55555554
[   21.643848] Call Trace:
[   21.644060]  <TASK>
[   21.644256]  ? __pfx_xfs_bmapi_convert_one_delalloc+0x10/0x10
[   21.644738]  ? __up_read+0x21e/0x810
[   21.645063]  ? unmap_mapping_range+0x118/0x2c0
[   21.645436]  ? lock_release+0x14f/0x2c0
[   21.645761]  ? __pfx_unmap_mapping_range+0x10/0x10
[   21.646164]  xfs_bmapi_convert_delalloc+0xb0/0x100
[   21.646568]  xfs_buffered_write_iomap_begin+0x153d/0x1890
[   21.647022]  ? __pfx_xfs_buffered_write_iomap_begin+0x10/0x10
[   21.647535]  ? filemap_range_has_writeback+0x318/0x460
[   21.647977]  ? filemap_range_has_writeback+0xf1/0x460
[   21.648395]  ? __pfx_xfs_buffered_write_iomap_begin+0x10/0x10
[   21.648870]  iomap_iter+0x58b/0xd30
[   21.649178]  iomap_zero_range+0x547/0x740
[   21.649526]  ? __pfx_iomap_zero_range+0x10/0x10
[   21.649926]  ? __this_cpu_preempt_check+0x21/0x30
[   21.650317]  ? lock_is_held_type+0xef/0x150
[   21.650673]  xfs_zero_range+0xaf/0x100
[   21.650990]  xfs_file_write_checks+0x639/0xa00
[   21.651385]  xfs_file_buffered_write+0x183/0x9a0
[   21.651806]  ? __pfx_xfs_file_buffered_write+0x10/0x10
[   21.652225]  ? __lock_acquire+0x410/0x2260
[   21.652580]  ? __lruvec_stat_mod_folio+0x1aa/0x3e0
[   21.652985]  xfs_file_write_iter+0x52a/0xc00
[   21.653350]  do_iter_readv_writev+0x6b1/0x9c0
[   21.653723]  ? __pfx_do_iter_readv_writev+0x10/0x10
[   21.654127]  ? __this_cpu_preempt_check+0x21/0x30
[   21.654526]  ? lock_is_held_type+0xef/0x150
[   21.654879]  vfs_writev+0x311/0xd70
[   21.655204]  ? __lock_acquire+0x410/0x2260
[   21.655574]  ? __pfx_vfs_writev+0x10/0x10
[   21.655922]  ? __fget_files+0x1fa/0x3b0
[   21.656251]  ? __this_cpu_preempt_check+0x21/0x30
[   21.656643]  ? lock_release+0x14f/0x2c0
[   21.656973]  ? __fget_files+0x204/0x3b0
[   21.657306]  do_pwritev+0x1c9/0x280
[   21.657612]  ? do_pwritev+0x1c9/0x280
[   21.657927]  ? __pfx_do_pwritev+0x10/0x10
[   21.658269]  ? __audit_syscall_entry+0x39c/0x500
[   21.658661]  __x64_sys_pwritev+0xa3/0x100
[   21.659000]  ? syscall_trace_enter+0x14d/0x280
[   21.659387]  x64_sys_call+0x1bbc/0x2150
[   21.659741]  do_syscall_64+0x6d/0x150
[   21.660055]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   21.660473] RIP: 0033:0x7f4c1b63ee5d
[   21.660774] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[   21.662222] RSP: 002b:00007f4c1b999df8 EFLAGS: 00000202 ORIG_RAX: 0000000000000128
[   21.662830] RAX: ffffffffffffffda RBX: 00007f4c1b99a640 RCX: 00007f4c1b63ee5d
[   21.663432] RDX: 0000000000000001 RSI: 0000000020000300 RDI: 0000000000000004
[   21.664001] RBP: 00007f4c1b999e20 R08: 0000000000000000 R09: 0000000000000000
[   21.664569] R10: 0000000002fffffd R11: 0000000000000202 R12: 00007f4c1b99a640
[   21.665137] R13: 000000000000006e R14: 00007f4c1b69f560 R15: 0000000000000000
[   21.665726]  </TASK>
"

Hope this cound be insightful to you.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 

On Thu, Apr 25, 2024 at 09:13:30PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Current clone operation could be non-atomic if the destination of a file
> is beyond EOF, user could get a file with corrupted (zeroed) data on
> crash.
> 
> The problem is about preallocations. If you write some data into a file:
> 
> 	[A...B)
> 
> and XFS decides to preallocate some post-eof blocks, then it can create
> a delayed allocation reservation:
> 
> 	[A.........D)
> 
> The writeback path tries to convert delayed extents to real ones by
> allocating blocks. If there aren't enough contiguous free space, we can
> end up with two extents, the first real and the second still delalloc:
> 
> 	[A....C)[C.D)
> 
> After that, both the in-memory and the on-disk file sizes are still B.
> If we clone into the range [E...F) from another file:
> 
> 	[A....C)[C.D)      [E...F)
> 
> then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the
> range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc
> extent, its pagecache will be zeroed and both the in-memory and on-disk
> size will be updated to D after flushing but before cloning. This is
> wrong, because the user can see the size change and read the zeroes
> while the clone operation is ongoing.
> 
> We need to keep the in-memory and on-disk size before the clone
> operation starts, so instead of writing zeroes through the page cache
> for delayed ranges beyond EOF, we convert these ranges to unwritten and
> invalidate any cached data over that range beyond EOF.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> Changes since v4:
> 
> Move the delalloc converting hunk before searching the COW fork. Because
> if the file has been reflinked and copied on write,
> xfs_bmap_extsize_align() aligned the range of COW delalloc extent, after
> the writeback, there might be some unwritten extents left over in the
> COW fork that overlaps the delalloc extent we found in data fork.
> 
>   data fork  ...wwww|dddddddddd...
>   cow fork          |uuuuuuuuuu...
>                     ^
>                   i_size
> 
> In my v4, we search the COW fork before checking the delalloc extent,
> goto found_cow tag and return unconverted delalloc srcmap in the above
> case, so the delayed extent in the data fork will have no chance to
> convert to unwritten, it will lead to delalloc extent residue and break
> generic/522 after merging patch 6.
> 
>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 236ee78aa75b..2857ef1b0272 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1022,6 +1022,24 @@ xfs_buffered_write_iomap_begin(
>  		goto out_unlock;
>  	}
>  
> +	/*
> +	 * For zeroing, trim a delalloc extent that extends beyond the EOF
> +	 * block.  If it starts beyond the EOF block, convert it to an
> +	 * unwritten extent.
> +	 */
> +	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
> +	    isnullstartblock(imap.br_startblock)) {
> +		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> +
> +		if (offset_fsb >= eof_fsb)
> +			goto convert_delay;
> +		if (end_fsb > eof_fsb) {
> +			end_fsb = eof_fsb;
> +			xfs_trim_extent(&imap, offset_fsb,
> +					end_fsb - offset_fsb);
> +		}
> +	}
> +
>  	/*
>  	 * Search the COW fork extent list even if we did not find a data fork
>  	 * extent.  This serves two purposes: first this implements the
> @@ -1167,6 +1185,17 @@ xfs_buffered_write_iomap_begin(
>  	xfs_iunlock(ip, lockmode);
>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>  
> +convert_delay:
> +	xfs_iunlock(ip, lockmode);
> +	truncate_pagecache(inode, offset);
> +	error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
> +					   iomap, NULL);
> +	if (error)
> +		return error;
> +
> +	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
> +	return 0;
> +
>  found_cow:
>  	seq = xfs_iomap_inode_sequence(ip, 0);
>  	if (imap.br_startoff <= offset_fsb) {
> -- 
> 2.39.2
>
Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Brian Foster 7 months, 2 weeks ago
On Wed, Apr 16, 2025 at 03:20:21PM +0800, Lai, Yi wrote:
> Hi Zhang Yi,
> 
> Greetings!
> 
> I used Syzkaller and found that there is WARNING in xfs_bmapi_convert_one_delalloc in linux v6.15-rc2.
> 
> After bisection and the first bad commit is:
> "
> 5ce5674187c3 xfs: convert delayed extents to unwritten when zeroing post eof blocks
> "
> 

I took a look at this and it seems like the issue is a race between a
failed writeback and zero range via write extension. The sequence is
generally some file is written to with delalloc across the EOF folio. An
fsync occurs on said file. The underlying fs is (I'm assuming
deliberately) corrupted such that physical block allocation fails, so
the writeback block map fails and drops the ilock.

An extending buffered write comes in and discovers the post-EOF delalloc
range needs to be zeroed and then unlocks to fall into delalloc
conversion. However, the writeback side was blocked waiting on the ilock
in xfs_discard_folio(), which now punches out the blocks in the same
range and thus causes the warning down in the conversion code on the
buffered write / delalloc convert side where it finds a hole (and warns
because it doesn't like that this is not the COW fork remap case).

There's probably a few ways to address this. The most immediate that
comes to mind for me is probably to just drop the warning, document how
this can happen, and probably also tweak the error logic for the data
fork case so the operation proceeds for remaining blocks beyond the
folio that failed writeback. Another option could be to similarly return
EAGAIN and trigger a retry in the iomap_begin() path. Can't say I'm in
love with either one of those tbh, nor the whole convert blocks on zero
range thing in the first place, but it's also a rare error case and
perhaps not worth too much complexity.

That said, I've also wondered in the past if a higher level
iomap_map_blocks() call that effectively implements a more generic
->map_blocks() would be a more appropriate way to do this. Other
thoughts?

Brian

> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/250413_025108_xfs_bmapi_convert_one_delalloc/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250413_025108_xfs_bmapi_convert_one_delalloc/bzImage_0af2f6be1b4281385b618cb86ad946eded089ac8
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/250413_025108_xfs_bmapi_convert_one_delalloc/0af2f6be1b4281385b618cb86ad946eded089ac8_dmesg.log
> 
> "
> [   21.631771] ------------[ cut here ]------------
> [   21.632173] WARNING: CPU: 1 PID: 760 at fs/xfs/libxfs/xfs_bmap.c:4401 xfs_bmapi_convert_one_delalloc+0x520/0xca0
> [   21.633001] Modules linked in:
> [   21.633292] CPU: 1 UID: 0 PID: 760 Comm: repro Tainted: G        W           6.15.0-rc2-v6.15-rc2 #1 PREEMPT(voluntary) 
> [   21.634151] Tainted: [W]=WARN
> [   21.634401] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [   21.635476] RIP: 0010:xfs_bmapi_convert_one_delalloc+0x520/0xca0
> [   21.635970] Code: fc ff ff e8 32 76 eb fe 44 89 fe bf 02 00 00 00 41 bc f5 ff ff ff e8 cf 70 eb fe 41 83 ff 02 0f 84 03 ff ff ff e8 10 76 eb fe <0f> 0b e9 f7 fe ff ff e8 04 76 eb fe 48 c7 c0 c0 ca ff 89 48 ba 00
> [   21.637426] RSP: 0018:ff1100001cc37228 EFLAGS: 00010293
> [   21.637855] RAX: 0000000000000000 RBX: ff1100001cc373e8 RCX: ffffffff829c44b1
> [   21.638430] RDX: ff110000108b2bc0 RSI: ffffffff829c44c0 RDI: 0000000000000005
> [   21.639001] RBP: ff1100001cc37410 R08: ff1100001cc37318 R09: fffffbfff0fdc374
> [   21.639614] R10: 0000000000000000 R11: ff110000108b3a18 R12: 00000000fffffff5
> [   21.640180] R13: ff1100001d471680 R14: ff1100000ec74000 R15: 0000000000000000
> [   21.640751] FS:  00007f4c1b99a640(0000) GS:ff110000e3a84000(0000) knlGS:0000000000000000
> [   21.641394] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   21.641862] CR2: 00007f4c14cfd000 CR3: 000000000d4f4006 CR4: 0000000000771ef0
> [   21.642439] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   21.643007] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [   21.643620] PKRU: 55555554
> [   21.643848] Call Trace:
> [   21.644060]  <TASK>
> [   21.644256]  ? __pfx_xfs_bmapi_convert_one_delalloc+0x10/0x10
> [   21.644738]  ? __up_read+0x21e/0x810
> [   21.645063]  ? unmap_mapping_range+0x118/0x2c0
> [   21.645436]  ? lock_release+0x14f/0x2c0
> [   21.645761]  ? __pfx_unmap_mapping_range+0x10/0x10
> [   21.646164]  xfs_bmapi_convert_delalloc+0xb0/0x100
> [   21.646568]  xfs_buffered_write_iomap_begin+0x153d/0x1890
> [   21.647022]  ? __pfx_xfs_buffered_write_iomap_begin+0x10/0x10
> [   21.647535]  ? filemap_range_has_writeback+0x318/0x460
> [   21.647977]  ? filemap_range_has_writeback+0xf1/0x460
> [   21.648395]  ? __pfx_xfs_buffered_write_iomap_begin+0x10/0x10
> [   21.648870]  iomap_iter+0x58b/0xd30
> [   21.649178]  iomap_zero_range+0x547/0x740
> [   21.649526]  ? __pfx_iomap_zero_range+0x10/0x10
> [   21.649926]  ? __this_cpu_preempt_check+0x21/0x30
> [   21.650317]  ? lock_is_held_type+0xef/0x150
> [   21.650673]  xfs_zero_range+0xaf/0x100
> [   21.650990]  xfs_file_write_checks+0x639/0xa00
> [   21.651385]  xfs_file_buffered_write+0x183/0x9a0
> [   21.651806]  ? __pfx_xfs_file_buffered_write+0x10/0x10
> [   21.652225]  ? __lock_acquire+0x410/0x2260
> [   21.652580]  ? __lruvec_stat_mod_folio+0x1aa/0x3e0
> [   21.652985]  xfs_file_write_iter+0x52a/0xc00
> [   21.653350]  do_iter_readv_writev+0x6b1/0x9c0
> [   21.653723]  ? __pfx_do_iter_readv_writev+0x10/0x10
> [   21.654127]  ? __this_cpu_preempt_check+0x21/0x30
> [   21.654526]  ? lock_is_held_type+0xef/0x150
> [   21.654879]  vfs_writev+0x311/0xd70
> [   21.655204]  ? __lock_acquire+0x410/0x2260
> [   21.655574]  ? __pfx_vfs_writev+0x10/0x10
> [   21.655922]  ? __fget_files+0x1fa/0x3b0
> [   21.656251]  ? __this_cpu_preempt_check+0x21/0x30
> [   21.656643]  ? lock_release+0x14f/0x2c0
> [   21.656973]  ? __fget_files+0x204/0x3b0
> [   21.657306]  do_pwritev+0x1c9/0x280
> [   21.657612]  ? do_pwritev+0x1c9/0x280
> [   21.657927]  ? __pfx_do_pwritev+0x10/0x10
> [   21.658269]  ? __audit_syscall_entry+0x39c/0x500
> [   21.658661]  __x64_sys_pwritev+0xa3/0x100
> [   21.659000]  ? syscall_trace_enter+0x14d/0x280
> [   21.659387]  x64_sys_call+0x1bbc/0x2150
> [   21.659741]  do_syscall_64+0x6d/0x150
> [   21.660055]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   21.660473] RIP: 0033:0x7f4c1b63ee5d
> [   21.660774] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> [   21.662222] RSP: 002b:00007f4c1b999df8 EFLAGS: 00000202 ORIG_RAX: 0000000000000128
> [   21.662830] RAX: ffffffffffffffda RBX: 00007f4c1b99a640 RCX: 00007f4c1b63ee5d
> [   21.663432] RDX: 0000000000000001 RSI: 0000000020000300 RDI: 0000000000000004
> [   21.664001] RBP: 00007f4c1b999e20 R08: 0000000000000000 R09: 0000000000000000
> [   21.664569] R10: 0000000002fffffd R11: 0000000000000202 R12: 00007f4c1b99a640
> [   21.665137] R13: 000000000000006e R14: 00007f4c1b69f560 R15: 0000000000000000
> [   21.665726]  </TASK>
> "
> 
> Hope this cound be insightful to you.
> 
> Regards,
> Yi Lai
> 
> ---
> 
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
> 
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
>   // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
>   // You could change the bzImage_xxx as you want
>   // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@localhost
> 
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@localhost:/root/
> 
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage           //x should equal or less than cpu num your pc has
> 
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
> 
> 
> Tips:
> If you already have qemu-system-x86_64, please ignore below info.
> If you want to install qemu v7.1.0 version:
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git checkout -f v7.1.0
> mkdir build
> cd build
> yum install -y ninja-build.x86_64
> yum -y install libslirp-devel.x86_64
> ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> make
> make install 
> 
> On Thu, Apr 25, 2024 at 09:13:30PM +0800, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > Current clone operation could be non-atomic if the destination of a file
> > is beyond EOF, user could get a file with corrupted (zeroed) data on
> > crash.
> > 
> > The problem is about preallocations. If you write some data into a file:
> > 
> > 	[A...B)
> > 
> > and XFS decides to preallocate some post-eof blocks, then it can create
> > a delayed allocation reservation:
> > 
> > 	[A.........D)
> > 
> > The writeback path tries to convert delayed extents to real ones by
> > allocating blocks. If there aren't enough contiguous free space, we can
> > end up with two extents, the first real and the second still delalloc:
> > 
> > 	[A....C)[C.D)
> > 
> > After that, both the in-memory and the on-disk file sizes are still B.
> > If we clone into the range [E...F) from another file:
> > 
> > 	[A....C)[C.D)      [E...F)
> > 
> > then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the
> > range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc
> > extent, its pagecache will be zeroed and both the in-memory and on-disk
> > size will be updated to D after flushing but before cloning. This is
> > wrong, because the user can see the size change and read the zeroes
> > while the clone operation is ongoing.
> > 
> > We need to keep the in-memory and on-disk size before the clone
> > operation starts, so instead of writing zeroes through the page cache
> > for delayed ranges beyond EOF, we convert these ranges to unwritten and
> > invalidate any cached data over that range beyond EOF.
> > 
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > ---
> > Changes since v4:
> > 
> > Move the delalloc converting hunk before searching the COW fork. Because
> > if the file has been reflinked and copied on write,
> > xfs_bmap_extsize_align() aligned the range of COW delalloc extent, after
> > the writeback, there might be some unwritten extents left over in the
> > COW fork that overlaps the delalloc extent we found in data fork.
> > 
> >   data fork  ...wwww|dddddddddd...
> >   cow fork          |uuuuuuuuuu...
> >                     ^
> >                   i_size
> > 
> > In my v4, we search the COW fork before checking the delalloc extent,
> > goto found_cow tag and return unconverted delalloc srcmap in the above
> > case, so the delayed extent in the data fork will have no chance to
> > convert to unwritten, it will lead to delalloc extent residue and break
> > generic/522 after merging patch 6.
> > 
> >  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
> >  1 file changed, 29 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 236ee78aa75b..2857ef1b0272 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -1022,6 +1022,24 @@ xfs_buffered_write_iomap_begin(
> >  		goto out_unlock;
> >  	}
> >  
> > +	/*
> > +	 * For zeroing, trim a delalloc extent that extends beyond the EOF
> > +	 * block.  If it starts beyond the EOF block, convert it to an
> > +	 * unwritten extent.
> > +	 */
> > +	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
> > +	    isnullstartblock(imap.br_startblock)) {
> > +		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> > +
> > +		if (offset_fsb >= eof_fsb)
> > +			goto convert_delay;
> > +		if (end_fsb > eof_fsb) {
> > +			end_fsb = eof_fsb;
> > +			xfs_trim_extent(&imap, offset_fsb,
> > +					end_fsb - offset_fsb);
> > +		}
> > +	}
> > +
> >  	/*
> >  	 * Search the COW fork extent list even if we did not find a data fork
> >  	 * extent.  This serves two purposes: first this implements the
> > @@ -1167,6 +1185,17 @@ xfs_buffered_write_iomap_begin(
> >  	xfs_iunlock(ip, lockmode);
> >  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
> >  
> > +convert_delay:
> > +	xfs_iunlock(ip, lockmode);
> > +	truncate_pagecache(inode, offset);
> > +	error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
> > +					   iomap, NULL);
> > +	if (error)
> > +		return error;
> > +
> > +	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
> > +	return 0;
> > +
> >  found_cow:
> >  	seq = xfs_iomap_inode_sequence(ip, 0);
> >  	if (imap.br_startoff <= offset_fsb) {
> > -- 
> > 2.39.2
> > 
>
Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Darrick J. Wong 1 year, 7 months ago
On Thu, Apr 25, 2024 at 09:13:30PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Current clone operation could be non-atomic if the destination of a file
> is beyond EOF, user could get a file with corrupted (zeroed) data on
> crash.
> 
> The problem is about preallocations. If you write some data into a file:
> 
> 	[A...B)
> 
> and XFS decides to preallocate some post-eof blocks, then it can create
> a delayed allocation reservation:
> 
> 	[A.........D)
> 
> The writeback path tries to convert delayed extents to real ones by
> allocating blocks. If there aren't enough contiguous free space, we can
> end up with two extents, the first real and the second still delalloc:
> 
> 	[A....C)[C.D)
> 
> After that, both the in-memory and the on-disk file sizes are still B.
> If we clone into the range [E...F) from another file:
> 
> 	[A....C)[C.D)      [E...F)
> 
> then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the
> range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc
> extent, its pagecache will be zeroed and both the in-memory and on-disk
> size will be updated to D after flushing but before cloning. This is
> wrong, because the user can see the size change and read the zeroes
> while the clone operation is ongoing.
> 
> We need to keep the in-memory and on-disk size before the clone
> operation starts, so instead of writing zeroes through the page cache
> for delayed ranges beyond EOF, we convert these ranges to unwritten and
> invalidate any cached data over that range beyond EOF.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> Changes since v4:
> 
> Move the delalloc converting hunk before searching the COW fork. Because
> if the file has been reflinked and copied on write,
> xfs_bmap_extsize_align() aligned the range of COW delalloc extent, after
> the writeback, there might be some unwritten extents left over in the
> COW fork that overlaps the delalloc extent we found in data fork.
> 
>   data fork  ...wwww|dddddddddd...
>   cow fork          |uuuuuuuuuu...
>                     ^
>                   i_size
> 
> In my v4, we search the COW fork before checking the delalloc extent,
> goto found_cow tag and return unconverted delalloc srcmap in the above
> case, so the delayed extent in the data fork will have no chance to
> convert to unwritten, it will lead to delalloc extent residue and break
> generic/522 after merging patch 6.

Hmmm.  I suppose that works, but it feels a little funny to convert the
delalloc mapping in the data fork to unwritten /while/ there's unwritten
extents in the cow fork too.  Would it make more sense to remap the cow
fork extents here?

OTOH unwritten extents in the cow fork get changed to written ones by
all the cow remapping functions.  Soooo maybe we don't want to go
digging /that/ deep into the system.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 236ee78aa75b..2857ef1b0272 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1022,6 +1022,24 @@ xfs_buffered_write_iomap_begin(
>  		goto out_unlock;
>  	}
>  
> +	/*
> +	 * For zeroing, trim a delalloc extent that extends beyond the EOF
> +	 * block.  If it starts beyond the EOF block, convert it to an
> +	 * unwritten extent.
> +	 */
> +	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
> +	    isnullstartblock(imap.br_startblock)) {
> +		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> +
> +		if (offset_fsb >= eof_fsb)
> +			goto convert_delay;
> +		if (end_fsb > eof_fsb) {
> +			end_fsb = eof_fsb;
> +			xfs_trim_extent(&imap, offset_fsb,
> +					end_fsb - offset_fsb);
> +		}
> +	}
> +
>  	/*
>  	 * Search the COW fork extent list even if we did not find a data fork
>  	 * extent.  This serves two purposes: first this implements the
> @@ -1167,6 +1185,17 @@ xfs_buffered_write_iomap_begin(
>  	xfs_iunlock(ip, lockmode);
>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>  
> +convert_delay:
> +	xfs_iunlock(ip, lockmode);
> +	truncate_pagecache(inode, offset);
> +	error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
> +					   iomap, NULL);
> +	if (error)
> +		return error;
> +
> +	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
> +	return 0;
> +
>  found_cow:
>  	seq = xfs_iomap_inode_sequence(ip, 0);
>  	if (imap.br_startoff <= offset_fsb) {
> -- 
> 2.39.2
> 
>
Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Zhang Yi 1 year, 7 months ago
On 2024/4/26 2:29, Darrick J. Wong wrote:
> On Thu, Apr 25, 2024 at 09:13:30PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Current clone operation could be non-atomic if the destination of a file
>> is beyond EOF, user could get a file with corrupted (zeroed) data on
>> crash.
>>
>> The problem is about preallocations. If you write some data into a file:
>>
>> 	[A...B)
>>
>> and XFS decides to preallocate some post-eof blocks, then it can create
>> a delayed allocation reservation:
>>
>> 	[A.........D)
>>
>> The writeback path tries to convert delayed extents to real ones by
>> allocating blocks. If there aren't enough contiguous free space, we can
>> end up with two extents, the first real and the second still delalloc:
>>
>> 	[A....C)[C.D)
>>
>> After that, both the in-memory and the on-disk file sizes are still B.
>> If we clone into the range [E...F) from another file:
>>
>> 	[A....C)[C.D)      [E...F)
>>
>> then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the
>> range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc
>> extent, its pagecache will be zeroed and both the in-memory and on-disk
>> size will be updated to D after flushing but before cloning. This is
>> wrong, because the user can see the size change and read the zeroes
>> while the clone operation is ongoing.
>>
>> We need to keep the in-memory and on-disk size before the clone
>> operation starts, so instead of writing zeroes through the page cache
>> for delayed ranges beyond EOF, we convert these ranges to unwritten and
>> invalidate any cached data over that range beyond EOF.
>>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>> Changes since v4:
>>
>> Move the delalloc converting hunk before searching the COW fork. Because
>> if the file has been reflinked and copied on write,
>> xfs_bmap_extsize_align() aligned the range of COW delalloc extent, after
>> the writeback, there might be some unwritten extents left over in the
>> COW fork that overlaps the delalloc extent we found in data fork.
>>
>>   data fork  ...wwww|dddddddddd...
>>   cow fork          |uuuuuuuuuu...
>>                     ^
>>                   i_size
>>
>> In my v4, we search the COW fork before checking the delalloc extent,
>> goto found_cow tag and return unconverted delalloc srcmap in the above
>> case, so the delayed extent in the data fork will have no chance to
>> convert to unwritten, it will lead to delalloc extent residue and break
>> generic/522 after merging patch 6.
> 
> Hmmm.  I suppose that works, but it feels a little funny to convert the
> delalloc mapping in the data fork to unwritten /while/ there's unwritten
> extents in the cow fork too.  Would it make more sense to remap the cow
> fork extents here?
> 

Yeah, it looks more reasonable. But from the original scene, the
xfs_bmap_extsize_align() aligned the new extent that added to the cow fork
could overlaps the unreflinked range, IIUC, I guess that spare range is
useless exactly, is there any situation that would use it?

> OTOH unwritten extents in the cow fork get changed to written ones by
> all the cow remapping functions.  Soooo maybe we don't want to go
> digging /that/ deep into the system.
> 

Yeah, I think it's okay now unless there's some strong claims.

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 
>>
>>  fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
>>  1 file changed, 29 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 236ee78aa75b..2857ef1b0272 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1022,6 +1022,24 @@ xfs_buffered_write_iomap_begin(
>>  		goto out_unlock;
>>  	}
>>  
>> +	/*
>> +	 * For zeroing, trim a delalloc extent that extends beyond the EOF
>> +	 * block.  If it starts beyond the EOF block, convert it to an
>> +	 * unwritten extent.
>> +	 */
>> +	if ((flags & IOMAP_ZERO) && imap.br_startoff <= offset_fsb &&
>> +	    isnullstartblock(imap.br_startblock)) {
>> +		xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
>> +
>> +		if (offset_fsb >= eof_fsb)
>> +			goto convert_delay;
>> +		if (end_fsb > eof_fsb) {
>> +			end_fsb = eof_fsb;
>> +			xfs_trim_extent(&imap, offset_fsb,
>> +					end_fsb - offset_fsb);
>> +		}
>> +	}
>> +
>>  	/*
>>  	 * Search the COW fork extent list even if we did not find a data fork
>>  	 * extent.  This serves two purposes: first this implements the
>> @@ -1167,6 +1185,17 @@ xfs_buffered_write_iomap_begin(
>>  	xfs_iunlock(ip, lockmode);
>>  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>>  
>> +convert_delay:
>> +	xfs_iunlock(ip, lockmode);
>> +	truncate_pagecache(inode, offset);
>> +	error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
>> +					   iomap, NULL);
>> +	if (error)
>> +		return error;
>> +
>> +	trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
>> +	return 0;
>> +
>>  found_cow:
>>  	seq = xfs_iomap_inode_sequence(ip, 0);
>>  	if (imap.br_startoff <= offset_fsb) {
>> -- 
>> 2.39.2
>>
>>
Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Christoph Hellwig 1 year, 7 months ago
On Fri, Apr 26, 2024 at 02:24:19PM +0800, Zhang Yi wrote:
> Yeah, it looks more reasonable. But from the original scene, the
> xfs_bmap_extsize_align() aligned the new extent that added to the cow fork
> could overlaps the unreflinked range, IIUC, I guess that spare range is
> useless exactly, is there any situation that would use it?

I've just started staring at this (again) half an hour ago, and I fail
to understand the (pre-existing) logic in xfs_reflink_zero_posteof.

We obviously need to ensure data between i_size and the end of the
block that i_size sits in is zeroed (but IIRC we already do that
in write and truncate anyway).  But what is the point of zeroing
any speculative preallocation beyond the last block that actually
contains data?  Just truncating the preallocation and freeing
the delalloc and unwritten blocks seems like it would be way
more efficient.
Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Zhang Yi 1 year, 7 months ago
On 2024/4/26 14:33, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 02:24:19PM +0800, Zhang Yi wrote:
>> Yeah, it looks more reasonable. But from the original scene, the
>> xfs_bmap_extsize_align() aligned the new extent that added to the cow fork
>> could overlaps the unreflinked range, IIUC, I guess that spare range is
>> useless exactly, is there any situation that would use it?
> 
> I've just started staring at this (again) half an hour ago, and I fail
> to understand the (pre-existing) logic in xfs_reflink_zero_posteof.
> 
> We obviously need to ensure data between i_size and the end of the
> block that i_size sits in is zeroed (but IIRC we already do that
> in write and truncate anyway).  But what is the point of zeroing
> any speculative preallocation beyond the last block that actually
> contains data?  Just truncating the preallocation and freeing
> the delalloc and unwritten blocks seems like it would be way
> more efficient.
> 

I've had the same idea before, I asked Dave and he explained that Linux
could leak data beyond EOF page for some cases, e.g. mmap() can write to
the EOF page beyond EOF without failing, and the data in that EOF page
could be non-zeroed by mmap(), so the zeroing is still needed now.

OTOH, if we free the delalloc and unwritten blocks beyond EOF blocks, he
said it could lead to some performance problems and make thinks
complicated to deal with the trimming of EOF block. Please see [1]
for details and maybe Dave could explain more.

[1] https://lore.kernel.org/linux-xfs/ZeERAob9Imwh01bG@dread.disaster.area/

Thanks,
Yi.
Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Christoph Hellwig 1 year, 7 months ago
On Fri, Apr 26, 2024 at 03:18:17PM +0800, Zhang Yi wrote:
> I've had the same idea before, I asked Dave and he explained that Linux
> could leak data beyond EOF page for some cases, e.g. mmap() can write to
> the EOF page beyond EOF without failing, and the data in that EOF page
> could be non-zeroed by mmap(), so the zeroing is still needed now.
> 
> OTOH, if we free the delalloc and unwritten blocks beyond EOF blocks, he
> said it could lead to some performance problems and make thinks
> complicated to deal with the trimming of EOF block. Please see [1]
> for details and maybe Dave could explain more.

Oh well.  Given that we're full in on the speculative allocations
we might as well deal with it.
Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Zhang Yi 1 year, 7 months ago
On 2024/4/27 14:59, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 03:18:17PM +0800, Zhang Yi wrote:
>> I've had the same idea before, I asked Dave and he explained that Linux
>> could leak data beyond EOF page for some cases, e.g. mmap() can write to
>> the EOF page beyond EOF without failing, and the data in that EOF page
>> could be non-zeroed by mmap(), so the zeroing is still needed now.
>>
>> OTOH, if we free the delalloc and unwritten blocks beyond EOF blocks, he
>> said it could lead to some performance problems and make thinks
>> complicated to deal with the trimming of EOF block. Please see [1]
>> for details and maybe Dave could explain more.
> 
> Oh well.  Given that we're full in on the speculative allocations
> we might as well deal with it.
> 

Let me confirm, so you also think the preallocations in the COW fork that
overlaps the unreflinked range is useless, we should avoid allocating
this range, is that right? If so, I suppose we can do this improvement in
another patch(set), this one works fine now.

Thanks,
Yi.
Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Christoph Hellwig 1 year, 7 months ago
On Sun, Apr 28, 2024 at 11:26:00AM +0800, Zhang Yi wrote:
> > 
> > Oh well.  Given that we're full in on the speculative allocations
> > we might as well deal with it.
> > 
> 
> Let me confirm, so you also think the preallocations in the COW fork that
> overlaps the unreflinked range is useless, we should avoid allocating
> this range, is that right? If so, I suppose we can do this improvement in
> another patch(set), this one works fine now.

Well, not stop allocating it, but not actually convert it to a real
allocation when we're just truncating it and replacing the blocks with
reflinked blocks.

But yes, this is a bigger project.

For now for this patch to go ahead:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Re: [PATCH v5 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks
Posted by Zhang Yi 1 year, 7 months ago
On 2024/4/29 12:41, Christoph Hellwig wrote:
> On Sun, Apr 28, 2024 at 11:26:00AM +0800, Zhang Yi wrote:
>>>
>>> Oh well.  Given that we're full in on the speculative allocations
>>> we might as well deal with it.
>>>
>>
>> Let me confirm, so you also think the preallocations in the COW fork that
>> overlaps the unreflinked range is useless, we should avoid allocating
>> this range, is that right? If so, I suppose we can do this improvement in
>> another patch(set), this one works fine now.
> 
> Well, not stop allocating it, but not actually convert it to a real
> allocation when we're just truncating it and replacing the blocks with
> reflinked blocks.

OK, it looks fine, too.

Thanks,
Yi.

> 
> But yes, this is a bigger project.>
> For now for this patch to go ahead:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>