fs/btrfs/inode.c | 5 +++++ 1 file changed, 5 insertions(+)
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
NOCOW writes do not generate stripe_extent entries in the RAID stripe
tree, as the RAID stripe-tree feature initially was designed with a
zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
writes. But the RAID stripe-tree feature is independent from the zoned
feature, so we must also allow NOCOW writes for zoned filesystems.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
fs/btrfs/inode.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index edac499fd83d..c6e4b58c334c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3111,6 +3111,11 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
ret = btrfs_update_inode_fallback(trans, inode);
if (ret) /* -ENOMEM or corruption */
btrfs_abort_transaction(trans, ret);
+
+ ret = btrfs_insert_raid_extent(trans, ordered_extent);
+ if (ret)
+ btrfs_abort_transaction(trans, ret);
+
goto out;
}
--
2.43.0
On Mon, Sep 23, 2024 at 08:45:47AM GMT, Johannes Thumshirn wrote: > From: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > NOCOW writes do not generate stripe_extent entries in the RAID stripe > tree, as the RAID stripe-tree feature initially was designed with a > zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW > writes. But the RAID stripe-tree feature is independent from the zoned > feature, so we must also allow NOCOW writes for zoned filesystems. > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> The NCOW case itself looks fine for me. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> But, looking around the context, since there are some duplication among regular and NOCOW case, it would be good time to merge them. Apparently, btrfs_insert_raid_extent() in the regular case (current code) should abort the transaction as same as this patch. With that fixed, code will become really similar. > --- > fs/btrfs/inode.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index edac499fd83d..c6e4b58c334c 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -3111,6 +3111,11 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent) > ret = btrfs_update_inode_fallback(trans, inode); > if (ret) /* -ENOMEM or corruption */ > btrfs_abort_transaction(trans, ret); > + > + ret = btrfs_insert_raid_extent(trans, ordered_extent); > + if (ret) > + btrfs_abort_transaction(trans, ret); > + > goto out; > } > > -- > 2.43.0 >
在 2024/9/23 16:15, Johannes Thumshirn 写道: > From: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > NOCOW writes do not generate stripe_extent entries in the RAID stripe > tree, as the RAID stripe-tree feature initially was designed with a > zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW > writes. But the RAID stripe-tree feature is independent from the zoned > feature, so we must also allow NOCOW writes for zoned filesystems. > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Sorry I'm going to repeat myself again, I still believe if we insert an RST entry at falloc() time, it will be more consistent with the non-RST code. Yes, I known preallocated space will not need any read nor search RST entry, and we just fill the page cache with zero at read time. But the point of proper (not just dummy) RST entry for the whole preallocated space is, we do not need to touch the RST entry anymore for NOCOW/PREALLOCATED write at all. This makes the RST NOCOW/PREALLOC writes behavior to align with the non-RST code, which doesn't update any extent item, but only modify the file extent for PREALLOC writes. Thanks, Qu > --- > fs/btrfs/inode.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index edac499fd83d..c6e4b58c334c 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -3111,6 +3111,11 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent) > ret = btrfs_update_inode_fallback(trans, inode); > if (ret) /* -ENOMEM or corruption */ > btrfs_abort_transaction(trans, ret); > + > + ret = btrfs_insert_raid_extent(trans, ordered_extent); > + if (ret) > + btrfs_abort_transaction(trans, ret); > + > goto out; > } >
On Mon, Sep 23, 2024 at 04:58:34PM +0930, Qu Wenruo wrote: > > > 在 2024/9/23 16:15, Johannes Thumshirn 写道: > > From: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > > > NOCOW writes do not generate stripe_extent entries in the RAID stripe > > tree, as the RAID stripe-tree feature initially was designed with a > > zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW > > writes. But the RAID stripe-tree feature is independent from the zoned > > feature, so we must also allow NOCOW writes for zoned filesystems. > > > > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > Sorry I'm going to repeat myself again, I still believe if we insert an > RST entry at falloc() time, it will be more consistent with the non-RST > code. > > Yes, I known preallocated space will not need any read nor search RST > entry, and we just fill the page cache with zero at read time. > > But the point of proper (not just dummy) RST entry for the whole > preallocated space is, we do not need to touch the RST entry anymore for > NOCOW/PREALLOCATED write at all. > > This makes the RST NOCOW/PREALLOC writes behavior to align with the > non-RST code, which doesn't update any extent item, but only modify the > file extent for PREALLOC writes. I see what you're getting at here, but it creates a huge amount of problems later down the line. I prealloc an extent, I map that logical extent to a physical extent and then I insert a RST entry for that mapping. Now I rip out one of my disks, and now I have to update RST entries for extents I'm not going to rewrite because they're prealloc. RST is a logical->physical mapping. We do not need to update or insert anything until we create that logical->physical mapping. Keeping the rules consistent across the different layers will make it easier to reason about and easier to maintain. Adding an index at endio time for NOCOW makes sense, we now have created a thing on disk that we need to have a mapping for. The same goes for prealloc, adding an entry at prealloc time doesn't make logical sense as we haven't yet instantiated that space on disk. Thanks, Josef
在 2024/9/24 00:50, Josef Bacik 写道: > On Mon, Sep 23, 2024 at 04:58:34PM +0930, Qu Wenruo wrote: >> >> >> 在 2024/9/23 16:15, Johannes Thumshirn 写道: >>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>> >>> NOCOW writes do not generate stripe_extent entries in the RAID stripe >>> tree, as the RAID stripe-tree feature initially was designed with a >>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW >>> writes. But the RAID stripe-tree feature is independent from the zoned >>> feature, so we must also allow NOCOW writes for zoned filesystems. >>> >>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >> >> Sorry I'm going to repeat myself again, I still believe if we insert an >> RST entry at falloc() time, it will be more consistent with the non-RST >> code. >> >> Yes, I known preallocated space will not need any read nor search RST >> entry, and we just fill the page cache with zero at read time. >> >> But the point of proper (not just dummy) RST entry for the whole >> preallocated space is, we do not need to touch the RST entry anymore for >> NOCOW/PREALLOCATED write at all. >> >> This makes the RST NOCOW/PREALLOC writes behavior to align with the >> non-RST code, which doesn't update any extent item, but only modify the >> file extent for PREALLOC writes. > > I see what you're getting at here, but it creates a huge amount of problems > later down the line. > > I prealloc an extent, I map that logical extent to a physical extent and then I > insert a RST entry for that mapping. Now I rip out one of my disks, and now I > have to update RST entries for extents I'm not going to rewrite because they're > prealloc. Why do we even need to do anything update the RST entries? RST is just an extra layer for logical bytenr mapping, and did you see the non-RST btrfs do relocation just because one device went missing? Can you explain more on the "have to update RST entries" part? That mismatches from my understanding of RST. > > RST is a logical->physical mapping. We do not need to update or insert anything > until we create that logical->physical mapping. Just consider the fallocate of non-RST as an example. We DO allocate real data extents, they have real location on the disk. Then add the RST layer. Now preallocated extent suddenly do not have RST mapping, but still have extents allocated for them. I do not think this is any more consistent. > Keeping the rules consistent > across the different layers will make it easier to reason about and easier to > maintain. I think all data extents should have RST mapping, that's way more consistent than two different handling for different data extents. Just like we do not bother if a data extent is preallocated or not in scrub. > Adding an index at endio time for NOCOW makes sense, we now have > created a thing on disk that we need to have a mapping for. The same goes for > prealloc, adding an entry at prealloc time doesn't make logical sense as we > haven't yet instantiated that space on disk. Thanks, But we allocated data extents. Even if we won't really utilize them for now, we should have RST for it. In fact, I do not even think it's correct to insert/update RST at endio/ordered extent time. It will be way more consistent to update/insert RST entries at data extent allocation time. Thanks, Qu > > Josef >
在 2024/9/24 08:02, Qu Wenruo 写道: > > > 在 2024/9/24 00:50, Josef Bacik 写道: >> On Mon, Sep 23, 2024 at 04:58:34PM +0930, Qu Wenruo wrote: >>> >>> >>> 在 2024/9/23 16:15, Johannes Thumshirn 写道: >>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>>> >>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe >>>> tree, as the RAID stripe-tree feature initially was designed with a >>>> zoned filesystem in mind and on a zoned filesystem, we do not allow >>>> NOCOW >>>> writes. But the RAID stripe-tree feature is independent from the zoned >>>> feature, so we must also allow NOCOW writes for zoned filesystems. >>>> >>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>> >>> Sorry I'm going to repeat myself again, I still believe if we insert an >>> RST entry at falloc() time, it will be more consistent with the non-RST >>> code. >>> >>> Yes, I known preallocated space will not need any read nor search RST >>> entry, and we just fill the page cache with zero at read time. >>> >>> But the point of proper (not just dummy) RST entry for the whole >>> preallocated space is, we do not need to touch the RST entry anymore for >>> NOCOW/PREALLOCATED write at all. >>> >>> This makes the RST NOCOW/PREALLOC writes behavior to align with the >>> non-RST code, which doesn't update any extent item, but only modify the >>> file extent for PREALLOC writes. >> >> I see what you're getting at here, but it creates a huge amount of >> problems >> later down the line. >> >> I prealloc an extent, I map that logical extent to a physical extent >> and then I >> insert a RST entry for that mapping. Now I rip out one of my disks, >> and now I >> have to update RST entries for extents I'm not going to rewrite >> because they're >> prealloc. > > Why do we even need to do anything update the RST entries? > > RST is just an extra layer for logical bytenr mapping, and did you see > the non-RST btrfs do relocation just because one device went missing? > > Can you explain more on the "have to update RST entries" part? > That mismatches from my understanding of RST. > >> >> RST is a logical->physical mapping. We do not need to update or >> insert anything >> until we create that logical->physical mapping. > > Just consider the fallocate of non-RST as an example. > > We DO allocate real data extents, they have real location on the disk. > > Then add the RST layer. Now preallocated extent suddenly do not have RST > mapping, but still have extents allocated for them. > > I do not think this is any more consistent. > >> Keeping the rules consistent >> across the different layers will make it easier to reason about and >> easier to >> maintain. > > I think all data extents should have RST mapping, that's way more > consistent than two different handling for different data extents. > > Just like we do not bother if a data extent is preallocated or not in > scrub. > > >> Adding an index at endio time for NOCOW makes sense, we now have >> created a thing on disk that we need to have a mapping for. The same >> goes for >> prealloc, adding an entry at prealloc time doesn't make logical sense >> as we >> haven't yet instantiated that space on disk. Thanks, > > But we allocated data extents. Even if we won't really utilize them for > now, we should have RST for it. > > In fact, I do not even think it's correct to insert/update RST at > endio/ordered extent time. > > It will be way more consistent to update/insert RST entries at data > extent allocation time. My bad, this is impossible for zoned devices, and that's why RST is here to help the situation. So this indeed means we have to do the work at endio/finish_ordered_io() time. And in that case, it looks like we really have to handle preallocated extents differently (thanks to these zoned devices) Thanks, Qu > > Thanks, > Qu > >> >> Josef >> > >
On 23.09.24 09:28, Qu Wenruo wrote: > > > 在 2024/9/23 16:15, Johannes Thumshirn 写道: >> From: Johannes Thumshirn <johannes.thumshirn@wdc.com> >> >> NOCOW writes do not generate stripe_extent entries in the RAID stripe >> tree, as the RAID stripe-tree feature initially was designed with a >> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW >> writes. But the RAID stripe-tree feature is independent from the zoned >> feature, so we must also allow NOCOW writes for zoned filesystems. >> >> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> > > Sorry I'm going to repeat myself again, I still believe if we insert an > RST entry at falloc() time, it will be more consistent with the non-RST > code. > > Yes, I known preallocated space will not need any read nor search RST > entry, and we just fill the page cache with zero at read time. > > But the point of proper (not just dummy) RST entry for the whole > preallocated space is, we do not need to touch the RST entry anymore for > NOCOW/PREALLOCATED write at all. > > This makes the RST NOCOW/PREALLOC writes behavior to align with the > non-RST code, which doesn't update any extent item, but only modify the > file extent for PREALLOC writes. Please re-read the patch. This is not a dummy RST entry but a real RST entry for NOCOW writes.
在 2024/9/23 17:10, Johannes Thumshirn 写道: > On 23.09.24 09:28, Qu Wenruo wrote: >> >> >> 在 2024/9/23 16:15, Johannes Thumshirn 写道: >>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>> >>> NOCOW writes do not generate stripe_extent entries in the RAID stripe >>> tree, as the RAID stripe-tree feature initially was designed with a >>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW >>> writes. But the RAID stripe-tree feature is independent from the zoned >>> feature, so we must also allow NOCOW writes for zoned filesystems. >>> >>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >> >> Sorry I'm going to repeat myself again, I still believe if we insert an >> RST entry at falloc() time, it will be more consistent with the non-RST >> code. >> >> Yes, I known preallocated space will not need any read nor search RST >> entry, and we just fill the page cache with zero at read time. >> >> But the point of proper (not just dummy) RST entry for the whole >> preallocated space is, we do not need to touch the RST entry anymore for >> NOCOW/PREALLOCATED write at all. >> >> This makes the RST NOCOW/PREALLOC writes behavior to align with the >> non-RST code, which doesn't update any extent item, but only modify the >> file extent for PREALLOC writes. > > Please re-read the patch. This is not a dummy RST entry but a real RST > entry for NOCOW writes. > I know, but my point is, if the RST entry for preallocated range is already a regular one, you won't even need to insert/update the RST tree at all. Just like we do not need to update the extent tree for NOCOW/PREALLOCATED writes. Thanks, Qu
On 23.09.24 09:56, Qu Wenruo wrote: >>> 在 2024/9/23 16:15, Johannes Thumshirn 写道: >>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>>> >>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe >>>> tree, as the RAID stripe-tree feature initially was designed with a >>>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW >>>> writes. But the RAID stripe-tree feature is independent from the zoned >>>> feature, so we must also allow NOCOW writes for zoned filesystems. >>>> >>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>> >>> Sorry I'm going to repeat myself again, I still believe if we insert an >>> RST entry at falloc() time, it will be more consistent with the non-RST >>> code. >>> >>> Yes, I known preallocated space will not need any read nor search RST >>> entry, and we just fill the page cache with zero at read time. >>> >>> But the point of proper (not just dummy) RST entry for the whole >>> preallocated space is, we do not need to touch the RST entry anymore for >>> NOCOW/PREALLOCATED write at all. >>> >>> This makes the RST NOCOW/PREALLOC writes behavior to align with the >>> non-RST code, which doesn't update any extent item, but only modify the >>> file extent for PREALLOC writes. >> >> Please re-read the patch. This is not a dummy RST entry but a real RST >> entry for NOCOW writes. >> > I know, but my point is, if the RST entry for preallocated range is > already a regular one, you won't even need to insert/update the RST tree > at all. > > Just like we do not need to update the extent tree for > NOCOW/PREALLOCATED writes. But as long as there is no data on disk there is no point of having a logical to N-disk physical mapping. We haven't even called btrfs_map_block() before, so where do we know which disks will get the blocks at which address? Look at this example: Fallocate [0, 1M] virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" -c sync test virtme-scsi:/mnt # btrfs ins dump-tree -t extent /dev/sda | \ grep -A 1 EXTENT_ITEM item 13 key (298844160 EXTENT_ITEM 1048576) itemoff 15828 itemsize 53 refs 1 gen 8 flags DATA virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda | \ grep -A 3 EXTENT_DATA item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 8 type 2 (prealloc) prealloc data disk byte 298844160 nr 1048576 [1] prealloc data offset 0 nr 1048576 Write at [64k, 128k] virtme-scsi:/mnt # xfs_io -fc "pwrite 64k 64k" test wrote 65536/65536 bytes at offset 65536 64 KiB, 16 ops; 0.0003 sec (183.284 MiB/sec and 46920.8211 ops/sec) virtme-scsi:/mnt # btrfs ins dump-tree -t fs /dev/sda | grep -A 3 EXTENT_DATA item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 298844160 nr 1048576 prealloc data offset 0 nr 65536 item 7 key (257 EXTENT_DATA 65536) itemoff 15763 itemsize 53 generation 9 type 1 (regular) extent data disk byte 298844160 nr 1048576 [2] extent data offset 65536 nr 65536 ram 1048576 -- item 8 key (257 EXTENT_DATA 131072) itemoff 15710 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 298844160 nr 1048576 prealloc data offset 131072 nr 917504 virtme-scsi:/mnt # btrfs ins dump-tree -t raid-stripe /dev/sda btrfs-progs v6.9 raid stripe tree key (RAID_STRIPE_TREE ROOT_ITEM 0) leaf 30638080 items 1 free space 16226 generation 9 owner RAID_STRIPE_TREE leaf 30638080 flags 0x1(WRITTEN) backref revision 1 checksum stored 9ed94b4d checksum calced 9ed94b4d fs uuid bebf1755-9379-4ac8-a623-ad0dc52641cf chunk uuid 463f7b1d-c0b8-4373-8334-52f5bf83475e item 0 key (298909696 RAID_STRIPE 65536) itemoff 16251 itemsize 32 stripe 0 devid 1 physical 298909696 [3] stripe 1 devid 2 physical 277938176 [1] we preallocate the data for [0, 1M] @ 298844160 [2] we have the actual written data for [64k, 128k] @ 298844160 What should I do to insert the RST entry there as we get: [3] the physical copies starting at 298909696 on devid 1 and 277938176 on devid 2
在 2024/9/23 17:45, Johannes Thumshirn 写道: > On 23.09.24 09:56, Qu Wenruo wrote: >>>> 在 2024/9/23 16:15, Johannes Thumshirn 写道: >>>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>>>> >>>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe >>>>> tree, as the RAID stripe-tree feature initially was designed with a >>>>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW >>>>> writes. But the RAID stripe-tree feature is independent from the zoned >>>>> feature, so we must also allow NOCOW writes for zoned filesystems. >>>>> >>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>>> >>>> Sorry I'm going to repeat myself again, I still believe if we insert an >>>> RST entry at falloc() time, it will be more consistent with the non-RST >>>> code. >>>> >>>> Yes, I known preallocated space will not need any read nor search RST >>>> entry, and we just fill the page cache with zero at read time. >>>> >>>> But the point of proper (not just dummy) RST entry for the whole >>>> preallocated space is, we do not need to touch the RST entry anymore for >>>> NOCOW/PREALLOCATED write at all. >>>> >>>> This makes the RST NOCOW/PREALLOC writes behavior to align with the >>>> non-RST code, which doesn't update any extent item, but only modify the >>>> file extent for PREALLOC writes. >>> >>> Please re-read the patch. This is not a dummy RST entry but a real RST >>> entry for NOCOW writes. >>> >> I know, but my point is, if the RST entry for preallocated range is >> already a regular one, you won't even need to insert/update the RST tree >> at all. >> >> Just like we do not need to update the extent tree for >> NOCOW/PREALLOCATED writes. > > But as long as there is no data on disk there is no point of having a > logical to N-disk physical mapping. We haven't even called > btrfs_map_block() before, so where do we know which disks will get the > blocks at which address? Look at this example: > > Fallocate [0, 1M] > virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" -c sync test [...] > > > [1] we preallocate the data for [0, 1M] @ 298844160 > [2] we have the actual written data for [64k, 128k] @ 298844160 > > What should I do to insert the RST entry there as we get: Do the needed write map and insert the RST entries, just as if we're doing a write, but without any actual IO. Currently the handling of RST is not consistent with non-RST, thus that's the reason causing problems with scrub on preallocated extents. I known preallocated range won't trigger any read thus it makes no sense to do the proper RST setup. But let's also take the example of non-RST scrub: Do we spend our time checking if a data extent is preallocated or not? No, we do not. We just read the content, and check against its csum. For preallocated extents, it just has no csum. You can argue that this is wasting IO, but I can also counter-argue that we need to make sure the read on that device range is fine, even if we know we will not really read the content (but that's only for now). Furthermore, from the logical aspect, at least to me, non-RST case is just a special case where logical address is 1:1 mapped already. This also means, even for preallocated extents, they should have RST entries. Finally, I do not think it's a good idea to insert RST entries for NOCOW. If a file is set NOCOW, it means we'll doing a lot of overwrite for it. Then why waste our time updating the RST entries again and again? Isn't such behavior going to cause more write amplification? Meanwhile for non-RST cases, NOCOW should cause the least amount of write amplification. So again, YES, I know doing proper write map and inserting RST entries for preallocated range makes no sense for READ. But preallocation and NOCOW is utilized for contents we frequently over-writes, and such RST entries save us more for those frequently over-writes. Thanks, Qu > > [3] the physical copies starting at 298909696 on devid 1 and 277938176 > on devid 2 > >
On 23.09.24 10:54, Qu Wenruo wrote: > > > 在 2024/9/23 17:45, Johannes Thumshirn 写道: >> On 23.09.24 09:56, Qu Wenruo wrote: >>>>> 在 2024/9/23 16:15, Johannes Thumshirn 写道: >>>>>> From: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>>>>> >>>>>> NOCOW writes do not generate stripe_extent entries in the RAID stripe >>>>>> tree, as the RAID stripe-tree feature initially was designed with a >>>>>> zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW >>>>>> writes. But the RAID stripe-tree feature is independent from the zoned >>>>>> feature, so we must also allow NOCOW writes for zoned filesystems. >>>>>> >>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> >>>>> >>>>> Sorry I'm going to repeat myself again, I still believe if we insert an >>>>> RST entry at falloc() time, it will be more consistent with the non-RST >>>>> code. >>>>> >>>>> Yes, I known preallocated space will not need any read nor search RST >>>>> entry, and we just fill the page cache with zero at read time. >>>>> >>>>> But the point of proper (not just dummy) RST entry for the whole >>>>> preallocated space is, we do not need to touch the RST entry anymore for >>>>> NOCOW/PREALLOCATED write at all. >>>>> >>>>> This makes the RST NOCOW/PREALLOC writes behavior to align with the >>>>> non-RST code, which doesn't update any extent item, but only modify the >>>>> file extent for PREALLOC writes. >>>> >>>> Please re-read the patch. This is not a dummy RST entry but a real RST >>>> entry for NOCOW writes. >>>> >>> I know, but my point is, if the RST entry for preallocated range is >>> already a regular one, you won't even need to insert/update the RST tree >>> at all. >>> >>> Just like we do not need to update the extent tree for >>> NOCOW/PREALLOCATED writes. >> >> But as long as there is no data on disk there is no point of having a >> logical to N-disk physical mapping. We haven't even called >> btrfs_map_block() before, so where do we know which disks will get the >> blocks at which address? Look at this example: >> >> Fallocate [0, 1M] >> virtme-scsi:/mnt # xfs_io -fc "falloc 0 1M" -c sync test > [...] >> >> >> [1] we preallocate the data for [0, 1M] @ 298844160 >> [2] we have the actual written data for [64k, 128k] @ 298844160 >> >> What should I do to insert the RST entry there as we get: > > Do the needed write map and insert the RST entries, just as if we're > doing a write, but without any actual IO. > > > Currently the handling of RST is not consistent with non-RST, thus > that's the reason causing problems with scrub on preallocated extents. > > I known preallocated range won't trigger any read thus it makes no sense > to do the proper RST setup. > But let's also take the example of non-RST scrub: > > Do we spend our time checking if a data extent is preallocated or not? > No, we do not. We just read the content, and check against its csum. > For preallocated extents, it just has no csum. > > You can argue that this is wasting IO, but I can also counter-argue that > we need to make sure the read on that device range is fine, even if we > know we will not really read the content (but that's only for now). > > > Furthermore, from the logical aspect, at least to me, non-RST case is > just a special case where logical address is 1:1 mapped already. > > This also means, even for preallocated extents, they should have RST > entries. > > > Finally, I do not think it's a good idea to insert RST entries for NOCOW. > If a file is set NOCOW, it means we'll doing a lot of overwrite for it. > Then why waste our time updating the RST entries again and again? > > Isn't such behavior going to cause more write amplification? Meanwhile > for non-RST cases, NOCOW should cause the least amount of write > amplification. The whole idea behind the RST was to write the RST entries _after_ the data has been persisted to disk. Otherwise we're back at the write hole problem. See for example this imaginary sequence: Preallocate a range. This will then also preallocate the RST entries with the mapping as you describe. Write to it and while you write you have a powerloss. The copy/stripe to disk 1 is correctly written but disk 2 didn't report back before the power loss happened. After we have power again, a read to disk 2 comes in, as we have a RST entry, the read will be directed to the broken entry and garbage is returned. And this is the good case, as we can repair it. If it was an overwrite of a block and the same happens, we have a RST entry pointing to a good and a bad copy. Once we're adding the RST entries after both writes succeed the problem isn't there. So for preallocated extents it is even harmful to add a RST entry.
在 2024/9/24 00:11, Johannes Thumshirn 写道: > On 23.09.24 10:54, Qu Wenruo wrote: >> >> [...] >> Finally, I do not think it's a good idea to insert RST entries for NOCOW. >> If a file is set NOCOW, it means we'll doing a lot of overwrite for it. >> Then why waste our time updating the RST entries again and again? >> >> Isn't such behavior going to cause more write amplification? Meanwhile >> for non-RST cases, NOCOW should cause the least amount of write >> amplification. > > The whole idea behind the RST was to write the RST entries _after_ the > data has been persisted to disk. Otherwise we're back at the write hole > problem. See for example this imaginary sequence: > > Preallocate a range. This will then also preallocate the RST entries > with the mapping as you describe. Write to it and while you write you > have a powerloss. The copy/stripe to disk 1 is correctly written but > disk 2 didn't report back before the power loss happened. > After we have > power again, a read to disk 2 comes in, as we have a RST entry, the read > will be directed to the broken entry and garbage is returned. And this > is the good case, as we can repair it. > If it was an overwrite of a block and the same happens, we have a RST > entry pointing to a good and a bad copy. Nope, that will not happen. Because our metadata is still COW protected, after such powerloss, the file extent is still showing that range is PREALLOCATED, we won't even trigger a read. And this is exactly the same as the non-RST PREALLOCATED write. > > Once we're adding the RST entries after both writes succeed the problem > isn't there. So for preallocated extents it is even harmful to add a RST > entry. You just forgot the metadata part, which prevents the problem from happening in the very beginning. Thanks, Qu
© 2016 - 2024 Red Hat, Inc.