[v3] btrfs: eb struct cleanups

[PATCH v3 1/2] btrfs: remove extent buffer's redundant `len` member field

Posted by Daniel Vacek 9 months, 1 week ago

Even super block nowadays uses nodesize for eb->len. This is since commits

551561c34663 ("btrfs: don't pass nodesize to __alloc_extent_buffer()")
da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
ce3e69847e3e ("btrfs: sink parameter len to alloc_extent_buffer")
a83fffb75d09 ("btrfs: sink blocksize parameter to btrfs_find_create_tree_block")

With these the eb->len is not really useful anymore. Let's use the nodesize
directly where applicable.

Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
---

v3 changes:
 - use local variable nodesize directly in btree_csum_one_bio(), spotted by Boris
 - include nodesize in the warning logged by report_eb_range(), suggested by Boris

v2 changes:
 - https://lore.kernel.org/linux-btrfs/20250502133725.1210587-2-neelx@suse.com/
 - rebased to this week's for-next
 - use plain eb->fs_info->nodesize instead of a helper function as suggested
   by Filipe and Wenruo
 - constify local nodesize variables as suggested by Wenruo

v1: https://lore.kernel.org/linux-btrfs/20250429151800.649010-1-neelx@suse.com/

---
 fs/btrfs/accessors.c             |  4 +--
 fs/btrfs/disk-io.c               | 11 +++---
 fs/btrfs/extent-tree.c           | 28 ++++++++-------
 fs/btrfs/extent_io.c             | 59 ++++++++++++++------------------
 fs/btrfs/extent_io.h             |  6 ++--
 fs/btrfs/ioctl.c                 |  2 +-
 fs/btrfs/relocation.c            |  2 +-
 fs/btrfs/subpage.c               | 12 ++++---
 fs/btrfs/tests/extent-io-tests.c | 12 +++----
 fs/btrfs/zoned.c                 |  2 +-
 10 files changed, 69 insertions(+), 69 deletions(-)

diff --git a/fs/btrfs/accessors.c b/fs/btrfs/accessors.c
index e3716516ca387..6839251b09a18 100644
--- a/fs/btrfs/accessors.c
+++ b/fs/btrfs/accessors.c
@@ -14,10 +14,10 @@ static bool check_setget_bounds(const struct extent_buffer *eb,
 {
 	const unsigned long member_offset = (unsigned long)ptr + off;
 
-	if (unlikely(member_offset + size > eb->len)) {
+	if (unlikely(member_offset + size > eb->fs_info->nodesize)) {
 		btrfs_warn(eb->fs_info,
 		"bad eb member %s: ptr 0x%lx start %llu member offset %lu size %d",
-			(member_offset > eb->len ? "start" : "end"),
+			(member_offset > eb->fs_info->nodesize ? "start" : "end"),
 			(unsigned long)ptr, eb->start, member_offset, size);
 		return false;
 	}
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5bcf11246ba66..2f819446cef1c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -190,7 +190,7 @@ static int btrfs_repair_eb_io_failure(const struct extent_buffer *eb,
 	for (int i = 0; i < num_extent_folios(eb); i++) {
 		struct folio *folio = eb->folios[i];
 		u64 start = max_t(u64, eb->start, folio_pos(folio));
-		u64 end = min_t(u64, eb->start + eb->len,
+		u64 end = min_t(u64, eb->start + fs_info->nodesize,
 				folio_pos(folio) + eb->folio_size);
 		u32 len = end - start;
 		phys_addr_t paddr = PFN_PHYS(folio_pfn(folio)) +
@@ -230,7 +230,7 @@ int btrfs_read_extent_buffer(struct extent_buffer *eb,
 			break;
 
 		num_copies = btrfs_num_copies(fs_info,
-					      eb->start, eb->len);
+					      eb->start, fs_info->nodesize);
 		if (num_copies == 1)
 			break;
 
@@ -260,6 +260,7 @@ int btree_csum_one_bio(struct btrfs_bio *bbio)
 {
 	struct extent_buffer *eb = bbio->private;
 	struct btrfs_fs_info *fs_info = eb->fs_info;
+	const u32 nodesize = fs_info->nodesize;
 	u64 found_start = btrfs_header_bytenr(eb);
 	u64 last_trans;
 	u8 result[BTRFS_CSUM_SIZE];
@@ -268,7 +269,7 @@ int btree_csum_one_bio(struct btrfs_bio *bbio)
 	/* Btree blocks are always contiguous on disk. */
 	if (WARN_ON_ONCE(bbio->file_offset != eb->start))
 		return -EIO;
-	if (WARN_ON_ONCE(bbio->bio.bi_iter.bi_size != eb->len))
+	if (WARN_ON_ONCE(bbio->bio.bi_iter.bi_size != nodesize))
 		return -EIO;
 
 	/*
@@ -277,7 +278,7 @@ int btree_csum_one_bio(struct btrfs_bio *bbio)
 	 * ordering of I/O without unnecessarily writing out data.
 	 */
 	if (test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags)) {
-		memzero_extent_buffer(eb, 0, eb->len);
+		memzero_extent_buffer(eb, 0, nodesize);
 		return 0;
 	}
 
@@ -881,7 +882,7 @@ struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans,
 	btrfs_set_root_generation(&root->root_item, trans->transid);
 	btrfs_set_root_level(&root->root_item, 0);
 	btrfs_set_root_refs(&root->root_item, 1);
-	btrfs_set_root_used(&root->root_item, leaf->len);
+	btrfs_set_root_used(&root->root_item, fs_info->nodesize);
 	btrfs_set_root_last_snapshot(&root->root_item, 0);
 	btrfs_set_root_dirid(&root->root_item, 0);
 	if (is_fstree(objectid))
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 64e8c653ae8f3..e1cce5b011993 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2186,7 +2186,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
 	extent_op->update_flags = true;
 	extent_op->update_key = false;
 
-	ret = btrfs_add_delayed_extent_op(trans, eb->start, eb->len,
+	ret = btrfs_add_delayed_extent_op(trans, eb->start, eb->fs_info->nodesize,
 					  btrfs_header_level(eb), extent_op);
 	if (ret)
 		btrfs_free_delayed_extent_op(extent_op);
@@ -2635,10 +2635,10 @@ int btrfs_pin_extent_for_log_replay(struct btrfs_trans_handle *trans,
 	if (ret)
 		goto out;
 
-	pin_down_extent(trans, cache, eb->start, eb->len, 0);
+	pin_down_extent(trans, cache, eb->start, eb->fs_info->nodesize, 0);
 
 	/* remove us from the free space cache (if we're there at all) */
-	ret = btrfs_remove_free_space(cache, eb->start, eb->len);
+	ret = btrfs_remove_free_space(cache, eb->start, eb->fs_info->nodesize);
 out:
 	btrfs_put_block_group(cache);
 	return ret;
@@ -3434,13 +3434,14 @@ int btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_block_group *bg;
+	const u32 nodesize = fs_info->nodesize;
 	int ret;
 
 	if (root_id != BTRFS_TREE_LOG_OBJECTID) {
 		struct btrfs_ref generic_ref = {
 			.action = BTRFS_DROP_DELAYED_REF,
 			.bytenr = buf->start,
-			.num_bytes = buf->len,
+			.num_bytes = nodesize,
 			.parent = parent,
 			.owning_root = btrfs_header_owner(buf),
 			.ref_root = root_id,
@@ -3476,7 +3477,7 @@ int btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 	bg = btrfs_lookup_block_group(fs_info, buf->start);
 
 	if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
-		pin_down_extent(trans, bg, buf->start, buf->len, 1);
+		pin_down_extent(trans, bg, buf->start, nodesize, 1);
 		btrfs_put_block_group(bg);
 		goto out;
 	}
@@ -3500,17 +3501,17 @@ int btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 
 	if (test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags)
 		     || btrfs_is_zoned(fs_info)) {
-		pin_down_extent(trans, bg, buf->start, buf->len, 1);
+		pin_down_extent(trans, bg, buf->start, nodesize, 1);
 		btrfs_put_block_group(bg);
 		goto out;
 	}
 
 	WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
 
-	btrfs_add_free_space(bg, buf->start, buf->len);
-	btrfs_free_reserved_bytes(bg, buf->len, 0);
+	btrfs_add_free_space(bg, buf->start, nodesize);
+	btrfs_free_reserved_bytes(bg, nodesize, 0);
 	btrfs_put_block_group(bg);
-	trace_btrfs_reserved_extent_free(fs_info, buf->start, buf->len);
+	trace_btrfs_reserved_extent_free(fs_info, buf->start, nodesize);
 
 out:
 	return 0;
@@ -4766,7 +4767,7 @@ int btrfs_pin_reserved_extent(struct btrfs_trans_handle *trans,
 		return -ENOSPC;
 	}
 
-	ret = pin_down_extent(trans, cache, eb->start, eb->len, 1);
+	ret = pin_down_extent(trans, cache, eb->start, eb->fs_info->nodesize, 1);
 	btrfs_put_block_group(cache);
 	return ret;
 }
@@ -5048,6 +5049,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct extent_buffer *buf;
 	u64 lockdep_owner = owner;
+	const u32 nodesize = fs_info->nodesize;
 
 	buf = btrfs_find_create_tree_block(fs_info, bytenr, owner, level);
 	if (IS_ERR(buf))
@@ -5105,16 +5107,16 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		 */
 		if (buf->log_index == 0)
 			btrfs_set_extent_bit(&root->dirty_log_pages, buf->start,
-					     buf->start + buf->len - 1,
+					     buf->start + nodesize - 1,
 					     EXTENT_DIRTY, NULL);
 		else
 			btrfs_set_extent_bit(&root->dirty_log_pages, buf->start,
-					     buf->start + buf->len - 1,
+					     buf->start + nodesize - 1,
 					     EXTENT_NEW, NULL);
 	} else {
 		buf->log_index = -1;
 		btrfs_set_extent_bit(&trans->transaction->dirty_pages, buf->start,
-				     buf->start + buf->len - 1, EXTENT_DIRTY, NULL);
+				     buf->start + nodesize - 1, EXTENT_DIRTY, NULL);
 	}
 	/* this returns a buffer locked for blocking */
 	return buf;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index fa757f3da64fa..0df5534a834ad 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -76,8 +76,8 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info)
 		eb = list_first_entry(&fs_info->allocated_ebs,
 				      struct extent_buffer, leak_list);
 		pr_err(
-	"BTRFS: buffer leak start %llu len %u refs %d bflags %lu owner %llu\n",
-		       eb->start, eb->len, atomic_read(&eb->refs), eb->bflags,
+	"BTRFS: buffer leak start %llu refs %d bflags %lu owner %llu\n",
+		       eb->start, atomic_read(&eb->refs), eb->bflags,
 		       btrfs_header_owner(eb));
 		list_del(&eb->leak_list);
 		WARN_ON_ONCE(1);
@@ -1788,8 +1788,8 @@ static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *e
 
 		btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
 		percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
-					 -eb->len,
-					 fs_info->dirty_metadata_batch);
+					 -fs_info->nodesize,
+					  fs_info->dirty_metadata_batch);
 		ret = true;
 	} else {
 		spin_unlock(&eb->refs_lock);
@@ -1986,7 +1986,7 @@ static unsigned int buffer_tree_get_ebs_tag(struct btrfs_fs_info *fs_info,
 	rcu_read_lock();
 	while ((eb = find_get_eb(&xas, end, tag)) != NULL) {
 		if (!eb_batch_add(batch, eb)) {
-			*start = (eb->start + eb->len) >> fs_info->sectorsize_bits;
+			*start = (eb->start + fs_info->nodesize) >> fs_info->sectorsize_bits;
 			goto out;
 		}
 	}
@@ -2050,7 +2050,7 @@ static void prepare_eb_write(struct extent_buffer *eb)
 	nritems = btrfs_header_nritems(eb);
 	if (btrfs_header_level(eb) > 0) {
 		end = btrfs_node_key_ptr_offset(eb, nritems);
-		memzero_extent_buffer(eb, end, eb->len - end);
+		memzero_extent_buffer(eb, end, eb->fs_info->nodesize - end);
 	} else {
 		/*
 		 * Leaf:
@@ -2086,7 +2086,7 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb,
 		struct folio *folio = eb->folios[i];
 		u64 range_start = max_t(u64, eb->start, folio_pos(folio));
 		u32 range_len = min_t(u64, folio_pos(folio) + folio_size(folio),
-				      eb->start + eb->len) - range_start;
+				      eb->start + fs_info->nodesize) - range_start;
 
 		folio_lock(folio);
 		btrfs_meta_folio_clear_dirty(folio, eb);
@@ -2200,7 +2200,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 			if (ctx.zoned_bg) {
 				/* Mark the last eb in the block group. */
 				btrfs_schedule_zone_finish_bg(ctx.zoned_bg, eb);
-				ctx.zoned_bg->meta_write_pointer += eb->len;
+				ctx.zoned_bg->meta_write_pointer += fs_info->nodesize;
 			}
 			write_one_eb(eb, wbc);
 		}
@@ -2836,7 +2836,6 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
 
 	eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
 	eb->start = start;
-	eb->len = fs_info->nodesize;
 	eb->fs_info = fs_info;
 	init_rwsem(&eb->lock);
 
@@ -2845,8 +2844,6 @@ static struct extent_buffer *__alloc_extent_buffer(struct btrfs_fs_info *fs_info
 	spin_lock_init(&eb->refs_lock);
 	atomic_set(&eb->refs, 1);
 
-	ASSERT(eb->len <= BTRFS_MAX_METADATA_BLOCKSIZE);
-
 	return eb;
 }
 
@@ -3558,7 +3555,7 @@ void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans,
 		return;
 
 	buffer_tree_clear_mark(eb, PAGECACHE_TAG_DIRTY);
-	percpu_counter_add_batch(&fs_info->dirty_metadata_bytes, -eb->len,
+	percpu_counter_add_batch(&fs_info->dirty_metadata_bytes, -fs_info->nodesize,
 				 fs_info->dirty_metadata_batch);
 
 	for (int i = 0; i < num_extent_folios(eb); i++) {
@@ -3589,7 +3586,8 @@ void set_extent_buffer_dirty(struct extent_buffer *eb)
 	WARN_ON(test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags));
 
 	if (!was_dirty) {
-		bool subpage = btrfs_meta_is_subpage(eb->fs_info);
+		struct btrfs_fs_info *fs_info = eb->fs_info;
+		bool subpage = btrfs_meta_is_subpage(fs_info);
 
 		/*
 		 * For subpage case, we can have other extent buffers in the
@@ -3609,9 +3607,9 @@ void set_extent_buffer_dirty(struct extent_buffer *eb)
 		buffer_tree_set_mark(eb, PAGECACHE_TAG_DIRTY);
 		if (subpage)
 			folio_unlock(eb->folios[0]);
-		percpu_counter_add_batch(&eb->fs_info->dirty_metadata_bytes,
-					 eb->len,
-					 eb->fs_info->dirty_metadata_batch);
+		percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
+					  fs_info->nodesize,
+					  fs_info->dirty_metadata_batch);
 	}
 #ifdef CONFIG_BTRFS_DEBUG
 	for (int i = 0; i < num_extent_folios(eb); i++)
@@ -3723,7 +3721,7 @@ int read_extent_buffer_pages_nowait(struct extent_buffer *eb, int mirror_num,
 		struct folio *folio = eb->folios[i];
 		u64 range_start = max_t(u64, eb->start, folio_pos(folio));
 		u32 range_len = min_t(u64, folio_pos(folio) + folio_size(folio),
-				      eb->start + eb->len) - range_start;
+				      eb->start + eb->fs_info->nodesize) - range_start;
 
 		bio_add_folio_nofail(&bbio->bio, folio, range_len,
 				     offset_in_folio(folio, range_start));
@@ -3751,8 +3749,8 @@ static bool report_eb_range(const struct extent_buffer *eb, unsigned long start,
 			    unsigned long len)
 {
 	btrfs_warn(eb->fs_info,
-		"access to eb bytenr %llu len %u out of range start %lu len %lu",
-		eb->start, eb->len, start, len);
+		"access to eb bytenr %llu nodesize %u out of range start %lu len %lu",
+		eb->start, eb->fs_info->nodesize, start, len);
 	DEBUG_WARN();
 
 	return true;
@@ -3770,8 +3768,8 @@ static inline int check_eb_range(const struct extent_buffer *eb,
 {
 	unsigned long offset;
 
-	/* start, start + len should not go beyond eb->len nor overflow */
-	if (unlikely(check_add_overflow(start, len, &offset) || offset > eb->len))
+	/* start, start + len should not go beyond nodesize nor overflow */
+	if (unlikely(check_add_overflow(start, len, &offset) || offset > eb->fs_info->nodesize))
 		return report_eb_range(eb, start, len);
 
 	return false;
@@ -3827,8 +3825,8 @@ int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
 	unsigned long i = get_eb_folio_index(eb, start);
 	int ret = 0;
 
-	WARN_ON(start > eb->len);
-	WARN_ON(start + len > eb->start + eb->len);
+	WARN_ON(start > eb->fs_info->nodesize);
+	WARN_ON(start + len > eb->start + eb->fs_info->nodesize);
 
 	if (eb->addr) {
 		if (copy_to_user_nofault(dstv, eb->addr + start, len))
@@ -3919,8 +3917,8 @@ static void assert_eb_folio_uptodate(const struct extent_buffer *eb, int i)
 		folio = eb->folios[0];
 		ASSERT(i == 0);
 		if (WARN_ON(!btrfs_subpage_test_uptodate(fs_info, folio,
-							 eb->start, eb->len)))
-			btrfs_subpage_dump_bitmap(fs_info, folio, eb->start, eb->len);
+							 eb->start, fs_info->nodesize)))
+			btrfs_subpage_dump_bitmap(fs_info, folio, eb->start, fs_info->nodesize);
 	} else {
 		WARN_ON(!folio_test_uptodate(folio));
 	}
@@ -4013,12 +4011,10 @@ void copy_extent_buffer_full(const struct extent_buffer *dst,
 	const int unit_size = src->folio_size;
 	unsigned long cur = 0;
 
-	ASSERT(dst->len == src->len);
-
-	while (cur < src->len) {
+	while (cur < src->fs_info->nodesize) {
 		unsigned long index = get_eb_folio_index(src, cur);
 		unsigned long offset = get_eb_offset_in_folio(src, cur);
-		unsigned long cur_len = min(src->len, unit_size - offset);
+		unsigned long cur_len = min(src->fs_info->nodesize, unit_size - offset);
 		void *addr = folio_address(src->folios[index]) + offset;
 
 		write_extent_buffer(dst, addr, cur, cur_len);
@@ -4033,7 +4029,6 @@ void copy_extent_buffer(const struct extent_buffer *dst,
 			unsigned long len)
 {
 	const int unit_size = dst->folio_size;
-	u64 dst_len = dst->len;
 	size_t cur;
 	size_t offset;
 	char *kaddr;
@@ -4043,8 +4038,6 @@ void copy_extent_buffer(const struct extent_buffer *dst,
 	    check_eb_range(src, src_offset, len))
 		return;
 
-	WARN_ON(src->len != dst_len);
-
 	offset = get_eb_offset_in_folio(dst, dst_offset);
 
 	while (len > 0) {
@@ -4315,7 +4308,7 @@ static int try_release_subpage_extent_buffer(struct folio *folio)
 			xa_unlock_irq(&fs_info->buffer_tree);
 			break;
 		}
-		cur = eb->start + eb->len;
+		cur = eb->start + fs_info->nodesize;
 
 		/*
 		 * The same as try_release_extent_buffer(), to ensure the eb
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index e36e8d6a00bc5..7a8451c11630a 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -16,6 +16,7 @@
 #include "messages.h"
 #include "ulist.h"
 #include "misc.h"
+#include "fs.h"
 
 struct page;
 struct file;
@@ -86,7 +87,6 @@ void __cold extent_buffer_free_cachep(void);
 #define INLINE_EXTENT_BUFFER_PAGES     (BTRFS_MAX_METADATA_BLOCKSIZE / PAGE_SIZE)
 struct extent_buffer {
 	u64 start;
-	u32 len;
 	u32 folio_size;
 	unsigned long bflags;
 	struct btrfs_fs_info *fs_info;
@@ -274,12 +274,12 @@ static inline int __pure num_extent_pages(const struct extent_buffer *eb)
 {
 	/*
 	 * For sectorsize == PAGE_SIZE case, since nodesize is always aligned to
-	 * sectorsize, it's just eb->len >> PAGE_SHIFT.
+	 * sectorsize, it's just nodesize >> PAGE_SHIFT.
 	 *
 	 * For sectorsize < PAGE_SIZE case, we could have nodesize < PAGE_SIZE,
 	 * thus have to ensure we get at least one page.
 	 */
-	return (eb->len >> PAGE_SHIFT) ?: 1;
+	return (eb->fs_info->nodesize >> PAGE_SHIFT) ?: 1;
 }
 
 /*
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a498fe524c907..d06008ff63de9 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -598,7 +598,7 @@ static noinline int create_subvol(struct mnt_idmap *idmap,
 	btrfs_set_root_generation(root_item, trans->transid);
 	btrfs_set_root_level(root_item, 0);
 	btrfs_set_root_refs(root_item, 1);
-	btrfs_set_root_used(root_item, leaf->len);
+	btrfs_set_root_used(root_item, fs_info->nodesize);
 	btrfs_set_root_last_snapshot(root_item, 0);
 
 	btrfs_set_root_generation_v2(root_item,
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 02086191630d0..fb194790406ff 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4353,7 +4353,7 @@ int btrfs_reloc_cow_block(struct btrfs_trans_handle *trans,
 			mark_block_processed(rc, node);
 
 		if (first_cow && level > 0)
-			rc->nodes_relocated += buf->len;
+			rc->nodes_relocated += fs_info->nodesize;
 	}
 
 	if (level == 0 && first_cow && rc->stage == UPDATE_DATA_PTRS)
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index d4f0192334936..64c212f76ff12 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -631,7 +631,8 @@ void btrfs_meta_folio_set_##name(struct folio *folio, const struct extent_buffer
 		folio_set_func(folio);					\
 		return;							\
 	}								\
-	btrfs_subpage_set_##name(eb->fs_info, folio, eb->start, eb->len); \
+	btrfs_subpage_set_##name(eb->fs_info, folio, eb->start,		\
+				 eb->fs_info->nodesize);		\
 }									\
 void btrfs_meta_folio_clear_##name(struct folio *folio, const struct extent_buffer *eb) \
 {									\
@@ -639,13 +640,15 @@ void btrfs_meta_folio_clear_##name(struct folio *folio, const struct extent_buff
 		folio_clear_func(folio);				\
 		return;							\
 	}								\
-	btrfs_subpage_clear_##name(eb->fs_info, folio, eb->start, eb->len); \
+	btrfs_subpage_clear_##name(eb->fs_info, folio, eb->start,	\
+				   eb->fs_info->nodesize);		\
 }									\
 bool btrfs_meta_folio_test_##name(struct folio *folio, const struct extent_buffer *eb) \
 {									\
 	if (!btrfs_meta_is_subpage(eb->fs_info))			\
 		return folio_test_func(folio);				\
-	return btrfs_subpage_test_##name(eb->fs_info, folio, eb->start, eb->len); \
+	return btrfs_subpage_test_##name(eb->fs_info, folio, eb->start,	\
+					 eb->fs_info->nodesize);	\
 }
 IMPLEMENT_BTRFS_PAGE_OPS(uptodate, folio_mark_uptodate, folio_clear_uptodate,
 			 folio_test_uptodate);
@@ -765,7 +768,8 @@ bool btrfs_meta_folio_clear_and_test_dirty(struct folio *folio, const struct ext
 		return true;
 	}
 
-	last = btrfs_subpage_clear_and_test_dirty(eb->fs_info, folio, eb->start, eb->len);
+	last = btrfs_subpage_clear_and_test_dirty(eb->fs_info, folio, eb->start,
+						  eb->fs_info->nodesize);
 	if (last) {
 		folio_clear_dirty_for_io(folio);
 		return true;
diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index 00da54f0164c9..697d558808103 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -342,7 +342,7 @@ static int check_eb_bitmap(unsigned long *bitmap, struct extent_buffer *eb)
 {
 	unsigned long i;
 
-	for (i = 0; i < eb->len * BITS_PER_BYTE; i++) {
+	for (i = 0; i < eb->fs_info->nodesize * BITS_PER_BYTE; i++) {
 		int bit, bit1;
 
 		bit = !!test_bit(i, bitmap);
@@ -411,7 +411,7 @@ static int test_bitmap_clear(const char *name, unsigned long *bitmap,
 static int __test_eb_bitmaps(unsigned long *bitmap, struct extent_buffer *eb)
 {
 	unsigned long i, j;
-	unsigned long byte_len = eb->len;
+	unsigned long byte_len = eb->fs_info->nodesize;
 	u32 x;
 	int ret;
 
@@ -670,7 +670,7 @@ static int test_find_first_clear_extent_bit(void)
 static void dump_eb_and_memory_contents(struct extent_buffer *eb, void *memory,
 					const char *test_name)
 {
-	for (int i = 0; i < eb->len; i++) {
+	for (int i = 0; i < eb->fs_info->nodesize; i++) {
 		struct page *page = folio_page(eb->folios[i >> PAGE_SHIFT], 0);
 		void *addr = page_address(page) + offset_in_page(i);
 
@@ -686,7 +686,7 @@ static void dump_eb_and_memory_contents(struct extent_buffer *eb, void *memory,
 static int verify_eb_and_memory(struct extent_buffer *eb, void *memory,
 				const char *test_name)
 {
-	for (int i = 0; i < (eb->len >> PAGE_SHIFT); i++) {
+	for (int i = 0; i < (eb->fs_info->nodesize >> PAGE_SHIFT); i++) {
 		void *eb_addr = folio_address(eb->folios[i]);
 
 		if (memcmp(memory + (i << PAGE_SHIFT), eb_addr, PAGE_SIZE) != 0) {
@@ -703,8 +703,8 @@ static int verify_eb_and_memory(struct extent_buffer *eb, void *memory,
  */
 static void init_eb_and_memory(struct extent_buffer *eb, void *memory)
 {
-	get_random_bytes(memory, eb->len);
-	write_extent_buffer(eb, memory, 0, eb->len);
+	get_random_bytes(memory, eb->fs_info->nodesize);
+	write_extent_buffer(eb, memory, 0, eb->fs_info->nodesize);
 }
 
 static int test_eb_mem_ops(u32 sectorsize, u32 nodesize)
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 271c958909cd8..0b87f2e2ee75e 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2416,7 +2416,7 @@ void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
 				   struct extent_buffer *eb)
 {
 	if (!test_bit(BLOCK_GROUP_FLAG_SEQUENTIAL_ZONE, &bg->runtime_flags) ||
-	    eb->start + eb->len * 2 <= bg->start + bg->zone_capacity)
+	    eb->start + eb->fs_info->nodesize * 2 <= bg->start + bg->zone_capacity)
 		return;
 
 	if (WARN_ON(bg->zone_finish_work.func == btrfs_zone_finish_endio_workfn)) {
-- 
2.47.2

Re: [PATCH v3 1/2] btrfs: remove extent buffer's redundant `len` member field

Posted by David Sterba 9 months, 1 week ago

On Mon, May 05, 2025 at 01:50:54PM +0200, Daniel Vacek wrote:
> Even super block nowadays uses nodesize for eb->len. This is since commits
> 
> 551561c34663 ("btrfs: don't pass nodesize to __alloc_extent_buffer()")
> da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
> ce3e69847e3e ("btrfs: sink parameter len to alloc_extent_buffer")
> a83fffb75d09 ("btrfs: sink blocksize parameter to btrfs_find_create_tree_block")
> 
> With these the eb->len is not really useful anymore. Let's use the nodesize
> directly where applicable.

You haven't updated the changelog despite we had a discussion about the
potential drawbacks, so this should be here. But I'm not convinced this
is a good change. The eb size does not change so no better packing in
the slab and the caching of length in the same cacheline is lost.

In the assebly it's clear where the pointer dereference is added, using
an example from btrfs_get_token_8():

  mov    0x8(%rbp),%r9d

vs

  mov    0x18(%rbp),%r10
  mov    0xd38(%r10),%r9d

And there's a new register dependency. The eb operations are most of the
time performed on memory structures and not just in connnection with IO,
I think this could be measurable on a fast device with metadata heavy
workload. So I'd rather see some numbers this will not decrease
performance.

Re: [PATCH v3 1/2] btrfs: remove extent buffer's redundant `len` member field

Posted by Daniel Vacek 9 months, 1 week ago

On Mon, 5 May 2025 at 17:18, David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, May 05, 2025 at 01:50:54PM +0200, Daniel Vacek wrote:
> > Even super block nowadays uses nodesize for eb->len. This is since commits
> >
> > 551561c34663 ("btrfs: don't pass nodesize to __alloc_extent_buffer()")
> > da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
> > ce3e69847e3e ("btrfs: sink parameter len to alloc_extent_buffer")
> > a83fffb75d09 ("btrfs: sink blocksize parameter to btrfs_find_create_tree_block")
> >
> > With these the eb->len is not really useful anymore. Let's use the nodesize
> > directly where applicable.
>
> You haven't updated the changelog despite we had a discussion about the
> potential drawbacks, so this should be here. But I'm not convinced this

Right. That's because I was not sure we came to any conclusion yet. I
thought this discussion was still ongoing. I understand that so far
this is still all just a theory and any premature conclusions may be
misleading.

But yeah, I may write some kind of a warning or a disclaimer
mentioning the suspicion. Just so that it gets documented and it is
clear it was considered, even though maybe without a clear conclusion.

> is a good change. The eb size does not change so no better packing in
> the slab and the caching of length in the same cacheline is lost.

So to be perfectly clear, what sharing do you mean? Is it the
eb->start and eb->len you talking about? In other words, when getting
`start` you also get `len` for free?

Since the structure is 8 bytes aligned, they may eventually still end
up in two cachelines. Luckily the size of the structure is 0 mod 16 so
just out of the luck this never happens and they are always in the
same cache line. But this may break with future changes, so it is not
rock solid the way it is now.

> In the assebly it's clear where the pointer dereference is added, using
> an example from btrfs_get_token_8():
>
>   mov    0x8(%rbp),%r9d
>
> vs
>
>   mov    0x18(%rbp),%r10
>   mov    0xd38(%r10),%r9d

I understand that. Again, this is what I originally considered. Not
all the functions end up like this but there are definitely some. And
by a rule of a thumb it's roughly half of them, give or take. That
sounds like a good reason to be concerned.

My reasoning was that the fs_info->nodesize is accessed by many so the
chances are it will be hot in cache. But you're right that this may
not always be the case. It depends. The question remains if that makes
a difference?

Another (IMO valid) point is that I believe the code will dereference
many other pointers before getting here so this may seem like a drop
in the sea. It's not like this was a tight loop scattering over
hundreds random memory addresses.
For example when this code is reached from a syscall, the syscall
itself will have significantly higher overhead then one additional
dereference. And I think the same applies when reached from an
interrupt.
Otherwise this would be visible on perf profile (which none of us
performed yet).

Still, I'd say reading the assembly is a good indication of possible
issues to be concerned with. But it's not always the best one. An
observed performance regression would be.
I'll be happy to conduct any suggested benchmarks. Though as you
mentioned this may be also picky on the used HW. So even though we
don't see any regressions with our testing, one day someone may find
some if we merged this change. In which case we can always revert.

I have to say I'm pretty happy with the positive feedback from the
other reviewers. So far you're the only one raising this concern.

So how do we conclude this?

If you don't want this cleanup I'd opt at least for rename eb->len
==>> eb->nodesize.

> And there's a new register dependency. The eb operations are most of the
> time performed on memory structures and not just in connnection with IO,
> I think this could be measurable on a fast device with metadata heavy
> workload. So I'd rather see some numbers this will not decrease
> performance.

Re: [PATCH v3 1/2] btrfs: remove extent buffer's redundant `len` member field

Posted by David Sterba 9 months ago

On Mon, May 05, 2025 at 07:53:16PM +0200, Daniel Vacek wrote:
> On Mon, 5 May 2025 at 17:18, David Sterba <dsterba@suse.cz> wrote:
> >
> > On Mon, May 05, 2025 at 01:50:54PM +0200, Daniel Vacek wrote:
> > > Even super block nowadays uses nodesize for eb->len. This is since commits
> > >
> > > 551561c34663 ("btrfs: don't pass nodesize to __alloc_extent_buffer()")
> > > da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
> > > ce3e69847e3e ("btrfs: sink parameter len to alloc_extent_buffer")
> > > a83fffb75d09 ("btrfs: sink blocksize parameter to btrfs_find_create_tree_block")
> > >
> > > With these the eb->len is not really useful anymore. Let's use the nodesize
> > > directly where applicable.
> >
> > You haven't updated the changelog despite we had a discussion about the
> > potential drawbacks, so this should be here. But I'm not convinced this
> 
> Right. That's because I was not sure we came to any conclusion yet. I
> thought this discussion was still ongoing. I understand that so far
> this is still all just a theory and any premature conclusions may be
> misleading.
> 
> But yeah, I may write some kind of a warning or a disclaimer
> mentioning the suspicion. Just so that it gets documented and it is
> clear it was considered, even though maybe without a clear conclusion.
> 
> > is a good change. The eb size does not change so no better packing in
> > the slab and the caching of length in the same cacheline is lost.
> 
> So to be perfectly clear, what sharing do you mean? Is it the
> eb->start and eb->len you talking about? In other words, when getting
> `start` you also get `len` for free?

Yes, basically.

> Since the structure is 8 bytes aligned, they may eventually still end
> up in two cachelines. Luckily the size of the structure is 0 mod 16 so
> just out of the luck this never happens and they are always in the
> same cache line. But this may break with future changes, so it is not
> rock solid the way it is now.

Yes, with structures not perfectly aligned to a cacheline (64B) there
will be times when the access needs fetching two cachelines. With locks
it can result in various cacheline bouncing patterns when various
allocated structures are aligned to 8 bytes, giving 8 possible groups of
"misaligned" structure instances.

There's also a big assumption that the CPU cache prefetcher is clever
enough to average out many bad patterns. What I try to do on the source
code level is to be able to reason about the high level access patterns,
like "used at the same time -> close together in the structure".

> > In the assebly it's clear where the pointer dereference is added, using
> > an example from btrfs_get_token_8():
> >
> >   mov    0x8(%rbp),%r9d
> >
> > vs
> >
> >   mov    0x18(%rbp),%r10
> >   mov    0xd38(%r10),%r9d
> 
> I understand that. Again, this is what I originally considered. Not
> all the functions end up like this but there are definitely some. And
> by a rule of a thumb it's roughly half of them, give or take. That
> sounds like a good reason to be concerned.
> 
> My reasoning was that the fs_info->nodesize is accessed by many so the
> chances are it will be hot in cache. But you're right that this may
> not always be the case. It depends. The question remains if that makes
> a difference?
> 
> Another (IMO valid) point is that I believe the code will dereference
> many other pointers before getting here so this may seem like a drop
> in the sea. It's not like this was a tight loop scattering over
> hundreds random memory addresses.
> For example when this code is reached from a syscall, the syscall
> itself will have significantly higher overhead then one additional
> dereference. And I think the same applies when reached from an
> interrupt.
> Otherwise this would be visible on perf profile (which none of us
> performed yet).

Yeah and I don't intend to do so other than to verify your measurements
and calims that this patch does not make things worse at least. The
burden is on the patch submitter.

> Still, I'd say reading the assembly is a good indication of possible
> issues to be concerned with. But it's not always the best one. An
> observed performance regression would be.
> I'll be happy to conduct any suggested benchmarks. Though as you
> mentioned this may be also picky on the used HW.

I'd say any contemporary hardware is good, I don't think the internal
CPU algorithms of prefetching or branch predictions change that often. A
difference that I would consider significant is 5% in either direction.

> So even though we
> don't see any regressions with our testing, one day someone may find
> some if we merged this change. In which case we can always revert.
> 
> I have to say I'm pretty happy with the positive feedback from the
> other reviewers. So far you're the only one raising this concern.

I don't dispute the feedback from others, the patch is not wrong on
itself, it removes a redundant member. However I don't see it as a
simple change because I also spent some time looking into that
particular structure, shrinking size, member ordering and access
patterns that I am questioning the runtime performance implications.

My local eb->len removal patch is from 2020 and I decided not to send it
because I was not able to prove to myself it'd be for the better. This
is what I base my feedback on.

> So how do we conclude this?

I don't agree to add this patch due to the following main points:

- no struct size change, ie. same number of objects in the slab
- disputed cacheline behaviour, numbers showing improvement or not
  making it worse needed

> If you don't want this cleanup I'd opt at least for rename eb->len
> ==>> eb->nodesize.

This sounds like an unnecessary churn. 'length' related to a structure
is natural, nodesize is a property of the whole filesystem, we're used
to eb->len, I don't any benefit to increase the cognitive load.

Re: [PATCH v3 1/2] btrfs: remove extent buffer's redundant `len` member field

Posted by Daniel Vacek 9 months ago

On Tue, 13 May 2025 at 02:32, David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, May 05, 2025 at 07:53:16PM +0200, Daniel Vacek wrote:
> > On Mon, 5 May 2025 at 17:18, David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Mon, May 05, 2025 at 01:50:54PM +0200, Daniel Vacek wrote:
> > > > Even super block nowadays uses nodesize for eb->len. This is since commits
> > > >
> > > > 551561c34663 ("btrfs: don't pass nodesize to __alloc_extent_buffer()")
> > > > da17066c4047 ("btrfs: pull node/sector/stripe sizes out of root and into fs_info")
> > > > ce3e69847e3e ("btrfs: sink parameter len to alloc_extent_buffer")
> > > > a83fffb75d09 ("btrfs: sink blocksize parameter to btrfs_find_create_tree_block")
> > > >
> > > > With these the eb->len is not really useful anymore. Let's use the nodesize
> > > > directly where applicable.
> > >
> > > You haven't updated the changelog despite we had a discussion about the
> > > potential drawbacks, so this should be here. But I'm not convinced this
> >
> > Right. That's because I was not sure we came to any conclusion yet. I
> > thought this discussion was still ongoing. I understand that so far
> > this is still all just a theory and any premature conclusions may be
> > misleading.
> >
> > But yeah, I may write some kind of a warning or a disclaimer
> > mentioning the suspicion. Just so that it gets documented and it is
> > clear it was considered, even though maybe without a clear conclusion.
> >
> > > is a good change. The eb size does not change so no better packing in
> > > the slab and the caching of length in the same cacheline is lost.
> >
> > So to be perfectly clear, what sharing do you mean? Is it the
> > eb->start and eb->len you talking about? In other words, when getting
> > `start` you also get `len` for free?
>
> Yes, basically.
>
> > Since the structure is 8 bytes aligned, they may eventually still end
> > up in two cachelines. Luckily the size of the structure is 0 mod 16 so
> > just out of the luck this never happens and they are always in the
> > same cache line. But this may break with future changes, so it is not
> > rock solid the way it is now.
>
> Yes, with structures not perfectly aligned to a cacheline (64B) there
> will be times when the access needs fetching two cachelines. With locks
> it can result in various cacheline bouncing patterns when various
> allocated structures are aligned to 8 bytes, giving 8 possible groups of
> "misaligned" structure instances.
>
> There's also a big assumption that the CPU cache prefetcher is clever
> enough to average out many bad patterns. What I try to do on the source
> code level is to be able to reason about the high level access patterns,
> like "used at the same time -> close together in the structure".

That makes sense. The closer they are the higher the chance they end
up in the same cacheline if the object itself is free of alignment.

> > > In the assebly it's clear where the pointer dereference is added, using
> > > an example from btrfs_get_token_8():
> > >
> > >   mov    0x8(%rbp),%r9d
> > >
> > > vs
> > >
> > >   mov    0x18(%rbp),%r10
> > >   mov    0xd38(%r10),%r9d
> >
> > I understand that. Again, this is what I originally considered. Not
> > all the functions end up like this but there are definitely some. And
> > by a rule of a thumb it's roughly half of them, give or take. That
> > sounds like a good reason to be concerned.
> >
> > My reasoning was that the fs_info->nodesize is accessed by many so the
> > chances are it will be hot in cache. But you're right that this may
> > not always be the case. It depends. The question remains if that makes
> > a difference?
> >
> > Another (IMO valid) point is that I believe the code will dereference
> > many other pointers before getting here so this may seem like a drop
> > in the sea. It's not like this was a tight loop scattering over
> > hundreds random memory addresses.
> > For example when this code is reached from a syscall, the syscall
> > itself will have significantly higher overhead then one additional
> > dereference. And I think the same applies when reached from an
> > interrupt.
> > Otherwise this would be visible on perf profile (which none of us
> > performed yet).
>
> Yeah and I don't intend to do so other than to verify your measurements
> and calims that this patch does not make things worse at least. The
> burden is on the patch submitter.

Totally. I'll run some profiles to see if anything significant is
visible. Not sure what workload to use though? I guess fio would be a
good starting point?

Though we agreed that this also depends on the actual HW used.
Whatever numbers I present, if my CPU is fast enough and my storage is
slow enough, I'll hardly ever find any issue. I think this needs more
broader testing than just one's claims that this is OK for me
(provided my results would suggest so).

So how about pushing this to misc-next to get some broader testing and
see if there are any issues in the bigger scope?

> > Still, I'd say reading the assembly is a good indication of possible
> > issues to be concerned with. But it's not always the best one. An
> > observed performance regression would be.
> > I'll be happy to conduct any suggested benchmarks. Though as you
> > mentioned this may be also picky on the used HW.
>
> I'd say any contemporary hardware is good, I don't think the internal
> CPU algorithms of prefetching or branch predictions change that often. A
> difference that I would consider significant is 5% in either direction.

I meant the storage part rather than the CPU. Though the CPUs also
differ in cache sizes between consumer and server parts.

> > So even though we
> > don't see any regressions with our testing, one day someone may find
> > some if we merged this change. In which case we can always revert.
> >
> > I have to say I'm pretty happy with the positive feedback from the
> > other reviewers. So far you're the only one raising this concern.
>
> I don't dispute the feedback from others, the patch is not wrong on
> itself, it removes a redundant member. However I don't see it as a
> simple change because I also spent some time looking into that
> particular structure, shrinking size, member ordering and access
> patterns that I am questioning the runtime performance implications.

With that experience, would you suggest any other tests than fio where
a performance change could be visible? Something I can focus on?

> My local eb->len removal patch is from 2020 and I decided not to send it
> because I was not able to prove to myself it'd be for the better. This
> is what I base my feedback on.

I understand. Let's see what we can do about it.

> > So how do we conclude this?
>
> I don't agree to add this patch due to the following main points:
>
> - no struct size change, ie. same number of objects in the slab

Right. That makes sense as the original idea was about code cleanup
and not about reducing the struct size in terms of bytes.
I did not think a cleanup needs to be justified by reducing the struct
size. Even without that I believe it's still a good cleanup.
The struct is still reduced by one member (with a new hole now).
Moreover with the followup patch the size is actually reduced at least
for -rt configs by filling the new hole so that the later hole is not
created at all. This wins 8 bytes for -rt. Not much and maybe not
worth it. Or is it?

That said, as long as the removed member was not meant to cache the
fs-wide value for performance reasons. I did not see any documentation
or code comments suggesting that's the case (even you're not able to
tell for sure) and so I thought it may be fine to remove this eb->len
field.
Which brings us to the point below.

> - disputed cacheline behaviour, numbers showing improvement or not
>   making it worse needed

I'll look into this. As I said, any hints are welcome otherwise I'll
start with fio and I'll compare the perf profiles of some functions
which access the data. Let me know if I can do anything better than
that, please.

> > If you don't want this cleanup I'd opt at least for rename eb->len
> > ==>> eb->nodesize.
>
> This sounds like an unnecessary churn. 'length' related to a structure
> is natural, nodesize is a property of the whole filesystem, we're used
> to eb->len, I don't any benefit to increase the cognitive load.

Now, our points of view differ here. For someone new to the codebase,
I need to carry on the additional information that eb->len actually
means the nodesize. In other words without digging this detail up the
code just hides that fact.
Since it's always set to nodesize, it does not really mean 'length' as
in any arbitrary length, like 12345 bytes.
And that's what I was trying to clean up. See? Does it make sense?
Also that's why as an alternative I suggested renaming the member if
removing it is not really favorable.

And I thought it can also have implications maybe at some places when
you realize that what's being used in the expression as eb->len really
means the fs-wide nodesize.

After all, with this insight I realized that the buffer_tree is
inefficient as we're wasting the slots using the tree really sparsely.
Since there are no shorter ebs, those empty slots can never be filled.

So using the right name actually makes sense, IMO. For code clarity.
But I can also understand your point and I can get used to it. It just
does not feel right to me, but it's not a big deal. I just mentioned
it to know others' POV and explain the intention one way or the other.

[PATCH v3 1/2] btrfs: remove extent buffer's redundant `len` member field
[PATCH v3 2/2] btrfs: rearrange the extent buffer structure members