[PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group

Baokun Li posted 16 patches 3 months, 2 weeks ago
There is a newer version of this series
[PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Baokun Li 3 months, 2 weeks ago
After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.

Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent, and we can ensure that the s_mb_last_group read is up
to date by using smp_store_release/smp_load_acquire.

Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

                   | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
 Disk: 960GB SSD   |-------------------------|-------------------------|
                   | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  2 +-
 fs/ext4/mballoc.c | 17 ++++++-----------
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index cfb60f8fbb63..93f03d8c3dca 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1630,7 +1630,7 @@ struct ext4_sb_info {
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
-	unsigned long s_mb_last_group;
+	ext4_group_t s_mb_last_group;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5cdae3bda072..3f103919868b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	ac->ac_buddy_folio = e4b->bd_buddy_folio;
 	folio_get(ac->ac_buddy_folio);
 	/* store last allocated for subsequent stream allocation */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		spin_lock(&sbi->s_md_lock);
-		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
-		spin_unlock(&sbi->s_md_lock);
-	}
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
+		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
+		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
 	/*
 	 * As we've just preallocated more space than
 	 * user requested originally, we store allocated
@@ -2844,12 +2842,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	}
 
 	/* if stream allocation is enabled, use global goal */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		/* TBD: may be hot point */
-		spin_lock(&sbi->s_md_lock);
-		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
-		spin_unlock(&sbi->s_md_lock);
-	}
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
+		/* pairs with smp_store_release in ext4_mb_use_best_found() */
+		ac->ac_g_ex.fe_group = smp_load_acquire(&sbi->s_mb_last_group);
 
 	/*
 	 * Let's just scan groups to find more-less suitable blocks We
-- 
2.46.1
Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by kernel test robot 3 months, 1 week ago

Hello,

kernel test robot noticed a 31.1% improvement of stress-ng.fsize.ops_per_sec on:


commit: ad0d50f30d3fe376a99fd0e392867c7ca9b619e3 ("[PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group")
url: https://github.com/intel-lab-lkp/linux/commits/Baokun-Li/ext4-add-ext4_try_lock_group-to-skip-busy-groups/20250623-155451
base: https://git.kernel.org/cgit/linux/kernel/git/tytso/ext4.git dev
patch link: https://lore.kernel.org/all/20250623073304.3275702-4-libaokun1@huawei.com/
patch subject: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group

testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
parameters:

	nr_threads: 100%
	disk: 1HDD
	testtime: 60s
	fs: ext4
	test: fsize
	cpufreq_governor: performance



Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250701/202507010457.3b3d3c33-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/1HDD/ext4/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp4/fsize/stress-ng/60s

commit: 
  86f92bf2c0 ("ext4: remove unnecessary s_mb_last_start")
  ad0d50f30d ("ext4: remove unnecessary s_md_lock on update s_mb_last_group")

86f92bf2c059852a ad0d50f30d3fe376a99fd0e3928 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      5042 ±  4%     -10.1%       4532 ±  2%  meminfo.Dirty
    100194 ± 63%     +92.5%     192828 ± 32%  numa-meminfo.node0.Shmem
      5082 ±  3%     +28.1%       6510 ±  5%  vmstat.system.cs
     71089           -17.1%      58900 ±  2%  perf-c2c.DRAM.remote
     44206           -13.4%      38284 ±  2%  perf-c2c.HITM.remote
    131696            -4.1%     126359 ±  2%  perf-c2c.HITM.total
      0.15 ± 18%      +0.2        0.35 ± 14%  mpstat.cpu.all.iowait%
      0.32 ±  7%      -0.0        0.28 ±  4%  mpstat.cpu.all.irq%
      0.05 ±  4%      +0.0        0.07 ±  3%  mpstat.cpu.all.soft%
      0.50 ± 13%      +0.2        0.69 ± 16%  mpstat.cpu.all.usr%
  14478005 ±  2%     +32.7%   19217687 ±  4%  numa-numastat.node0.local_node
  14540770 ±  2%     +32.6%   19285137 ±  4%  numa-numastat.node0.numa_hit
  14722680           +28.8%   18967713        numa-numastat.node1.local_node
  14793059           +28.7%   19032805        numa-numastat.node1.numa_hit
    918392           -38.4%     565297 ± 18%  sched_debug.cpu.avg_idle.avg
    356474 ±  5%     -92.0%      28413 ± 90%  sched_debug.cpu.avg_idle.min
      2362 ±  2%     +18.8%       2806 ±  4%  sched_debug.cpu.nr_switches.avg
      1027           +35.5%       1391 ±  6%  sched_debug.cpu.nr_switches.min
     25263 ± 63%     +91.0%      48258 ± 31%  numa-vmstat.node0.nr_shmem
  14540796 ±  2%     +32.5%   19271949 ±  4%  numa-vmstat.node0.numa_hit
  14478031 ±  2%     +32.6%   19204499 ±  4%  numa-vmstat.node0.numa_local
  14792432           +28.6%   19020203        numa-vmstat.node1.numa_hit
  14722053           +28.8%   18955111        numa-vmstat.node1.numa_local
      3780           +30.9%       4950 ±  2%  stress-ng.fsize.SIGXFSZ_signals_per_sec
    643887           +31.0%     843807 ±  2%  stress-ng.fsize.ops
     10726           +31.1%      14059 ±  2%  stress-ng.fsize.ops_per_sec
    126167 ±  2%      +8.7%     137085 ±  2%  stress-ng.time.involuntary_context_switches
     21.82 ±  2%     +45.1%      31.66 ±  4%  stress-ng.time.user_time
      5144 ± 15%    +704.0%      41366 ± 20%  stress-ng.time.voluntary_context_switches
      1272 ±  4%     -10.8%       1135 ±  2%  proc-vmstat.nr_dirty
     59459            +8.1%      64288        proc-vmstat.nr_slab_reclaimable
      1272 ±  4%     -10.8%       1134 ±  2%  proc-vmstat.nr_zone_write_pending
  29335922           +30.6%   38319823        proc-vmstat.numa_hit
  29202778           +30.8%   38187281        proc-vmstat.numa_local
  35012787           +31.9%   46166245 ±  2%  proc-vmstat.pgalloc_normal
  34753289           +31.9%   45830460 ±  2%  proc-vmstat.pgfree
    120464            +2.3%     123212        proc-vmstat.pgpgout
      0.35 ±  3%      +0.1        0.41 ±  3%  perf-stat.i.branch-miss-rate%
  48059547           +21.7%   58484853        perf-stat.i.branch-misses
     33.69            -1.8       31.91        perf-stat.i.cache-miss-rate%
 1.227e+08           +13.5%  1.392e+08 ±  7%  perf-stat.i.cache-misses
 3.623e+08           +19.9%  4.342e+08 ±  7%  perf-stat.i.cache-references
      4958 ±  3%     +30.4%       6467 ±  4%  perf-stat.i.context-switches
      6.10            -5.2%       5.79 ±  4%  perf-stat.i.cpi
    208.43           +22.0%     254.30 ±  5%  perf-stat.i.cpu-migrations
      3333           -11.4%       2954 ±  7%  perf-stat.i.cycles-between-cache-misses
      0.33            +0.1        0.39 ±  2%  perf-stat.overall.branch-miss-rate%
     33.87            -1.8       32.04        perf-stat.overall.cache-miss-rate%
      6.16            -5.3%       5.83 ±  4%  perf-stat.overall.cpi
      3360           -11.5%       2973 ±  7%  perf-stat.overall.cycles-between-cache-misses
      0.16            +5.8%       0.17 ±  4%  perf-stat.overall.ipc
  47200442           +21.7%   57451126        perf-stat.ps.branch-misses
 1.206e+08           +13.5%  1.369e+08 ±  7%  perf-stat.ps.cache-misses
 3.563e+08           +19.9%  4.271e+08 ±  7%  perf-stat.ps.cache-references
      4873 ±  3%     +30.3%       6351 ±  4%  perf-stat.ps.context-switches
    204.75           +22.0%     249.75 ±  5%  perf-stat.ps.cpu-migrations
 6.583e+10            +5.7%  6.955e+10 ±  4%  perf-stat.ps.instructions
 4.046e+12            +5.5%  4.267e+12 ±  4%  perf-stat.total.instructions
      0.15 ± 24%     +97.6%       0.31 ± 21%  perf-sched.sch_delay.avg.ms.__cond_resched.ext4_free_blocks.ext4_remove_blocks.ext4_ext_rm_leaf.ext4_ext_remove_space
      0.69 ± 34%     -45.3%       0.38 ± 24%  perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_poll
      0.04 ±  2%     -11.0%       0.03 ±  7%  perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.09 ± 18%    +104.1%       0.19 ± 38%  perf-sched.sch_delay.avg.ms.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
      0.32 ± 59%    +284.8%       1.24 ± 71%  perf-sched.sch_delay.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_noprof.__filemap_get_folio
     16.34 ± 81%     -81.7%       2.99 ± 34%  perf-sched.sch_delay.max.ms.__cond_resched.__ext4_handle_dirty_metadata.ext4_mb_mark_context.ext4_mb_mark_diskspace_used.ext4_mb_new_blocks
      3.51 ± 11%     +56.2%       5.48 ± 38%  perf-sched.sch_delay.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.ext4_setattr
      0.06 ±223%   +1443.8%       0.86 ± 97%  perf-sched.sch_delay.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.generic_update_time
      0.47 ± 33%    +337.5%       2.05 ± 67%  perf-sched.sch_delay.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_ext_insert_extent.ext4_ext_map_blocks.ext4_map_create_blocks
      0.47 ± 64%    +417.9%       2.43 ± 53%  perf-sched.sch_delay.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_truncate.ext4_setattr.notify_change
      7.30 ± 60%     -53.7%       3.38 ± 22%  perf-sched.sch_delay.max.ms.__cond_resched.__find_get_block_slow.find_get_block_common.bdev_getblk.ext4_read_block_bitmap_nowait
      2.72 ± 34%     +59.5%       4.33 ± 20%  perf-sched.sch_delay.max.ms.__cond_resched.down_read.ext4_map_blocks.ext4_alloc_file_blocks.isra
      0.08 ±138%    +382.6%       0.37 ± 24%  perf-sched.sch_delay.max.ms.__cond_resched.down_write.do_truncate.do_ftruncate.do_sys_ftruncate
      1.33 ± 90%    +122.5%       2.96 ± 34%  perf-sched.sch_delay.max.ms.__cond_resched.down_write.ext4_alloc_file_blocks.isra.0
      3.04           +93.7%       5.89 ± 82%  perf-sched.sch_delay.max.ms.__cond_resched.down_write.ext4_setattr.notify_change.do_truncate
      3.66 ± 19%     +52.6%       5.59 ± 31%  perf-sched.sch_delay.max.ms.__cond_resched.down_write.ext4_truncate.ext4_setattr.notify_change
      0.41 ± 26%    +169.4%       1.11 ± 78%  perf-sched.sch_delay.max.ms.__cond_resched.ext4_free_blocks.ext4_remove_blocks.ext4_ext_rm_leaf.ext4_ext_remove_space
      6.93 ± 82%     -65.5%       2.39 ± 49%  perf-sched.sch_delay.max.ms.__cond_resched.ext4_mb_regular_allocator.ext4_mb_new_blocks.ext4_ext_map_blocks.ext4_map_create_blocks
      0.23 ± 68%    +357.9%       1.04 ± 82%  perf-sched.sch_delay.max.ms.__cond_resched.kmem_cache_alloc_noprof.ext4_mb_clear_bb.ext4_remove_blocks.ext4_ext_rm_leaf
      0.26 ± 39%    +205.8%       0.78 ± 73%  perf-sched.sch_delay.max.ms.__cond_resched.mutex_lock.ext4_mb_initialize_context.ext4_mb_new_blocks.ext4_ext_map_blocks
      0.11 ± 93%   +1390.4%       1.60 ± 62%  perf-sched.sch_delay.max.ms.io_schedule.bit_wait_io.__wait_on_bit_lock.out_of_line_wait_on_bit_lock
      0.30 ± 74%   +2467.2%       7.58 ± 60%  perf-sched.sch_delay.max.ms.io_schedule.folio_wait_bit_common.__find_get_block_slow.find_get_block_common
      2.66 ± 18%     +29.4%       3.44 ±  7%  perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      2.64 ± 21%    +197.3%       7.84 ± 53%  perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     87.11 ±  2%     -15.3%      73.79 ±  4%  perf-sched.total_wait_and_delay.average.ms
     21561 ±  2%     +18.5%      25553 ±  4%  perf-sched.total_wait_and_delay.count.ms
     86.95 ±  2%     -15.4%      73.60 ±  4%  perf-sched.total_wait_time.average.ms
      0.76 ± 54%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.bdev_getblk.ext4_read_block_bitmap_nowait.ext4_read_block_bitmap.ext4_mb_mark_context
      0.61 ± 47%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.ext4_mb_regular_allocator.ext4_mb_new_blocks.ext4_ext_map_blocks.ext4_map_create_blocks
    168.47 ±  2%     -10.4%     150.98 ±  4%  perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    125.33 ± 10%     +72.2%     215.83 ±  8%  perf-sched.wait_and_delay.count.__cond_resched.__ext4_handle_dirty_metadata.ext4_do_update_inode.isra.0
    781.33 ±  3%     -74.6%     198.83 ± 15%  perf-sched.wait_and_delay.count.__cond_resched.__ext4_handle_dirty_metadata.ext4_mb_mark_context.ext4_mb_mark_diskspace_used.ext4_mb_new_blocks
    278.67 ± 13%    +310.9%       1145 ± 20%  perf-sched.wait_and_delay.count.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.ext4_setattr
      1116 ±  3%     -81.5%     206.33 ± 13%  perf-sched.wait_and_delay.count.__cond_resched.__find_get_block_slow.find_get_block_common.bdev_getblk.ext4_read_block_bitmap_nowait
    166.33 ±  8%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.bdev_getblk.ext4_read_block_bitmap_nowait.ext4_read_block_bitmap.ext4_mb_mark_context
    115.50 ± 46%    +298.7%     460.50 ± 16%  perf-sched.wait_and_delay.count.__cond_resched.down_read.ext4_map_blocks.ext4_alloc_file_blocks.isra
    138.33 ± 16%    +290.7%     540.50 ± 18%  perf-sched.wait_and_delay.count.__cond_resched.down_write.ext4_setattr.notify_change.do_truncate
    310.17 ± 14%    +263.9%       1128 ± 21%  perf-sched.wait_and_delay.count.__cond_resched.down_write.ext4_truncate.ext4_setattr.notify_change
      1274 ±  2%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.ext4_mb_regular_allocator.ext4_mb_new_blocks.ext4_ext_map_blocks.ext4_map_create_blocks
      7148 ±  2%     +11.9%       7998 ±  4%  perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     32.82 ± 80%     -81.8%       5.99 ± 34%  perf-sched.wait_and_delay.max.ms.__cond_resched.__ext4_handle_dirty_metadata.ext4_mb_mark_context.ext4_mb_mark_diskspace_used.ext4_mb_new_blocks
     12.06 ± 22%    +168.4%      32.36 ± 47%  perf-sched.wait_and_delay.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.ext4_setattr
     20.55 ± 82%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.bdev_getblk.ext4_read_block_bitmap_nowait.ext4_read_block_bitmap.ext4_mb_mark_context
     27.66 ± 20%     +78.9%      49.49 ± 60%  perf-sched.wait_and_delay.max.ms.__cond_resched.ext4_journal_check_start.__ext4_journal_start_sb.ext4_dirty_inode.__mark_inode_dirty
     16.75 ± 64%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.ext4_mb_regular_allocator.ext4_mb_new_blocks.ext4_ext_map_blocks.ext4_map_create_blocks
      0.19 ± 29%    +191.5%       0.55 ± 29%  perf-sched.wait_time.avg.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_truncate.ext4_setattr.notify_change
      0.15 ± 24%     +98.1%       0.31 ± 21%  perf-sched.wait_time.avg.ms.__cond_resched.ext4_free_blocks.ext4_remove_blocks.ext4_ext_rm_leaf.ext4_ext_remove_space
    168.44 ±  2%     -10.4%     150.94 ±  4%  perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.36 ± 40%    +392.9%       1.78 ± 71%  perf-sched.wait_time.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_noprof.__filemap_get_folio
     17.42 ± 70%     -82.4%       3.07 ± 34%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_handle_dirty_metadata.ext4_mb_mark_context.ext4_mb_mark_diskspace_used.ext4_mb_new_blocks
     11.49 ± 26%    +180.6%      32.23 ± 48%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.ext4_setattr
      0.06 ±223%   +1443.8%       0.86 ± 97%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_dirty_inode.__mark_inode_dirty.generic_update_time
      0.47 ± 33%    +411.8%       2.40 ± 56%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_ext_insert_extent.ext4_ext_map_blocks.ext4_map_create_blocks
      0.64 ±161%    +244.6%       2.20 ± 61%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_setattr.notify_change.do_truncate
      0.47 ± 64%    +968.9%       5.01 ± 83%  perf-sched.wait_time.max.ms.__cond_resched.__ext4_mark_inode_dirty.ext4_truncate.ext4_setattr.notify_change
      0.08 ±138%    +382.6%       0.37 ± 24%  perf-sched.wait_time.max.ms.__cond_resched.down_write.do_truncate.do_ftruncate.do_sys_ftruncate
      0.41 ± 26%    +169.4%       1.11 ± 78%  perf-sched.wait_time.max.ms.__cond_resched.ext4_free_blocks.ext4_remove_blocks.ext4_ext_rm_leaf.ext4_ext_remove_space
     17.67 ± 25%    +110.8%      37.26 ± 35%  perf-sched.wait_time.max.ms.__cond_resched.ext4_journal_check_start.__ext4_journal_start_sb.ext4_dirty_inode.__mark_inode_dirty
      2.23 ± 51%    +360.3%      10.28 ± 71%  perf-sched.wait_time.max.ms.__cond_resched.ext4_journal_check_start.__ext4_journal_start_sb.ext4_ext_remove_space.ext4_ext_truncate
     84.33 ± 14%     -46.9%      44.77 ± 72%  perf-sched.wait_time.max.ms.__cond_resched.ext4_mb_load_buddy_gfp.ext4_process_freed_data.ext4_journal_commit_callback.jbd2_journal_commit_transaction
      0.23 ± 68%    +357.9%       1.04 ± 82%  perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_noprof.ext4_mb_clear_bb.ext4_remove_blocks.ext4_ext_rm_leaf
      0.26 ± 39%    +205.8%       0.78 ± 73%  perf-sched.wait_time.max.ms.__cond_resched.mutex_lock.ext4_mb_initialize_context.ext4_mb_new_blocks.ext4_ext_map_blocks
    276.82 ± 13%     -22.2%     215.50 ± 13%  perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.30 ± 74%   +9637.4%      28.76 ± 48%  perf-sched.wait_time.max.ms.io_schedule.folio_wait_bit_common.__find_get_block_slow.find_get_block_common
      1.44 ± 79%  +11858.3%     172.80 ±219%  perf-sched.wait_time.max.ms.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown]




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Jan Kara 3 months, 1 week ago
On Mon 23-06-25 15:32:51, Baokun Li wrote:
> After we optimized the block group lock, we found another lock
> contention issue when running will-it-scale/fallocate2 with multiple
> processes. The fallocate's block allocation and the truncate's block
> release were fighting over the s_md_lock. The problem is, this lock
> protects totally different things in those two processes: the list of
> freed data blocks (s_freed_data_list) when releasing, and where to start
> looking for new blocks (mb_last_group) when allocating.
> 
> Now we only need to track s_mb_last_group and no longer need to track
> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> two are consistent, and we can ensure that the s_mb_last_group read is up
> to date by using smp_store_release/smp_load_acquire.
> 
> Besides, the s_mb_last_group data type only requires ext4_group_t
> (i.e., unsigned int), rendering unsigned long superfluous.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                    | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>  Disk: 960GB SSD   |-------------------------|-------------------------|
>                    | base  |    patched      | base  |    patched      |
> -------------------|-------|-----------------|-------|-----------------|
> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

...

> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 5cdae3bda072..3f103919868b 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>  	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>  	folio_get(ac->ac_buddy_folio);
>  	/* store last allocated for subsequent stream allocation */
> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		spin_lock(&sbi->s_md_lock);
> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> -		spin_unlock(&sbi->s_md_lock);
> -	}
> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);

Do you really need any kind of barrier (implied by smp_store_release())
here? I mean the store to s_mb_last_group is perfectly fine to be reordered
with other accesses from the thread, isn't it? As such it should be enough
to have WRITE_ONCE() here...

>  	/*
>  	 * As we've just preallocated more space than
>  	 * user requested originally, we store allocated
> @@ -2844,12 +2842,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  	}
>  
>  	/* if stream allocation is enabled, use global goal */
> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		/* TBD: may be hot point */
> -		spin_lock(&sbi->s_md_lock);
> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> -		spin_unlock(&sbi->s_md_lock);
> -	}
> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> +		/* pairs with smp_store_release in ext4_mb_use_best_found() */
> +		ac->ac_g_ex.fe_group = smp_load_acquire(&sbi->s_mb_last_group);

... and READ_ONCE() here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Baokun Li 3 months, 1 week ago
On 2025/6/28 2:19, Jan Kara wrote:
> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>> After we optimized the block group lock, we found another lock
>> contention issue when running will-it-scale/fallocate2 with multiple
>> processes. The fallocate's block allocation and the truncate's block
>> release were fighting over the s_md_lock. The problem is, this lock
>> protects totally different things in those two processes: the list of
>> freed data blocks (s_freed_data_list) when releasing, and where to start
>> looking for new blocks (mb_last_group) when allocating.
>>
>> Now we only need to track s_mb_last_group and no longer need to track
>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>> two are consistent, and we can ensure that the s_mb_last_group read is up
>> to date by using smp_store_release/smp_load_acquire.
>>
>> Besides, the s_mb_last_group data type only requires ext4_group_t
>> (i.e., unsigned int), rendering unsigned long superfluous.
>>
>> Performance test data follows:
>>
>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>> Observation: Average fallocate operations per container per second.
>>
>>                     | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>   Disk: 960GB SSD   |-------------------------|-------------------------|
>>                     | base  |    patched      | base  |    patched      |
>> -------------------|-------|-----------------|-------|-----------------|
>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ...
>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 5cdae3bda072..3f103919868b 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>   	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>   	folio_get(ac->ac_buddy_folio);
>>   	/* store last allocated for subsequent stream allocation */
>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>> -		spin_lock(&sbi->s_md_lock);
>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>> -		spin_unlock(&sbi->s_md_lock);
>> -	}
>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
> Do you really need any kind of barrier (implied by smp_store_release())
> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
> with other accesses from the thread, isn't it? As such it should be enough
> to have WRITE_ONCE() here...

WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
that variable reads/writes access values directly from L1/L2 cache rather
than registers.

They do not guarantee that other CPUs see the latest values. Reading stale
values could lead to more useless traversals, which might incur higher
overhead than memory barriers. This is why we use memory barriers to ensure
the latest values are read.

If we could guarantee that each goal is used on only one CPU, we could
switch to the cheaper WRITE_ONCE()/READ_ONCE().


Regards,
Baokun

>>   	/*
>>   	 * As we've just preallocated more space than
>>   	 * user requested originally, we store allocated
>> @@ -2844,12 +2842,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>   	}
>>   
>>   	/* if stream allocation is enabled, use global goal */
>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>> -		/* TBD: may be hot point */
>> -		spin_lock(&sbi->s_md_lock);
>> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
>> -		spin_unlock(&sbi->s_md_lock);
>> -	}
>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>> +		/* pairs with smp_store_release in ext4_mb_use_best_found() */
>> +		ac->ac_g_ex.fe_group = smp_load_acquire(&sbi->s_mb_last_group);
> ... and READ_ONCE() here.
>
> 								Honza
Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Jan Kara 3 months, 1 week ago
On Mon 30-06-25 11:48:20, Baokun Li wrote:
> On 2025/6/28 2:19, Jan Kara wrote:
> > On Mon 23-06-25 15:32:51, Baokun Li wrote:
> > > After we optimized the block group lock, we found another lock
> > > contention issue when running will-it-scale/fallocate2 with multiple
> > > processes. The fallocate's block allocation and the truncate's block
> > > release were fighting over the s_md_lock. The problem is, this lock
> > > protects totally different things in those two processes: the list of
> > > freed data blocks (s_freed_data_list) when releasing, and where to start
> > > looking for new blocks (mb_last_group) when allocating.
> > > 
> > > Now we only need to track s_mb_last_group and no longer need to track
> > > s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> > > two are consistent, and we can ensure that the s_mb_last_group read is up
> > > to date by using smp_store_release/smp_load_acquire.
> > > 
> > > Besides, the s_mb_last_group data type only requires ext4_group_t
> > > (i.e., unsigned int), rendering unsigned long superfluous.
> > > 
> > > Performance test data follows:
> > > 
> > > Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> > > Observation: Average fallocate operations per container per second.
> > > 
> > >                     | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
> > >   Disk: 960GB SSD   |-------------------------|-------------------------|
> > >                     | base  |    patched      | base  |    patched      |
> > > -------------------|-------|-----------------|-------|-----------------|
> > > mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
> > > mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
> > > 
> > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > ...
> > 
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index 5cdae3bda072..3f103919868b 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> > >   	ac->ac_buddy_folio = e4b->bd_buddy_folio;
> > >   	folio_get(ac->ac_buddy_folio);
> > >   	/* store last allocated for subsequent stream allocation */
> > > -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> > > -		spin_lock(&sbi->s_md_lock);
> > > -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> > > -		spin_unlock(&sbi->s_md_lock);
> > > -	}
> > > +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> > > +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
> > > +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
> > Do you really need any kind of barrier (implied by smp_store_release())
> > here? I mean the store to s_mb_last_group is perfectly fine to be reordered
> > with other accesses from the thread, isn't it? As such it should be enough
> > to have WRITE_ONCE() here...
> 
> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
> that variable reads/writes access values directly from L1/L2 cache rather
> than registers.

I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
particular they force the compiler to read / write the memory location
exactly once instead of reading it potentially multiple times in different
parts of expression and getting inconsistent values, or possibly writing
the value say byte by byte (yes, that would be insane but not contrary to
the C standard).

> They do not guarantee that other CPUs see the latest values. Reading stale
> values could lead to more useless traversals, which might incur higher
> overhead than memory barriers. This is why we use memory barriers to ensure
> the latest values are read.

But smp_load_acquire() / smp_store_release() have no guarantee about CPU
seeing latest values either. They are just speculation barriers meaning
they prevent the CPU from reordering accesses in the code after
smp_load_acquire() to be performed before the smp_load_acquire() is
executed and similarly with smp_store_release(). So I dare to say that
these barries have no (positive) impact on the allocation performance and
just complicate the code - but if you have some data that show otherwise,
I'd be happy to be proven wrong.

> If we could guarantee that each goal is used on only one CPU, we could
> switch to the cheaper WRITE_ONCE()/READ_ONCE().

Well, neither READ_ONCE() / WRITE_ONCE() nor smp_load_acquire() /
smp_store_release() can guarantee that.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Baokun Li 3 months, 1 week ago
On 2025/6/30 15:47, Jan Kara wrote:
> On Mon 30-06-25 11:48:20, Baokun Li wrote:
>> On 2025/6/28 2:19, Jan Kara wrote:
>>> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>>>> After we optimized the block group lock, we found another lock
>>>> contention issue when running will-it-scale/fallocate2 with multiple
>>>> processes. The fallocate's block allocation and the truncate's block
>>>> release were fighting over the s_md_lock. The problem is, this lock
>>>> protects totally different things in those two processes: the list of
>>>> freed data blocks (s_freed_data_list) when releasing, and where to start
>>>> looking for new blocks (mb_last_group) when allocating.
>>>>
>>>> Now we only need to track s_mb_last_group and no longer need to track
>>>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>>>> two are consistent, and we can ensure that the s_mb_last_group read is up
>>>> to date by using smp_store_release/smp_load_acquire.
>>>>
>>>> Besides, the s_mb_last_group data type only requires ext4_group_t
>>>> (i.e., unsigned int), rendering unsigned long superfluous.
>>>>
>>>> Performance test data follows:
>>>>
>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>> Observation: Average fallocate operations per container per second.
>>>>
>>>>                      | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>    Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>                      | base  |    patched      | base  |    patched      |
>>>> -------------------|-------|-----------------|-------|-----------------|
>>>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>>>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>>>
>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>> ...
>>>
>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>> index 5cdae3bda072..3f103919868b 100644
>>>> --- a/fs/ext4/mballoc.c
>>>> +++ b/fs/ext4/mballoc.c
>>>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>>>    	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>>>    	folio_get(ac->ac_buddy_folio);
>>>>    	/* store last allocated for subsequent stream allocation */
>>>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>> -		spin_lock(&sbi->s_md_lock);
>>>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>>>> -		spin_unlock(&sbi->s_md_lock);
>>>> -	}
>>>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>>>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>>>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>> Do you really need any kind of barrier (implied by smp_store_release())
>>> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
>>> with other accesses from the thread, isn't it? As such it should be enough
>>> to have WRITE_ONCE() here...
>> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
>> that variable reads/writes access values directly from L1/L2 cache rather
>> than registers.
> I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
> particular they force the compiler to read / write the memory location
> exactly once instead of reading it potentially multiple times in different
> parts of expression and getting inconsistent values, or possibly writing
> the value say byte by byte (yes, that would be insane but not contrary to
> the C standard).
READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
two main purposes:

1. It tells the compiler that the variable's value can change unexpectedly,
    preventing the compiler from making incorrect optimizations based on
    assumptions about its stability.

2. It ensures the CPU directly reads from or writes to the variable's
    memory address. This means the value will be fetched from cache (L1/L2)
    if available, or from main memory otherwise, rather than using a stale
    value from a CPU register.
>> They do not guarantee that other CPUs see the latest values. Reading stale
>> values could lead to more useless traversals, which might incur higher
>> overhead than memory barriers. This is why we use memory barriers to ensure
>> the latest values are read.
> But smp_load_acquire() / smp_store_release() have no guarantee about CPU
> seeing latest values either. They are just speculation barriers meaning
> they prevent the CPU from reordering accesses in the code after
> smp_load_acquire() to be performed before the smp_load_acquire() is
> executed and similarly with smp_store_release(). So I dare to say that
> these barries have no (positive) impact on the allocation performance and
> just complicate the code - but if you have some data that show otherwise,
> I'd be happy to be proven wrong.
smp_load_acquire() / smp_store_release() guarantee that CPUs read the
latest data.

For example, imagine a variable a = 0, with both CPU0 and CPU1 having
a=0 in their caches.

Without a memory barrier:
When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
a=1 is written to main memory and becomes visible to other CPUs.
Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
its invalidation queue. However, it might not process it immediately;
instead, it could perform the read first, potentially still reading a=0
from its cache.

With a memory barrier:
When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
the store buffer, but data in the store buffer is also written to main
memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.

When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
to its invalidation queue. Here, the invalidation queue is flushed, which
invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
performs the read, its cache is invalid, so it reads the latest a=1 from
main memory.

This is a general overview. Please let me know if I've missed anything.


Thanks,
Baokun

Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Jan Kara 3 months, 1 week ago
On Mon 30-06-25 17:21:48, Baokun Li wrote:
> On 2025/6/30 15:47, Jan Kara wrote:
> > On Mon 30-06-25 11:48:20, Baokun Li wrote:
> > > On 2025/6/28 2:19, Jan Kara wrote:
> > > > On Mon 23-06-25 15:32:51, Baokun Li wrote:
> > > > > After we optimized the block group lock, we found another lock
> > > > > contention issue when running will-it-scale/fallocate2 with multiple
> > > > > processes. The fallocate's block allocation and the truncate's block
> > > > > release were fighting over the s_md_lock. The problem is, this lock
> > > > > protects totally different things in those two processes: the list of
> > > > > freed data blocks (s_freed_data_list) when releasing, and where to start
> > > > > looking for new blocks (mb_last_group) when allocating.
> > > > > 
> > > > > Now we only need to track s_mb_last_group and no longer need to track
> > > > > s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> > > > > two are consistent, and we can ensure that the s_mb_last_group read is up
> > > > > to date by using smp_store_release/smp_load_acquire.
> > > > > 
> > > > > Besides, the s_mb_last_group data type only requires ext4_group_t
> > > > > (i.e., unsigned int), rendering unsigned long superfluous.
> > > > > 
> > > > > Performance test data follows:
> > > > > 
> > > > > Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> > > > > Observation: Average fallocate operations per container per second.
> > > > > 
> > > > >                      | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
> > > > >    Disk: 960GB SSD   |-------------------------|-------------------------|
> > > > >                      | base  |    patched      | base  |    patched      |
> > > > > -------------------|-------|-----------------|-------|-----------------|
> > > > > mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
> > > > > mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
> > > > > 
> > > > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > > > ...
> > > > 
> > > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > > > index 5cdae3bda072..3f103919868b 100644
> > > > > --- a/fs/ext4/mballoc.c
> > > > > +++ b/fs/ext4/mballoc.c
> > > > > @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> > > > >    	ac->ac_buddy_folio = e4b->bd_buddy_folio;
> > > > >    	folio_get(ac->ac_buddy_folio);
> > > > >    	/* store last allocated for subsequent stream allocation */
> > > > > -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> > > > > -		spin_lock(&sbi->s_md_lock);
> > > > > -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> > > > > -		spin_unlock(&sbi->s_md_lock);
> > > > > -	}
> > > > > +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> > > > > +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
> > > > > +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
> > > > Do you really need any kind of barrier (implied by smp_store_release())
> > > > here? I mean the store to s_mb_last_group is perfectly fine to be reordered
> > > > with other accesses from the thread, isn't it? As such it should be enough
> > > > to have WRITE_ONCE() here...
> > > WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
> > > that variable reads/writes access values directly from L1/L2 cache rather
> > > than registers.
> > I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
> > particular they force the compiler to read / write the memory location
> > exactly once instead of reading it potentially multiple times in different
> > parts of expression and getting inconsistent values, or possibly writing
> > the value say byte by byte (yes, that would be insane but not contrary to
> > the C standard).
> READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
> two main purposes:
> 
> 1. It tells the compiler that the variable's value can change unexpectedly,
>    preventing the compiler from making incorrect optimizations based on
>    assumptions about its stability.
> 
> 2. It ensures the CPU directly reads from or writes to the variable's
>    memory address. This means the value will be fetched from cache (L1/L2)
>    if available, or from main memory otherwise, rather than using a stale
>    value from a CPU register.

Yes, we agree on this.

> > > They do not guarantee that other CPUs see the latest values. Reading stale
> > > values could lead to more useless traversals, which might incur higher
> > > overhead than memory barriers. This is why we use memory barriers to ensure
> > > the latest values are read.
> > But smp_load_acquire() / smp_store_release() have no guarantee about CPU
> > seeing latest values either. They are just speculation barriers meaning
> > they prevent the CPU from reordering accesses in the code after
> > smp_load_acquire() to be performed before the smp_load_acquire() is
> > executed and similarly with smp_store_release(). So I dare to say that
> > these barries have no (positive) impact on the allocation performance and
> > just complicate the code - but if you have some data that show otherwise,
> > I'd be happy to be proven wrong.
> smp_load_acquire() / smp_store_release() guarantee that CPUs read the
> latest data.
> 
> For example, imagine a variable a = 0, with both CPU0 and CPU1 having
> a=0 in their caches.
> 
> Without a memory barrier:
> When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
> an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
> a=1 is written to main memory and becomes visible to other CPUs.
> Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
> its invalidation queue. However, it might not process it immediately;
> instead, it could perform the read first, potentially still reading a=0
> from its cache.
> 
> With a memory barrier:
> When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
> the store buffer, but data in the store buffer is also written to main
> memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
> 
> When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
> to its invalidation queue. Here, the invalidation queue is flushed, which
> invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
> performs the read, its cache is invalid, so it reads the latest a=1 from
> main memory.

Well, here I think you assume way more about the CPU architecture than is
generally true (and I didn't find what you write above guaranteed neither
by x86 nor by arm64 CPU documentation). Generally I'm following the
guarantees as defined by Documentation/memory-barriers.txt and there you
can argue only about order of effects as observed by different CPUs but not
really about when content is fetched to / from CPU caches.

BTW on x86 in particular smp_load_acquire() and smp_store_release() aren't
very different from pure READ_ONCE() / WRITE_ONCE:

arch/x86/include/asm/barrier.h:

#define __smp_store_release(p, v)                                       \
do {                                                                    \
        compiletime_assert_atomic_type(*p);                             \
        barrier();                                                      \
        WRITE_ONCE(*p, v);                                              \
} while (0)

#define __smp_load_acquire(p)                                           \
({                                                                      \
        typeof(*p) ___p1 = READ_ONCE(*p);                               \
        compiletime_assert_atomic_type(*p);                             \
        barrier();                                                      \
        ___p1;                                                          \
})

where barrier() is just a compiler barrier - i.e., preventing the compiler
from reordering accesses around this point. This is because x86 is strongly
ordered and the CPU can only reorder loads earlier than previous stores.
TL;DR; on x86 there's no practical difference between using READ_ONCE() /
WRITE_ONCE() and smp_load_acquire() and smp_store_release() in your code.
So I still think using those will be clearer and I'd be curious if you can
see any performance impacts from using READ_ONCE / WRITE_ONCE instead of
smp_load_acquire() / smp_store_release().

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Baokun Li 3 months, 1 week ago
On 2025/7/1 0:32, Jan Kara wrote:
> On Mon 30-06-25 17:21:48, Baokun Li wrote:
>> On 2025/6/30 15:47, Jan Kara wrote:
>>> On Mon 30-06-25 11:48:20, Baokun Li wrote:
>>>> On 2025/6/28 2:19, Jan Kara wrote:
>>>>> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>>>>>> After we optimized the block group lock, we found another lock
>>>>>> contention issue when running will-it-scale/fallocate2 with multiple
>>>>>> processes. The fallocate's block allocation and the truncate's block
>>>>>> release were fighting over the s_md_lock. The problem is, this lock
>>>>>> protects totally different things in those two processes: the list of
>>>>>> freed data blocks (s_freed_data_list) when releasing, and where to start
>>>>>> looking for new blocks (mb_last_group) when allocating.
>>>>>>
>>>>>> Now we only need to track s_mb_last_group and no longer need to track
>>>>>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>>>>>> two are consistent, and we can ensure that the s_mb_last_group read is up
>>>>>> to date by using smp_store_release/smp_load_acquire.
>>>>>>
>>>>>> Besides, the s_mb_last_group data type only requires ext4_group_t
>>>>>> (i.e., unsigned int), rendering unsigned long superfluous.
>>>>>>
>>>>>> Performance test data follows:
>>>>>>
>>>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>>>> Observation: Average fallocate operations per container per second.
>>>>>>
>>>>>>                       | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>>>     Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>>>                       | base  |    patched      | base  |    patched      |
>>>>>> -------------------|-------|-----------------|-------|-----------------|
>>>>>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>>>>>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>>>>>
>>>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>>>> ...
>>>>>
>>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>>>> index 5cdae3bda072..3f103919868b 100644
>>>>>> --- a/fs/ext4/mballoc.c
>>>>>> +++ b/fs/ext4/mballoc.c
>>>>>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>>>>>     	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>>>>>     	folio_get(ac->ac_buddy_folio);
>>>>>>     	/* store last allocated for subsequent stream allocation */
>>>>>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>>>> -		spin_lock(&sbi->s_md_lock);
>>>>>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>>>>>> -		spin_unlock(&sbi->s_md_lock);
>>>>>> -	}
>>>>>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>>>>>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>>>>>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>>>> Do you really need any kind of barrier (implied by smp_store_release())
>>>>> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
>>>>> with other accesses from the thread, isn't it? As such it should be enough
>>>>> to have WRITE_ONCE() here...
>>>> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
>>>> that variable reads/writes access values directly from L1/L2 cache rather
>>>> than registers.
>>> I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
>>> particular they force the compiler to read / write the memory location
>>> exactly once instead of reading it potentially multiple times in different
>>> parts of expression and getting inconsistent values, or possibly writing
>>> the value say byte by byte (yes, that would be insane but not contrary to
>>> the C standard).
>> READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
>> two main purposes:
>>
>> 1. It tells the compiler that the variable's value can change unexpectedly,
>>     preventing the compiler from making incorrect optimizations based on
>>     assumptions about its stability.
>>
>> 2. It ensures the CPU directly reads from or writes to the variable's
>>     memory address. This means the value will be fetched from cache (L1/L2)
>>     if available, or from main memory otherwise, rather than using a stale
>>     value from a CPU register.
> Yes, we agree on this.
>
>>>> They do not guarantee that other CPUs see the latest values. Reading stale
>>>> values could lead to more useless traversals, which might incur higher
>>>> overhead than memory barriers. This is why we use memory barriers to ensure
>>>> the latest values are read.
>>> But smp_load_acquire() / smp_store_release() have no guarantee about CPU
>>> seeing latest values either. They are just speculation barriers meaning
>>> they prevent the CPU from reordering accesses in the code after
>>> smp_load_acquire() to be performed before the smp_load_acquire() is
>>> executed and similarly with smp_store_release(). So I dare to say that
>>> these barries have no (positive) impact on the allocation performance and
>>> just complicate the code - but if you have some data that show otherwise,
>>> I'd be happy to be proven wrong.
>> smp_load_acquire() / smp_store_release() guarantee that CPUs read the
>> latest data.
>>
>> For example, imagine a variable a = 0, with both CPU0 and CPU1 having
>> a=0 in their caches.
>>
>> Without a memory barrier:
>> When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
>> an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
>> a=1 is written to main memory and becomes visible to other CPUs.
>> Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
>> its invalidation queue. However, it might not process it immediately;
>> instead, it could perform the read first, potentially still reading a=0
>> from its cache.
>>
>> With a memory barrier:
>> When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
>> the store buffer, but data in the store buffer is also written to main
>> memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
>>
>> When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
>> to its invalidation queue. Here, the invalidation queue is flushed, which
>> invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
>> performs the read, its cache is invalid, so it reads the latest a=1 from
>> main memory.
> Well, here I think you assume way more about the CPU architecture than is
> generally true (and I didn't find what you write above guaranteed neither
> by x86 nor by arm64 CPU documentation). Generally I'm following the
> guarantees as defined by Documentation/memory-barriers.txt and there you
> can argue only about order of effects as observed by different CPUs but not
> really about when content is fetched to / from CPU caches.

Explaining why smp_load_acquire() and smp_store_release() guarantee the
latest data is read truly requires delving into their underlying
implementation details.

I suggest you Google "why memory barriers are needed." You might find
introductions to concepts like 'Total Store Order', 'Weak Memory Ordering',
MESI, store buffers, and invalidate queue, along with the stories behind
them.

The Documentation/memory-barriers.txt file does a good job of introducing
memory barrier concepts and guiding their usage (for instance, the
'MULTICOPY ATOMICITY' section covers CPU cache coherence in detail).
However, it skips many of the specific implementation details that are
quite often necessary for a deeper understanding.

>
> BTW on x86 in particular smp_load_acquire() and smp_store_release() aren't
> very different from pure READ_ONCE() / WRITE_ONCE:
>
> arch/x86/include/asm/barrier.h:
>
> #define __smp_store_release(p, v)                                       \
> do {                                                                    \
>          compiletime_assert_atomic_type(*p);                             \
>          barrier();                                                      \
>          WRITE_ONCE(*p, v);                                              \
> } while (0)
>
> #define __smp_load_acquire(p)                                           \
> ({                                                                      \
>          typeof(*p) ___p1 = READ_ONCE(*p);                               \
>          compiletime_assert_atomic_type(*p);                             \
>          barrier();                                                      \
>          ___p1;                                                          \
> })
>
> where barrier() is just a compiler barrier - i.e., preventing the compiler
> from reordering accesses around this point. This is because x86 is strongly
> ordered and the CPU can only reorder loads earlier than previous stores.
> TL;DR; on x86 there's no practical difference between using READ_ONCE() /
> WRITE_ONCE() and smp_load_acquire() and smp_store_release() in your code.
> So I still think using those will be clearer and I'd be curious if you can
> see any performance impacts from using READ_ONCE / WRITE_ONCE instead of
> smp_load_acquire() / smp_store_release().
>
> 								Honza

Yes, x86 is a strongly ordered memory architecture. For x86, we only need
to use READ_ONCE()/WRITE_ONCE() to ensure access to data in the CPU cache,
as x86 guarantees the cache is up-to-date.

However, the Linux kernel doesn't exclusively run on x86 architectures;
we have a large number of arm64 servers. Disregarding performance, it's
inherently unreasonable that x86 consistently sees the latest global goals
during block allocation while arm64 does not.


Cheers,
Baokun

Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Jan Kara 3 months, 1 week ago
On Tue 01-07-25 10:39:53, Baokun Li wrote:
> On 2025/7/1 0:32, Jan Kara wrote:
> > On Mon 30-06-25 17:21:48, Baokun Li wrote:
> > > On 2025/6/30 15:47, Jan Kara wrote:
> > > > On Mon 30-06-25 11:48:20, Baokun Li wrote:
> > > > > On 2025/6/28 2:19, Jan Kara wrote:
> > > > > > On Mon 23-06-25 15:32:51, Baokun Li wrote:
> > > > > > > After we optimized the block group lock, we found another lock
> > > > > > > contention issue when running will-it-scale/fallocate2 with multiple
> > > > > > > processes. The fallocate's block allocation and the truncate's block
> > > > > > > release were fighting over the s_md_lock. The problem is, this lock
> > > > > > > protects totally different things in those two processes: the list of
> > > > > > > freed data blocks (s_freed_data_list) when releasing, and where to start
> > > > > > > looking for new blocks (mb_last_group) when allocating.
> > > > > > > 
> > > > > > > Now we only need to track s_mb_last_group and no longer need to track
> > > > > > > s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> > > > > > > two are consistent, and we can ensure that the s_mb_last_group read is up
> > > > > > > to date by using smp_store_release/smp_load_acquire.
> > > > > > > 
> > > > > > > Besides, the s_mb_last_group data type only requires ext4_group_t
> > > > > > > (i.e., unsigned int), rendering unsigned long superfluous.
> > > > > > > 
> > > > > > > Performance test data follows:
> > > > > > > 
> > > > > > > Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> > > > > > > Observation: Average fallocate operations per container per second.
> > > > > > > 
> > > > > > >                       | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
> > > > > > >     Disk: 960GB SSD   |-------------------------|-------------------------|
> > > > > > >                       | base  |    patched      | base  |    patched      |
> > > > > > > -------------------|-------|-----------------|-------|-----------------|
> > > > > > > mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
> > > > > > > mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
> > > > > > > 
> > > > > > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > > > > > ...
> > > > > > 
> > > > > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > > > > > index 5cdae3bda072..3f103919868b 100644
> > > > > > > --- a/fs/ext4/mballoc.c
> > > > > > > +++ b/fs/ext4/mballoc.c
> > > > > > > @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
> > > > > > >     	ac->ac_buddy_folio = e4b->bd_buddy_folio;
> > > > > > >     	folio_get(ac->ac_buddy_folio);
> > > > > > >     	/* store last allocated for subsequent stream allocation */
> > > > > > > -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> > > > > > > -		spin_lock(&sbi->s_md_lock);
> > > > > > > -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> > > > > > > -		spin_unlock(&sbi->s_md_lock);
> > > > > > > -	}
> > > > > > > +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> > > > > > > +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
> > > > > > > +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
> > > > > > Do you really need any kind of barrier (implied by smp_store_release())
> > > > > > here? I mean the store to s_mb_last_group is perfectly fine to be reordered
> > > > > > with other accesses from the thread, isn't it? As such it should be enough
> > > > > > to have WRITE_ONCE() here...
> > > > > WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
> > > > > that variable reads/writes access values directly from L1/L2 cache rather
> > > > > than registers.
> > > > I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
> > > > particular they force the compiler to read / write the memory location
> > > > exactly once instead of reading it potentially multiple times in different
> > > > parts of expression and getting inconsistent values, or possibly writing
> > > > the value say byte by byte (yes, that would be insane but not contrary to
> > > > the C standard).
> > > READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
> > > two main purposes:
> > > 
> > > 1. It tells the compiler that the variable's value can change unexpectedly,
> > >     preventing the compiler from making incorrect optimizations based on
> > >     assumptions about its stability.
> > > 
> > > 2. It ensures the CPU directly reads from or writes to the variable's
> > >     memory address. This means the value will be fetched from cache (L1/L2)
> > >     if available, or from main memory otherwise, rather than using a stale
> > >     value from a CPU register.
> > Yes, we agree on this.
> > 
> > > > > They do not guarantee that other CPUs see the latest values. Reading stale
> > > > > values could lead to more useless traversals, which might incur higher
> > > > > overhead than memory barriers. This is why we use memory barriers to ensure
> > > > > the latest values are read.
> > > > But smp_load_acquire() / smp_store_release() have no guarantee about CPU
> > > > seeing latest values either. They are just speculation barriers meaning
> > > > they prevent the CPU from reordering accesses in the code after
> > > > smp_load_acquire() to be performed before the smp_load_acquire() is
> > > > executed and similarly with smp_store_release(). So I dare to say that
> > > > these barries have no (positive) impact on the allocation performance and
> > > > just complicate the code - but if you have some data that show otherwise,
> > > > I'd be happy to be proven wrong.
> > > smp_load_acquire() / smp_store_release() guarantee that CPUs read the
> > > latest data.
> > > 
> > > For example, imagine a variable a = 0, with both CPU0 and CPU1 having
> > > a=0 in their caches.
> > > 
> > > Without a memory barrier:
> > > When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
> > > an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
> > > a=1 is written to main memory and becomes visible to other CPUs.
> > > Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
> > > its invalidation queue. However, it might not process it immediately;
> > > instead, it could perform the read first, potentially still reading a=0
> > > from its cache.
> > > 
> > > With a memory barrier:
> > > When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
> > > the store buffer, but data in the store buffer is also written to main
> > > memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
> > > 
> > > When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
> > > to its invalidation queue. Here, the invalidation queue is flushed, which
> > > invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
> > > performs the read, its cache is invalid, so it reads the latest a=1 from
> > > main memory.
> > Well, here I think you assume way more about the CPU architecture than is
> > generally true (and I didn't find what you write above guaranteed neither
> > by x86 nor by arm64 CPU documentation). Generally I'm following the
> > guarantees as defined by Documentation/memory-barriers.txt and there you
> > can argue only about order of effects as observed by different CPUs but not
> > really about when content is fetched to / from CPU caches.
> 
> Explaining why smp_load_acquire() and smp_store_release() guarantee the
> latest data is read truly requires delving into their underlying
> implementation details.
> 
> I suggest you Google "why memory barriers are needed." You might find
> introductions to concepts like 'Total Store Order', 'Weak Memory Ordering',
> MESI, store buffers, and invalidate queue, along with the stories behind
> them.

Yes, I know these things. Not that I'd be really an expert in them but I'd
call myself familiar enough :). But that is kind of besides the point here.
What I want to point out it that if you have code like:

  some access A
  grp = smp_load_acquire(&sbi->s_mb_last_group)
  some more accesses

then the CPU is fully within it's right to execute them as:

  grp = smp_load_acquire(&sbi->s_mb_last_group)
  some access A
  some more accesses

Now your *particular implementation* of the ARM64 CPU model may never do
that similarly as no x86 CPU currently does it but some other CPU
implementation may (e.g. Alpha CPU probably would, as much as that's
irrevelent these days :). So using smp_load_acquire() is at best a
heuristics that may happen to help using more fresh value for some CPU
models but it isn't guaranteed to help for all architectures and all CPU
models Linux supports.

So can you do me a favor please and do a performance comparison of using
READ_ONCE / WRITE_ONCE vs using smp_load_acquire / smp_store_release on
your Arm64 server for streaming goal management? If smp_load_acquire /
smp_store_release indeed bring any performance benefit for your servers, we
can just stick a comment there explaining why they are used. If they bring
no measurable benefit I'd put READ_ONCE / WRITE_ONCE there for code
simplicity. Do you agree?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Baokun Li 3 months ago
On 2025/7/1 20:21, Jan Kara wrote:
> On Tue 01-07-25 10:39:53, Baokun Li wrote:
>> On 2025/7/1 0:32, Jan Kara wrote:
>>> On Mon 30-06-25 17:21:48, Baokun Li wrote:
>>>> On 2025/6/30 15:47, Jan Kara wrote:
>>>>> On Mon 30-06-25 11:48:20, Baokun Li wrote:
>>>>>> On 2025/6/28 2:19, Jan Kara wrote:
>>>>>>> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>>>>>>>> After we optimized the block group lock, we found another lock
>>>>>>>> contention issue when running will-it-scale/fallocate2 with multiple
>>>>>>>> processes. The fallocate's block allocation and the truncate's block
>>>>>>>> release were fighting over the s_md_lock. The problem is, this lock
>>>>>>>> protects totally different things in those two processes: the list of
>>>>>>>> freed data blocks (s_freed_data_list) when releasing, and where to start
>>>>>>>> looking for new blocks (mb_last_group) when allocating.
>>>>>>>>
>>>>>>>> Now we only need to track s_mb_last_group and no longer need to track
>>>>>>>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>>>>>>>> two are consistent, and we can ensure that the s_mb_last_group read is up
>>>>>>>> to date by using smp_store_release/smp_load_acquire.
>>>>>>>>
>>>>>>>> Besides, the s_mb_last_group data type only requires ext4_group_t
>>>>>>>> (i.e., unsigned int), rendering unsigned long superfluous.
>>>>>>>>
>>>>>>>> Performance test data follows:
>>>>>>>>
>>>>>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>>>>>> Observation: Average fallocate operations per container per second.
>>>>>>>>
>>>>>>>>                        | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>>>>>      Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>>>>>                        | base  |    patched      | base  |    patched      |
>>>>>>>> -------------------|-------|-----------------|-------|-----------------|
>>>>>>>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>>>>>>>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>>>>>>>
>>>>>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>>>>>> ...
>>>>>>>
>>>>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>>>>>> index 5cdae3bda072..3f103919868b 100644
>>>>>>>> --- a/fs/ext4/mballoc.c
>>>>>>>> +++ b/fs/ext4/mballoc.c
>>>>>>>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>>>>>>>      	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>>>>>>>      	folio_get(ac->ac_buddy_folio);
>>>>>>>>      	/* store last allocated for subsequent stream allocation */
>>>>>>>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>>>>>> -		spin_lock(&sbi->s_md_lock);
>>>>>>>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>>>>>>>> -		spin_unlock(&sbi->s_md_lock);
>>>>>>>> -	}
>>>>>>>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>>>>>>>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>>>>>>>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>>>>>> Do you really need any kind of barrier (implied by smp_store_release())
>>>>>>> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
>>>>>>> with other accesses from the thread, isn't it? As such it should be enough
>>>>>>> to have WRITE_ONCE() here...
>>>>>> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
>>>>>> that variable reads/writes access values directly from L1/L2 cache rather
>>>>>> than registers.
>>>>> I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
>>>>> particular they force the compiler to read / write the memory location
>>>>> exactly once instead of reading it potentially multiple times in different
>>>>> parts of expression and getting inconsistent values, or possibly writing
>>>>> the value say byte by byte (yes, that would be insane but not contrary to
>>>>> the C standard).
>>>> READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
>>>> two main purposes:
>>>>
>>>> 1. It tells the compiler that the variable's value can change unexpectedly,
>>>>      preventing the compiler from making incorrect optimizations based on
>>>>      assumptions about its stability.
>>>>
>>>> 2. It ensures the CPU directly reads from or writes to the variable's
>>>>      memory address. This means the value will be fetched from cache (L1/L2)
>>>>      if available, or from main memory otherwise, rather than using a stale
>>>>      value from a CPU register.
>>> Yes, we agree on this.
>>>
>>>>>> They do not guarantee that other CPUs see the latest values. Reading stale
>>>>>> values could lead to more useless traversals, which might incur higher
>>>>>> overhead than memory barriers. This is why we use memory barriers to ensure
>>>>>> the latest values are read.
>>>>> But smp_load_acquire() / smp_store_release() have no guarantee about CPU
>>>>> seeing latest values either. They are just speculation barriers meaning
>>>>> they prevent the CPU from reordering accesses in the code after
>>>>> smp_load_acquire() to be performed before the smp_load_acquire() is
>>>>> executed and similarly with smp_store_release(). So I dare to say that
>>>>> these barries have no (positive) impact on the allocation performance and
>>>>> just complicate the code - but if you have some data that show otherwise,
>>>>> I'd be happy to be proven wrong.
>>>> smp_load_acquire() / smp_store_release() guarantee that CPUs read the
>>>> latest data.
>>>>
>>>> For example, imagine a variable a = 0, with both CPU0 and CPU1 having
>>>> a=0 in their caches.
>>>>
>>>> Without a memory barrier:
>>>> When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
>>>> an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
>>>> a=1 is written to main memory and becomes visible to other CPUs.
>>>> Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
>>>> its invalidation queue. However, it might not process it immediately;
>>>> instead, it could perform the read first, potentially still reading a=0
>>>> from its cache.
>>>>
>>>> With a memory barrier:
>>>> When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
>>>> the store buffer, but data in the store buffer is also written to main
>>>> memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
>>>>
>>>> When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
>>>> to its invalidation queue. Here, the invalidation queue is flushed, which
>>>> invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
>>>> performs the read, its cache is invalid, so it reads the latest a=1 from
>>>> main memory.
>>> Well, here I think you assume way more about the CPU architecture than is
>>> generally true (and I didn't find what you write above guaranteed neither
>>> by x86 nor by arm64 CPU documentation). Generally I'm following the
>>> guarantees as defined by Documentation/memory-barriers.txt and there you
>>> can argue only about order of effects as observed by different CPUs but not
>>> really about when content is fetched to / from CPU caches.
>> Explaining why smp_load_acquire() and smp_store_release() guarantee the
>> latest data is read truly requires delving into their underlying
>> implementation details.
>>
>> I suggest you Google "why memory barriers are needed." You might find
>> introductions to concepts like 'Total Store Order', 'Weak Memory Ordering',
>> MESI, store buffers, and invalidate queue, along with the stories behind
>> them.
> Yes, I know these things. Not that I'd be really an expert in them but I'd
> call myself familiar enough :). But that is kind of besides the point here.
> What I want to point out it that if you have code like:
>
>    some access A
>    grp = smp_load_acquire(&sbi->s_mb_last_group)
>    some more accesses
>
> then the CPU is fully within it's right to execute them as:
>
>    grp = smp_load_acquire(&sbi->s_mb_last_group)
>    some access A
>    some more accesses
>
> Now your *particular implementation* of the ARM64 CPU model may never do
> that similarly as no x86 CPU currently does it but some other CPU
> implementation may (e.g. Alpha CPU probably would, as much as that's
> irrevelent these days :). So using smp_load_acquire() is at best a
> heuristics that may happen to help using more fresh value for some CPU
> models but it isn't guaranteed to help for all architectures and all CPU
> models Linux supports.
>
> So can you do me a favor please and do a performance comparison of using
> READ_ONCE / WRITE_ONCE vs using smp_load_acquire / smp_store_release on
> your Arm64 server for streaming goal management? If smp_load_acquire /
> smp_store_release indeed bring any performance benefit for your servers, we
> can just stick a comment there explaining why they are used. If they bring
> no measurable benefit I'd put READ_ONCE / WRITE_ONCE there for code
> simplicity. Do you agree?
>
> 								Honza

Sorry for getting to this so late – I've been totally overloaded
with stuff recently.

Anyway, back to what we were discussing. I managed to test
the performance difference between READ_ONCE / WRITE_ONCE and
smp_load_acquire / smp_store_release on an ARM64 server.
Here's the results:

CPU: Kunpeng 920
Memory: 512GB
Disk: 960GB SSD (~500M/s)

         | mb_optimize_scan  |       0        |       1        |
         |-------------------|----------------|----------------|
         | Num. containers   |  P80  |   P1   |  P80  |   P1   |
--------|-------------------|-------|--------|-------|--------|
         | acquire/release   | 9899  | 290260 | 5005  | 307361 |
  single | [READ|WRITE]_ONCE | 9636  | 337597 | 4834  | 341440 |
  goal   |-------------------|-------|--------|-------|--------|
         |                   | -2.6% | +16.3% | -3.4% | +11.0% |
--------|-------------------|-------|--------|-------|--------|
         | acquire/release   | 19931 | 290348 | 7365  | 311717 |
  muti   | [READ|WRITE]_ONCE | 19628 | 320885 | 7129  | 321275 |
  goal   |-------------------|-------|--------|-------|--------|
         |                   | -1.5% | +10.5% | -3.2% | +3.0%  |

So, my tests show that READ_ONCE / WRITE_ONCE gives us better
single-threaded performance. That's because it skips the mandatory
CPU-to-CPU syncing. This also helps explain why x86 has double the
disk bandwidth (~1000MB/s) of Arm64, but surprisingly, single
containers run much worse on x86.

However, in multi-threaded scenarios, not consistently reading
the latest goal has these implications:

  * ext4_get_group_info() calls increase, as ext4_mb_good_group_nolock()
    is invoked more often on incorrect groups.

  * ext4_mb_load_buddy() calls increase due to repeated group accesses
    leading to more folio_mark_accessed calls.

  * ext4_mb_prefetch() calls increase with more frequent prefetch_grp
    access. (I suspect the current mb_prefetch mechanism has some inherent
    issues we could optimize later.)

At this point, I believe either approach is acceptable.

What are your thoughts?


Cheers,
Baokun

Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Jan Kara 3 months ago
On Tue 08-07-25 21:08:00, Baokun Li wrote:
> Sorry for getting to this so late – I've been totally overloaded
> with stuff recently.
> 
> Anyway, back to what we were discussing. I managed to test
> the performance difference between READ_ONCE / WRITE_ONCE and
> smp_load_acquire / smp_store_release on an ARM64 server.
> Here's the results:
> 
> CPU: Kunpeng 920
> Memory: 512GB
> Disk: 960GB SSD (~500M/s)
> 
>         | mb_optimize_scan  |       0        |       1        |
>         |-------------------|----------------|----------------|
>         | Num. containers   |  P80  |   P1   |  P80  |   P1   |
> --------|-------------------|-------|--------|-------|--------|
>         | acquire/release   | 9899  | 290260 | 5005  | 307361 |
>  single | [READ|WRITE]_ONCE | 9636  | 337597 | 4834  | 341440 |
>  goal   |-------------------|-------|--------|-------|--------|
>         |                   | -2.6% | +16.3% | -3.4% | +11.0% |
> --------|-------------------|-------|--------|-------|--------|
>         | acquire/release   | 19931 | 290348 | 7365  | 311717 |
>  muti   | [READ|WRITE]_ONCE | 19628 | 320885 | 7129  | 321275 |
>  goal   |-------------------|-------|--------|-------|--------|
>         |                   | -1.5% | +10.5% | -3.2% | +3.0%  |
> 
> So, my tests show that READ_ONCE / WRITE_ONCE gives us better
> single-threaded performance. That's because it skips the mandatory
> CPU-to-CPU syncing. This also helps explain why x86 has double the
> disk bandwidth (~1000MB/s) of Arm64, but surprisingly, single
> containers run much worse on x86.

Interesting! Thanks for measuring the data!

> However, in multi-threaded scenarios, not consistently reading
> the latest goal has these implications:
> 
>  * ext4_get_group_info() calls increase, as ext4_mb_good_group_nolock()
>    is invoked more often on incorrect groups.
> 
>  * ext4_mb_load_buddy() calls increase due to repeated group accesses
>    leading to more folio_mark_accessed calls.
> 
>  * ext4_mb_prefetch() calls increase with more frequent prefetch_grp
>    access. (I suspect the current mb_prefetch mechanism has some inherent
>    issues we could optimize later.)
> 
> At this point, I believe either approach is acceptable.
> 
> What are your thoughts?

Yes, apparently both approaches have their pros and cons. I'm actually
surprised the impact of additional barriers on ARM is so big for the
single container case. 10% gain for single container cases look nice OTOH
realistical workloads will have more container so maybe that's not worth
optimizing for. Ted, do you have any opinion?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Theodore Ts'o 2 months, 3 weeks ago
On Thu, Jul 10, 2025 at 04:38:33PM +0200, Jan Kara wrote:
> 
> Yes, apparently both approaches have their pros and cons. I'm actually
> surprised the impact of additional barriers on ARM is so big for the
> single container case. 10% gain for single container cases look nice OTOH
> realistical workloads will have more container so maybe that's not worth
> optimizing for. Ted, do you have any opinion?

Let me try to summarize; regardless of whether we use
{READ,WRITE})_ONCE or smp_load_acquire / smp_store_restore, both are
signiicantly better than using a the spinlock.  The other thing about
the "single-threaded perforance" is that there is the aditional cost
of the CPU-to-CPU syncing is not free.  But CPU synchronization cost
applies when that the single thread is bouncing between CPU's --- if
we hada single threaded application which is pinned on a single CPU
cost of smp_load_acquire would't be there since the cache line
wouldn't be bouncing back and forth.  Is that correct, or am I missing
something?

In any case, so long as the single-threaded performance doesn't
regress relative to the current spin_lock implementation, I'm inclined
to prefer the use smp_load_acquire approach if it improves
multi-threaded allocation performance on ARM64.

Cheers,

							- Ted
Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Baokun Li 2 months, 3 weeks ago
Hello!

On 2025/7/14 11:01, Theodore Ts'o wrote:
> On Thu, Jul 10, 2025 at 04:38:33PM +0200, Jan Kara wrote:
>> Yes, apparently both approaches have their pros and cons. I'm actually
>> surprised the impact of additional barriers on ARM is so big for the
>> single container case. 10% gain for single container cases look nice OTOH
>> realistical workloads will have more container so maybe that's not worth
>> optimizing for. Ted, do you have any opinion?
> Let me try to summarize; regardless of whether we use
> {READ,WRITE})_ONCE or smp_load_acquire / smp_store_restore, both are
> signiicantly better than using a the spinlock.  The other thing about
> the "single-threaded perforance" is that there is the aditional cost
> of the CPU-to-CPU syncing is not free.  But CPU synchronization cost
> applies when that the single thread is bouncing between CPU's --- if
> we hada single threaded application which is pinned on a single CPU
> cost of smp_load_acquire would't be there since the cache line
> wouldn't be bouncing back and forth.  Is that correct, or am I missing
> something?
>
> In any case, so long as the single-threaded performance doesn't
> regress relative to the current spin_lock implementation, I'm inclined
> to prefer the use smp_load_acquire approach if it improves
> multi-threaded allocation performance on ARM64.
>
> Cheers,
>
> 							- Ted
>
Using {READ,WRITE}_ONCE yielded a very significant improvement in single
container scenarios (10%-16%). Although there was a slight decrease in
multi-container scenarios (-1% to -3%), subsequent optimizations
compensated for this.

To prevent regressions in single-container performance, we ultimately chose
{READ,WRITE}_ONCE for the v3 release last week.

Thank you for your suggestion!


Cheers,
Baokun
Re: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group
Posted by Baokun Li 3 months, 1 week ago
On 2025/7/1 20:21, Jan Kara wrote:
> On Tue 01-07-25 10:39:53, Baokun Li wrote:
>> On 2025/7/1 0:32, Jan Kara wrote:
>>> On Mon 30-06-25 17:21:48, Baokun Li wrote:
>>>> On 2025/6/30 15:47, Jan Kara wrote:
>>>>> On Mon 30-06-25 11:48:20, Baokun Li wrote:
>>>>>> On 2025/6/28 2:19, Jan Kara wrote:
>>>>>>> On Mon 23-06-25 15:32:51, Baokun Li wrote:
>>>>>>>> After we optimized the block group lock, we found another lock
>>>>>>>> contention issue when running will-it-scale/fallocate2 with multiple
>>>>>>>> processes. The fallocate's block allocation and the truncate's block
>>>>>>>> release were fighting over the s_md_lock. The problem is, this lock
>>>>>>>> protects totally different things in those two processes: the list of
>>>>>>>> freed data blocks (s_freed_data_list) when releasing, and where to start
>>>>>>>> looking for new blocks (mb_last_group) when allocating.
>>>>>>>>
>>>>>>>> Now we only need to track s_mb_last_group and no longer need to track
>>>>>>>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>>>>>>>> two are consistent, and we can ensure that the s_mb_last_group read is up
>>>>>>>> to date by using smp_store_release/smp_load_acquire.
>>>>>>>>
>>>>>>>> Besides, the s_mb_last_group data type only requires ext4_group_t
>>>>>>>> (i.e., unsigned int), rendering unsigned long superfluous.
>>>>>>>>
>>>>>>>> Performance test data follows:
>>>>>>>>
>>>>>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>>>>>> Observation: Average fallocate operations per container per second.
>>>>>>>>
>>>>>>>>                        | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>>>>>      Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>>>>>                        | base  |    patched      | base  |    patched      |
>>>>>>>> -------------------|-------|-----------------|-------|-----------------|
>>>>>>>> mb_optimize_scan=0 | 4821  | 7612  (+57.8%)  | 15371 | 21647 (+40.8%)  |
>>>>>>>> mb_optimize_scan=1 | 4784  | 7568  (+58.1%)  | 6101  | 9117  (+49.4%)  |
>>>>>>>>
>>>>>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>>>>>> ...
>>>>>>>
>>>>>>>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>>>>>>>> index 5cdae3bda072..3f103919868b 100644
>>>>>>>> --- a/fs/ext4/mballoc.c
>>>>>>>> +++ b/fs/ext4/mballoc.c
>>>>>>>> @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>>>>>>>      	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>>>>>>>      	folio_get(ac->ac_buddy_folio);
>>>>>>>>      	/* store last allocated for subsequent stream allocation */
>>>>>>>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>>>>>>>> -		spin_lock(&sbi->s_md_lock);
>>>>>>>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>>>>>>>> -		spin_unlock(&sbi->s_md_lock);
>>>>>>>> -	}
>>>>>>>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>>>>>>>> +		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
>>>>>>>> +		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>>>>>> Do you really need any kind of barrier (implied by smp_store_release())
>>>>>>> here? I mean the store to s_mb_last_group is perfectly fine to be reordered
>>>>>>> with other accesses from the thread, isn't it? As such it should be enough
>>>>>>> to have WRITE_ONCE() here...
>>>>>> WRITE_ONCE()/READ_ONCE() primarily prevent compiler reordering and ensure
>>>>>> that variable reads/writes access values directly from L1/L2 cache rather
>>>>>> than registers.
>>>>> I agree READ_ONCE() / WRITE_ONCE() are about compiler optimizations - in
>>>>> particular they force the compiler to read / write the memory location
>>>>> exactly once instead of reading it potentially multiple times in different
>>>>> parts of expression and getting inconsistent values, or possibly writing
>>>>> the value say byte by byte (yes, that would be insane but not contrary to
>>>>> the C standard).
>>>> READ_ONCE() and WRITE_ONCE() rely on the volatile keyword, which serves
>>>> two main purposes:
>>>>
>>>> 1. It tells the compiler that the variable's value can change unexpectedly,
>>>>      preventing the compiler from making incorrect optimizations based on
>>>>      assumptions about its stability.
>>>>
>>>> 2. It ensures the CPU directly reads from or writes to the variable's
>>>>      memory address. This means the value will be fetched from cache (L1/L2)
>>>>      if available, or from main memory otherwise, rather than using a stale
>>>>      value from a CPU register.
>>> Yes, we agree on this.
>>>
>>>>>> They do not guarantee that other CPUs see the latest values. Reading stale
>>>>>> values could lead to more useless traversals, which might incur higher
>>>>>> overhead than memory barriers. This is why we use memory barriers to ensure
>>>>>> the latest values are read.
>>>>> But smp_load_acquire() / smp_store_release() have no guarantee about CPU
>>>>> seeing latest values either. They are just speculation barriers meaning
>>>>> they prevent the CPU from reordering accesses in the code after
>>>>> smp_load_acquire() to be performed before the smp_load_acquire() is
>>>>> executed and similarly with smp_store_release(). So I dare to say that
>>>>> these barries have no (positive) impact on the allocation performance and
>>>>> just complicate the code - but if you have some data that show otherwise,
>>>>> I'd be happy to be proven wrong.
>>>> smp_load_acquire() / smp_store_release() guarantee that CPUs read the
>>>> latest data.
>>>>
>>>> For example, imagine a variable a = 0, with both CPU0 and CPU1 having
>>>> a=0 in their caches.
>>>>
>>>> Without a memory barrier:
>>>> When CPU0 executes WRITE_ONCE(a, 1), a=1 is written to the store buffer,
>>>> an RFO is broadcast, and CPU0 continues other tasks. After receiving ACKs,
>>>> a=1 is written to main memory and becomes visible to other CPUs.
>>>> Then, if CPU1 executes READ_ONCE(a), it receives the RFO and adds it to
>>>> its invalidation queue. However, it might not process it immediately;
>>>> instead, it could perform the read first, potentially still reading a=0
>>>> from its cache.
>>>>
>>>> With a memory barrier:
>>>> When CPU0 executes smp_store_release(&a, 1), a=1 is not only written to
>>>> the store buffer, but data in the store buffer is also written to main
>>>> memory. An RFO is then broadcast, and CPU0 waits for ACKs from all CPUs.
>>>>
>>>> When CPU1 executes smp_load_acquire(a), it receives the RFO and adds it
>>>> to its invalidation queue. Here, the invalidation queue is flushed, which
>>>> invalidates a in CPU1's cache. CPU1 then replies with an ACK, and when it
>>>> performs the read, its cache is invalid, so it reads the latest a=1 from
>>>> main memory.
>>> Well, here I think you assume way more about the CPU architecture than is
>>> generally true (and I didn't find what you write above guaranteed neither
>>> by x86 nor by arm64 CPU documentation). Generally I'm following the
>>> guarantees as defined by Documentation/memory-barriers.txt and there you
>>> can argue only about order of effects as observed by different CPUs but not
>>> really about when content is fetched to / from CPU caches.
>> Explaining why smp_load_acquire() and smp_store_release() guarantee the
>> latest data is read truly requires delving into their underlying
>> implementation details.
>>
>> I suggest you Google "why memory barriers are needed." You might find
>> introductions to concepts like 'Total Store Order', 'Weak Memory Ordering',
>> MESI, store buffers, and invalidate queue, along with the stories behind
>> them.
> Yes, I know these things. Not that I'd be really an expert in them but I'd
> call myself familiar enough :). But that is kind of besides the point here.
> What I want to point out it that if you have code like:
>
>    some access A
>    grp = smp_load_acquire(&sbi->s_mb_last_group)
>    some more accesses
>
> then the CPU is fully within it's right to execute them as:
>
>    grp = smp_load_acquire(&sbi->s_mb_last_group)
>    some access A
>    some more accesses
>
> Now your *particular implementation* of the ARM64 CPU model may never do
> that similarly as no x86 CPU currently does it but some other CPU
> implementation may (e.g. Alpha CPU probably would, as much as that's
> irrevelent these days :). So using smp_load_acquire() is at best a
> heuristics that may happen to help using more fresh value for some CPU
> models but it isn't guaranteed to help for all architectures and all CPU
> models Linux supports.
Yes, it's true that the underlying implementation of
smp_load_acquire() can differ somewhat across various
processor architectures.
>
> So can you do me a favor please and do a performance comparison of using
> READ_ONCE / WRITE_ONCE vs using smp_load_acquire / smp_store_release on
> your Arm64 server for streaming goal management? If smp_load_acquire /
> smp_store_release indeed bring any performance benefit for your servers, we
> can just stick a comment there explaining why they are used. If they bring
> no measurable benefit I'd put READ_ONCE / WRITE_ONCE there for code
> simplicity. Do you agree?
>
> 								Honza

Okay, no problem. I'll get an ARM server from the resource pool to test
the difference between the two. If there's no difference, replacing them
with READ_ONCE/WRITE_ONCE would be acceptable.


Cheers,
Baokun