A global counter for the magnitude of memcg stats update is maintained
on the memcg side to avoid invoking rstat flushes when the pending
updates are not significant. This avoids unnecessary flushes, which are
not very cheap even if there isn't a lot of stats to flush. It also
avoids unnecessary lock contention on the underlying global rstat lock.
Make this threshold per-memcg. The scheme is followed where percpu (now
also per-memcg) counters are incremented in the update path, and only
propagated to per-memcg atomics when they exceed a certain threshold.
This provides two benefits:
(a) On large machines with a lot of memcgs, the global threshold can be
reached relatively fast, so guarding the underlying lock becomes less
effective. Making the threshold per-memcg avoids this.
(b) Having a global threshold makes it hard to do subtree flushes, as we
cannot reset the global counter except for a full flush. Per-memcg
counters removes this as a blocker from doing subtree flushes, which
helps avoid unnecessary work when the stats of a small subtree are
needed.
Nothing is free, of course. This comes at a cost:
(a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
bytes. The extra memory usage is insigificant.
(b) More work on the update side, although in the common case it will
only be percpu counter updates. The amount of work scales with the
number of ancestors (i.e. tree depth). This is not a new concept, adding
a cgroup to the rstat tree involves a parent loop, so is charging.
Testing results below show no significant regressions.
(c) The error margin in the stats for the system as a whole increases
from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
NR_MEMCGS. This is probably fine because we have a similar per-memcg
error in charges coming from percpu stocks, and we have a periodic
flusher that makes sure we always flush all the stats every 2s anyway.
This patch was tested to make sure no significant regressions are
introduced on the update path as follows. The following benchmarks were
ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
deeper than a usual setup:
(a) neper [1] with 1000 flows and 100 threads (single machine). The
values in the table are the average of server and client throughputs in
mbps after 30 iterations, each running for 30s:
tcp_rr tcp_stream
Base 9504218.56 357366.84
Patched 9656205.68 356978.39
Delta +1.6% -0.1%
Standard Deviation 0.95% 1.03%
An increase in the performance of tcp_rr doesn't really make sense, but
it's probably in the noise. The same tests were ran with 1 flow and 1
thread but the throughput was too noisy to make any conclusions (the
averages did not show regressions nonetheless).
Looking at perf for one iteration of the above test, __mod_memcg_state()
(which is where memcg_rstat_updated() is called) does not show up at all
without this patch, but it shows up with this patch as 1.06% for tcp_rr
and 0.36% for tcp_stream.
(b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
stress-ng very well, so I am not sure that's the best way to test this,
but it spawns 384 workers and spits a lot of metrics which looks nice :)
I picked a few ones that seem to be relevant to the stats update path. I
also included cache misses as this patch introduce more atomics that may
bounce between cpu caches:
Metric Base Patched Delta
Cache Misses 3.394 B/sec 3.433 B/sec +1.14%
Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05%
Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8%
Page Faults Total 4.304 M/sec 4.535 M/sec +5.4%
Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4%
Page Faults Major 18.794 /sec 0.000 /sec
Kmalloc 0.153 M/sec 0.152 M/sec -0.65%
Kfree 0.152 M/sec 0.153 M/sec +0.65%
MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56%
MM Page Free 4.639 M/sec 4.897 M/sec +5.56%
Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32%
Lock Contention End 0.362 M/sec 0.479 M/sec +32.32%
page-cache add 238.057 /sec 0.000 /sec
page-cache del 6.265 /sec 6.267 /sec -0.03%
This is only using a single run in each case. I am not sure what to
make out of most of these numbers, but they mostly seem in the noise
(some better, some worse). The lock contention numbers are interesting.
I am not sure if higher is better or worse here. No new locks or lock
sections are introduced by this patch either way.
Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
this patch. This is suspicious, but I verified while stress-ng is
running that all the threads are in the right cgroup.
(3) will-it-scale page_fault tests. These tests (specifically
per_process_ops in page_fault3 test) detected a 25.9% regression before
for a change in the stats update path [2]. These are the
numbers from 30 runs (+ is good):
LABEL | MEAN | MEDIAN | STDDEV |
------------------------------+-------------+-------------+-------------
page_fault1_per_process_ops | | | |
(A) base | 265207.738 | 262941.000 | 12112.379 |
(B) patched | 249249.191 | 248781.000 | 8767.457 |
| -6.02% | -5.39% | |
page_fault1_per_thread_ops | | | |
(A) base | 241618.484 | 240209.000 | 10162.207 |
(B) patched | 229820.671 | 229108.000 | 7506.582 |
| -4.88% | -4.62% | |
page_fault1_scalability | | |
(A) base | 0.03545 | 0.035705 | 0.0015837 |
(B) patched | 0.029952 | 0.029957 | 0.0013551 |
| -9.29% | -9.35% | |
page_fault2_per_process_ops | | |
(A) base | 203916.148 | 203496.000 | 2908.331 |
(B) patched | 186975.419 | 187023.000 | 1991.100 |
| -6.85% | -6.90% | |
page_fault2_per_thread_ops | | |
(A) base | 170604.972 | 170532.000 | 1624.834 |
(B) patched | 163100.260 | 163263.000 | 1517.967 |
| -4.40% | -4.26% | |
page_fault2_scalability | | |
(A) base | 0.054603 | 0.054693 | 0.00080196 |
(B) patched | 0.044882 | 0.044957 | 0.0011766 |
| -0.05% | +0.33% | |
page_fault3_per_process_ops | | |
(A) base | 1299821.099 | 1297918.000 | 9882.872 |
(B) patched | 1248700.839 | 1247168.000 | 8454.891 |
| -3.93% | -3.91% | |
page_fault3_per_thread_ops | | |
(A) base | 387216.963 | 387115.000 | 1605.760 |
(B) patched | 368538.213 | 368826.000 | 1852.594 |
| -4.82% | -4.72% | |
page_fault3_scalability | | |
(A) base | 0.59909 | 0.59367 | 0.01256 |
(B) patched | 0.59995 | 0.59769 | 0.010088 |
| +0.14% | +0.68% | |
There is some microbenchmarks regressions (and some minute improvements),
but nothing outside the normal variance of this benchmark between kernel
versions. The fix for [2] assumes that 3% is noise -- and there were no
further practical complaints), so hopefully this means that such variations
in these microbenchmarks do not reflect on practical workloads.
[1]https://github.com/google/neper
[2]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
mm/memcontrol.c | 49 +++++++++++++++++++++++++++++++++----------------
1 file changed, 33 insertions(+), 16 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a393f1399a2b..9a586893bd3e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -627,6 +627,9 @@ struct memcg_vmstats_percpu {
/* Cgroup1: threshold notifications & softlimit tree updates */
unsigned long nr_page_events;
unsigned long targets[MEM_CGROUP_NTARGETS];
+
+ /* Stats updates since the last flush */
+ unsigned int stats_updates;
};
struct memcg_vmstats {
@@ -641,6 +644,9 @@ struct memcg_vmstats {
/* Pending child counts during tree propagation */
long state_pending[MEMCG_NR_STAT];
unsigned long events_pending[NR_MEMCG_EVENTS];
+
+ /* Stats updates since the last flush */
+ atomic64_t stats_updates;
};
/*
@@ -660,9 +666,7 @@ struct memcg_vmstats {
*/
static void flush_memcg_stats_dwork(struct work_struct *w);
static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
-static DEFINE_PER_CPU(unsigned int, stats_updates);
static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
-static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
static u64 flush_last_time;
#define FLUSH_TIME (2UL*HZ)
@@ -689,26 +693,37 @@ static void memcg_stats_unlock(void)
preempt_enable_nested();
}
+
+static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
+{
+ return atomic64_read(&memcg->vmstats->stats_updates) >
+ MEMCG_CHARGE_BATCH * num_online_cpus();
+}
+
static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
{
+ int cpu = smp_processor_id();
unsigned int x;
if (!val)
return;
- cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
+ cgroup_rstat_updated(memcg->css.cgroup, cpu);
+
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+ x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
+ abs(val));
+
+ if (x < MEMCG_CHARGE_BATCH)
+ continue;
- x = __this_cpu_add_return(stats_updates, abs(val));
- if (x > MEMCG_CHARGE_BATCH) {
/*
- * If stats_flush_threshold exceeds the threshold
- * (>num_online_cpus()), cgroup stats update will be triggered
- * in __mem_cgroup_flush_stats(). Increasing this var further
- * is redundant and simply adds overhead in atomic update.
+ * If @memcg is already flush-able, increasing stats_updates is
+ * redundant. Avoid the overhead of the atomic update.
*/
- if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
- atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
- __this_cpu_write(stats_updates, 0);
+ if (!memcg_should_flush_stats(memcg))
+ atomic64_add(x, &memcg->vmstats->stats_updates);
+ __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
}
}
@@ -727,13 +742,12 @@ static void do_flush_stats(void)
cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
- atomic_set(&stats_flush_threshold, 0);
atomic_set(&stats_flush_ongoing, 0);
}
void mem_cgroup_flush_stats(void)
{
- if (atomic_read(&stats_flush_threshold) > num_online_cpus())
+ if (memcg_should_flush_stats(root_mem_cgroup))
do_flush_stats();
}
@@ -747,8 +761,8 @@ void mem_cgroup_flush_stats_ratelimited(void)
static void flush_memcg_stats_dwork(struct work_struct *w)
{
/*
- * Always flush here so that flushing in latency-sensitive paths is
- * as cheap as possible.
+ * Deliberately ignore memcg_should_flush_stats() here so that flushing
+ * in latency-sensitive paths is as cheap as possible.
*/
do_flush_stats();
queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
@@ -5803,6 +5817,9 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
}
}
}
+ /* We are in a per-cpu loop here, only do the atomic write once */
+ if (atomic64_read(&memcg->vmstats->stats_updates))
+ atomic64_set(&memcg->vmstats->stats_updates, 0);
}
#ifdef CONFIG_MMU
--
2.42.0.609.gbb76f46606-goog
Hello,
kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
testcase: will-it-scale
test machine: 104 threads 2 sockets (Skylake) with 192G memory
parameters:
nr_task: 100%
mode: thread
test: fallocate1
cpufreq_governor: performance
In addition to that, the commit also has significant impact on the following tests:
+------------------+---------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
| test machine | 104 threads 2 sockets (Skylake) with 192G memory |
| test parameters | cpufreq_governor=performance |
| | mode=thread |
| | nr_task=50% |
| | test=fallocate1 |
+------------------+---------------------------------------------------------------+
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310202303.c68e7639-oliver.sang@intel.com
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231020/202310202303.c68e7639-oliver.sang@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
commit:
130617edc1 ("mm: memcg: move vmstats structs definition above flushing code")
51d74c18a9 ("mm: memcg: make stats flushing threshold per-memcg")
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522
---------------- ---------------------------
%stddev %change %stddev
\ | \
2.09 -0.5 1.61 ± 2% mpstat.cpu.all.usr%
27.58 +3.7% 28.59 turbostat.RAMWatt
3324 -10.0% 2993 vmstat.system.cs
1056 -100.0% 0.00 numa-meminfo.node0.Inactive(file)
6.67 ±141% +15799.3% 1059 numa-meminfo.node1.Inactive(file)
120.83 ± 11% +79.6% 217.00 ± 9% perf-c2c.DRAM.local
594.50 ± 6% +43.8% 854.83 ± 5% perf-c2c.DRAM.remote
3797041 -25.8% 2816352 will-it-scale.104.threads
36509 -25.8% 27079 will-it-scale.per_thread_ops
3797041 -25.8% 2816352 will-it-scale.workload
1.142e+09 -26.2% 8.437e+08 numa-numastat.node0.local_node
1.143e+09 -26.1% 8.439e+08 numa-numastat.node0.numa_hit
1.148e+09 -25.4% 8.563e+08 ± 2% numa-numastat.node1.local_node
1.149e+09 -25.4% 8.564e+08 ± 2% numa-numastat.node1.numa_hit
32933 -2.6% 32068 proc-vmstat.nr_slab_reclaimable
2.291e+09 -25.8% 1.7e+09 proc-vmstat.numa_hit
2.291e+09 -25.8% 1.7e+09 proc-vmstat.numa_local
2.29e+09 -25.8% 1.699e+09 proc-vmstat.pgalloc_normal
2.289e+09 -25.8% 1.699e+09 proc-vmstat.pgfree
1.00 ± 93% +154.2% 2.55 ± 16% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
191.10 ± 2% +18.0% 225.55 ± 2% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
385.50 ± 14% +39.6% 538.17 ± 12% perf-sched.wait_and_delay.count.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
118.67 ± 11% -62.6% 44.33 ±100% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
5043 ± 2% -13.0% 4387 ± 6% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
167.12 ±222% +200.1% 501.48 ± 99% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
191.09 ± 2% +18.0% 225.53 ± 2% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
293.46 ± 4% +12.8% 330.98 ± 6% perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
199.33 -100.0% 0.00 numa-vmstat.node0.nr_active_file
264.00 -100.0% 0.00 numa-vmstat.node0.nr_inactive_file
199.33 -100.0% 0.00 numa-vmstat.node0.nr_zone_active_file
264.00 -100.0% 0.00 numa-vmstat.node0.nr_zone_inactive_file
1.143e+09 -26.1% 8.439e+08 numa-vmstat.node0.numa_hit
1.142e+09 -26.2% 8.437e+08 numa-vmstat.node0.numa_local
1.67 ±141% +15799.3% 264.99 numa-vmstat.node1.nr_inactive_file
1.67 ±141% +15799.3% 264.99 numa-vmstat.node1.nr_zone_inactive_file
1.149e+09 -25.4% 8.564e+08 ± 2% numa-vmstat.node1.numa_hit
1.148e+09 -25.4% 8.563e+08 ± 2% numa-vmstat.node1.numa_local
0.59 ± 3% +125.2% 1.32 ± 2% perf-stat.i.MPKI
9.027e+09 -17.9% 7.408e+09 perf-stat.i.branch-instructions
0.64 -0.0 0.60 perf-stat.i.branch-miss-rate%
58102855 -23.3% 44580037 ± 2% perf-stat.i.branch-misses
15.28 +7.0 22.27 perf-stat.i.cache-miss-rate%
25155306 ± 2% +82.7% 45953601 ± 3% perf-stat.i.cache-misses
1.644e+08 +25.4% 2.062e+08 ± 2% perf-stat.i.cache-references
3258 -10.3% 2921 perf-stat.i.context-switches
6.73 +23.3% 8.30 perf-stat.i.cpi
145.97 -1.3% 144.13 perf-stat.i.cpu-migrations
11519 ± 3% -45.4% 6293 ± 3% perf-stat.i.cycles-between-cache-misses
0.04 -0.0 0.03 perf-stat.i.dTLB-load-miss-rate%
3921408 -25.3% 2929564 perf-stat.i.dTLB-load-misses
1.098e+10 -18.1% 8.993e+09 perf-stat.i.dTLB-loads
0.00 ± 2% +0.0 0.00 ± 4% perf-stat.i.dTLB-store-miss-rate%
5.606e+09 -23.2% 4.304e+09 perf-stat.i.dTLB-stores
95.65 -1.2 94.49 perf-stat.i.iTLB-load-miss-rate%
3876741 -25.0% 2905764 perf-stat.i.iTLB-load-misses
4.286e+10 -18.9% 3.477e+10 perf-stat.i.instructions
11061 +8.2% 11969 perf-stat.i.instructions-per-iTLB-miss
0.15 -18.9% 0.12 perf-stat.i.ipc
48.65 ± 2% +46.2% 71.11 ± 2% perf-stat.i.metric.K/sec
247.84 -18.9% 201.05 perf-stat.i.metric.M/sec
3138385 ± 2% +77.7% 5578401 ± 2% perf-stat.i.node-load-misses
375827 ± 3% +69.2% 635857 ± 11% perf-stat.i.node-loads
1343194 -26.8% 983668 perf-stat.i.node-store-misses
51550 ± 3% -19.0% 41748 ± 7% perf-stat.i.node-stores
0.59 ± 3% +125.1% 1.32 ± 2% perf-stat.overall.MPKI
0.64 -0.0 0.60 perf-stat.overall.branch-miss-rate%
15.30 +7.0 22.28 perf-stat.overall.cache-miss-rate%
6.73 +23.3% 8.29 perf-stat.overall.cpi
11470 ± 2% -45.3% 6279 ± 2% perf-stat.overall.cycles-between-cache-misses
0.04 -0.0 0.03 perf-stat.overall.dTLB-load-miss-rate%
0.00 ± 2% +0.0 0.00 ± 4% perf-stat.overall.dTLB-store-miss-rate%
95.56 -1.4 94.17 perf-stat.overall.iTLB-load-miss-rate%
11059 +8.2% 11967 perf-stat.overall.instructions-per-iTLB-miss
0.15 -18.9% 0.12 perf-stat.overall.ipc
3396437 +9.5% 3718021 perf-stat.overall.path-length
8.997e+09 -17.9% 7.383e+09 perf-stat.ps.branch-instructions
57910417 -23.3% 44426577 ± 2% perf-stat.ps.branch-misses
25075498 ± 2% +82.7% 45803186 ± 3% perf-stat.ps.cache-misses
1.639e+08 +25.4% 2.056e+08 ± 2% perf-stat.ps.cache-references
3247 -10.3% 2911 perf-stat.ps.context-switches
145.47 -1.3% 143.61 perf-stat.ps.cpu-migrations
3908900 -25.3% 2920218 perf-stat.ps.dTLB-load-misses
1.094e+10 -18.1% 8.963e+09 perf-stat.ps.dTLB-loads
5.587e+09 -23.2% 4.289e+09 perf-stat.ps.dTLB-stores
3863663 -25.0% 2895895 perf-stat.ps.iTLB-load-misses
4.272e+10 -18.9% 3.466e+10 perf-stat.ps.instructions
3128132 ± 2% +77.7% 5559939 ± 2% perf-stat.ps.node-load-misses
375403 ± 3% +69.0% 634300 ± 11% perf-stat.ps.node-loads
1338688 -26.8% 980311 perf-stat.ps.node-store-misses
51546 ± 3% -19.1% 41692 ± 7% perf-stat.ps.node-stores
1.29e+13 -18.8% 1.047e+13 perf-stat.total.instructions
0.96 -0.3 0.70 ± 2% perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.97 -0.3 0.72 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
0.76 ± 2% -0.2 0.54 ± 3% perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.82 -0.2 0.60 ± 2% perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.91 -0.2 0.72 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.68 +0.1 0.76 ± 2% perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.67 +0.1 1.77 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.78 ± 2% +0.1 1.92 ± 2% perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
0.69 ± 5% +0.1 0.84 ± 4% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
1.56 ± 2% +0.2 1.76 ± 2% perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
0.85 ± 4% +0.4 1.23 ± 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.78 ± 4% +0.4 1.20 ± 3% perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
0.73 ± 4% +0.4 1.17 ± 3% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
48.39 +0.8 49.14 perf-profile.calltrace.cycles-pp.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.00 +0.8 0.77 ± 4% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
40.24 +0.8 41.03 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
40.22 +0.8 41.01 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
0.00 +0.8 0.79 ± 3% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
40.19 +0.8 40.98 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru
1.33 ± 5% +0.8 2.13 ± 4% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
48.16 +0.8 48.98 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.00 +0.9 0.88 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
47.92 +0.9 48.81 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
47.07 +0.9 48.01 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
46.59 +1.1 47.64 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
0.99 -0.3 0.73 ± 2% perf-profile.children.cycles-pp.syscall_return_via_sysret
0.96 -0.3 0.70 ± 2% perf-profile.children.cycles-pp.shmem_alloc_folio
0.78 ± 2% -0.2 0.56 ± 3% perf-profile.children.cycles-pp.shmem_inode_acct_blocks
0.83 -0.2 0.61 ± 2% perf-profile.children.cycles-pp.alloc_pages_mpol
0.92 -0.2 0.73 perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.74 ± 2% -0.2 0.55 ± 2% perf-profile.children.cycles-pp.xas_store
0.67 -0.2 0.50 ± 3% perf-profile.children.cycles-pp.__alloc_pages
0.43 -0.1 0.31 ± 2% perf-profile.children.cycles-pp.__entry_text_start
0.41 ± 2% -0.1 0.30 ± 3% perf-profile.children.cycles-pp.free_unref_page_list
0.35 -0.1 0.25 ± 2% perf-profile.children.cycles-pp.xas_load
0.35 ± 2% -0.1 0.25 ± 4% perf-profile.children.cycles-pp.__mod_lruvec_state
0.39 -0.1 0.30 ± 2% perf-profile.children.cycles-pp.get_page_from_freelist
0.27 ± 2% -0.1 0.19 ± 4% perf-profile.children.cycles-pp.__mod_node_page_state
0.32 ± 3% -0.1 0.24 ± 3% perf-profile.children.cycles-pp.find_lock_entries
0.23 ± 2% -0.1 0.15 ± 4% perf-profile.children.cycles-pp.xas_descend
0.28 ± 3% -0.1 0.20 ± 3% perf-profile.children.cycles-pp._raw_spin_lock
0.25 ± 3% -0.1 0.18 ± 3% perf-profile.children.cycles-pp.__dquot_alloc_space
0.16 ± 3% -0.1 0.10 ± 5% perf-profile.children.cycles-pp.xas_find_conflict
0.26 ± 2% -0.1 0.20 ± 3% perf-profile.children.cycles-pp.filemap_get_entry
0.26 -0.1 0.20 ± 2% perf-profile.children.cycles-pp.rmqueue
0.20 ± 3% -0.1 0.14 ± 3% perf-profile.children.cycles-pp.truncate_cleanup_folio
0.19 ± 5% -0.1 0.14 ± 4% perf-profile.children.cycles-pp.xas_clear_mark
0.17 ± 5% -0.0 0.12 ± 4% perf-profile.children.cycles-pp.xas_init_marks
0.15 ± 4% -0.0 0.10 ± 4% perf-profile.children.cycles-pp.free_unref_page_commit
0.18 ± 3% -0.0 0.14 ± 3% perf-profile.children.cycles-pp.__cond_resched
0.07 ± 5% -0.0 0.02 ± 99% perf-profile.children.cycles-pp.xas_find
0.13 ± 2% -0.0 0.09 perf-profile.children.cycles-pp.security_vm_enough_memory_mm
0.14 ± 4% -0.0 0.10 ± 7% perf-profile.children.cycles-pp.__fget_light
0.06 ± 6% -0.0 0.02 ± 99% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.12 ± 4% -0.0 0.08 ± 4% perf-profile.children.cycles-pp.xas_start
0.08 ± 5% -0.0 0.05 perf-profile.children.cycles-pp.__folio_throttle_swaprate
0.12 -0.0 0.08 ± 5% perf-profile.children.cycles-pp.folio_unlock
0.14 ± 3% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.try_charge_memcg
0.12 ± 6% -0.0 0.08 ± 5% perf-profile.children.cycles-pp.free_unref_page_prepare
0.12 ± 3% -0.0 0.09 ± 4% perf-profile.children.cycles-pp.noop_dirty_folio
0.20 ± 2% -0.0 0.17 ± 5% perf-profile.children.cycles-pp.page_counter_uncharge
0.10 -0.0 0.07 ± 5% perf-profile.children.cycles-pp.cap_vm_enough_memory
0.09 ± 6% -0.0 0.06 ± 6% perf-profile.children.cycles-pp._raw_spin_trylock
0.09 ± 5% -0.0 0.06 ± 7% perf-profile.children.cycles-pp.inode_add_bytes
0.06 ± 6% -0.0 0.03 ± 70% perf-profile.children.cycles-pp.filemap_free_folio
0.06 ± 6% -0.0 0.03 ± 70% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.12 ± 3% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.__folio_cancel_dirty
0.12 ± 3% -0.0 0.10 ± 5% perf-profile.children.cycles-pp.shmem_recalc_inode
0.09 ± 5% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.__vm_enough_memory
0.08 ± 5% -0.0 0.06 perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.08 ± 5% -0.0 0.06 perf-profile.children.cycles-pp.security_file_permission
0.08 ± 6% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.apparmor_file_permission
0.09 ± 4% -0.0 0.07 ± 8% perf-profile.children.cycles-pp.__percpu_counter_limited_add
0.08 ± 6% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.__list_add_valid_or_report
0.07 ± 8% -0.0 0.05 perf-profile.children.cycles-pp.get_pfnblock_flags_mask
0.14 ± 3% -0.0 0.12 ± 6% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.07 ± 5% -0.0 0.05 perf-profile.children.cycles-pp.policy_nodemask
0.24 ± 2% -0.0 0.22 ± 2% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.08 -0.0 0.07 ± 7% perf-profile.children.cycles-pp.xas_create
0.69 +0.1 0.78 perf-profile.children.cycles-pp.lru_add_fn
1.72 ± 2% +0.1 1.80 perf-profile.children.cycles-pp.shmem_add_to_page_cache
1.79 ± 2% +0.1 1.93 ± 2% perf-profile.children.cycles-pp.filemap_remove_folio
0.13 ± 5% +0.1 0.28 perf-profile.children.cycles-pp.file_modified
0.69 ± 5% +0.1 0.84 ± 3% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
0.09 ± 7% +0.2 0.24 ± 2% perf-profile.children.cycles-pp.inode_needs_update_time
1.58 ± 3% +0.2 1.77 ± 2% perf-profile.children.cycles-pp.__filemap_remove_folio
0.15 ± 3% +0.4 0.50 ± 3% perf-profile.children.cycles-pp.__count_memcg_events
0.79 ± 4% +0.4 1.20 ± 3% perf-profile.children.cycles-pp.filemap_unaccount_folio
0.36 ± 5% +0.4 0.77 ± 4% perf-profile.children.cycles-pp.mem_cgroup_commit_charge
98.33 +0.5 98.78 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
97.74 +0.6 98.34 perf-profile.children.cycles-pp.do_syscall_64
48.39 +0.8 49.15 perf-profile.children.cycles-pp.__x64_sys_fallocate
1.34 ± 5% +0.8 2.14 ± 4% perf-profile.children.cycles-pp.__mem_cgroup_charge
1.61 ± 4% +0.8 2.42 ± 2% perf-profile.children.cycles-pp.__mod_lruvec_page_state
48.17 +0.8 48.98 perf-profile.children.cycles-pp.vfs_fallocate
47.94 +0.9 48.82 perf-profile.children.cycles-pp.shmem_fallocate
47.10 +0.9 48.04 perf-profile.children.cycles-pp.shmem_get_folio_gfp
84.34 +0.9 85.28 perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
84.31 +0.9 85.26 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
84.24 +1.0 85.21 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
46.65 +1.1 47.70 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
1.23 ± 4% +1.4 2.58 ± 2% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
0.98 -0.3 0.73 ± 2% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.88 -0.2 0.70 perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.60 -0.2 0.45 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.41 ± 3% -0.1 0.27 ± 3% perf-profile.self.cycles-pp.release_pages
0.41 -0.1 0.30 ± 3% perf-profile.self.cycles-pp.xas_store
0.41 ± 3% -0.1 0.29 ± 2% perf-profile.self.cycles-pp.folio_batch_move_lru
0.30 ± 3% -0.1 0.18 ± 5% perf-profile.self.cycles-pp.shmem_add_to_page_cache
0.38 ± 2% -0.1 0.27 ± 2% perf-profile.self.cycles-pp.__entry_text_start
0.30 ± 3% -0.1 0.20 ± 6% perf-profile.self.cycles-pp.lru_add_fn
0.28 ± 2% -0.1 0.20 ± 5% perf-profile.self.cycles-pp.shmem_fallocate
0.26 ± 2% -0.1 0.18 ± 5% perf-profile.self.cycles-pp.__mod_node_page_state
0.27 ± 3% -0.1 0.20 ± 2% perf-profile.self.cycles-pp._raw_spin_lock
0.21 ± 2% -0.1 0.15 ± 4% perf-profile.self.cycles-pp.__alloc_pages
0.20 ± 2% -0.1 0.14 ± 3% perf-profile.self.cycles-pp.xas_descend
0.26 ± 3% -0.1 0.20 ± 4% perf-profile.self.cycles-pp.find_lock_entries
0.18 ± 4% -0.0 0.13 ± 5% perf-profile.self.cycles-pp.xas_clear_mark
0.15 ± 7% -0.0 0.10 ± 11% perf-profile.self.cycles-pp.shmem_inode_acct_blocks
0.16 ± 4% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.__dquot_alloc_space
0.13 ± 4% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.free_unref_page_commit
0.13 -0.0 0.09 ± 5% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.16 ± 4% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
0.13 ± 5% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.__filemap_remove_folio
0.13 ± 2% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.get_page_from_freelist
0.12 ± 4% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.vfs_fallocate
0.06 ± 7% -0.0 0.02 ± 99% perf-profile.self.cycles-pp.apparmor_file_permission
0.13 ± 3% -0.0 0.10 ± 5% perf-profile.self.cycles-pp.fallocate64
0.11 ± 4% -0.0 0.07 perf-profile.self.cycles-pp.xas_start
0.07 ± 5% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.shmem_alloc_folio
0.14 ± 4% -0.0 0.10 ± 7% perf-profile.self.cycles-pp.__fget_light
0.10 ± 4% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.rmqueue
0.12 ± 3% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.xas_load
0.11 ± 4% -0.0 0.08 ± 7% perf-profile.self.cycles-pp.folio_unlock
0.10 ± 4% -0.0 0.07 ± 8% perf-profile.self.cycles-pp.alloc_pages_mpol
0.15 ± 2% -0.0 0.12 ± 5% perf-profile.self.cycles-pp.shmem_get_folio_gfp
0.10 -0.0 0.07 perf-profile.self.cycles-pp.cap_vm_enough_memory
0.16 ± 2% -0.0 0.13 ± 6% perf-profile.self.cycles-pp.page_counter_uncharge
0.12 ± 5% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.__cond_resched
0.06 ± 6% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.filemap_free_folio
0.12 ± 3% -0.0 0.10 ± 5% perf-profile.self.cycles-pp.free_unref_page_list
0.12 -0.0 0.09 ± 4% perf-profile.self.cycles-pp.noop_dirty_folio
0.10 ± 3% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.filemap_remove_folio
0.10 ± 5% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.try_charge_memcg
0.12 ± 3% -0.0 0.10 ± 8% perf-profile.self.cycles-pp.cgroup_rstat_updated
0.09 ± 4% -0.0 0.07 ± 7% perf-profile.self.cycles-pp.__folio_cancel_dirty
0.08 ± 4% -0.0 0.06 ± 8% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.08 ± 5% -0.0 0.06 perf-profile.self.cycles-pp._raw_spin_trylock
0.08 -0.0 0.06 ± 6% perf-profile.self.cycles-pp.folio_add_lru
0.08 ± 8% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.__mod_lruvec_state
0.07 ± 5% -0.0 0.05 perf-profile.self.cycles-pp.xas_find_conflict
0.08 ± 10% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.truncate_cleanup_folio
0.07 ± 10% -0.0 0.05 perf-profile.self.cycles-pp.xas_init_marks
0.08 ± 4% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.__percpu_counter_limited_add
0.07 ± 7% -0.0 0.05 perf-profile.self.cycles-pp.get_pfnblock_flags_mask
0.07 ± 5% -0.0 0.06 ± 8% perf-profile.self.cycles-pp.__list_add_valid_or_report
0.02 ±141% +0.0 0.06 ± 8% perf-profile.self.cycles-pp.uncharge_batch
0.21 ± 9% +0.1 0.31 ± 7% perf-profile.self.cycles-pp.mem_cgroup_commit_charge
0.69 ± 5% +0.1 0.83 ± 4% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.06 ± 6% +0.2 0.22 ± 2% perf-profile.self.cycles-pp.inode_needs_update_time
0.14 ± 8% +0.3 0.42 ± 7% perf-profile.self.cycles-pp.__mem_cgroup_charge
0.13 ± 7% +0.4 0.49 ± 3% perf-profile.self.cycles-pp.__count_memcg_events
84.24 +1.0 85.21 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
1.12 ± 5% +1.4 2.50 ± 2% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
***************************************************************************************************
lkp-skl-fpga01: 104 threads 2 sockets (Skylake) with 192G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
commit:
130617edc1 ("mm: memcg: move vmstats structs definition above flushing code")
51d74c18a9 ("mm: memcg: make stats flushing threshold per-memcg")
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522
---------------- ---------------------------
%stddev %change %stddev
\ | \
1.87 -0.4 1.43 ± 3% mpstat.cpu.all.usr%
3171 -5.3% 3003 ± 2% vmstat.system.cs
84.83 ± 9% +55.8% 132.17 ± 16% perf-c2c.DRAM.local
484.17 ± 3% +37.1% 663.67 ± 10% perf-c2c.DRAM.remote
72763 ± 5% +14.4% 83212 ± 12% turbostat.C1
0.08 -25.0% 0.06 turbostat.IPC
27.90 +4.6% 29.18 turbostat.RAMWatt
3982212 -30.0% 2785941 will-it-scale.52.threads
76580 -30.0% 53575 will-it-scale.per_thread_ops
3982212 -30.0% 2785941 will-it-scale.workload
1.175e+09 ± 2% -28.6% 8.392e+08 ± 2% numa-numastat.node0.local_node
1.175e+09 ± 2% -28.6% 8.394e+08 ± 2% numa-numastat.node0.numa_hit
1.231e+09 ± 2% -31.3% 8.463e+08 ± 3% numa-numastat.node1.local_node
1.232e+09 ± 2% -31.3% 8.466e+08 ± 3% numa-numastat.node1.numa_hit
1.175e+09 ± 2% -28.6% 8.394e+08 ± 2% numa-vmstat.node0.numa_hit
1.175e+09 ± 2% -28.6% 8.392e+08 ± 2% numa-vmstat.node0.numa_local
1.232e+09 ± 2% -31.3% 8.466e+08 ± 3% numa-vmstat.node1.numa_hit
1.231e+09 ± 2% -31.3% 8.463e+08 ± 3% numa-vmstat.node1.numa_local
2.408e+09 -30.0% 1.686e+09 proc-vmstat.numa_hit
2.406e+09 -30.0% 1.685e+09 proc-vmstat.numa_local
2.404e+09 -29.9% 1.684e+09 proc-vmstat.pgalloc_normal
2.404e+09 -29.9% 1.684e+09 proc-vmstat.pgfree
0.04 ± 9% -19.3% 0.03 ± 6% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.04 ± 8% -22.3% 0.03 ± 5% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.91 ± 2% +11.3% 1.01 ± 5% perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.04 ± 13% -90.3% 0.00 ±223% perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
1.14 +15.1% 1.31 perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
189.94 ± 3% +18.3% 224.73 ± 4% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1652 ± 4% -13.4% 1431 ± 4% perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
83.67 ± 7% -87.6% 10.33 ±223% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
3827 ± 4% -13.0% 3328 ± 3% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.71 ±165% -83.4% 0.28 ± 21% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.43 ± 17% -43.8% 0.24 ± 26% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.46 ± 17% -36.7% 0.29 ± 12% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.30 ± 34% -90.7% 0.03 ±223% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
0.04 ± 9% -19.3% 0.03 ± 6% perf-sched.wait_time.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.04 ± 8% -22.3% 0.03 ± 5% perf-sched.wait_time.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.04 ± 11% -33.1% 0.03 ± 17% perf-sched.wait_time.avg.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.90 ± 2% +11.5% 1.00 ± 5% perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.04 ± 13% -26.6% 0.03 ± 12% perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
1.13 +15.2% 1.30 perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
189.93 ± 3% +18.3% 224.72 ± 4% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.71 ±165% -83.4% 0.28 ± 21% perf-sched.wait_time.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.43 ± 17% -43.8% 0.24 ± 26% perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.46 ± 17% -36.7% 0.29 ± 12% perf-sched.wait_time.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.75 +142.0% 1.83 ± 2% perf-stat.i.MPKI
8.47e+09 -24.4% 6.407e+09 perf-stat.i.branch-instructions
0.66 -0.0 0.63 perf-stat.i.branch-miss-rate%
56364992 -28.3% 40421603 ± 3% perf-stat.i.branch-misses
14.64 +6.7 21.30 perf-stat.i.cache-miss-rate%
30868184 +81.3% 55977240 ± 3% perf-stat.i.cache-misses
2.107e+08 +24.7% 2.627e+08 ± 2% perf-stat.i.cache-references
3106 -5.5% 2934 ± 2% perf-stat.i.context-switches
3.55 +33.4% 4.74 perf-stat.i.cpi
4722 -44.8% 2605 ± 3% perf-stat.i.cycles-between-cache-misses
0.04 -0.0 0.04 perf-stat.i.dTLB-load-miss-rate%
4117232 -29.1% 2917107 perf-stat.i.dTLB-load-misses
1.051e+10 -24.1% 7.979e+09 perf-stat.i.dTLB-loads
0.00 ± 3% +0.0 0.00 ± 6% perf-stat.i.dTLB-store-miss-rate%
5.886e+09 -27.5% 4.269e+09 perf-stat.i.dTLB-stores
78.16 -6.6 71.51 perf-stat.i.iTLB-load-miss-rate%
4131074 ± 3% -30.0% 2891515 perf-stat.i.iTLB-load-misses
4.098e+10 -25.0% 3.072e+10 perf-stat.i.instructions
9929 ± 2% +7.0% 10627 perf-stat.i.instructions-per-iTLB-miss
0.28 -25.0% 0.21 perf-stat.i.ipc
63.49 +43.8% 91.27 ± 3% perf-stat.i.metric.K/sec
241.12 -24.6% 181.87 perf-stat.i.metric.M/sec
3735316 +78.6% 6669641 ± 3% perf-stat.i.node-load-misses
377465 ± 4% +86.1% 702512 ± 11% perf-stat.i.node-loads
1322217 -27.6% 957081 ± 5% perf-stat.i.node-store-misses
37459 ± 3% -23.0% 28826 ± 5% perf-stat.i.node-stores
0.75 +141.8% 1.82 ± 2% perf-stat.overall.MPKI
0.67 -0.0 0.63 perf-stat.overall.branch-miss-rate%
14.65 +6.7 21.30 perf-stat.overall.cache-miss-rate%
3.55 +33.4% 4.73 perf-stat.overall.cpi
4713 -44.8% 2601 ± 3% perf-stat.overall.cycles-between-cache-misses
0.04 -0.0 0.04 perf-stat.overall.dTLB-load-miss-rate%
0.00 ± 3% +0.0 0.00 ± 5% perf-stat.overall.dTLB-store-miss-rate%
78.14 -6.7 71.47 perf-stat.overall.iTLB-load-miss-rate%
9927 ± 2% +7.0% 10624 perf-stat.overall.instructions-per-iTLB-miss
0.28 -25.0% 0.21 perf-stat.overall.ipc
3098901 +7.1% 3318983 perf-stat.overall.path-length
8.441e+09 -24.4% 6.385e+09 perf-stat.ps.branch-instructions
56179581 -28.3% 40286337 ± 3% perf-stat.ps.branch-misses
30759982 +81.3% 55777812 ± 3% perf-stat.ps.cache-misses
2.1e+08 +24.6% 2.618e+08 ± 2% perf-stat.ps.cache-references
3095 -5.5% 2923 ± 2% perf-stat.ps.context-switches
4103292 -29.1% 2907270 perf-stat.ps.dTLB-load-misses
1.048e+10 -24.1% 7.952e+09 perf-stat.ps.dTLB-loads
5.866e+09 -27.5% 4.255e+09 perf-stat.ps.dTLB-stores
4117020 ± 3% -30.0% 2881750 perf-stat.ps.iTLB-load-misses
4.084e+10 -25.0% 3.062e+10 perf-stat.ps.instructions
3722149 +78.5% 6645867 ± 3% perf-stat.ps.node-load-misses
376240 ± 4% +86.1% 700053 ± 11% perf-stat.ps.node-loads
1317772 -27.6% 953773 ± 5% perf-stat.ps.node-store-misses
37408 ± 3% -23.2% 28748 ± 5% perf-stat.ps.node-stores
1.234e+13 -25.1% 9.246e+12 perf-stat.total.instructions
1.28 -0.4 0.90 ± 2% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
1.26 ± 2% -0.4 0.90 ± 3% perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.08 ± 2% -0.3 0.77 ± 3% perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.92 ± 2% -0.3 0.62 ± 3% perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.84 ± 3% -0.2 0.61 ± 3% perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.26 -0.2 1.08 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr
1.26 -0.2 1.08 perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change
1.24 -0.2 1.06 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range
1.24 -0.2 1.06 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release
1.23 -0.2 1.06 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu
1.20 -0.2 1.04 ± 2% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.68 ± 3% +0.0 0.72 ± 4% perf-profile.calltrace.cycles-pp.__mem_cgroup_uncharge_list.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
1.08 +0.1 1.20 perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
2.91 +0.3 3.18 ± 2% perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate
2.56 +0.4 2.92 ± 2% perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
1.36 ± 3% +0.4 1.76 ± 9% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
2.22 +0.5 2.68 ± 2% perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
0.00 +0.6 0.60 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
2.33 +0.6 2.94 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.00 +0.7 0.72 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
0.69 ± 4% +0.8 1.47 ± 3% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
1.24 ± 2% +0.8 2.04 ± 2% perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
0.00 +0.8 0.82 ± 4% perf-profile.calltrace.cycles-pp.__count_memcg_events.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.17 ± 2% +0.8 2.00 ± 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
0.59 ± 4% +0.9 1.53 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.38 +1.0 2.33 ± 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.62 ± 3% +1.0 1.66 ± 5% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
38.70 +1.2 39.90 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
38.34 +1.3 39.65 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
37.24 +1.6 38.86 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
36.64 +1.8 38.40 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
2.47 ± 2% +2.1 4.59 ± 8% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.30 -0.4 0.92 ± 2% perf-profile.children.cycles-pp.syscall_return_via_sysret
1.28 ± 2% -0.4 0.90 ± 3% perf-profile.children.cycles-pp.shmem_alloc_folio
1.10 ± 2% -0.3 0.78 ± 3% perf-profile.children.cycles-pp.alloc_pages_mpol
0.96 ± 2% -0.3 0.64 ± 3% perf-profile.children.cycles-pp.shmem_inode_acct_blocks
0.88 -0.3 0.58 ± 2% perf-profile.children.cycles-pp.xas_store
0.88 ± 3% -0.2 0.64 ± 3% perf-profile.children.cycles-pp.__alloc_pages
0.61 ± 2% -0.2 0.43 ± 3% perf-profile.children.cycles-pp.__entry_text_start
1.26 -0.2 1.09 perf-profile.children.cycles-pp.lru_add_drain_cpu
0.56 -0.2 0.39 ± 4% perf-profile.children.cycles-pp.free_unref_page_list
1.22 -0.2 1.06 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.46 -0.1 0.32 ± 3% perf-profile.children.cycles-pp.__mod_lruvec_state
0.41 ± 3% -0.1 0.28 ± 4% perf-profile.children.cycles-pp.xas_load
0.44 ± 4% -0.1 0.31 ± 4% perf-profile.children.cycles-pp.find_lock_entries
0.50 ± 3% -0.1 0.37 ± 2% perf-profile.children.cycles-pp.get_page_from_freelist
0.24 ± 7% -0.1 0.12 ± 5% perf-profile.children.cycles-pp.__list_add_valid_or_report
0.34 ± 2% -0.1 0.24 ± 4% perf-profile.children.cycles-pp.__mod_node_page_state
0.38 ± 3% -0.1 0.28 ± 4% perf-profile.children.cycles-pp._raw_spin_lock
0.32 ± 2% -0.1 0.22 ± 5% perf-profile.children.cycles-pp.__dquot_alloc_space
0.26 ± 2% -0.1 0.17 ± 2% perf-profile.children.cycles-pp.xas_descend
0.22 ± 3% -0.1 0.14 ± 4% perf-profile.children.cycles-pp.free_unref_page_commit
0.25 -0.1 0.17 ± 3% perf-profile.children.cycles-pp.xas_clear_mark
0.32 ± 4% -0.1 0.25 ± 3% perf-profile.children.cycles-pp.rmqueue
0.23 ± 2% -0.1 0.16 ± 2% perf-profile.children.cycles-pp.xas_init_marks
0.24 ± 2% -0.1 0.17 ± 5% perf-profile.children.cycles-pp.__cond_resched
0.25 ± 4% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.truncate_cleanup_folio
0.30 ± 3% -0.1 0.23 ± 4% perf-profile.children.cycles-pp.filemap_get_entry
0.20 ± 2% -0.1 0.13 ± 5% perf-profile.children.cycles-pp.folio_unlock
0.16 ± 4% -0.1 0.10 ± 5% perf-profile.children.cycles-pp.xas_find_conflict
0.19 ± 3% -0.1 0.13 ± 5% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.17 ± 5% -0.1 0.12 ± 3% perf-profile.children.cycles-pp.noop_dirty_folio
0.13 ± 4% -0.1 0.08 ± 9% perf-profile.children.cycles-pp.security_vm_enough_memory_mm
0.18 ± 8% -0.1 0.13 ± 4% perf-profile.children.cycles-pp.shmem_recalc_inode
0.16 ± 2% -0.1 0.11 ± 3% perf-profile.children.cycles-pp.free_unref_page_prepare
0.09 ± 5% -0.1 0.04 ± 45% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
0.10 ± 7% -0.0 0.05 ± 45% perf-profile.children.cycles-pp.cap_vm_enough_memory
0.14 ± 5% -0.0 0.10 perf-profile.children.cycles-pp.__folio_cancel_dirty
0.14 ± 5% -0.0 0.10 ± 4% perf-profile.children.cycles-pp.security_file_permission
0.10 ± 5% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.xas_find
0.15 ± 4% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.__fget_light
0.14 ± 5% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.file_modified
0.12 ± 3% -0.0 0.09 ± 7% perf-profile.children.cycles-pp.__vm_enough_memory
0.12 ± 3% -0.0 0.09 ± 4% perf-profile.children.cycles-pp.apparmor_file_permission
0.12 ± 3% -0.0 0.08 ± 5% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.12 ± 4% -0.0 0.08 ± 4% perf-profile.children.cycles-pp.xas_start
0.09 -0.0 0.06 ± 8% perf-profile.children.cycles-pp.__folio_throttle_swaprate
0.12 ± 6% -0.0 0.08 ± 8% perf-profile.children.cycles-pp._raw_spin_trylock
0.12 ± 4% -0.0 0.08 ± 4% perf-profile.children.cycles-pp.__percpu_counter_limited_add
0.12 ± 4% -0.0 0.09 ± 4% perf-profile.children.cycles-pp.inode_add_bytes
0.20 ± 2% -0.0 0.17 ± 7% perf-profile.children.cycles-pp.try_charge_memcg
0.10 ± 5% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.policy_nodemask
0.09 ± 6% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.get_pfnblock_flags_mask
0.09 ± 6% -0.0 0.06 ± 7% perf-profile.children.cycles-pp.filemap_free_folio
0.07 ± 6% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.down_write
0.08 ± 4% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.get_task_policy
0.09 ± 5% -0.0 0.07 ± 5% perf-profile.children.cycles-pp.xas_create
0.09 ± 7% -0.0 0.07 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.09 ± 7% -0.0 0.07 perf-profile.children.cycles-pp.inode_needs_update_time
0.16 ± 2% -0.0 0.14 ± 5% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.08 ± 7% -0.0 0.06 ± 9% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.07 ± 5% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.folio_mark_dirty
0.08 ± 10% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.shmem_is_huge
0.07 ± 6% +0.0 0.09 ± 10% perf-profile.children.cycles-pp.propagate_protected_usage
0.43 ± 3% +0.0 0.46 ± 5% perf-profile.children.cycles-pp.uncharge_batch
0.68 ± 3% +0.0 0.73 ± 4% perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
1.11 +0.1 1.22 perf-profile.children.cycles-pp.lru_add_fn
2.91 +0.3 3.18 ± 2% perf-profile.children.cycles-pp.truncate_inode_folio
2.56 +0.4 2.92 ± 2% perf-profile.children.cycles-pp.filemap_remove_folio
1.37 ± 3% +0.4 1.76 ± 9% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
2.24 +0.5 2.70 ± 2% perf-profile.children.cycles-pp.__filemap_remove_folio
2.38 +0.6 2.97 perf-profile.children.cycles-pp.shmem_add_to_page_cache
0.18 ± 4% +0.7 0.91 ± 4% perf-profile.children.cycles-pp.__count_memcg_events
1.26 +0.8 2.04 ± 2% perf-profile.children.cycles-pp.filemap_unaccount_folio
0.63 ± 2% +1.0 1.67 ± 5% perf-profile.children.cycles-pp.mem_cgroup_commit_charge
38.71 +1.2 39.91 perf-profile.children.cycles-pp.vfs_fallocate
38.37 +1.3 39.66 perf-profile.children.cycles-pp.shmem_fallocate
37.28 +1.6 38.89 perf-profile.children.cycles-pp.shmem_get_folio_gfp
36.71 +1.7 38.45 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
2.58 +1.8 4.36 ± 2% perf-profile.children.cycles-pp.__mod_lruvec_page_state
2.48 ± 2% +2.1 4.60 ± 8% perf-profile.children.cycles-pp.__mem_cgroup_charge
1.93 ± 3% +2.4 4.36 ± 2% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
1.30 -0.4 0.92 ± 2% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.73 -0.2 0.52 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.54 ± 2% -0.2 0.36 ± 3% perf-profile.self.cycles-pp.release_pages
0.48 -0.2 0.30 ± 3% perf-profile.self.cycles-pp.xas_store
0.54 ± 2% -0.2 0.38 ± 3% perf-profile.self.cycles-pp.__entry_text_start
1.17 -0.1 1.03 ± 2% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.36 ± 2% -0.1 0.22 ± 3% perf-profile.self.cycles-pp.shmem_add_to_page_cache
0.43 ± 5% -0.1 0.30 ± 7% perf-profile.self.cycles-pp.lru_add_fn
0.24 ± 7% -0.1 0.12 ± 6% perf-profile.self.cycles-pp.__list_add_valid_or_report
0.38 ± 4% -0.1 0.27 ± 4% perf-profile.self.cycles-pp._raw_spin_lock
0.52 ± 3% -0.1 0.41 perf-profile.self.cycles-pp.folio_batch_move_lru
0.32 ± 2% -0.1 0.22 ± 4% perf-profile.self.cycles-pp.__mod_node_page_state
0.36 ± 4% -0.1 0.26 ± 4% perf-profile.self.cycles-pp.find_lock_entries
0.36 ± 2% -0.1 0.26 ± 2% perf-profile.self.cycles-pp.shmem_fallocate
0.28 ± 3% -0.1 0.20 ± 5% perf-profile.self.cycles-pp.__alloc_pages
0.24 ± 2% -0.1 0.16 ± 4% perf-profile.self.cycles-pp.xas_descend
0.23 ± 2% -0.1 0.16 ± 3% perf-profile.self.cycles-pp.xas_clear_mark
0.18 ± 3% -0.1 0.11 ± 6% perf-profile.self.cycles-pp.free_unref_page_commit
0.18 ± 3% -0.1 0.12 ± 4% perf-profile.self.cycles-pp.shmem_inode_acct_blocks
0.21 ± 3% -0.1 0.15 ± 2% perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
0.18 ± 2% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.__filemap_remove_folio
0.18 ± 7% -0.1 0.12 ± 7% perf-profile.self.cycles-pp.vfs_fallocate
0.20 ± 2% -0.1 0.14 ± 6% perf-profile.self.cycles-pp.__dquot_alloc_space
0.18 ± 2% -0.1 0.13 ± 3% perf-profile.self.cycles-pp.folio_unlock
0.18 ± 2% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.get_page_from_freelist
0.15 ± 3% -0.1 0.10 ± 7% perf-profile.self.cycles-pp.xas_load
0.17 ± 3% -0.1 0.12 ± 8% perf-profile.self.cycles-pp.__cond_resched
0.17 ± 2% -0.1 0.12 ± 3% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.17 ± 5% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.noop_dirty_folio
0.10 ± 7% -0.0 0.05 ± 45% perf-profile.self.cycles-pp.cap_vm_enough_memory
0.12 ± 3% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.rmqueue
0.07 ± 5% -0.0 0.02 ± 99% perf-profile.self.cycles-pp.xas_find
0.13 ± 3% -0.0 0.09 ± 6% perf-profile.self.cycles-pp.alloc_pages_mpol
0.07 ± 6% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.xas_find_conflict
0.16 ± 2% -0.0 0.12 ± 6% perf-profile.self.cycles-pp.free_unref_page_list
0.12 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.fallocate64
0.20 ± 4% -0.0 0.16 ± 3% perf-profile.self.cycles-pp.shmem_get_folio_gfp
0.06 ± 7% -0.0 0.02 ± 99% perf-profile.self.cycles-pp.shmem_recalc_inode
0.13 ± 3% -0.0 0.09 perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.22 ± 3% -0.0 0.19 ± 6% perf-profile.self.cycles-pp.page_counter_uncharge
0.14 ± 3% -0.0 0.10 ± 6% perf-profile.self.cycles-pp.filemap_remove_folio
0.15 ± 5% -0.0 0.11 ± 3% perf-profile.self.cycles-pp.__fget_light
0.12 ± 4% -0.0 0.08 perf-profile.self.cycles-pp.__folio_cancel_dirty
0.11 ± 4% -0.0 0.08 ± 7% perf-profile.self.cycles-pp._raw_spin_trylock
0.12 ± 3% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.__mod_lruvec_state
0.11 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.truncate_cleanup_folio
0.11 ± 3% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.__percpu_counter_limited_add
0.11 ± 3% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.xas_start
0.10 ± 6% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.xas_init_marks
0.09 ± 6% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.get_pfnblock_flags_mask
0.11 -0.0 0.08 ± 5% perf-profile.self.cycles-pp.folio_add_lru
0.09 ± 6% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.filemap_free_folio
0.09 ± 4% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.shmem_alloc_folio
0.14 ± 5% -0.0 0.12 ± 5% perf-profile.self.cycles-pp.cgroup_rstat_updated
0.10 ± 4% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.apparmor_file_permission
0.07 ± 7% -0.0 0.04 ± 44% perf-profile.self.cycles-pp.policy_nodemask
0.07 ± 11% -0.0 0.04 ± 45% perf-profile.self.cycles-pp.shmem_is_huge
0.08 ± 4% -0.0 0.06 ± 8% perf-profile.self.cycles-pp.get_task_policy
0.08 ± 6% -0.0 0.05 ± 8% perf-profile.self.cycles-pp.__x64_sys_fallocate
0.12 ± 3% -0.0 0.10 ± 6% perf-profile.self.cycles-pp.try_charge_memcg
0.07 -0.0 0.05 perf-profile.self.cycles-pp.free_unref_page_prepare
0.07 ± 6% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.08 ± 4% -0.0 0.06 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.09 ± 7% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.filemap_get_entry
0.07 ± 9% +0.0 0.09 ± 10% perf-profile.self.cycles-pp.propagate_protected_usage
0.96 ± 2% +0.2 1.12 ± 7% perf-profile.self.cycles-pp.__mod_lruvec_page_state
0.45 ± 4% +0.4 0.82 ± 8% perf-profile.self.cycles-pp.mem_cgroup_commit_charge
1.36 ± 3% +0.4 1.75 ± 9% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.29 +0.7 1.00 ± 10% perf-profile.self.cycles-pp.__mem_cgroup_charge
0.16 ± 4% +0.7 0.90 ± 4% perf-profile.self.cycles-pp.__count_memcg_events
1.80 ± 2% +2.5 4.26 ± 2% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
>
>
> commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
>
> testcase: will-it-scale
> test machine: 104 threads 2 sockets (Skylake) with 192G memory
> parameters:
>
> nr_task: 100%
> mode: thread
> test: fallocate1
> cpufreq_governor: performance
>
>
> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+---------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> | test machine | 104 threads 2 sockets (Skylake) with 192G memory |
> | test parameters | cpufreq_governor=performance |
> | | mode=thread |
> | | nr_task=50% |
> | | test=fallocate1 |
> +------------------+---------------------------------------------------------------+
>
Yosry, I don't think 25% to 30% regression can be ignored. Unless
there is a quick fix, IMO this series should be skipped for the
upcoming kernel open window.
On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
> >
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> >
> >
> > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> >
> > testcase: will-it-scale
> > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > parameters:
> >
> > nr_task: 100%
> > mode: thread
> > test: fallocate1
> > cpufreq_governor: performance
> >
> >
> > In addition to that, the commit also has significant impact on the following tests:
> >
> > +------------------+---------------------------------------------------------------+
> > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > | test machine | 104 threads 2 sockets (Skylake) with 192G memory |
> > | test parameters | cpufreq_governor=performance |
> > | | mode=thread |
> > | | nr_task=50% |
> > | | test=fallocate1 |
> > +------------------+---------------------------------------------------------------+
> >
>
> Yosry, I don't think 25% to 30% regression can be ignored. Unless
> there is a quick fix, IMO this series should be skipped for the
> upcoming kernel open window.
I am currently looking into it. It's reasonable to skip the next merge
window if a quick fix isn't found soon.
I am surprised by the size of the regression given the following:
1.12 ą 5% +1.4 2.50 ą 2%
perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote:
> On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
> > >
> > >
> > >
> > > Hello,
> > >
> > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> > >
> > >
> > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> > >
> > > testcase: will-it-scale
> > > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > > parameters:
> > >
> > > nr_task: 100%
> > > mode: thread
> > > test: fallocate1
> > > cpufreq_governor: performance
> > >
> > >
> > > In addition to that, the commit also has significant impact on the following tests:
> > >
> > > +------------------+---------------------------------------------------------------+
> > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory |
> > > | test parameters | cpufreq_governor=performance |
> > > | | mode=thread |
> > > | | nr_task=50% |
> > > | | test=fallocate1 |
> > > +------------------+---------------------------------------------------------------+
> > >
> >
> > Yosry, I don't think 25% to 30% regression can be ignored. Unless
> > there is a quick fix, IMO this series should be skipped for the
> > upcoming kernel open window.
>
> I am currently looking into it. It's reasonable to skip the next merge
> window if a quick fix isn't found soon.
>
> I am surprised by the size of the regression given the following:
> 1.12 ą 5% +1.4 2.50 ą 2%
> perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
>
> IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
Yes, this is kind of confusing. And we have seen similar cases before,
espcially for micro benchmark like will-it-scale, stressng, netperf
etc, the change to those functions in hot path was greatly amplified
in the final benchmark score.
In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
the affected functions have around 10% change in perf's cpu-cycles,
and trigger 69% regression. IIRC, micro benchmarks are very sensitive
to those statistics update, like memcg's and vmstat.
Thanks,
Feng
On Sun, Oct 22, 2023 at 6:34 PM Feng Tang <feng.tang@intel.com> wrote:
>
> On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote:
> > On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
> > > >
> > > >
> > > >
> > > > Hello,
> > > >
> > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> > > >
> > > >
> > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> > > >
> > > > testcase: will-it-scale
> > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > > > parameters:
> > > >
> > > > nr_task: 100%
> > > > mode: thread
> > > > test: fallocate1
> > > > cpufreq_governor: performance
> > > >
> > > >
> > > > In addition to that, the commit also has significant impact on the following tests:
> > > >
> > > > +------------------+---------------------------------------------------------------+
> > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory |
> > > > | test parameters | cpufreq_governor=performance |
> > > > | | mode=thread |
> > > > | | nr_task=50% |
> > > > | | test=fallocate1 |
> > > > +------------------+---------------------------------------------------------------+
> > > >
> > >
> > > Yosry, I don't think 25% to 30% regression can be ignored. Unless
> > > there is a quick fix, IMO this series should be skipped for the
> > > upcoming kernel open window.
> >
> > I am currently looking into it. It's reasonable to skip the next merge
> > window if a quick fix isn't found soon.
> >
> > I am surprised by the size of the regression given the following:
> > 1.12 ą 5% +1.4 2.50 ą 2%
> > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
> >
> > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
>
> Yes, this is kind of confusing. And we have seen similar cases before,
> espcially for micro benchmark like will-it-scale, stressng, netperf
> etc, the change to those functions in hot path was greatly amplified
> in the final benchmark score.
>
> In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
> the affected functions have around 10% change in perf's cpu-cycles,
> and trigger 69% regression. IIRC, micro benchmarks are very sensitive
> to those statistics update, like memcg's and vmstat.
>
Thanks for clarifying. I am still trying to reproduce locally but I am
running into some quirks with tooling. I may have to run a modified
version of the fallocate test manually. Meanwhile, I noticed that the
patch was tested without the fixlet that I posted [1] for it,
understandably. Would it be possible to get some numbers with that
fixlet? It should reduce the total number of contended atomic
operations, so it may help.
[1]https://lore.kernel.org/lkml/CAJD7tkZDarDn_38ntFg5bK2fAmFdSe+Rt6DKOZA7Sgs_kERoVA@mail.gmail.com/
I am also wondering if aligning the stats_updates atomic will help.
Right now it may share a cacheline with some items of the
events_pending array. The latter may be dirtied during a flush and
unnecessarily dirty the former, but the chances are slim to be honest.
If it's easy to test such a diff, that would be nice, but I don't
expect a lot of difference:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7cbc7d94eb65..a35fce653262 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -646,7 +646,7 @@ struct memcg_vmstats {
unsigned long events_pending[NR_MEMCG_EVENTS];
/* Stats updates since the last flush */
- atomic64_t stats_updates;
+ atomic64_t stats_updates ____cacheline_aligned_in_smp;
};
/*
On Mon, Oct 23, 2023 at 11:25 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Sun, Oct 22, 2023 at 6:34 PM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote:
> > > On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> > > > >
> > > > >
> > > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > > > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-yosryahmed@google.com/
> > > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> > > > >
> > > > > testcase: will-it-scale
> > > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > > > > parameters:
> > > > >
> > > > > nr_task: 100%
> > > > > mode: thread
> > > > > test: fallocate1
> > > > > cpufreq_governor: performance
> > > > >
> > > > >
> > > > > In addition to that, the commit also has significant impact on the following tests:
> > > > >
> > > > > +------------------+---------------------------------------------------------------+
> > > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > > > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory |
> > > > > | test parameters | cpufreq_governor=performance |
> > > > > | | mode=thread |
> > > > > | | nr_task=50% |
> > > > > | | test=fallocate1 |
> > > > > +------------------+---------------------------------------------------------------+
> > > > >
> > > >
> > > > Yosry, I don't think 25% to 30% regression can be ignored. Unless
> > > > there is a quick fix, IMO this series should be skipped for the
> > > > upcoming kernel open window.
> > >
> > > I am currently looking into it. It's reasonable to skip the next merge
> > > window if a quick fix isn't found soon.
> > >
> > > I am surprised by the size of the regression given the following:
> > > 1.12 ą 5% +1.4 2.50 ą 2%
> > > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
> > >
> > > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
> >
> > Yes, this is kind of confusing. And we have seen similar cases before,
> > espcially for micro benchmark like will-it-scale, stressng, netperf
> > etc, the change to those functions in hot path was greatly amplified
> > in the final benchmark score.
> >
> > In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
> > the affected functions have around 10% change in perf's cpu-cycles,
> > and trigger 69% regression. IIRC, micro benchmarks are very sensitive
> > to those statistics update, like memcg's and vmstat.
> >
>
> Thanks for clarifying. I am still trying to reproduce locally but I am
> running into some quirks with tooling. I may have to run a modified
> version of the fallocate test manually. Meanwhile, I noticed that the
> patch was tested without the fixlet that I posted [1] for it,
> understandably. Would it be possible to get some numbers with that
> fixlet? It should reduce the total number of contended atomic
> operations, so it may help.
>
> [1]https://lore.kernel.org/lkml/CAJD7tkZDarDn_38ntFg5bK2fAmFdSe+Rt6DKOZA7Sgs_kERoVA@mail.gmail.com/
>
> I am also wondering if aligning the stats_updates atomic will help.
> Right now it may share a cacheline with some items of the
> events_pending array. The latter may be dirtied during a flush and
> unnecessarily dirty the former, but the chances are slim to be honest.
> If it's easy to test such a diff, that would be nice, but I don't
> expect a lot of difference:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7cbc7d94eb65..a35fce653262 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -646,7 +646,7 @@ struct memcg_vmstats {
> unsigned long events_pending[NR_MEMCG_EVENTS];
>
> /* Stats updates since the last flush */
> - atomic64_t stats_updates;
> + atomic64_t stats_updates ____cacheline_aligned_in_smp;
> };
>
> /*
I still could not run the benchmark, but I used a version of
fallocate1.c that does 1 million iterations. I ran 100 in parallel.
This showed ~13% regression with the patch, so not the same as the
will-it-scale version, but it could be an indicator.
With that, I did not see any improvement with the fixlet above or
___cacheline_aligned_in_smp. So you can scratch that.
I did, however, see some improvement with reducing the indirection
layers by moving stats_updates directly into struct mem_cgroup. The
regression in my manual testing went down to 9%. Still not great, but
I am wondering how this reflects on the benchmark. If you're able to
test it that would be great, the diff is below. Meanwhile I am still
looking for other improvements that can be made.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f64ac140083e..b4dfcd8b9cc1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -270,6 +270,9 @@ struct mem_cgroup {
CACHELINE_PADDING(_pad1_);
+ /* Stats updates since the last flush */
+ atomic64_t stats_updates;
+
/* memory.stat */
struct memcg_vmstats *vmstats;
@@ -309,6 +312,7 @@ struct mem_cgroup {
atomic_t moving_account;
struct task_struct *move_lock_task;
+ unsigned int __percpu *stats_updates_percpu;
struct memcg_vmstats_percpu __percpu *vmstats_percpu;
#ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7cbc7d94eb65..e5d2f3d4d874 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -627,9 +627,6 @@ struct memcg_vmstats_percpu {
/* Cgroup1: threshold notifications & softlimit tree updates */
unsigned long nr_page_events;
unsigned long targets[MEM_CGROUP_NTARGETS];
-
- /* Stats updates since the last flush */
- unsigned int stats_updates;
};
struct memcg_vmstats {
@@ -644,9 +641,6 @@ struct memcg_vmstats {
/* Pending child counts during tree propagation */
long state_pending[MEMCG_NR_STAT];
unsigned long events_pending[NR_MEMCG_EVENTS];
-
- /* Stats updates since the last flush */
- atomic64_t stats_updates;
};
/*
@@ -695,14 +689,14 @@ static void memcg_stats_unlock(void)
static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
{
- return atomic64_read(&memcg->vmstats->stats_updates) >
+ return atomic64_read(&memcg->stats_updates) >
MEMCG_CHARGE_BATCH * num_online_cpus();
}
static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
{
int cpu = smp_processor_id();
- unsigned int x;
+ unsigned int *stats_updates_percpu;
if (!val)
return;
@@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
cgroup_rstat_updated(memcg->css.cgroup, cpu);
for (; memcg; memcg = parent_mem_cgroup(memcg)) {
- x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
- abs(val));
+ stats_updates_percpu =
this_cpu_ptr(memcg->stats_updates_percpu);
- if (x < MEMCG_CHARGE_BATCH)
+ *stats_updates_percpu += abs(val);
+ if (*stats_updates_percpu < MEMCG_CHARGE_BATCH)
continue;
/*
@@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
* redundant. Avoid the overhead of the atomic update.
*/
if (!memcg_should_flush_stats(memcg))
- atomic64_add(x, &memcg->vmstats->stats_updates);
- __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
+ atomic64_add(*stats_updates_percpu,
&memcg->stats_updates);
+ *stats_updates_percpu = 0;
}
}
@@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
free_mem_cgroup_per_node_info(memcg, node);
kfree(memcg->vmstats);
free_percpu(memcg->vmstats_percpu);
+ free_percpu(memcg->stats_updates_percpu);
kfree(memcg);
}
@@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (!memcg->vmstats_percpu)
goto fail;
+ memcg->stats_updates_percpu = alloc_percpu_gfp(unsigned int,
+ GFP_KERNEL_ACCOUNT);
+ if (!memcg->stats_updates_percpu)
+ goto fail;
+
for_each_node(node)
if (alloc_mem_cgroup_per_node_info(memcg, node))
goto fail;
@@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
struct memcg_vmstats_percpu *statc;
+ int *stats_updates_percpu;
long delta, delta_cpu, v;
int i, nid;
statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
+ stats_updates_percpu = per_cpu_ptr(memcg->stats_updates_percpu, cpu);
for (i = 0; i < MEMCG_NR_STAT; i++) {
/*
@@ -5826,10 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
}
}
}
- statc->stats_updates = 0;
+ *stats_updates_percpu = 0;
/* We are in a per-cpu loop here, only do the atomic write once */
- if (atomic64_read(&memcg->vmstats->stats_updates))
- atomic64_set(&memcg->vmstats->stats_updates, 0);
+ if (atomic64_read(&memcg->stats_updates))
+ atomic64_set(&memcg->stats_updates, 0);
}
#ifdef CONFIG_MMU
hi, Yosry Ahmed,
On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:
...
>
> I still could not run the benchmark, but I used a version of
> fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> This showed ~13% regression with the patch, so not the same as the
> will-it-scale version, but it could be an indicator.
>
> With that, I did not see any improvement with the fixlet above or
> ___cacheline_aligned_in_smp. So you can scratch that.
>
> I did, however, see some improvement with reducing the indirection
> layers by moving stats_updates directly into struct mem_cgroup. The
> regression in my manual testing went down to 9%. Still not great, but
> I am wondering how this reflects on the benchmark. If you're able to
> test it that would be great, the diff is below. Meanwhile I am still
> looking for other improvements that can be made.
we applied previous patch-set as below:
c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set
I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
but failed. could you guide how to apply this patch?
Thanks
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f64ac140083e..b4dfcd8b9cc1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -270,6 +270,9 @@ struct mem_cgroup {
>
> CACHELINE_PADDING(_pad1_);
>
> + /* Stats updates since the last flush */
> + atomic64_t stats_updates;
> +
> /* memory.stat */
> struct memcg_vmstats *vmstats;
>
> @@ -309,6 +312,7 @@ struct mem_cgroup {
> atomic_t moving_account;
> struct task_struct *move_lock_task;
>
> + unsigned int __percpu *stats_updates_percpu;
> struct memcg_vmstats_percpu __percpu *vmstats_percpu;
>
> #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7cbc7d94eb65..e5d2f3d4d874 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -627,9 +627,6 @@ struct memcg_vmstats_percpu {
> /* Cgroup1: threshold notifications & softlimit tree updates */
> unsigned long nr_page_events;
> unsigned long targets[MEM_CGROUP_NTARGETS];
> -
> - /* Stats updates since the last flush */
> - unsigned int stats_updates;
> };
>
> struct memcg_vmstats {
> @@ -644,9 +641,6 @@ struct memcg_vmstats {
> /* Pending child counts during tree propagation */
> long state_pending[MEMCG_NR_STAT];
> unsigned long events_pending[NR_MEMCG_EVENTS];
> -
> - /* Stats updates since the last flush */
> - atomic64_t stats_updates;
> };
>
> /*
> @@ -695,14 +689,14 @@ static void memcg_stats_unlock(void)
>
> static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
> {
> - return atomic64_read(&memcg->vmstats->stats_updates) >
> + return atomic64_read(&memcg->stats_updates) >
> MEMCG_CHARGE_BATCH * num_online_cpus();
> }
>
> static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> {
> int cpu = smp_processor_id();
> - unsigned int x;
> + unsigned int *stats_updates_percpu;
>
> if (!val)
> return;
> @@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct
> mem_cgroup *memcg, int val)
> cgroup_rstat_updated(memcg->css.cgroup, cpu);
>
> for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> - x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
> - abs(val));
> + stats_updates_percpu =
> this_cpu_ptr(memcg->stats_updates_percpu);
>
> - if (x < MEMCG_CHARGE_BATCH)
> + *stats_updates_percpu += abs(val);
> + if (*stats_updates_percpu < MEMCG_CHARGE_BATCH)
> continue;
>
> /*
> @@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct
> mem_cgroup *memcg, int val)
> * redundant. Avoid the overhead of the atomic update.
> */
> if (!memcg_should_flush_stats(memcg))
> - atomic64_add(x, &memcg->vmstats->stats_updates);
> - __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
> + atomic64_add(*stats_updates_percpu,
> &memcg->stats_updates);
> + *stats_updates_percpu = 0;
> }
> }
>
> @@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
> free_mem_cgroup_per_node_info(memcg, node);
> kfree(memcg->vmstats);
> free_percpu(memcg->vmstats_percpu);
> + free_percpu(memcg->stats_updates_percpu);
> kfree(memcg);
> }
>
> @@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> if (!memcg->vmstats_percpu)
> goto fail;
>
> + memcg->stats_updates_percpu = alloc_percpu_gfp(unsigned int,
> + GFP_KERNEL_ACCOUNT);
> + if (!memcg->stats_updates_percpu)
> + goto fail;
> +
> for_each_node(node)
> if (alloc_mem_cgroup_per_node_info(memcg, node))
> goto fail;
> @@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct
> cgroup_subsys_state *css, int cpu)
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> struct memcg_vmstats_percpu *statc;
> + int *stats_updates_percpu;
> long delta, delta_cpu, v;
> int i, nid;
>
> statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
> + stats_updates_percpu = per_cpu_ptr(memcg->stats_updates_percpu, cpu);
>
> for (i = 0; i < MEMCG_NR_STAT; i++) {
> /*
> @@ -5826,10 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct
> cgroup_subsys_state *css, int cpu)
> }
> }
> }
> - statc->stats_updates = 0;
> + *stats_updates_percpu = 0;
> /* We are in a per-cpu loop here, only do the atomic write once */
> - if (atomic64_read(&memcg->vmstats->stats_updates))
> - atomic64_set(&memcg->vmstats->stats_updates, 0);
> + if (atomic64_read(&memcg->stats_updates))
> + atomic64_set(&memcg->stats_updates, 0);
> }
>
> #ifdef CONFIG_MMU
>
On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <oliver.sang@intel.com> wrote:
>
> hi, Yosry Ahmed,
>
> On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:
>
> ...
>
> >
> > I still could not run the benchmark, but I used a version of
> > fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> > This showed ~13% regression with the patch, so not the same as the
> > will-it-scale version, but it could be an indicator.
> >
> > With that, I did not see any improvement with the fixlet above or
> > ___cacheline_aligned_in_smp. So you can scratch that.
> >
> > I did, however, see some improvement with reducing the indirection
> > layers by moving stats_updates directly into struct mem_cgroup. The
> > regression in my manual testing went down to 9%. Still not great, but
> > I am wondering how this reflects on the benchmark. If you're able to
> > test it that would be great, the diff is below. Meanwhile I am still
> > looking for other improvements that can be made.
>
> we applied previous patch-set as below:
>
> c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set
>
> I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
> but failed. could you guide how to apply this patch?
> Thanks
>
Thanks for looking into this. I rebased the diff on top of
c5f50d8b23c79. Please find it attached.
From 0b0dffdfe192382a3aacfa313beee68b33bf7d86 Mon Sep 17 00:00:00 2001
From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 24 Oct 2023 07:02:02 +0000
Subject: [PATCH] memcg: move stats_updates to struct mem_cgroup
---
include/linux/memcontrol.h | 4 ++++
mm/memcontrol.c | 33 ++++++++++++++++++---------------
2 files changed, 22 insertions(+), 15 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f64ac140083ee..b4dfcd8b9cc1c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -270,6 +270,9 @@ struct mem_cgroup {
CACHELINE_PADDING(_pad1_);
+ /* Stats updates since the last flush */
+ atomic64_t stats_updates;
+
/* memory.stat */
struct memcg_vmstats *vmstats;
@@ -309,6 +312,7 @@ struct mem_cgroup {
atomic_t moving_account;
struct task_struct *move_lock_task;
+ unsigned int __percpu *stats_updates_percpu;
struct memcg_vmstats_percpu __percpu *vmstats_percpu;
#ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 182b4f215fc64..e5d2f3d4d8747 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -627,9 +627,6 @@ struct memcg_vmstats_percpu {
/* Cgroup1: threshold notifications & softlimit tree updates */
unsigned long nr_page_events;
unsigned long targets[MEM_CGROUP_NTARGETS];
-
- /* Stats updates since the last flush */
- unsigned int stats_updates;
};
struct memcg_vmstats {
@@ -644,9 +641,6 @@ struct memcg_vmstats {
/* Pending child counts during tree propagation */
long state_pending[MEMCG_NR_STAT];
unsigned long events_pending[NR_MEMCG_EVENTS];
-
- /* Stats updates since the last flush */
- atomic64_t stats_updates;
};
/*
@@ -695,14 +689,14 @@ static void memcg_stats_unlock(void)
static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
{
- return atomic64_read(&memcg->vmstats->stats_updates) >
+ return atomic64_read(&memcg->stats_updates) >
MEMCG_CHARGE_BATCH * num_online_cpus();
}
static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
{
int cpu = smp_processor_id();
- unsigned int x;
+ unsigned int *stats_updates_percpu;
if (!val)
return;
@@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
cgroup_rstat_updated(memcg->css.cgroup, cpu);
for (; memcg; memcg = parent_mem_cgroup(memcg)) {
- x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
- abs(val));
+ stats_updates_percpu = this_cpu_ptr(memcg->stats_updates_percpu);
- if (x < MEMCG_CHARGE_BATCH)
+ *stats_updates_percpu += abs(val);
+ if (*stats_updates_percpu < MEMCG_CHARGE_BATCH)
continue;
/*
@@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
* redundant. Avoid the overhead of the atomic update.
*/
if (!memcg_should_flush_stats(memcg))
- atomic64_add(x, &memcg->vmstats->stats_updates);
- __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
+ atomic64_add(*stats_updates_percpu, &memcg->stats_updates);
+ *stats_updates_percpu = 0;
}
}
@@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
free_mem_cgroup_per_node_info(memcg, node);
kfree(memcg->vmstats);
free_percpu(memcg->vmstats_percpu);
+ free_percpu(memcg->stats_updates_percpu);
kfree(memcg);
}
@@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (!memcg->vmstats_percpu)
goto fail;
+ memcg->stats_updates_percpu = alloc_percpu_gfp(unsigned int,
+ GFP_KERNEL_ACCOUNT);
+ if (!memcg->stats_updates_percpu)
+ goto fail;
+
for_each_node(node)
if (alloc_mem_cgroup_per_node_info(memcg, node))
goto fail;
@@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
struct memcg_vmstats_percpu *statc;
+ int *stats_updates_percpu;
long delta, delta_cpu, v;
int i, nid;
statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
+ stats_updates_percpu = per_cpu_ptr(memcg->stats_updates_percpu, cpu);
for (i = 0; i < MEMCG_NR_STAT; i++) {
/*
@@ -5826,9 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
}
}
}
+ *stats_updates_percpu = 0;
/* We are in a per-cpu loop here, only do the atomic write once */
- if (atomic64_read(&memcg->vmstats->stats_updates))
- atomic64_set(&memcg->vmstats->stats_updates, 0);
+ if (atomic64_read(&memcg->stats_updates))
+ atomic64_set(&memcg->stats_updates, 0);
}
#ifdef CONFIG_MMU
--
2.42.0.758.gaed0368e0e-goog
hi, Yosry Ahmed,
On Tue, Oct 24, 2023 at 12:14:42AM -0700, Yosry Ahmed wrote:
> On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <oliver.sang@intel.com> wrote:
> >
> > hi, Yosry Ahmed,
> >
> > On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:
> >
> > ...
> >
> > >
> > > I still could not run the benchmark, but I used a version of
> > > fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> > > This showed ~13% regression with the patch, so not the same as the
> > > will-it-scale version, but it could be an indicator.
> > >
> > > With that, I did not see any improvement with the fixlet above or
> > > ___cacheline_aligned_in_smp. So you can scratch that.
> > >
> > > I did, however, see some improvement with reducing the indirection
> > > layers by moving stats_updates directly into struct mem_cgroup. The
> > > regression in my manual testing went down to 9%. Still not great, but
> > > I am wondering how this reflects on the benchmark. If you're able to
> > > test it that would be great, the diff is below. Meanwhile I am still
> > > looking for other improvements that can be made.
> >
> > we applied previous patch-set as below:
> >
> > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set
> >
> > I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
> > but failed. could you guide how to apply this patch?
> > Thanks
> >
>
> Thanks for looking into this. I rebased the diff on top of
> c5f50d8b23c79. Please find it attached.
from our tests, this patch has little impact.
it was applied as below ac6a9444dec85:
ac6a9444dec85 (linux-devel/fixup-c5f50d8b23c79) memcg: move stats_updates to struct mem_cgroup
c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything
for the first regression reported in original report, data are very close
for 51d74c18a9c61, c5f50d8b23c79 (patch-set tip, parent of ac6a9444dec85),
and ac6a9444dec85.
full comparison is as [1]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
36509 -25.8% 27079 -25.2% 27305 -25.0% 27383 will-it-scale.per_thread_ops
for the second regression reported in origianl report, seems a small impact
from ac6a9444dec85.
full comparison is as [2]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
76580 -30.0% 53575 -28.9% 54415 -26.7% 56152 will-it-scale.per_thread_ops
[1]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
2.09 -0.5 1.61 ± 2% -0.5 1.61 -0.5 1.60 mpstat.cpu.all.usr%
3324 -10.0% 2993 +3.6% 3444 ± 20% -6.2% 3118 ± 4% vmstat.system.cs
120.83 ± 11% +79.6% 217.00 ± 9% +105.8% 248.67 ± 10% +115.2% 260.00 ± 10% perf-c2c.DRAM.local
594.50 ± 6% +43.8% 854.83 ± 5% +56.6% 931.17 ± 10% +21.2% 720.67 ± 7% perf-c2c.DRAM.remote
-16.64 +39.7% -23.25 +177.3% -46.14 +13.9% -18.94 sched_debug.cpu.nr_uninterruptible.min
6.59 ± 13% +6.5% 7.02 ± 11% +84.7% 12.18 ± 51% -6.6% 6.16 ± 10% sched_debug.cpu.nr_uninterruptible.stddev
0.04 -20.8% 0.03 ± 11% -20.8% 0.03 ± 11% -25.0% 0.03 turbostat.IPC
27.58 +3.7% 28.59 +4.2% 28.74 +3.8% 28.63 turbostat.RAMWatt
71000 ± 68% +66.4% 118174 ± 60% -49.8% 35634 ± 13% -59.9% 28485 ± 10% numa-meminfo.node0.AnonHugePages
1056 -100.0% 0.00 +1.9% 1076 -12.6% 923.33 ± 44% numa-meminfo.node0.Inactive(file)
6.67 ±141% +15799.3% 1059 -100.0% 0.00 +2669.8% 184.65 ±223% numa-meminfo.node1.Inactive(file)
3797041 -25.8% 2816352 -25.2% 2839803 -25.0% 2847955 will-it-scale.104.threads
36509 -25.8% 27079 -25.2% 27305 -25.0% 27383 will-it-scale.per_thread_ops
3797041 -25.8% 2816352 -25.2% 2839803 -25.0% 2847955 will-it-scale.workload
1.142e+09 -26.2% 8.437e+08 -26.6% 8.391e+08 -25.7% 8.489e+08 numa-numastat.node0.local_node
1.143e+09 -26.1% 8.439e+08 -26.6% 8.392e+08 -25.7% 8.491e+08 numa-numastat.node0.numa_hit
1.148e+09 -25.4% 8.563e+08 ± 2% -23.7% 8.756e+08 ± 2% -24.2% 8.702e+08 numa-numastat.node1.local_node
1.149e+09 -25.4% 8.564e+08 ± 2% -23.8% 8.758e+08 ± 2% -24.2% 8.707e+08 numa-numastat.node1.numa_hit
10842 +0.9% 10941 +2.9% 11153 ± 2% +0.3% 10873 proc-vmstat.nr_mapped
32933 -2.6% 32068 +0.1% 32956 ± 2% -1.5% 32450 ± 2% proc-vmstat.nr_slab_reclaimable
2.291e+09 -25.8% 1.7e+09 -25.1% 1.715e+09 -24.9% 1.72e+09 proc-vmstat.numa_hit
2.291e+09 -25.8% 1.7e+09 -25.1% 1.715e+09 -25.0% 1.719e+09 proc-vmstat.numa_local
2.29e+09 -25.8% 1.699e+09 -25.1% 1.714e+09 -24.9% 1.718e+09 proc-vmstat.pgalloc_normal
2.289e+09 -25.8% 1.699e+09 -25.1% 1.714e+09 -24.9% 1.718e+09 proc-vmstat.pgfree
199.33 -100.0% 0.00 -0.3% 198.66 -16.4% 166.67 ± 44% numa-vmstat.node0.nr_active_file
264.00 -100.0% 0.00 +1.9% 269.00 -12.6% 230.83 ± 44% numa-vmstat.node0.nr_inactive_file
199.33 -100.0% 0.00 -0.3% 198.66 -16.4% 166.67 ± 44% numa-vmstat.node0.nr_zone_active_file
264.00 -100.0% 0.00 +1.9% 269.00 -12.6% 230.83 ± 44% numa-vmstat.node0.nr_zone_inactive_file
1.143e+09 -26.1% 8.439e+08 -26.6% 8.392e+08 -25.7% 8.491e+08 numa-vmstat.node0.numa_hit
1.142e+09 -26.2% 8.437e+08 -26.6% 8.391e+08 -25.7% 8.489e+08 numa-vmstat.node0.numa_local
1.67 ±141% +15799.3% 264.99 -100.0% 0.00 +2669.8% 46.16 ±223% numa-vmstat.node1.nr_inactive_file
1.67 ±141% +15799.3% 264.99 -100.0% 0.00 +2669.8% 46.16 ±223% numa-vmstat.node1.nr_zone_inactive_file
1.149e+09 -25.4% 8.564e+08 ± 2% -23.8% 8.758e+08 ± 2% -24.2% 8.707e+08 numa-vmstat.node1.numa_hit
1.148e+09 -25.4% 8.563e+08 ± 2% -23.7% 8.756e+08 ± 2% -24.2% 8.702e+08 numa-vmstat.node1.numa_local
0.04 ±108% -76.2% 0.01 ± 23% +154.8% 0.10 ± 34% +110.0% 0.08 ± 88% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
1.00 ± 93% +154.2% 2.55 ± 16% +133.4% 2.34 ± 39% +174.6% 2.76 ± 22% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
0.71 ±131% -91.3% 0.06 ± 74% +184.4% 2.02 ± 40% +122.6% 1.58 ± 98% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
1.84 ± 45% +35.2% 2.48 ± 31% +66.1% 3.05 ± 25% +61.9% 2.98 ± 10% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
191.10 ± 2% +18.0% 225.55 ± 2% +18.9% 227.22 ± 4% +19.8% 228.89 ± 4% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
3484 -7.8% 3211 ± 6% -7.3% 3230 ± 7% -11.0% 3101 ± 3% perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
385.50 ± 14% +39.6% 538.17 ± 12% +104.5% 788.17 ± 54% +30.9% 504.67 ± 41% perf-sched.wait_and_delay.count.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
3784 -7.5% 3500 ± 6% -7.1% 3516 ± 2% -10.6% 3383 ± 4% perf-sched.wait_and_delay.count.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
118.67 ± 11% -62.6% 44.33 ±100% -45.9% 64.17 ± 71% -64.9% 41.67 ±100% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
5043 ± 2% -13.0% 4387 ± 6% -14.7% 4301 ± 3% -16.5% 4210 ± 4% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
167.12 ±222% +200.1% 501.48 ± 99% +2.9% 171.99 ±215% +399.7% 835.05 ± 44% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
2.17 ± 21% +8.9% 2.36 ± 16% +94.3% 4.21 ± 36% +40.4% 3.04 ± 21% perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
191.09 ± 2% +18.0% 225.53 ± 2% +18.9% 227.21 ± 4% +19.8% 228.88 ± 4% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
293.46 ± 4% +12.8% 330.98 ± 6% +21.0% 355.18 ± 16% +7.1% 314.31 ± 26% perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
30.33 ±105% -35.1% 19.69 ±138% +494.1% 180.18 ± 79% +135.5% 71.43 ± 76% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
0.59 ± 3% +125.2% 1.32 ± 2% +139.3% 1.41 +128.6% 1.34 perf-stat.i.MPKI
9.027e+09 -17.9% 7.408e+09 -17.5% 7.446e+09 -17.3% 7.465e+09 perf-stat.i.branch-instructions
0.64 -0.0 0.60 -0.0 0.60 -0.0 0.60 perf-stat.i.branch-miss-rate%
58102855 -23.3% 44580037 ± 2% -23.4% 44524712 ± 2% -22.9% 44801374 perf-stat.i.branch-misses
15.28 +7.0 22.27 +7.9 23.14 +7.2 22.50 perf-stat.i.cache-miss-rate%
25155306 ± 2% +82.7% 45953601 ± 3% +95.2% 49105558 ± 2% +87.7% 47212483 perf-stat.i.cache-misses
1.644e+08 +25.4% 2.062e+08 ± 2% +29.0% 2.12e+08 +27.6% 2.098e+08 perf-stat.i.cache-references
3258 -10.3% 2921 +2.5% 3341 ± 19% -6.7% 3041 ± 4% perf-stat.i.context-switches
6.73 +23.3% 8.30 +22.7% 8.26 +21.8% 8.20 perf-stat.i.cpi
145.97 -1.3% 144.13 -1.4% 143.89 -1.2% 144.29 perf-stat.i.cpu-migrations
11519 ± 3% -45.4% 6293 ± 3% -48.9% 5892 ± 2% -46.9% 6118 perf-stat.i.cycles-between-cache-misses
0.04 -0.0 0.03 -0.0 0.03 -0.0 0.03 perf-stat.i.dTLB-load-miss-rate%
3921408 -25.3% 2929564 -24.6% 2957991 -24.5% 2961168 perf-stat.i.dTLB-load-misses
1.098e+10 -18.1% 8.993e+09 -17.6% 9.045e+09 -16.3% 9.185e+09 perf-stat.i.dTLB-loads
0.00 ± 2% +0.0 0.00 ± 4% +0.0 0.00 ± 5% +0.0 0.00 ± 3% perf-stat.i.dTLB-store-miss-rate%
5.606e+09 -23.2% 4.304e+09 -22.6% 4.338e+09 -22.4% 4.349e+09 perf-stat.i.dTLB-stores
95.65 -1.2 94.49 -0.9 94.74 -0.8 94.87 perf-stat.i.iTLB-load-miss-rate%
3876741 -25.0% 2905764 -24.8% 2915184 -25.0% 2909099 perf-stat.i.iTLB-load-misses
4.286e+10 -18.9% 3.477e+10 -18.4% 3.496e+10 -17.9% 3.517e+10 perf-stat.i.instructions
11061 +8.2% 11969 +8.4% 11996 +9.3% 12091 perf-stat.i.instructions-per-iTLB-miss
0.15 -18.9% 0.12 -18.5% 0.12 -17.9% 0.12 perf-stat.i.ipc
0.01 ± 96% -8.9% 0.01 ± 96% +72.3% 0.01 ± 73% +174.6% 0.02 ± 32% perf-stat.i.major-faults
48.65 ± 2% +46.2% 71.11 ± 2% +57.0% 76.37 ± 2% +45.4% 70.72 perf-stat.i.metric.K/sec
247.84 -18.9% 201.05 -18.4% 202.30 -17.7% 203.92 perf-stat.i.metric.M/sec
89.33 +0.5 89.79 -0.7 88.67 -2.1 87.23 perf-stat.i.node-load-miss-rate%
3138385 ± 2% +77.7% 5578401 ± 2% +89.9% 5958861 ± 2% +70.9% 5363943 perf-stat.i.node-load-misses
375827 ± 3% +69.2% 635857 ± 11% +102.6% 761334 ± 4% +109.3% 786773 ± 5% perf-stat.i.node-loads
1343194 -26.8% 983668 -22.6% 1039799 ± 2% -23.6% 1026076 perf-stat.i.node-store-misses
51550 ± 3% -19.0% 41748 ± 7% -22.5% 39954 ± 4% -20.6% 40921 ± 7% perf-stat.i.node-stores
0.59 ± 3% +125.1% 1.32 ± 2% +139.2% 1.40 +128.7% 1.34 perf-stat.overall.MPKI
0.64 -0.0 0.60 -0.0 0.60 -0.0 0.60 perf-stat.overall.branch-miss-rate%
15.30 +7.0 22.28 +7.9 23.16 +7.2 22.50 perf-stat.overall.cache-miss-rate%
6.73 +23.3% 8.29 +22.6% 8.25 +21.9% 8.20 perf-stat.overall.cpi
11470 ± 2% -45.3% 6279 ± 2% -48.8% 5875 ± 2% -46.7% 6108 perf-stat.overall.cycles-between-cache-misses
0.04 -0.0 0.03 -0.0 0.03 -0.0 0.03 perf-stat.overall.dTLB-load-miss-rate%
0.00 ± 2% +0.0 0.00 ± 4% +0.0 0.00 ± 5% +0.0 0.00 ± 4% perf-stat.overall.dTLB-store-miss-rate%
95.56 -1.4 94.17 -1.0 94.56 -0.9 94.66 perf-stat.overall.iTLB-load-miss-rate%
11059 +8.2% 11967 +8.5% 11994 +9.3% 12091 perf-stat.overall.instructions-per-iTLB-miss
0.15 -18.9% 0.12 -18.4% 0.12 -17.9% 0.12 perf-stat.overall.ipc
89.29 +0.5 89.78 -0.6 88.67 -2.1 87.20 perf-stat.overall.node-load-miss-rate%
3396437 +9.5% 3718021 +9.1% 3705386 +9.6% 3721307 perf-stat.overall.path-length
8.997e+09 -17.9% 7.383e+09 -17.5% 7.421e+09 -17.3% 7.44e+09 perf-stat.ps.branch-instructions
57910417 -23.3% 44426577 ± 2% -23.4% 44376780 ± 2% -22.9% 44649215 perf-stat.ps.branch-misses
25075498 ± 2% +82.7% 45803186 ± 3% +95.2% 48942749 ± 2% +87.7% 47057228 perf-stat.ps.cache-misses
1.639e+08 +25.4% 2.056e+08 ± 2% +28.9% 2.113e+08 +27.6% 2.091e+08 perf-stat.ps.cache-references
3247 -10.3% 2911 +2.5% 3329 ± 19% -6.7% 3030 ± 4% perf-stat.ps.context-switches
145.47 -1.3% 143.61 -1.4% 143.38 -1.2% 143.70 perf-stat.ps.cpu-migrations
3908900 -25.3% 2920218 -24.6% 2949112 -24.5% 2951633 perf-stat.ps.dTLB-load-misses
1.094e+10 -18.1% 8.963e+09 -17.6% 9.014e+09 -16.3% 9.154e+09 perf-stat.ps.dTLB-loads
5.587e+09 -23.2% 4.289e+09 -22.6% 4.324e+09 -22.4% 4.335e+09 perf-stat.ps.dTLB-stores
3863663 -25.0% 2895895 -24.8% 2905355 -25.0% 2899323 perf-stat.ps.iTLB-load-misses
4.272e+10 -18.9% 3.466e+10 -18.4% 3.484e+10 -17.9% 3.505e+10 perf-stat.ps.instructions
3128132 ± 2% +77.7% 5559939 ± 2% +89.9% 5938929 ± 2% +70.9% 5346027 perf-stat.ps.node-load-misses
375403 ± 3% +69.0% 634300 ± 11% +102.3% 759484 ± 4% +109.1% 784913 ± 5% perf-stat.ps.node-loads
1338688 -26.8% 980311 -22.6% 1036279 ± 2% -23.6% 1022618 perf-stat.ps.node-store-misses
51546 ± 3% -19.1% 41692 ± 7% -22.6% 39921 ± 4% -20.7% 40875 ± 7% perf-stat.ps.node-stores
1.29e+13 -18.8% 1.047e+13 -18.4% 1.052e+13 -17.8% 1.06e+13 perf-stat.total.instructions
0.96 -0.3 0.70 ± 2% -0.3 0.70 ± 2% -0.3 0.70 perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.97 -0.3 0.72 -0.2 0.72 -0.2 0.72 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
0.76 ± 2% -0.2 0.54 ± 3% -0.2 0.59 ± 3% -0.1 0.68 perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.82 -0.2 0.60 ± 2% -0.2 0.60 ± 2% -0.2 0.60 perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.91 -0.2 0.72 -0.2 0.72 -0.2 0.70 ± 2% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
51.50 -0.0 51.47 -0.5 50.99 -0.3 51.21 perf-profile.calltrace.cycles-pp.fallocate64
48.31 +0.0 48.35 +0.5 48.83 +0.3 48.61 perf-profile.calltrace.cycles-pp.ftruncate64
48.29 +0.0 48.34 +0.5 48.81 +0.3 48.60 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
48.28 +0.0 48.33 +0.5 48.80 +0.3 48.59 perf-profile.calltrace.cycles-pp.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
48.29 +0.1 48.34 +0.5 48.82 +0.3 48.60 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.ftruncate64
48.28 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.calltrace.cycles-pp.do_truncate.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
48.27 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.calltrace.cycles-pp.notify_change.do_truncate.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe
48.27 +0.1 48.32 +0.5 48.80 +0.3 48.58 perf-profile.calltrace.cycles-pp.shmem_setattr.notify_change.do_truncate.do_sys_ftruncate.do_syscall_64
48.25 +0.1 48.31 +0.5 48.78 +0.3 48.57 perf-profile.calltrace.cycles-pp.shmem_undo_range.shmem_setattr.notify_change.do_truncate.do_sys_ftruncate
2.06 ± 2% +0.1 2.13 ± 2% +0.1 2.16 +0.0 2.09 perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.68 +0.1 0.76 ± 2% +0.1 0.75 +0.1 0.74 perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.67 +0.1 1.77 +0.1 1.81 ± 2% +0.0 1.71 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
45.76 +0.1 45.86 +0.5 46.29 +0.4 46.13 perf-profile.calltrace.cycles-pp.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change.do_truncate
1.78 ± 2% +0.1 1.92 ± 2% +0.2 1.95 +0.1 1.88 perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
0.69 ± 5% +0.1 0.84 ± 4% +0.2 0.86 ± 5% +0.1 0.79 ± 2% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
1.56 ± 2% +0.2 1.76 ± 2% +0.2 1.79 +0.2 1.71 perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
0.85 ± 4% +0.4 1.23 ± 2% +0.4 1.26 ± 3% +0.3 1.14 perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.78 ± 4% +0.4 1.20 ± 3% +0.4 1.22 +0.3 1.11 perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
0.73 ± 4% +0.4 1.17 ± 3% +0.5 1.19 ± 2% +0.4 1.08 perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
41.60 +0.7 42.30 +0.1 41.73 +0.5 42.06 perf-profile.calltrace.cycles-pp.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
41.50 +0.7 42.23 +0.2 41.66 +0.5 41.99 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
48.39 +0.8 49.14 +0.2 48.64 +0.5 48.89 perf-profile.calltrace.cycles-pp.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.00 +0.8 0.77 ± 4% +0.8 0.80 ± 2% +0.8 0.78 ± 2% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
40.24 +0.8 41.03 +0.2 40.48 +0.6 40.80 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
40.22 +0.8 41.01 +0.2 40.47 +0.6 40.79 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
0.00 +0.8 0.79 ± 3% +0.8 0.82 ± 3% +0.8 0.79 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
40.19 +0.8 40.98 +0.3 40.44 +0.6 40.76 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru
1.33 ± 5% +0.8 2.13 ± 4% +0.9 2.21 ± 4% +0.8 2.09 ± 2% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
48.16 +0.8 48.98 +0.3 48.48 +0.6 48.72 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.00 +0.9 0.88 ± 2% +0.9 0.91 +0.9 0.86 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
47.92 +0.9 48.81 +0.4 48.30 +0.6 48.56 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
47.07 +0.9 48.01 +0.5 47.60 +0.7 47.79 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
46.59 +1.1 47.64 +0.7 47.24 +0.8 47.44 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
0.99 -0.3 0.73 ± 2% -0.3 0.74 -0.3 0.74 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.96 -0.3 0.70 ± 2% -0.3 0.70 ± 2% -0.3 0.71 perf-profile.children.cycles-pp.shmem_alloc_folio
0.78 ± 2% -0.2 0.56 ± 3% -0.2 0.61 ± 3% -0.1 0.69 ± 2% perf-profile.children.cycles-pp.shmem_inode_acct_blocks
0.83 -0.2 0.61 ± 2% -0.2 0.61 ± 2% -0.2 0.62 perf-profile.children.cycles-pp.alloc_pages_mpol
0.92 -0.2 0.73 -0.2 0.73 -0.2 0.71 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.74 ± 2% -0.2 0.55 ± 2% -0.2 0.56 ± 2% -0.2 0.58 ± 3% perf-profile.children.cycles-pp.xas_store
0.67 -0.2 0.50 ± 3% -0.2 0.50 ± 2% -0.2 0.50 perf-profile.children.cycles-pp.__alloc_pages
0.43 -0.1 0.31 ± 2% -0.1 0.31 -0.1 0.31 perf-profile.children.cycles-pp.__entry_text_start
0.41 ± 2% -0.1 0.30 ± 3% -0.1 0.31 ± 2% -0.1 0.31 ± 2% perf-profile.children.cycles-pp.free_unref_page_list
0.35 -0.1 0.25 ± 2% -0.1 0.25 ± 2% -0.1 0.25 perf-profile.children.cycles-pp.xas_load
0.35 ± 2% -0.1 0.25 ± 4% -0.1 0.25 ± 2% -0.1 0.26 ± 2% perf-profile.children.cycles-pp.__mod_lruvec_state
0.39 -0.1 0.30 ± 2% -0.1 0.29 ± 3% -0.1 0.30 perf-profile.children.cycles-pp.get_page_from_freelist
0.27 ± 2% -0.1 0.19 ± 4% -0.1 0.19 ± 5% -0.1 0.19 ± 3% perf-profile.children.cycles-pp.__mod_node_page_state
0.32 ± 3% -0.1 0.24 ± 3% -0.1 0.25 -0.1 0.26 ± 4% perf-profile.children.cycles-pp.find_lock_entries
0.23 ± 2% -0.1 0.15 ± 4% -0.1 0.16 ± 3% -0.1 0.16 ± 5% perf-profile.children.cycles-pp.xas_descend
0.25 ± 3% -0.1 0.18 ± 3% -0.1 0.18 ± 3% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.__dquot_alloc_space
0.28 ± 3% -0.1 0.20 ± 3% -0.1 0.21 ± 2% -0.1 0.20 ± 2% perf-profile.children.cycles-pp._raw_spin_lock
0.16 ± 3% -0.1 0.10 ± 5% -0.1 0.10 ± 4% -0.1 0.10 ± 4% perf-profile.children.cycles-pp.xas_find_conflict
0.26 ± 2% -0.1 0.20 ± 3% -0.1 0.19 ± 3% -0.1 0.19 perf-profile.children.cycles-pp.filemap_get_entry
0.26 -0.1 0.20 ± 2% -0.1 0.20 ± 4% -0.1 0.20 ± 2% perf-profile.children.cycles-pp.rmqueue
0.20 ± 3% -0.1 0.14 ± 3% -0.0 0.15 ± 3% -0.0 0.16 ± 3% perf-profile.children.cycles-pp.truncate_cleanup_folio
0.19 ± 5% -0.1 0.14 ± 4% -0.0 0.15 ± 5% -0.0 0.15 ± 4% perf-profile.children.cycles-pp.xas_clear_mark
0.17 ± 5% -0.0 0.12 ± 4% -0.0 0.12 ± 6% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.xas_init_marks
0.15 ± 4% -0.0 0.10 ± 4% -0.0 0.10 ± 4% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.free_unref_page_commit
0.15 ± 12% -0.0 0.10 ± 20% -0.1 0.10 ± 15% -0.1 0.10 ± 14% perf-profile.children.cycles-pp._raw_spin_lock_irq
51.56 -0.0 51.51 -0.5 51.03 -0.3 51.26 perf-profile.children.cycles-pp.fallocate64
0.18 ± 3% -0.0 0.14 ± 3% -0.0 0.13 ± 5% -0.0 0.14 ± 2% perf-profile.children.cycles-pp.__cond_resched
0.07 ± 5% -0.0 0.02 ± 99% -0.0 0.04 ± 44% -0.0 0.04 ± 44% perf-profile.children.cycles-pp.xas_find
0.13 ± 2% -0.0 0.09 -0.0 0.10 ± 5% -0.0 0.12 ± 4% perf-profile.children.cycles-pp.security_vm_enough_memory_mm
0.14 ± 4% -0.0 0.10 ± 7% -0.0 0.10 ± 6% -0.0 0.10 ± 3% perf-profile.children.cycles-pp.__fget_light
0.06 ± 6% -0.0 0.02 ± 99% -0.0 0.05 -0.0 0.05 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.12 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 4% -0.0 0.08 perf-profile.children.cycles-pp.xas_start
0.08 ± 5% -0.0 0.05 -0.0 0.05 -0.0 0.05 ± 7% perf-profile.children.cycles-pp.__folio_throttle_swaprate
0.12 -0.0 0.08 ± 5% -0.0 0.08 ± 5% -0.0 0.08 ± 5% perf-profile.children.cycles-pp.folio_unlock
0.14 ± 3% -0.0 0.11 ± 3% -0.0 0.11 ± 4% -0.0 0.12 ± 3% perf-profile.children.cycles-pp.try_charge_memcg
0.12 ± 6% -0.0 0.08 ± 5% -0.0 0.09 ± 5% -0.0 0.09 ± 7% perf-profile.children.cycles-pp.free_unref_page_prepare
0.12 ± 3% -0.0 0.09 ± 4% -0.0 0.09 ± 7% -0.0 0.09 perf-profile.children.cycles-pp.noop_dirty_folio
0.20 ± 2% -0.0 0.17 ± 5% -0.0 0.18 -0.0 0.19 ± 2% perf-profile.children.cycles-pp.page_counter_uncharge
0.10 -0.0 0.07 ± 5% -0.0 0.08 ± 8% +0.0 0.10 ± 4% perf-profile.children.cycles-pp.cap_vm_enough_memory
0.09 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 7% perf-profile.children.cycles-pp._raw_spin_trylock
0.09 ± 5% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.inode_add_bytes
0.06 ± 6% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.filemap_free_folio
0.06 ± 6% -0.0 0.03 ± 70% +0.0 0.07 ± 7% +0.1 0.14 ± 6% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.12 ± 3% -0.0 0.10 ± 5% -0.0 0.09 ± 4% -0.0 0.09 perf-profile.children.cycles-pp.shmem_recalc_inode
0.12 ± 3% -0.0 0.09 ± 5% -0.0 0.09 ± 5% -0.0 0.10 ± 4% perf-profile.children.cycles-pp.__folio_cancel_dirty
0.09 ± 5% -0.0 0.07 ± 7% -0.0 0.09 ± 4% +0.1 0.16 ± 7% perf-profile.children.cycles-pp.__vm_enough_memory
0.08 ± 5% -0.0 0.06 -0.0 0.06 ± 6% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.security_file_permission
0.08 ± 5% -0.0 0.06 -0.0 0.06 ± 6% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.08 ± 6% -0.0 0.05 ± 7% -0.0 0.05 ± 8% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.apparmor_file_permission
0.09 ± 4% -0.0 0.07 ± 8% -0.0 0.09 ± 8% -0.0 0.07 ± 6% perf-profile.children.cycles-pp.__percpu_counter_limited_add
0.08 ± 6% -0.0 0.06 ± 8% -0.0 0.06 -0.0 0.06 ± 6% perf-profile.children.cycles-pp.__list_add_valid_or_report
0.07 ± 8% -0.0 0.05 -0.0 0.05 ± 7% -0.0 0.06 ± 9% perf-profile.children.cycles-pp.get_pfnblock_flags_mask
0.14 ± 3% -0.0 0.12 ± 6% -0.0 0.12 ± 3% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.07 ± 5% -0.0 0.05 -0.0 0.05 -0.0 0.05 perf-profile.children.cycles-pp.policy_nodemask
0.24 ± 2% -0.0 0.22 ± 2% -0.0 0.22 ± 2% -0.0 0.22 ± 2% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.08 -0.0 0.07 ± 7% -0.0 0.06 ± 6% -0.0 0.07 ± 6% perf-profile.children.cycles-pp.xas_create
0.08 ± 8% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
0.00 +0.0 0.00 +0.0 0.00 +0.1 0.08 ± 8% perf-profile.children.cycles-pp.__file_remove_privs
0.28 ± 2% +0.0 0.28 ± 4% +0.0 0.30 +0.0 0.30 perf-profile.children.cycles-pp.uncharge_batch
0.14 ± 5% +0.0 0.17 ± 4% +0.0 0.17 ± 2% +0.0 0.16 perf-profile.children.cycles-pp.uncharge_folio
0.43 +0.0 0.46 ± 4% +0.0 0.48 +0.0 0.47 perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
48.31 +0.0 48.35 +0.5 48.83 +0.3 48.61 perf-profile.children.cycles-pp.ftruncate64
48.28 +0.0 48.33 +0.5 48.80 +0.3 48.59 perf-profile.children.cycles-pp.do_sys_ftruncate
48.28 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.children.cycles-pp.do_truncate
48.27 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.children.cycles-pp.notify_change
48.27 +0.1 48.32 +0.5 48.80 +0.3 48.58 perf-profile.children.cycles-pp.shmem_setattr
48.26 +0.1 48.32 +0.5 48.79 +0.3 48.57 perf-profile.children.cycles-pp.shmem_undo_range
2.06 ± 2% +0.1 2.13 ± 2% +0.1 2.16 +0.0 2.10 perf-profile.children.cycles-pp.truncate_inode_folio
0.69 +0.1 0.78 +0.1 0.77 +0.1 0.76 perf-profile.children.cycles-pp.lru_add_fn
1.72 ± 2% +0.1 1.80 +0.1 1.83 ± 2% +0.0 1.74 perf-profile.children.cycles-pp.shmem_add_to_page_cache
45.77 +0.1 45.86 +0.5 46.29 +0.4 46.13 perf-profile.children.cycles-pp.__folio_batch_release
1.79 ± 2% +0.1 1.93 ± 2% +0.2 1.96 +0.1 1.88 perf-profile.children.cycles-pp.filemap_remove_folio
0.13 ± 5% +0.1 0.28 +0.1 0.19 ± 5% +0.1 0.24 ± 2% perf-profile.children.cycles-pp.file_modified
0.69 ± 5% +0.1 0.84 ± 3% +0.2 0.86 ± 5% +0.1 0.79 ± 2% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
0.09 ± 7% +0.2 0.24 ± 2% +0.1 0.15 ± 3% +0.0 0.14 ± 4% perf-profile.children.cycles-pp.inode_needs_update_time
1.58 ± 3% +0.2 1.77 ± 2% +0.2 1.80 +0.1 1.72 perf-profile.children.cycles-pp.__filemap_remove_folio
0.15 ± 3% +0.4 0.50 ± 3% +0.4 0.52 ± 2% +0.4 0.52 ± 2% perf-profile.children.cycles-pp.__count_memcg_events
0.79 ± 4% +0.4 1.20 ± 3% +0.4 1.22 +0.3 1.12 perf-profile.children.cycles-pp.filemap_unaccount_folio
0.36 ± 5% +0.4 0.77 ± 4% +0.4 0.81 ± 2% +0.4 0.78 ± 2% perf-profile.children.cycles-pp.mem_cgroup_commit_charge
98.33 +0.5 98.78 +0.4 98.77 +0.4 98.77 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
97.74 +0.6 98.34 +0.6 98.32 +0.6 98.33 perf-profile.children.cycles-pp.do_syscall_64
41.62 +0.7 42.33 +0.1 41.76 +0.5 42.08 perf-profile.children.cycles-pp.folio_add_lru
43.91 +0.7 44.64 +0.2 44.09 +0.5 44.40 perf-profile.children.cycles-pp.folio_batch_move_lru
48.39 +0.8 49.15 +0.2 48.64 +0.5 48.89 perf-profile.children.cycles-pp.__x64_sys_fallocate
1.34 ± 5% +0.8 2.14 ± 4% +0.9 2.22 ± 4% +0.8 2.10 ± 2% perf-profile.children.cycles-pp.__mem_cgroup_charge
1.61 ± 4% +0.8 2.42 ± 2% +0.9 2.47 ± 2% +0.6 2.24 perf-profile.children.cycles-pp.__mod_lruvec_page_state
48.17 +0.8 48.98 +0.3 48.48 +0.6 48.72 perf-profile.children.cycles-pp.vfs_fallocate
47.94 +0.9 48.82 +0.4 48.32 +0.6 48.56 perf-profile.children.cycles-pp.shmem_fallocate
47.10 +0.9 48.04 +0.5 47.64 +0.7 47.83 perf-profile.children.cycles-pp.shmem_get_folio_gfp
84.34 +0.9 85.28 +0.8 85.11 +0.9 85.28 perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
84.31 +0.9 85.26 +0.8 85.08 +0.9 85.26 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
84.24 +1.0 85.21 +0.8 85.04 +1.0 85.21 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
46.65 +1.1 47.70 +0.7 47.30 +0.8 47.48 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
1.23 ± 4% +1.4 2.58 ± 2% +1.4 2.63 ± 2% +1.3 2.52 perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
0.98 -0.3 0.73 ± 2% -0.2 0.74 -0.2 0.74 perf-profile.self.cycles-pp.syscall_return_via_sysret
0.88 -0.2 0.70 -0.2 0.70 -0.2 0.69 ± 2% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.60 -0.2 0.45 -0.1 0.46 ± 2% -0.2 0.46 ± 3% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.41 ± 3% -0.1 0.27 ± 3% -0.1 0.27 ± 2% -0.1 0.28 ± 2% perf-profile.self.cycles-pp.release_pages
0.41 ± 3% -0.1 0.29 ± 2% -0.1 0.28 ± 3% -0.1 0.29 ± 2% perf-profile.self.cycles-pp.folio_batch_move_lru
0.41 -0.1 0.30 ± 3% -0.1 0.30 ± 2% -0.1 0.32 ± 4% perf-profile.self.cycles-pp.xas_store
0.30 ± 3% -0.1 0.18 ± 5% -0.1 0.19 ± 2% -0.1 0.19 ± 2% perf-profile.self.cycles-pp.shmem_add_to_page_cache
0.38 ± 2% -0.1 0.27 ± 2% -0.1 0.27 ± 2% -0.1 0.27 perf-profile.self.cycles-pp.__entry_text_start
0.30 ± 3% -0.1 0.20 ± 6% -0.1 0.20 ± 5% -0.1 0.21 ± 2% perf-profile.self.cycles-pp.lru_add_fn
0.28 ± 2% -0.1 0.20 ± 5% -0.1 0.20 ± 2% -0.1 0.20 ± 3% perf-profile.self.cycles-pp.shmem_fallocate
0.26 ± 2% -0.1 0.18 ± 5% -0.1 0.18 ± 4% -0.1 0.19 ± 3% perf-profile.self.cycles-pp.__mod_node_page_state
0.27 ± 3% -0.1 0.20 ± 2% -0.1 0.20 ± 3% -0.1 0.20 ± 3% perf-profile.self.cycles-pp._raw_spin_lock
0.21 ± 2% -0.1 0.15 ± 4% -0.1 0.15 ± 4% -0.1 0.16 ± 2% perf-profile.self.cycles-pp.__alloc_pages
0.20 ± 2% -0.1 0.14 ± 3% -0.1 0.14 ± 2% -0.1 0.14 ± 5% perf-profile.self.cycles-pp.xas_descend
0.26 ± 3% -0.1 0.20 ± 4% -0.1 0.21 ± 3% -0.0 0.22 ± 4% perf-profile.self.cycles-pp.find_lock_entries
0.06 ± 6% -0.1 0.00 +0.0 0.06 ± 7% +0.1 0.13 ± 6% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.18 ± 4% -0.0 0.13 ± 5% -0.0 0.13 ± 3% -0.0 0.14 ± 4% perf-profile.self.cycles-pp.xas_clear_mark
0.15 ± 7% -0.0 0.10 ± 11% -0.0 0.11 ± 8% -0.0 0.10 ± 6% perf-profile.self.cycles-pp.shmem_inode_acct_blocks
0.13 ± 4% -0.0 0.09 ± 5% -0.0 0.08 ± 5% -0.0 0.09 perf-profile.self.cycles-pp.free_unref_page_commit
0.13 -0.0 0.09 ± 5% -0.0 0.09 ± 5% -0.0 0.09 ± 6% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.16 ± 4% -0.0 0.12 ± 4% -0.0 0.12 ± 3% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.__dquot_alloc_space
0.16 ± 4% -0.0 0.12 ± 4% -0.0 0.11 ± 6% -0.0 0.11 perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
0.13 ± 5% -0.0 0.09 ± 7% -0.0 0.09 -0.0 0.10 ± 7% perf-profile.self.cycles-pp.__filemap_remove_folio
0.13 ± 2% -0.0 0.09 ± 5% -0.0 0.09 ± 4% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.get_page_from_freelist
0.06 ± 7% -0.0 0.02 ± 99% -0.0 0.02 ± 99% -0.0 0.02 ±141% perf-profile.self.cycles-pp.apparmor_file_permission
0.12 ± 4% -0.0 0.09 ± 5% -0.0 0.09 ± 5% -0.0 0.08 ± 8% perf-profile.self.cycles-pp.vfs_fallocate
0.13 ± 3% -0.0 0.10 ± 5% -0.0 0.10 ± 4% -0.0 0.10 ± 4% perf-profile.self.cycles-pp.fallocate64
0.11 ± 4% -0.0 0.07 -0.0 0.08 ± 6% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.xas_start
0.07 ± 5% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.1 0.02 ±141% perf-profile.self.cycles-pp.shmem_alloc_folio
0.14 ± 4% -0.0 0.10 ± 7% -0.0 0.10 ± 5% -0.0 0.10 ± 3% perf-profile.self.cycles-pp.__fget_light
0.10 ± 4% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.rmqueue
0.10 ± 4% -0.0 0.07 ± 8% -0.0 0.07 ± 5% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.alloc_pages_mpol
0.12 ± 3% -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.xas_load
0.11 ± 4% -0.0 0.08 ± 7% -0.0 0.08 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.folio_unlock
0.15 ± 2% -0.0 0.12 ± 5% -0.0 0.12 ± 4% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.shmem_get_folio_gfp
0.10 -0.0 0.07 -0.0 0.08 ± 7% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.cap_vm_enough_memory
0.16 ± 2% -0.0 0.13 ± 6% -0.0 0.14 -0.0 0.14 perf-profile.self.cycles-pp.page_counter_uncharge
0.12 ± 5% -0.0 0.09 ± 4% -0.0 0.09 ± 7% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.__cond_resched
0.06 ± 6% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.0 0.05 perf-profile.self.cycles-pp.filemap_free_folio
0.12 -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 perf-profile.self.cycles-pp.noop_dirty_folio
0.12 ± 3% -0.0 0.10 ± 5% -0.0 0.10 ± 7% -0.0 0.10 ± 5% perf-profile.self.cycles-pp.free_unref_page_list
0.10 ± 3% -0.0 0.07 ± 5% -0.0 0.07 ± 5% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.filemap_remove_folio
0.10 ± 5% -0.0 0.07 ± 5% -0.0 0.07 -0.0 0.08 ± 4% perf-profile.self.cycles-pp.try_charge_memcg
0.12 ± 3% -0.0 0.10 ± 8% -0.0 0.10 -0.0 0.10 ± 4% perf-profile.self.cycles-pp.cgroup_rstat_updated
0.09 ± 4% -0.0 0.07 ± 7% -0.0 0.07 ± 5% -0.0 0.07 ± 7% perf-profile.self.cycles-pp.__folio_cancel_dirty
0.08 ± 4% -0.0 0.06 ± 8% -0.0 0.06 ± 6% -0.0 0.06 ± 8% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.08 ± 5% -0.0 0.06 -0.0 0.06 -0.0 0.06 perf-profile.self.cycles-pp._raw_spin_trylock
0.08 -0.0 0.06 ± 6% -0.0 0.06 ± 8% -0.0 0.06 perf-profile.self.cycles-pp.folio_add_lru
0.07 ± 5% -0.0 0.05 -0.0 0.05 -0.0 0.04 ± 44% perf-profile.self.cycles-pp.xas_find_conflict
0.08 ± 8% -0.0 0.06 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.__mod_lruvec_state
0.56 ± 6% -0.0 0.54 ± 9% -0.0 0.55 ± 5% -0.2 0.40 ± 3% perf-profile.self.cycles-pp.__mod_lruvec_page_state
0.08 ± 10% -0.0 0.06 ± 9% -0.0 0.06 -0.0 0.06 perf-profile.self.cycles-pp.truncate_cleanup_folio
0.07 ± 10% -0.0 0.05 -0.0 0.05 ± 7% -0.0 0.05 ± 8% perf-profile.self.cycles-pp.xas_init_marks
0.08 ± 4% -0.0 0.06 ± 7% +0.0 0.08 ± 4% -0.0 0.07 ± 10% perf-profile.self.cycles-pp.__percpu_counter_limited_add
0.07 ± 7% -0.0 0.05 -0.0 0.05 ± 7% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.get_pfnblock_flags_mask
0.07 ± 5% -0.0 0.06 ± 8% -0.0 0.06 ± 6% -0.0 0.05 ± 7% perf-profile.self.cycles-pp.__list_add_valid_or_report
0.07 ± 5% -0.0 0.06 ± 9% -0.0 0.06 ± 7% -0.0 0.06 perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
0.08 ± 4% -0.0 0.07 ± 5% -0.0 0.06 -0.0 0.06 ± 6% perf-profile.self.cycles-pp.filemap_get_entry
0.00 +0.0 0.00 +0.0 0.00 +0.1 0.08 ± 8% perf-profile.self.cycles-pp.__file_remove_privs
0.14 ± 2% +0.0 0.16 ± 6% +0.0 0.17 ± 3% +0.0 0.16 perf-profile.self.cycles-pp.uncharge_folio
0.02 ±141% +0.0 0.06 ± 8% +0.0 0.06 +0.0 0.06 ± 9% perf-profile.self.cycles-pp.uncharge_batch
0.21 ± 9% +0.1 0.31 ± 7% +0.1 0.32 ± 5% +0.1 0.30 ± 4% perf-profile.self.cycles-pp.mem_cgroup_commit_charge
0.69 ± 5% +0.1 0.83 ± 4% +0.2 0.86 ± 5% +0.1 0.79 ± 2% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.06 ± 6% +0.2 0.22 ± 2% +0.1 0.13 ± 5% +0.1 0.11 ± 4% perf-profile.self.cycles-pp.inode_needs_update_time
0.14 ± 8% +0.3 0.42 ± 7% +0.3 0.44 ± 6% +0.3 0.40 ± 3% perf-profile.self.cycles-pp.__mem_cgroup_charge
0.13 ± 7% +0.4 0.49 ± 3% +0.4 0.51 ± 2% +0.4 0.51 ± 2% perf-profile.self.cycles-pp.__count_memcg_events
84.24 +1.0 85.21 +0.8 85.04 +1.0 85.21 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
1.12 ± 5% +1.4 2.50 ± 2% +1.4 2.55 ± 2% +1.3 2.43 perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
[2]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
10544810 ± 11% +1.7% 10720938 ± 4% +1.7% 10719232 ± 4% +24.8% 13160448 meminfo.DirectMap2M
1.87 -0.4 1.43 ± 3% -0.4 1.47 ± 2% -0.4 1.46 mpstat.cpu.all.usr%
3171 -5.3% 3003 ± 2% +17.4% 3725 ± 30% +2.6% 3255 ± 5% vmstat.system.cs
93.97 ±130% +360.8% 433.04 ± 83% +5204.4% 4984 ±150% +1540.1% 1541 ± 56% boot-time.boot
6762 ±101% +96.3% 13275 ± 75% +3212.0% 223971 ±150% +752.6% 57655 ± 60% boot-time.idle
84.83 ± 9% +55.8% 132.17 ± 16% +75.6% 149.00 ± 11% +98.0% 168.00 ± 6% perf-c2c.DRAM.local
484.17 ± 3% +37.1% 663.67 ± 10% +44.1% 697.67 ± 7% -0.2% 483.00 ± 5% perf-c2c.DRAM.remote
72763 ± 5% +14.4% 83212 ± 12% +141.5% 175744 ± 83% +55.7% 113321 ± 21% turbostat.C1
0.08 -25.0% 0.06 -27.1% 0.06 ± 6% -25.0% 0.06 turbostat.IPC
27.90 +4.6% 29.18 +4.9% 29.27 +3.9% 29.00 turbostat.RAMWatt
3982212 -30.0% 2785941 -28.9% 2829631 -26.7% 2919929 will-it-scale.52.threads
76580 -30.0% 53575 -28.9% 54415 -26.7% 56152 will-it-scale.per_thread_ops
3982212 -30.0% 2785941 -28.9% 2829631 -26.7% 2919929 will-it-scale.workload
1.175e+09 ± 2% -28.6% 8.392e+08 ± 2% -28.2% 8.433e+08 ± 2% -25.4% 8.762e+08 numa-numastat.node0.local_node
1.175e+09 ± 2% -28.6% 8.394e+08 ± 2% -28.3% 8.434e+08 ± 2% -25.4% 8.766e+08 numa-numastat.node0.numa_hit
1.231e+09 ± 2% -31.3% 8.463e+08 ± 3% -29.5% 8.683e+08 ± 3% -27.7% 8.901e+08 numa-numastat.node1.local_node
1.232e+09 ± 2% -31.3% 8.466e+08 ± 3% -29.5% 8.688e+08 ± 3% -27.7% 8.907e+08 numa-numastat.node1.numa_hit
2.408e+09 -30.0% 1.686e+09 -28.9% 1.712e+09 -26.6% 1.767e+09 proc-vmstat.numa_hit
2.406e+09 -30.0% 1.685e+09 -28.9% 1.712e+09 -26.6% 1.766e+09 proc-vmstat.numa_local
2.404e+09 -29.9% 1.684e+09 -28.8% 1.71e+09 -26.6% 1.765e+09 proc-vmstat.pgalloc_normal
2.404e+09 -29.9% 1.684e+09 -28.8% 1.71e+09 -26.6% 1.765e+09 proc-vmstat.pgfree
2302080 -0.9% 2280448 -0.5% 2290432 -1.2% 2274688 proc-vmstat.unevictable_pgs_scanned
83444 ± 71% +34.2% 111978 ± 65% -9.1% 75877 ± 86% -76.2% 19883 ± 12% numa-meminfo.node0.AnonHugePages
150484 ± 55% +9.3% 164434 ± 46% -9.3% 136435 ± 53% -62.4% 56548 ± 18% numa-meminfo.node0.AnonPages
167427 ± 50% +8.2% 181159 ± 41% -8.3% 153613 ± 47% -56.1% 73487 ± 14% numa-meminfo.node0.Inactive
166720 ± 50% +8.7% 181159 ± 41% -8.3% 152902 ± 48% -56.6% 72379 ± 14% numa-meminfo.node0.Inactive(anon)
111067 ± 62% -13.7% 95819 ± 59% +14.6% 127294 ± 60% +86.1% 206693 ± 8% numa-meminfo.node1.AnonHugePages
179594 ± 47% -4.2% 172027 ± 43% +9.3% 196294 ± 39% +55.8% 279767 ± 3% numa-meminfo.node1.AnonPages
257406 ± 30% -2.1% 251990 ± 32% +9.9% 282766 ± 26% +42.2% 366131 ± 8% numa-meminfo.node1.AnonPages.max
196741 ± 43% -3.6% 189753 ± 39% +8.1% 212645 ± 36% +50.9% 296827 ± 3% numa-meminfo.node1.Inactive
196385 ± 43% -3.9% 188693 ± 39% +8.1% 212288 ± 36% +51.1% 296827 ± 3% numa-meminfo.node1.Inactive(anon)
37621 ± 55% +9.3% 41115 ± 46% -9.3% 34116 ± 53% -62.4% 14141 ± 18% numa-vmstat.node0.nr_anon_pages
41664 ± 50% +8.6% 45233 ± 41% -8.2% 38240 ± 47% -56.6% 18079 ± 14% numa-vmstat.node0.nr_inactive_anon
41677 ± 50% +8.6% 45246 ± 41% -8.2% 38250 ± 47% -56.6% 18092 ± 14% numa-vmstat.node0.nr_zone_inactive_anon
1.175e+09 ± 2% -28.6% 8.394e+08 ± 2% -28.3% 8.434e+08 ± 2% -25.4% 8.766e+08 numa-vmstat.node0.numa_hit
1.175e+09 ± 2% -28.6% 8.392e+08 ± 2% -28.2% 8.433e+08 ± 2% -25.4% 8.762e+08 numa-vmstat.node0.numa_local
44903 ± 47% -4.2% 43015 ± 43% +9.3% 49079 ± 39% +55.8% 69957 ± 3% numa-vmstat.node1.nr_anon_pages
49030 ± 43% -3.9% 47139 ± 39% +8.3% 53095 ± 36% +51.4% 74210 ± 3% numa-vmstat.node1.nr_inactive_anon
49035 ± 43% -3.9% 47135 ± 39% +8.3% 53098 ± 36% +51.3% 74212 ± 3% numa-vmstat.node1.nr_zone_inactive_anon
1.232e+09 ± 2% -31.3% 8.466e+08 ± 3% -29.5% 8.688e+08 ± 3% -27.7% 8.907e+08 numa-vmstat.node1.numa_hit
1.231e+09 ± 2% -31.3% 8.463e+08 ± 3% -29.5% 8.683e+08 ± 3% -27.7% 8.901e+08 numa-vmstat.node1.numa_local
5256095 ± 59% +557.5% 34561019 ± 89% +4549.1% 2.444e+08 ±146% +1646.7% 91810708 ± 50% sched_debug.cfs_rq:/.avg_vruntime.avg
8288083 ± 52% +365.0% 38543329 ± 81% +3020.3% 2.586e+08 ±145% +1133.9% 1.023e+08 ± 49% sched_debug.cfs_rq:/.avg_vruntime.max
1364475 ± 40% +26.7% 1728262 ± 29% +346.8% 6096205 ±118% +180.4% 3826288 ± 41% sched_debug.cfs_rq:/.avg_vruntime.stddev
161.62 ± 99% -42.4% 93.09 ±144% -57.3% 69.01 ± 74% -86.6% 21.73 ± 10% sched_debug.cfs_rq:/.load_avg.avg
902.70 ±107% -46.8% 480.28 ±171% -57.3% 385.28 ±120% -94.8% 47.03 ± 8% sched_debug.cfs_rq:/.load_avg.stddev
5256095 ± 59% +557.5% 34561019 ± 89% +4549.1% 2.444e+08 ±146% +1646.7% 91810708 ± 50% sched_debug.cfs_rq:/.min_vruntime.avg
8288083 ± 52% +365.0% 38543329 ± 81% +3020.3% 2.586e+08 ±145% +1133.9% 1.023e+08 ± 49% sched_debug.cfs_rq:/.min_vruntime.max
1364475 ± 40% +26.7% 1728262 ± 29% +346.8% 6096205 ±118% +180.4% 3826288 ± 41% sched_debug.cfs_rq:/.min_vruntime.stddev
31.84 ±161% -71.8% 8.98 ± 44% -84.0% 5.10 ± 43% -79.0% 6.68 ± 24% sched_debug.cfs_rq:/.removed.load_avg.avg
272.14 ±192% -84.9% 41.10 ± 29% -89.7% 28.08 ± 21% -87.8% 33.19 ± 12% sched_debug.cfs_rq:/.removed.load_avg.stddev
334.70 ± 17% +32.4% 443.13 ± 19% +34.3% 449.66 ± 11% +14.6% 383.66 ± 24% sched_debug.cfs_rq:/.util_est_enqueued.avg
322.95 ± 23% +12.5% 363.30 ± 19% +27.9% 412.92 ± 6% +11.2% 359.17 ± 18% sched_debug.cfs_rq:/.util_est_enqueued.stddev
240924 ± 52% +136.5% 569868 ± 62% +2031.9% 5136297 ±145% +600.7% 1688103 ± 51% sched_debug.cpu.clock.avg
240930 ± 52% +136.5% 569874 ± 62% +2031.9% 5136304 ±145% +600.7% 1688109 ± 51% sched_debug.cpu.clock.max
240917 ± 52% +136.5% 569861 ± 62% +2032.0% 5136290 ±145% +600.7% 1688095 ± 51% sched_debug.cpu.clock.min
239307 ± 52% +136.6% 566140 ± 62% +2009.9% 5049095 ±145% +600.7% 1676912 ± 51% sched_debug.cpu.clock_task.avg
239479 ± 52% +136.5% 566334 ± 62% +2014.9% 5064818 ±145% +600.4% 1677208 ± 51% sched_debug.cpu.clock_task.max
232462 ± 53% +140.6% 559281 ± 63% +2064.0% 5030381 ±146% +617.9% 1668793 ± 52% sched_debug.cpu.clock_task.min
683.22 ± 3% +0.7% 688.14 ± 4% +1762.4% 12724 ±138% +19.2% 814.55 ± 8% sched_debug.cpu.clock_task.stddev
3267 ± 57% +146.0% 8040 ± 63% +2127.2% 72784 ±146% +652.5% 24591 ± 52% sched_debug.cpu.curr->pid.avg
10463 ± 39% +101.0% 21030 ± 54% +1450.9% 162275 ±143% +448.5% 57391 ± 49% sched_debug.cpu.curr->pid.max
3373 ± 57% +149.1% 8403 ± 64% +2141.6% 75621 ±146% +657.7% 25561 ± 52% sched_debug.cpu.curr->pid.stddev
58697 ± 14% +1.6% 59612 ± 7% +1.9e+05% 1.142e+08 ±156% +105.4% 120565 ± 32% sched_debug.cpu.nr_switches.max
6023 ± 10% +13.6% 6843 ± 11% +2.9e+05% 17701514 ±151% +124.8% 13541 ± 32% sched_debug.cpu.nr_switches.stddev
240917 ± 52% +136.5% 569862 ± 62% +2032.0% 5136291 ±145% +600.7% 1688096 ± 51% sched_debug.cpu_clk
240346 ± 52% +136.9% 569288 ± 62% +2036.8% 5135723 ±145% +602.1% 1687529 ± 51% sched_debug.ktime
241481 ± 51% +136.2% 570443 ± 62% +2027.2% 5136856 ±145% +599.3% 1688672 ± 51% sched_debug.sched_clk
0.04 ± 9% -19.3% 0.03 ± 6% -19.7% 0.03 ± 6% -14.3% 0.03 ± 8% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.04 ± 11% -18.0% 0.03 ± 13% -22.8% 0.03 ± 10% -14.0% 0.04 ± 15% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.04 ± 8% -22.3% 0.03 ± 5% -19.4% 0.03 ± 3% -12.6% 0.04 ± 9% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.91 ± 2% +11.3% 1.01 ± 5% +65.3% 1.51 ± 53% +28.8% 1.17 ± 11% perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.04 ± 13% -90.3% 0.00 ±223% -66.4% 0.01 ±101% -83.8% 0.01 ±223% perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
24.11 ± 3% -8.5% 22.08 ± 11% -25.2% 18.04 ± 50% -29.5% 17.01 ± 21% perf-sched.wait_and_delay.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
1.14 +15.1% 1.31 -24.1% 0.86 ± 70% +13.7% 1.29 perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
189.94 ± 3% +18.3% 224.73 ± 4% +20.3% 228.52 ± 3% +22.1% 231.82 ± 3% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1652 ± 4% -13.4% 1431 ± 4% -13.4% 1431 ± 2% -14.3% 1416 ± 6% perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
1628 ± 8% -15.0% 1383 ± 9% -16.6% 1357 ± 2% -16.6% 1358 ± 7% perf-sched.wait_and_delay.count.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
83.67 ± 7% -87.6% 10.33 ±223% -59.2% 34.17 ±100% -85.5% 12.17 ±223% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
2835 ± 3% +10.6% 3135 ± 10% +123.8% 6345 ± 80% +48.4% 4207 ± 19% perf-sched.wait_and_delay.count.pipe_read.vfs_read.ksys_read.do_syscall_64
3827 ± 4% -13.0% 3328 ± 3% -12.9% 3335 ± 2% -14.7% 3264 ± 2% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.71 ±165% -83.4% 0.28 ± 21% -82.3% 0.30 ± 16% -74.6% 0.43 ± 60% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.43 ± 17% -43.8% 0.24 ± 26% -44.4% 0.24 ± 27% -32.9% 0.29 ± 23% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.46 ± 17% -36.7% 0.29 ± 12% -35.7% 0.30 ± 19% -35.3% 0.30 ± 21% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
45.41 ± 4% +13.4% 51.51 ± 12% +148.6% 112.88 ± 86% +56.7% 71.18 ± 21% perf-sched.wait_and_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.30 ± 34% -90.7% 0.03 ±223% -66.0% 0.10 ±110% -88.2% 0.04 ±223% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
2.39 +10.7% 2.65 ± 2% -24.3% 1.81 ± 70% +12.1% 2.68 ± 2% perf-sched.wait_and_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
0.04 ± 9% -19.3% 0.03 ± 6% -19.7% 0.03 ± 6% -14.3% 0.03 ± 8% perf-sched.wait_time.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.04 ± 11% -18.0% 0.03 ± 13% -22.8% 0.03 ± 10% -14.0% 0.04 ± 15% perf-sched.wait_time.avg.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.04 ± 8% -22.3% 0.03 ± 5% -19.4% 0.03 ± 3% -12.6% 0.04 ± 9% perf-sched.wait_time.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.04 ± 11% -33.1% 0.03 ± 17% -32.3% 0.03 ± 22% -16.3% 0.04 ± 12% perf-sched.wait_time.avg.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.90 ± 2% +11.5% 1.00 ± 5% +66.1% 1.50 ± 53% +29.2% 1.16 ± 11% perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.04 ± 13% -26.6% 0.03 ± 12% -33.6% 0.03 ± 11% -18.1% 0.04 ± 16% perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
24.05 ± 3% -9.0% 21.90 ± 10% -25.0% 18.04 ± 50% -29.4% 16.97 ± 21% perf-sched.wait_time.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
1.13 +15.2% 1.30 +15.0% 1.30 +13.7% 1.29 perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
189.93 ± 3% +18.3% 224.72 ± 4% +20.3% 228.50 ± 3% +22.1% 231.81 ± 3% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.71 ±165% -83.4% 0.28 ± 21% -82.3% 0.30 ± 16% -74.6% 0.43 ± 60% perf-sched.wait_time.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.43 ± 17% -43.8% 0.24 ± 26% -44.4% 0.24 ± 27% -32.9% 0.29 ± 23% perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.46 ± 17% -36.7% 0.29 ± 12% -35.7% 0.30 ± 19% -35.3% 0.30 ± 21% perf-sched.wait_time.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.31 ± 26% -42.1% 0.18 ± 58% -64.1% 0.11 ± 40% -28.5% 0.22 ± 30% perf-sched.wait_time.max.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
45.41 ± 4% +13.4% 51.50 ± 12% +148.6% 112.87 ± 86% +56.8% 71.18 ± 21% perf-sched.wait_time.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
2.39 +10.7% 2.64 ± 2% +12.9% 2.69 ± 2% +12.1% 2.68 ± 2% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
0.75 +142.0% 1.83 ± 2% +146.9% 1.86 +124.8% 1.70 perf-stat.i.MPKI
8.47e+09 -24.4% 6.407e+09 -23.2% 6.503e+09 -21.2% 6.674e+09 perf-stat.i.branch-instructions
0.66 -0.0 0.63 -0.0 0.64 -0.0 0.63 perf-stat.i.branch-miss-rate%
56364992 -28.3% 40421603 ± 3% -26.0% 41734061 ± 2% -25.8% 41829975 perf-stat.i.branch-misses
14.64 +6.7 21.30 +6.9 21.54 +6.5 21.10 perf-stat.i.cache-miss-rate%
30868184 +81.3% 55977240 ± 3% +87.7% 57950237 +76.2% 54404466 perf-stat.i.cache-misses
2.107e+08 +24.7% 2.627e+08 ± 2% +27.6% 2.69e+08 +22.3% 2.578e+08 perf-stat.i.cache-references
3106 -5.5% 2934 ± 2% +16.4% 3615 ± 29% +2.4% 3181 ± 5% perf-stat.i.context-switches
3.55 +33.4% 4.74 +31.5% 4.67 +27.4% 4.52 perf-stat.i.cpi
4722 -44.8% 2605 ± 3% -46.7% 2515 -43.3% 2675 perf-stat.i.cycles-between-cache-misses
0.04 -0.0 0.04 -0.0 0.04 -0.0 0.04 perf-stat.i.dTLB-load-miss-rate%
4117232 -29.1% 2917107 -28.1% 2961876 -25.8% 3056956 perf-stat.i.dTLB-load-misses
1.051e+10 -24.1% 7.979e+09 -23.0% 8.1e+09 -19.7% 8.44e+09 perf-stat.i.dTLB-loads
0.00 ± 3% +0.0 0.00 ± 6% +0.0 0.00 ± 5% +0.0 0.00 ± 4% perf-stat.i.dTLB-store-miss-rate%
5.886e+09 -27.5% 4.269e+09 -26.3% 4.34e+09 -24.1% 4.467e+09 perf-stat.i.dTLB-stores
78.16 -6.6 71.51 -6.4 71.75 -5.9 72.23 perf-stat.i.iTLB-load-miss-rate%
4131074 ± 3% -30.0% 2891515 -29.2% 2922789 -26.2% 3048227 perf-stat.i.iTLB-load-misses
4.098e+10 -25.0% 3.072e+10 -23.9% 3.119e+10 -21.6% 3.214e+10 perf-stat.i.instructions
9929 ± 2% +7.0% 10627 +7.5% 10673 +6.2% 10547 perf-stat.i.instructions-per-iTLB-miss
0.28 -25.0% 0.21 -23.9% 0.21 -21.5% 0.22 perf-stat.i.ipc
63.49 +43.8% 91.27 ± 3% +48.2% 94.07 +38.6% 87.97 perf-stat.i.metric.K/sec
241.12 -24.6% 181.87 -23.4% 184.70 -20.9% 190.75 perf-stat.i.metric.M/sec
90.84 -0.4 90.49 -0.9 89.98 -2.9 87.93 perf-stat.i.node-load-miss-rate%
3735316 +78.6% 6669641 ± 3% +83.1% 6839047 +62.4% 6067727 perf-stat.i.node-load-misses
377465 ± 4% +86.1% 702512 ± 11% +101.7% 761510 ± 4% +120.8% 833359 perf-stat.i.node-loads
1322217 -27.6% 957081 ± 5% -22.9% 1019779 ± 2% -19.4% 1066178 perf-stat.i.node-store-misses
37459 ± 3% -23.0% 28826 ± 5% -19.2% 30253 ± 6% -23.4% 28682 ± 3% perf-stat.i.node-stores
0.75 +141.8% 1.82 ± 2% +146.6% 1.86 +124.7% 1.69 perf-stat.overall.MPKI
0.67 -0.0 0.63 -0.0 0.64 -0.0 0.63 perf-stat.overall.branch-miss-rate%
14.65 +6.7 21.30 +6.9 21.54 +6.5 21.11 perf-stat.overall.cache-miss-rate%
3.55 +33.4% 4.73 +31.4% 4.66 +27.4% 4.52 perf-stat.overall.cpi
4713 -44.8% 2601 ± 3% -46.7% 2511 -43.3% 2671 perf-stat.overall.cycles-between-cache-misses
0.04 -0.0 0.04 -0.0 0.04 -0.0 0.04 perf-stat.overall.dTLB-load-miss-rate%
0.00 ± 3% +0.0 0.00 ± 5% +0.0 0.00 ± 5% +0.0 0.00 perf-stat.overall.dTLB-store-miss-rate%
78.14 -6.7 71.47 -6.4 71.70 -5.9 72.20 perf-stat.overall.iTLB-load-miss-rate%
9927 ± 2% +7.0% 10624 +7.5% 10672 +6.2% 10547 perf-stat.overall.instructions-per-iTLB-miss
0.28 -25.0% 0.21 -23.9% 0.21 -21.5% 0.22 perf-stat.overall.ipc
90.82 -0.3 90.49 -0.8 89.98 -2.9 87.92 perf-stat.overall.node-load-miss-rate%
3098901 +7.1% 3318983 +6.9% 3313112 +7.0% 3316044 perf-stat.overall.path-length
8.441e+09 -24.4% 6.385e+09 -23.2% 6.48e+09 -21.2% 6.652e+09 perf-stat.ps.branch-instructions
56179581 -28.3% 40286337 ± 3% -26.0% 41593521 ± 2% -25.8% 41687151 perf-stat.ps.branch-misses
30759982 +81.3% 55777812 ± 3% +87.7% 57746279 +76.3% 54217757 perf-stat.ps.cache-misses
2.1e+08 +24.6% 2.618e+08 ± 2% +27.6% 2.68e+08 +22.3% 2.569e+08 perf-stat.ps.cache-references
3095 -5.5% 2923 ± 2% +16.2% 3597 ± 29% +2.3% 3167 ± 5% perf-stat.ps.context-switches
135.89 -0.8% 134.84 -0.7% 134.99 -1.0% 134.55 perf-stat.ps.cpu-migrations
4103292 -29.1% 2907270 -28.1% 2951746 -25.7% 3046739 perf-stat.ps.dTLB-load-misses
1.048e+10 -24.1% 7.952e+09 -23.0% 8.072e+09 -19.7% 8.412e+09 perf-stat.ps.dTLB-loads
5.866e+09 -27.5% 4.255e+09 -26.3% 4.325e+09 -24.1% 4.452e+09 perf-stat.ps.dTLB-stores
4117020 ± 3% -30.0% 2881750 -29.3% 2912744 -26.2% 3037970 perf-stat.ps.iTLB-load-misses
4.084e+10 -25.0% 3.062e+10 -23.9% 3.109e+10 -21.6% 3.203e+10 perf-stat.ps.instructions
3722149 +78.5% 6645867 ± 3% +83.1% 6814976 +62.5% 6046854 perf-stat.ps.node-load-misses
376240 ± 4% +86.1% 700053 ± 11% +101.7% 758898 ± 4% +120.8% 830575 perf-stat.ps.node-loads
1317772 -27.6% 953773 ± 5% -22.9% 1016183 ± 2% -19.4% 1062457 perf-stat.ps.node-store-misses
37408 ± 3% -23.2% 28748 ± 5% -19.3% 30192 ± 6% -23.5% 28607 ± 3% perf-stat.ps.node-stores
1.234e+13 -25.1% 9.246e+12 -24.0% 9.375e+12 -21.5% 9.683e+12 perf-stat.total.instructions
1.28 -0.4 0.90 ± 2% -0.4 0.91 -0.3 0.94 ± 2% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
1.26 ± 2% -0.4 0.90 ± 3% -0.3 0.92 ± 2% -0.3 0.94 ± 2% perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.08 ± 2% -0.3 0.77 ± 3% -0.3 0.79 ± 2% -0.3 0.81 ± 2% perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.92 ± 2% -0.3 0.62 ± 3% -0.3 0.63 -0.3 0.66 ± 2% perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.84 ± 3% -0.2 0.61 ± 3% -0.2 0.63 ± 2% -0.2 0.65 ± 2% perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp
29.27 -0.2 29.09 -1.0 28.32 -0.2 29.04 perf-profile.calltrace.cycles-pp.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.26 -0.2 1.08 -0.2 1.07 -0.2 1.10 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr
1.26 -0.2 1.08 -0.2 1.07 -0.2 1.10 perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change
1.24 -0.2 1.06 -0.2 1.05 -0.2 1.08 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range
1.23 -0.2 1.06 -0.2 1.05 -0.2 1.08 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu
1.24 -0.2 1.06 -0.2 1.05 -0.2 1.08 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release
29.15 -0.2 28.99 -0.9 28.23 -0.2 28.94 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
1.20 -0.2 1.04 ± 2% -0.2 1.05 -0.2 1.02 ± 2% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
27.34 -0.1 27.22 ± 2% -0.9 26.49 -0.1 27.20 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
27.36 -0.1 27.24 ± 2% -0.9 26.51 -0.1 27.22 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
27.28 -0.1 27.17 ± 2% -0.8 26.44 -0.1 27.16 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru
25.74 -0.1 25.67 ± 2% +0.2 25.98 +0.9 26.62 perf-profile.calltrace.cycles-pp.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change
23.43 +0.0 23.43 ± 2% +0.3 23.70 +0.9 24.34 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release.shmem_undo_range
23.45 +0.0 23.45 ± 2% +0.3 23.73 +0.9 24.35 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
23.37 +0.0 23.39 ± 2% +0.3 23.67 +0.9 24.30 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release
0.68 ± 3% +0.0 0.72 ± 4% +0.1 0.73 ± 3% +0.1 0.74 perf-profile.calltrace.cycles-pp.__mem_cgroup_uncharge_list.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
1.08 +0.1 1.20 +0.1 1.17 +0.1 1.15 ± 2% perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
2.91 +0.3 3.18 ± 2% +0.3 3.23 +0.1 3.02 perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate
2.56 +0.4 2.92 ± 2% +0.4 2.98 +0.2 2.75 perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
1.36 ± 3% +0.4 1.76 ± 9% +0.4 1.75 ± 5% +0.3 1.68 ± 3% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
2.22 +0.5 2.68 ± 2% +0.5 2.73 +0.3 2.50 perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
0.00 +0.6 0.60 ± 2% +0.6 0.61 ± 2% +0.6 0.61 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
2.33 +0.6 2.94 +0.6 2.96 ± 3% +0.3 2.59 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.00 +0.7 0.72 ± 2% +0.7 0.72 ± 2% +0.7 0.68 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
0.69 ± 4% +0.8 1.47 ± 3% +0.8 1.48 ± 2% +0.7 1.42 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
1.24 ± 2% +0.8 2.04 ± 2% +0.8 2.07 ± 2% +0.6 1.82 perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
0.00 +0.8 0.82 ± 4% +0.8 0.85 ± 3% +0.8 0.78 ± 2% perf-profile.calltrace.cycles-pp.__count_memcg_events.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.17 ± 2% +0.8 2.00 ± 2% +0.9 2.04 ± 2% +0.6 1.77 perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
0.59 ± 4% +0.9 1.53 +0.9 1.53 ± 4% +0.8 1.37 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.38 +1.0 2.33 ± 2% +1.0 2.34 ± 3% +0.6 1.94 ± 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.62 ± 3% +1.0 1.66 ± 5% +1.1 1.68 ± 4% +1.0 1.57 ± 2% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
38.70 +1.2 39.90 +0.5 39.23 +0.7 39.45 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
38.34 +1.3 39.65 +0.6 38.97 +0.9 39.20 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
37.24 +1.6 38.86 +0.9 38.17 +1.1 38.35 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
36.64 +1.8 38.40 +1.1 37.72 +1.2 37.88 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
2.47 ± 2% +2.1 4.59 ± 8% +2.1 4.61 ± 5% +1.9 4.37 ± 2% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.30 -0.4 0.92 ± 2% -0.4 0.93 -0.4 0.96 perf-profile.children.cycles-pp.syscall_return_via_sysret
1.28 ± 2% -0.4 0.90 ± 3% -0.3 0.93 ± 2% -0.3 0.95 ± 2% perf-profile.children.cycles-pp.shmem_alloc_folio
30.44 -0.3 30.11 -1.1 29.33 -0.4 30.07 perf-profile.children.cycles-pp.folio_batch_move_lru
1.10 ± 2% -0.3 0.78 ± 3% -0.3 0.81 ± 2% -0.3 0.82 ± 2% perf-profile.children.cycles-pp.alloc_pages_mpol
0.96 ± 2% -0.3 0.64 ± 3% -0.3 0.65 -0.3 0.68 ± 2% perf-profile.children.cycles-pp.shmem_inode_acct_blocks
0.88 -0.3 0.58 ± 2% -0.3 0.60 ± 2% -0.3 0.62 ± 2% perf-profile.children.cycles-pp.xas_store
0.88 ± 3% -0.2 0.64 ± 3% -0.2 0.66 ± 2% -0.2 0.67 ± 2% perf-profile.children.cycles-pp.__alloc_pages
29.29 -0.2 29.10 -1.0 28.33 -0.2 29.06 perf-profile.children.cycles-pp.folio_add_lru
0.61 ± 2% -0.2 0.43 ± 3% -0.2 0.44 ± 2% -0.2 0.45 ± 3% perf-profile.children.cycles-pp.__entry_text_start
1.26 -0.2 1.09 -0.2 1.08 -0.2 1.10 perf-profile.children.cycles-pp.lru_add_drain_cpu
0.56 -0.2 0.39 ± 4% -0.2 0.40 ± 3% -0.2 0.40 ± 3% perf-profile.children.cycles-pp.free_unref_page_list
1.22 -0.2 1.06 ± 2% -0.2 1.06 -0.2 1.04 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.46 -0.1 0.32 ± 3% -0.1 0.32 -0.1 0.32 ± 3% perf-profile.children.cycles-pp.__mod_lruvec_state
0.41 ± 3% -0.1 0.28 ± 4% -0.1 0.28 ± 3% -0.1 0.29 ± 2% perf-profile.children.cycles-pp.xas_load
0.44 ± 4% -0.1 0.31 ± 4% -0.1 0.32 ± 2% -0.1 0.34 ± 3% perf-profile.children.cycles-pp.find_lock_entries
0.50 ± 3% -0.1 0.37 ± 2% -0.1 0.39 ± 4% -0.1 0.39 ± 2% perf-profile.children.cycles-pp.get_page_from_freelist
0.24 ± 7% -0.1 0.12 ± 5% -0.1 0.13 ± 2% -0.1 0.13 ± 3% perf-profile.children.cycles-pp.__list_add_valid_or_report
25.89 -0.1 25.78 ± 2% +0.2 26.08 +0.8 26.73 perf-profile.children.cycles-pp.release_pages
0.34 ± 2% -0.1 0.24 ± 4% -0.1 0.23 ± 2% -0.1 0.23 ± 4% perf-profile.children.cycles-pp.__mod_node_page_state
0.38 ± 3% -0.1 0.28 ± 4% -0.1 0.29 ± 3% -0.1 0.28 perf-profile.children.cycles-pp._raw_spin_lock
0.32 ± 2% -0.1 0.22 ± 5% -0.1 0.23 ± 2% -0.1 0.23 ± 2% perf-profile.children.cycles-pp.__dquot_alloc_space
0.26 ± 2% -0.1 0.17 ± 2% -0.1 0.18 ± 3% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.xas_descend
0.22 ± 3% -0.1 0.14 ± 4% -0.1 0.14 ± 3% -0.1 0.14 ± 2% perf-profile.children.cycles-pp.free_unref_page_commit
0.25 -0.1 0.17 ± 3% -0.1 0.18 ± 4% -0.1 0.18 ± 4% perf-profile.children.cycles-pp.xas_clear_mark
0.32 ± 4% -0.1 0.25 ± 3% -0.1 0.26 ± 4% -0.1 0.26 ± 2% perf-profile.children.cycles-pp.rmqueue
0.23 ± 2% -0.1 0.16 ± 2% -0.1 0.16 ± 4% -0.1 0.16 ± 6% perf-profile.children.cycles-pp.xas_init_marks
0.24 ± 2% -0.1 0.17 ± 5% -0.1 0.17 ± 4% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.__cond_resched
0.25 ± 4% -0.1 0.18 ± 2% -0.1 0.18 ± 2% -0.1 0.18 ± 4% perf-profile.children.cycles-pp.truncate_cleanup_folio
0.30 ± 3% -0.1 0.23 ± 4% -0.1 0.22 ± 3% -0.1 0.22 ± 2% perf-profile.children.cycles-pp.filemap_get_entry
0.20 ± 2% -0.1 0.13 ± 5% -0.1 0.13 ± 3% -0.1 0.14 ± 4% perf-profile.children.cycles-pp.folio_unlock
0.16 ± 4% -0.1 0.10 ± 5% -0.1 0.10 ± 7% -0.1 0.11 ± 6% perf-profile.children.cycles-pp.xas_find_conflict
0.19 ± 3% -0.1 0.13 ± 5% -0.0 0.14 ± 12% -0.1 0.14 ± 5% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.17 ± 5% -0.1 0.12 ± 3% -0.1 0.12 ± 4% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.noop_dirty_folio
0.13 ± 4% -0.1 0.08 ± 9% -0.1 0.08 ± 8% -0.0 0.09 perf-profile.children.cycles-pp.security_vm_enough_memory_mm
0.18 ± 8% -0.1 0.13 ± 4% -0.0 0.13 ± 5% -0.0 0.13 ± 5% perf-profile.children.cycles-pp.shmem_recalc_inode
0.16 ± 2% -0.1 0.11 ± 3% -0.0 0.12 ± 4% -0.0 0.12 ± 6% perf-profile.children.cycles-pp.free_unref_page_prepare
0.09 ± 5% -0.1 0.04 ± 45% -0.0 0.05 -0.0 0.05 ± 7% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
0.10 ± 7% -0.0 0.05 ± 45% -0.0 0.06 ± 13% -0.0 0.06 ± 7% perf-profile.children.cycles-pp.cap_vm_enough_memory
0.14 ± 5% -0.0 0.10 -0.0 0.10 ± 4% -0.0 0.11 ± 5% perf-profile.children.cycles-pp.__folio_cancel_dirty
0.14 ± 5% -0.0 0.10 ± 4% -0.0 0.10 ± 3% -0.0 0.10 ± 6% perf-profile.children.cycles-pp.security_file_permission
0.10 ± 5% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.07 ± 10% perf-profile.children.cycles-pp.xas_find
0.15 ± 4% -0.0 0.11 ± 3% -0.0 0.11 ± 6% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.__fget_light
0.12 ± 3% -0.0 0.09 ± 7% -0.0 0.09 ± 7% -0.0 0.09 ± 6% perf-profile.children.cycles-pp.__vm_enough_memory
0.12 ± 3% -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 ± 6% perf-profile.children.cycles-pp.apparmor_file_permission
0.12 ± 3% -0.0 0.08 ± 5% -0.0 0.08 ± 5% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.14 ± 5% -0.0 0.11 ± 3% -0.0 0.11 ± 4% -0.0 0.12 ± 3% perf-profile.children.cycles-pp.file_modified
0.12 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 7% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.xas_start
0.09 -0.0 0.06 ± 8% -0.0 0.04 ± 45% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.__folio_throttle_swaprate
0.12 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 4% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.__percpu_counter_limited_add
0.12 ± 6% -0.0 0.08 ± 8% -0.0 0.08 ± 8% -0.0 0.08 ± 4% perf-profile.children.cycles-pp._raw_spin_trylock
0.12 ± 4% -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 perf-profile.children.cycles-pp.inode_add_bytes
0.20 ± 2% -0.0 0.17 ± 7% -0.0 0.17 ± 4% -0.0 0.18 ± 3% perf-profile.children.cycles-pp.try_charge_memcg
0.10 ± 5% -0.0 0.07 ± 7% -0.0 0.07 ± 7% -0.0 0.06 ± 7% perf-profile.children.cycles-pp.policy_nodemask
0.09 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.get_pfnblock_flags_mask
0.09 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 5% perf-profile.children.cycles-pp.filemap_free_folio
0.07 ± 6% -0.0 0.05 ± 7% -0.0 0.06 ± 9% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.down_write
0.08 ± 4% -0.0 0.06 ± 8% -0.0 0.06 ± 9% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.get_task_policy
0.09 ± 7% -0.0 0.07 -0.0 0.07 ± 7% -0.0 0.07 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.09 ± 7% -0.0 0.07 -0.0 0.07 ± 5% -0.0 0.08 ± 6% perf-profile.children.cycles-pp.inode_needs_update_time
0.09 ± 5% -0.0 0.07 ± 5% -0.0 0.08 ± 4% -0.0 0.08 ± 4% perf-profile.children.cycles-pp.xas_create
0.16 ± 2% -0.0 0.14 ± 5% -0.0 0.14 ± 2% -0.0 0.15 ± 4% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.08 ± 7% -0.0 0.06 ± 9% -0.0 0.06 ± 6% -0.0 0.06 perf-profile.children.cycles-pp.percpu_counter_add_batch
0.07 ± 5% -0.0 0.05 ± 7% -0.0 0.03 ± 70% -0.0 0.06 ± 14% perf-profile.children.cycles-pp.folio_mark_dirty
0.08 ± 10% -0.0 0.06 ± 6% -0.0 0.06 ± 13% -0.0 0.05 perf-profile.children.cycles-pp.shmem_is_huge
0.07 ± 6% +0.0 0.09 ± 10% +0.0 0.09 ± 5% +0.0 0.09 ± 6% perf-profile.children.cycles-pp.propagate_protected_usage
0.43 ± 3% +0.0 0.46 ± 5% +0.0 0.47 ± 3% +0.0 0.48 ± 2% perf-profile.children.cycles-pp.uncharge_batch
0.68 ± 3% +0.0 0.73 ± 4% +0.0 0.74 ± 3% +0.1 0.74 perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
1.11 +0.1 1.22 +0.1 1.19 +0.1 1.17 ± 2% perf-profile.children.cycles-pp.lru_add_fn
2.91 +0.3 3.18 ± 2% +0.3 3.23 +0.1 3.02 perf-profile.children.cycles-pp.truncate_inode_folio
2.56 +0.4 2.92 ± 2% +0.4 2.98 +0.2 2.75 perf-profile.children.cycles-pp.filemap_remove_folio
1.37 ± 3% +0.4 1.76 ± 9% +0.4 1.76 ± 5% +0.3 1.69 ± 2% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
2.24 +0.5 2.70 ± 2% +0.5 2.75 +0.3 2.51 perf-profile.children.cycles-pp.__filemap_remove_folio
2.38 +0.6 2.97 +0.6 2.99 ± 3% +0.2 2.63 perf-profile.children.cycles-pp.shmem_add_to_page_cache
0.18 ± 4% +0.7 0.91 ± 4% +0.8 0.94 ± 4% +0.7 0.87 ± 2% perf-profile.children.cycles-pp.__count_memcg_events
1.26 +0.8 2.04 ± 2% +0.8 2.08 ± 2% +0.6 1.82 perf-profile.children.cycles-pp.filemap_unaccount_folio
0.63 ± 2% +1.0 1.67 ± 5% +1.1 1.68 ± 5% +1.0 1.58 ± 2% perf-profile.children.cycles-pp.mem_cgroup_commit_charge
38.71 +1.2 39.91 +0.5 39.23 +0.7 39.46 perf-profile.children.cycles-pp.vfs_fallocate
38.37 +1.3 39.66 +0.6 38.99 +0.8 39.21 perf-profile.children.cycles-pp.shmem_fallocate
37.28 +1.6 38.89 +0.9 38.20 +1.1 38.39 perf-profile.children.cycles-pp.shmem_get_folio_gfp
36.71 +1.7 38.45 +1.1 37.77 +1.2 37.94 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
2.58 +1.8 4.36 ± 2% +1.8 4.40 ± 3% +1.2 3.74 perf-profile.children.cycles-pp.__mod_lruvec_page_state
2.48 ± 2% +2.1 4.60 ± 8% +2.1 4.62 ± 5% +1.9 4.38 ± 2% perf-profile.children.cycles-pp.__mem_cgroup_charge
1.93 ± 3% +2.4 4.36 ± 2% +2.5 4.38 ± 3% +2.2 4.09 perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
1.30 -0.4 0.92 ± 2% -0.4 0.93 -0.3 0.95 perf-profile.self.cycles-pp.syscall_return_via_sysret
0.73 -0.2 0.52 ± 2% -0.2 0.53 -0.2 0.54 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.54 ± 2% -0.2 0.36 ± 3% -0.2 0.36 ± 3% -0.2 0.37 ± 2% perf-profile.self.cycles-pp.release_pages
0.48 -0.2 0.30 ± 3% -0.2 0.32 ± 3% -0.2 0.33 ± 2% perf-profile.self.cycles-pp.xas_store
0.54 ± 2% -0.2 0.38 ± 3% -0.1 0.39 ± 2% -0.1 0.39 ± 3% perf-profile.self.cycles-pp.__entry_text_start
1.17 -0.1 1.03 ± 2% -0.1 1.03 -0.2 1.00 ± 2% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.36 ± 2% -0.1 0.22 ± 3% -0.1 0.22 ± 3% -0.1 0.24 ± 2% perf-profile.self.cycles-pp.shmem_add_to_page_cache
0.43 ± 5% -0.1 0.30 ± 7% -0.2 0.27 ± 7% -0.1 0.29 ± 2% perf-profile.self.cycles-pp.lru_add_fn
0.24 ± 7% -0.1 0.12 ± 6% -0.1 0.13 ± 2% -0.1 0.12 ± 6% perf-profile.self.cycles-pp.__list_add_valid_or_report
0.38 ± 4% -0.1 0.27 ± 4% -0.1 0.28 ± 3% -0.1 0.28 ± 2% perf-profile.self.cycles-pp._raw_spin_lock
0.52 ± 3% -0.1 0.41 -0.1 0.41 -0.1 0.43 ± 3% perf-profile.self.cycles-pp.folio_batch_move_lru
0.32 ± 2% -0.1 0.22 ± 4% -0.1 0.22 ± 3% -0.1 0.22 ± 5% perf-profile.self.cycles-pp.__mod_node_page_state
0.36 ± 2% -0.1 0.26 ± 2% -0.1 0.26 ± 2% -0.1 0.27 perf-profile.self.cycles-pp.shmem_fallocate
0.36 ± 4% -0.1 0.26 ± 4% -0.1 0.26 ± 3% -0.1 0.27 ± 3% perf-profile.self.cycles-pp.find_lock_entries
0.28 ± 3% -0.1 0.20 ± 5% -0.1 0.20 ± 2% -0.1 0.21 ± 3% perf-profile.self.cycles-pp.__alloc_pages
0.24 ± 2% -0.1 0.16 ± 4% -0.1 0.16 ± 4% -0.1 0.16 ± 3% perf-profile.self.cycles-pp.xas_descend
0.09 ± 5% -0.1 0.01 ±223% -0.1 0.03 ± 70% -0.1 0.03 ± 70% perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
0.23 ± 2% -0.1 0.16 ± 3% -0.1 0.16 ± 2% -0.1 0.16 ± 4% perf-profile.self.cycles-pp.xas_clear_mark
0.18 ± 3% -0.1 0.11 ± 6% -0.1 0.12 ± 4% -0.1 0.11 ± 4% perf-profile.self.cycles-pp.free_unref_page_commit
0.18 ± 3% -0.1 0.12 ± 4% -0.1 0.12 ± 3% -0.0 0.13 ± 5% perf-profile.self.cycles-pp.shmem_inode_acct_blocks
0.21 ± 3% -0.1 0.15 ± 2% -0.1 0.15 ± 2% -0.1 0.16 ± 3% perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
0.18 ± 2% -0.1 0.12 ± 3% -0.1 0.12 ± 4% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.__filemap_remove_folio
0.18 ± 7% -0.1 0.12 ± 7% -0.0 0.13 ± 5% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.vfs_fallocate
0.18 ± 2% -0.1 0.13 ± 3% -0.1 0.13 -0.1 0.13 ± 5% perf-profile.self.cycles-pp.folio_unlock
0.20 ± 2% -0.1 0.14 ± 6% -0.1 0.15 ± 3% -0.1 0.15 ± 6% perf-profile.self.cycles-pp.__dquot_alloc_space
0.18 ± 2% -0.1 0.12 ± 3% -0.1 0.13 ± 3% -0.0 0.13 ± 4% perf-profile.self.cycles-pp.get_page_from_freelist
0.15 ± 3% -0.1 0.10 ± 7% -0.0 0.10 ± 3% -0.0 0.10 ± 3% perf-profile.self.cycles-pp.xas_load
0.17 ± 3% -0.1 0.12 ± 8% -0.1 0.12 ± 3% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.__cond_resched
0.17 ± 2% -0.1 0.12 ± 3% -0.1 0.12 ± 7% -0.0 0.13 ± 2% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.17 ± 5% -0.1 0.12 ± 3% -0.0 0.12 ± 4% -0.0 0.12 ± 6% perf-profile.self.cycles-pp.noop_dirty_folio
0.10 ± 7% -0.0 0.05 ± 45% -0.0 0.06 ± 13% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.cap_vm_enough_memory
0.12 ± 3% -0.0 0.08 ± 4% -0.0 0.08 -0.0 0.08 ± 4% perf-profile.self.cycles-pp.rmqueue
0.06 -0.0 0.02 ±141% -0.0 0.03 ± 70% -0.0 0.04 ± 44% perf-profile.self.cycles-pp.inode_needs_update_time
0.07 ± 5% -0.0 0.02 ± 99% -0.0 0.05 -0.0 0.05 ± 7% perf-profile.self.cycles-pp.xas_find
0.13 ± 3% -0.0 0.09 ± 6% -0.0 0.10 ± 5% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.alloc_pages_mpol
0.07 ± 6% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.0 0.05 perf-profile.self.cycles-pp.xas_find_conflict
0.16 ± 2% -0.0 0.12 ± 6% -0.0 0.12 ± 3% -0.0 0.13 ± 5% perf-profile.self.cycles-pp.free_unref_page_list
0.12 ± 5% -0.0 0.08 ± 4% -0.0 0.08 ± 4% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.fallocate64
0.20 ± 4% -0.0 0.16 ± 3% -0.0 0.16 ± 3% -0.0 0.18 ± 4% perf-profile.self.cycles-pp.shmem_get_folio_gfp
0.06 ± 7% -0.0 0.02 ± 99% -0.0 0.02 ± 99% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.shmem_recalc_inode
0.13 ± 3% -0.0 0.09 -0.0 0.09 ± 6% -0.0 0.09 ± 6% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.22 ± 3% -0.0 0.19 ± 6% -0.0 0.20 ± 3% -0.0 0.21 ± 4% perf-profile.self.cycles-pp.page_counter_uncharge
0.14 ± 3% -0.0 0.10 ± 6% -0.0 0.10 ± 8% -0.0 0.10 ± 4% perf-profile.self.cycles-pp.filemap_remove_folio
0.15 ± 5% -0.0 0.11 ± 3% -0.0 0.11 ± 6% -0.0 0.11 ± 3% perf-profile.self.cycles-pp.__fget_light
0.12 ± 4% -0.0 0.08 -0.0 0.08 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.__folio_cancel_dirty
0.11 ± 4% -0.0 0.08 ± 7% -0.0 0.08 ± 8% -0.0 0.08 ± 4% perf-profile.self.cycles-pp._raw_spin_trylock
0.11 ± 3% -0.0 0.08 ± 6% -0.0 0.07 ± 9% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.xas_start
0.11 ± 3% -0.0 0.08 ± 6% -0.0 0.08 ± 6% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.__percpu_counter_limited_add
0.12 ± 3% -0.0 0.09 ± 5% -0.0 0.08 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.__mod_lruvec_state
0.11 ± 5% -0.0 0.08 ± 4% -0.0 0.08 ± 6% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.truncate_cleanup_folio
0.10 ± 6% -0.0 0.07 ± 5% -0.0 0.07 ± 7% -0.0 0.07 ± 11% perf-profile.self.cycles-pp.xas_init_marks
0.09 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.get_pfnblock_flags_mask
0.11 -0.0 0.08 ± 5% -0.0 0.08 -0.0 0.09 ± 5% perf-profile.self.cycles-pp.folio_add_lru
0.09 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.filemap_free_folio
0.09 ± 4% -0.0 0.06 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.shmem_alloc_folio
0.10 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 6% -0.0 0.08 ± 7% perf-profile.self.cycles-pp.apparmor_file_permission
0.14 ± 5% -0.0 0.12 ± 5% -0.0 0.12 ± 3% -0.0 0.13 ± 4% perf-profile.self.cycles-pp.cgroup_rstat_updated
0.07 ± 7% -0.0 0.04 ± 44% -0.0 0.04 ± 44% -0.0 0.04 ± 71% perf-profile.self.cycles-pp.policy_nodemask
0.07 ± 11% -0.0 0.04 ± 45% -0.0 0.05 ± 7% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.shmem_is_huge
0.08 ± 4% -0.0 0.06 ± 8% -0.0 0.06 ± 9% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.get_task_policy
0.08 ± 6% -0.0 0.05 ± 8% -0.0 0.06 ± 8% -0.0 0.05 ± 8% perf-profile.self.cycles-pp.__x64_sys_fallocate
0.12 ± 3% -0.0 0.10 ± 6% -0.0 0.10 ± 6% -0.0 0.10 ± 3% perf-profile.self.cycles-pp.try_charge_memcg
0.07 -0.0 0.05 -0.0 0.05 -0.0 0.04 ± 45% perf-profile.self.cycles-pp.free_unref_page_prepare
0.07 ± 6% -0.0 0.06 ± 9% -0.0 0.06 ± 8% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.08 ± 4% -0.0 0.06 -0.0 0.06 ± 6% -0.0 0.06 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.09 ± 7% -0.0 0.07 ± 5% -0.0 0.07 ± 5% -0.0 0.07 ± 7% perf-profile.self.cycles-pp.filemap_get_entry
0.07 ± 9% +0.0 0.09 ± 10% +0.0 0.09 ± 5% +0.0 0.09 ± 6% perf-profile.self.cycles-pp.propagate_protected_usage
0.96 ± 2% +0.2 1.12 ± 7% +0.2 1.16 ± 4% -0.2 0.72 ± 2% perf-profile.self.cycles-pp.__mod_lruvec_page_state
0.45 ± 4% +0.4 0.82 ± 8% +0.4 0.81 ± 6% +0.3 0.77 ± 3% perf-profile.self.cycles-pp.mem_cgroup_commit_charge
1.36 ± 3% +0.4 1.75 ± 9% +0.4 1.75 ± 5% +0.3 1.68 ± 2% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.29 +0.7 1.00 ± 10% +0.7 1.01 ± 7% +0.6 0.93 ± 2% perf-profile.self.cycles-pp.__mem_cgroup_charge
0.16 ± 4% +0.7 0.90 ± 4% +0.8 0.92 ± 4% +0.7 0.85 ± 2% perf-profile.self.cycles-pp.__count_memcg_events
1.80 ± 2% +2.5 4.26 ± 2% +2.5 4.28 ± 3% +2.2 3.98 perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
On Tue, Oct 24, 2023 at 11:09 PM Oliver Sang <oliver.sang@intel.com> wrote: > > hi, Yosry Ahmed, > > On Tue, Oct 24, 2023 at 12:14:42AM -0700, Yosry Ahmed wrote: > > On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <oliver.sang@intel.com> wrote: > > > > > > hi, Yosry Ahmed, > > > > > > On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote: > > > > > > ... > > > > > > > > > > > I still could not run the benchmark, but I used a version of > > > > fallocate1.c that does 1 million iterations. I ran 100 in parallel. > > > > This showed ~13% regression with the patch, so not the same as the > > > > will-it-scale version, but it could be an indicator. > > > > > > > > With that, I did not see any improvement with the fixlet above or > > > > ___cacheline_aligned_in_smp. So you can scratch that. > > > > > > > > I did, however, see some improvement with reducing the indirection > > > > layers by moving stats_updates directly into struct mem_cgroup. The > > > > regression in my manual testing went down to 9%. Still not great, but > > > > I am wondering how this reflects on the benchmark. If you're able to > > > > test it that would be great, the diff is below. Meanwhile I am still > > > > looking for other improvements that can be made. > > > > > > we applied previous patch-set as below: > > > > > > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing > > > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() > > > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg > > > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code > > > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time > > > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set > > > > > > I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79, > > > but failed. could you guide how to apply this patch? > > > Thanks > > > > > > > Thanks for looking into this. I rebased the diff on top of > > c5f50d8b23c79. Please find it attached. > > from our tests, this patch has little impact. > > it was applied as below ac6a9444dec85: > > ac6a9444dec85 (linux-devel/fixup-c5f50d8b23c79) memcg: move stats_updates to struct mem_cgroup > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything > > for the first regression reported in original report, data are very close > for 51d74c18a9c61, c5f50d8b23c79 (patch-set tip, parent of ac6a9444dec85), > and ac6a9444dec85. > full comparison is as [1] > > ========================================================================================= > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: > gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale > > 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7 > ---------------- --------------------------- --------------------------- --------------------------- > %stddev %change %stddev %change %stddev %change %stddev > \ | \ | \ | \ > 36509 -25.8% 27079 -25.2% 27305 -25.0% 27383 will-it-scale.per_thread_ops > > for the second regression reported in origianl report, seems a small impact > from ac6a9444dec85. > full comparison is as [2] > > ========================================================================================= > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: > gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale > > 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7 > ---------------- --------------------------- --------------------------- --------------------------- > %stddev %change %stddev %change %stddev %change %stddev > \ | \ | \ | \ > 76580 -30.0% 53575 -28.9% 54415 -26.7% 56152 will-it-scale.per_thread_ops > > [1] > Thanks Oliver for running the numbers. If I understand correctly the will-it-scale.fallocate1 microbenchmark is the only one showing significant regression here, is this correct? In my runs, other more representative microbenchmarks benchmarks like netperf and will-it-scale.page_fault* show minimal regression. I would expect practical workloads to have high concurrency of page faults or networking, but maybe not fallocate/ftruncate. Oliver, in your experience, how often does such a regression in such a microbenchmark translate to a real regression that people care about? (or how often do people dismiss it?) I tried optimizing this further for the fallocate/ftruncate case but without luck. I even tried moving stats_updates into cgroup core (struct cgroup_rstat_cpu) to reuse the existing loop in cgroup_rstat_updated() -- but it somehow made it worse. On the other hand, we do have some machines in production running this series together with a previous optimization for non-hierarchical stats [1] on an older kernel, and we do see significant reduction in cpu time spent on reading the stats. Domenico did a similar experiment with only this series and reported similar results [2]. Shakeel, Johannes, (and other memcg folks), I personally think the benefits here outweigh a regression in this particular benchmark, but I am obviously biased. What do you think? [1]https://lore.kernel.org/lkml/20230726153223.821757-2-yosryahmed@google.com/ [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/
On Tue, Oct 24, 2023 at 11:23 PM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > Thanks Oliver for running the numbers. If I understand correctly the > will-it-scale.fallocate1 microbenchmark is the only one showing > significant regression here, is this correct? > > In my runs, other more representative microbenchmarks benchmarks like > netperf and will-it-scale.page_fault* show minimal regression. I would > expect practical workloads to have high concurrency of page faults or > networking, but maybe not fallocate/ftruncate. > > Oliver, in your experience, how often does such a regression in such a > microbenchmark translate to a real regression that people care about? > (or how often do people dismiss it?) > > I tried optimizing this further for the fallocate/ftruncate case but > without luck. I even tried moving stats_updates into cgroup core > (struct cgroup_rstat_cpu) to reuse the existing loop in > cgroup_rstat_updated() -- but it somehow made it worse. > > On the other hand, we do have some machines in production running this > series together with a previous optimization for non-hierarchical > stats [1] on an older kernel, and we do see significant reduction in > cpu time spent on reading the stats. Domenico did a similar experiment > with only this series and reported similar results [2]. > > Shakeel, Johannes, (and other memcg folks), I personally think the > benefits here outweigh a regression in this particular benchmark, but > I am obviously biased. What do you think? > > [1]https://lore.kernel.org/lkml/20230726153223.821757-2-yosryahmed@google.com/ > [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/ I still am not convinced of the benefits outweighing the regression but I would not block this. So, let's do this, skip this open window, get the patch series reviewed and hopefully we can work together on fixing that regression and we can make an informed decision of accepting the regression for this series for the next cycle.
On Wed, Oct 25, 2023 at 10:06 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Tue, Oct 24, 2023 at 11:23 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > [...] > > > > Thanks Oliver for running the numbers. If I understand correctly the > > will-it-scale.fallocate1 microbenchmark is the only one showing > > significant regression here, is this correct? > > > > In my runs, other more representative microbenchmarks benchmarks like > > netperf and will-it-scale.page_fault* show minimal regression. I would > > expect practical workloads to have high concurrency of page faults or > > networking, but maybe not fallocate/ftruncate. > > > > Oliver, in your experience, how often does such a regression in such a > > microbenchmark translate to a real regression that people care about? > > (or how often do people dismiss it?) > > > > I tried optimizing this further for the fallocate/ftruncate case but > > without luck. I even tried moving stats_updates into cgroup core > > (struct cgroup_rstat_cpu) to reuse the existing loop in > > cgroup_rstat_updated() -- but it somehow made it worse. > > > > On the other hand, we do have some machines in production running this > > series together with a previous optimization for non-hierarchical > > stats [1] on an older kernel, and we do see significant reduction in > > cpu time spent on reading the stats. Domenico did a similar experiment > > with only this series and reported similar results [2]. > > > > Shakeel, Johannes, (and other memcg folks), I personally think the > > benefits here outweigh a regression in this particular benchmark, but > > I am obviously biased. What do you think? > > > > [1]https://lore.kernel.org/lkml/20230726153223.821757-2-yosryahmed@google.com/ > > [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/ > > I still am not convinced of the benefits outweighing the regression > but I would not block this. So, let's do this, skip this open window, > get the patch series reviewed and hopefully we can work together on > fixing that regression and we can make an informed decision of > accepting the regression for this series for the next cycle. Skipping this open window sounds okay to me. FWIW, I think with this patch series we can keep the old behavior (roughly) and hide the changes behind a tunable (config option or sysfs file). I think the only changes that need to be done to the code to approximate the previous behavior are: - Use root when updating the pending stats in memcg_rstat_updated() instead of the passed memcg. - Use root in mem_cgroup_flush_stats() instead of the passed memcg. - Use mutex_trylock() instead of mutex_lock() in mem_cgroup_flush_stats(). So I think it should be doable to hide most changes behind a tunable, but let's not do this unless necessary.
On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > A global counter for the magnitude of memcg stats update is maintained > on the memcg side to avoid invoking rstat flushes when the pending > updates are not significant. This avoids unnecessary flushes, which are > not very cheap even if there isn't a lot of stats to flush. It also > avoids unnecessary lock contention on the underlying global rstat lock. > > Make this threshold per-memcg. The scheme is followed where percpu (now > also per-memcg) counters are incremented in the update path, and only > propagated to per-memcg atomics when they exceed a certain threshold. > > This provides two benefits: > (a) On large machines with a lot of memcgs, the global threshold can be > reached relatively fast, so guarding the underlying lock becomes less > effective. Making the threshold per-memcg avoids this. > > (b) Having a global threshold makes it hard to do subtree flushes, as we > cannot reset the global counter except for a full flush. Per-memcg > counters removes this as a blocker from doing subtree flushes, which > helps avoid unnecessary work when the stats of a small subtree are > needed. > > Nothing is free, of course. This comes at a cost: > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > bytes. The extra memory usage is insigificant. > > (b) More work on the update side, although in the common case it will > only be percpu counter updates. The amount of work scales with the > number of ancestors (i.e. tree depth). This is not a new concept, adding > a cgroup to the rstat tree involves a parent loop, so is charging. > Testing results below show no significant regressions. > > (c) The error margin in the stats for the system as a whole increases > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > NR_MEMCGS. This is probably fine because we have a similar per-memcg > error in charges coming from percpu stocks, and we have a periodic > flusher that makes sure we always flush all the stats every 2s anyway. > > This patch was tested to make sure no significant regressions are > introduced on the update path as follows. The following benchmarks were > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is > deeper than a usual setup: > > (a) neper [1] with 1000 flows and 100 threads (single machine). The > values in the table are the average of server and client throughputs in > mbps after 30 iterations, each running for 30s: > > tcp_rr tcp_stream > Base 9504218.56 357366.84 > Patched 9656205.68 356978.39 > Delta +1.6% -0.1% > Standard Deviation 0.95% 1.03% > > An increase in the performance of tcp_rr doesn't really make sense, but > it's probably in the noise. The same tests were ran with 1 flow and 1 > thread but the throughput was too noisy to make any conclusions (the > averages did not show regressions nonetheless). > > Looking at perf for one iteration of the above test, __mod_memcg_state() > (which is where memcg_rstat_updated() is called) does not show up at all > without this patch, but it shows up with this patch as 1.06% for tcp_rr > and 0.36% for tcp_stream. > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand > stress-ng very well, so I am not sure that's the best way to test this, > but it spawns 384 workers and spits a lot of metrics which looks nice :) > I picked a few ones that seem to be relevant to the stats update path. I > also included cache misses as this patch introduce more atomics that may > bounce between cpu caches: > > Metric Base Patched Delta > Cache Misses 3.394 B/sec 3.433 B/sec +1.14% > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05% > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8% > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Major 18.794 /sec 0.000 /sec > Kmalloc 0.153 M/sec 0.152 M/sec -0.65% > Kfree 0.152 M/sec 0.153 M/sec +0.65% > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56% > MM Page Free 4.639 M/sec 4.897 M/sec +5.56% > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32% > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32% > page-cache add 238.057 /sec 0.000 /sec > page-cache del 6.265 /sec 6.267 /sec -0.03% > > This is only using a single run in each case. I am not sure what to > make out of most of these numbers, but they mostly seem in the noise > (some better, some worse). The lock contention numbers are interesting. > I am not sure if higher is better or worse here. No new locks or lock > sections are introduced by this patch either way. > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without > this patch. This is suspicious, but I verified while stress-ng is > running that all the threads are in the right cgroup. > > (3) will-it-scale page_fault tests. These tests (specifically > per_process_ops in page_fault3 test) detected a 25.9% regression before > for a change in the stats update path [2]. These are the > numbers from 30 runs (+ is good): > > LABEL | MEAN | MEDIAN | STDDEV | > ------------------------------+-------------+-------------+------------- > page_fault1_per_process_ops | | | | > (A) base | 265207.738 | 262941.000 | 12112.379 | > (B) patched | 249249.191 | 248781.000 | 8767.457 | > | -6.02% | -5.39% | | > page_fault1_per_thread_ops | | | | > (A) base | 241618.484 | 240209.000 | 10162.207 | > (B) patched | 229820.671 | 229108.000 | 7506.582 | > | -4.88% | -4.62% | | > page_fault1_scalability | | | > (A) base | 0.03545 | 0.035705 | 0.0015837 | > (B) patched | 0.029952 | 0.029957 | 0.0013551 | > | -9.29% | -9.35% | | This much regression is not acceptable. In addition, I ran netperf with the same 4 level hierarchy as you have run and I am seeing ~11% regression. More specifically on a machine with 44 CPUs (HT disabled ixion machine): # for server $ netserver -6 # 22 instances of netperf clients $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K (averaged over 4 runs) base (next-20231009): 33081 MBPS patched: 29267 MBPS So, this series is not acceptable unless this regression is resolved.
On Tue, Oct 10, 2023 at 1:45 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > A global counter for the magnitude of memcg stats update is maintained > > on the memcg side to avoid invoking rstat flushes when the pending > > updates are not significant. This avoids unnecessary flushes, which are > > not very cheap even if there isn't a lot of stats to flush. It also > > avoids unnecessary lock contention on the underlying global rstat lock. > > > > Make this threshold per-memcg. The scheme is followed where percpu (now > > also per-memcg) counters are incremented in the update path, and only > > propagated to per-memcg atomics when they exceed a certain threshold. > > > > This provides two benefits: > > (a) On large machines with a lot of memcgs, the global threshold can be > > reached relatively fast, so guarding the underlying lock becomes less > > effective. Making the threshold per-memcg avoids this. > > > > (b) Having a global threshold makes it hard to do subtree flushes, as we > > cannot reset the global counter except for a full flush. Per-memcg > > counters removes this as a blocker from doing subtree flushes, which > > helps avoid unnecessary work when the stats of a small subtree are > > needed. > > > > Nothing is free, of course. This comes at a cost: > > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > > bytes. The extra memory usage is insigificant. > > > > (b) More work on the update side, although in the common case it will > > only be percpu counter updates. The amount of work scales with the > > number of ancestors (i.e. tree depth). This is not a new concept, adding > > a cgroup to the rstat tree involves a parent loop, so is charging. > > Testing results below show no significant regressions. > > > > (c) The error margin in the stats for the system as a whole increases > > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > > NR_MEMCGS. This is probably fine because we have a similar per-memcg > > error in charges coming from percpu stocks, and we have a periodic > > flusher that makes sure we always flush all the stats every 2s anyway. > > > > This patch was tested to make sure no significant regressions are > > introduced on the update path as follows. The following benchmarks were > > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is > > deeper than a usual setup: > > > > (a) neper [1] with 1000 flows and 100 threads (single machine). The > > values in the table are the average of server and client throughputs in > > mbps after 30 iterations, each running for 30s: > > > > tcp_rr tcp_stream > > Base 9504218.56 357366.84 > > Patched 9656205.68 356978.39 > > Delta +1.6% -0.1% > > Standard Deviation 0.95% 1.03% > > > > An increase in the performance of tcp_rr doesn't really make sense, but > > it's probably in the noise. The same tests were ran with 1 flow and 1 > > thread but the throughput was too noisy to make any conclusions (the > > averages did not show regressions nonetheless). > > > > Looking at perf for one iteration of the above test, __mod_memcg_state() > > (which is where memcg_rstat_updated() is called) does not show up at all > > without this patch, but it shows up with this patch as 1.06% for tcp_rr > > and 0.36% for tcp_stream. > > > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand > > stress-ng very well, so I am not sure that's the best way to test this, > > but it spawns 384 workers and spits a lot of metrics which looks nice :) > > I picked a few ones that seem to be relevant to the stats update path. I > > also included cache misses as this patch introduce more atomics that may > > bounce between cpu caches: > > > > Metric Base Patched Delta > > Cache Misses 3.394 B/sec 3.433 B/sec +1.14% > > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05% > > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8% > > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4% > > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4% > > Page Faults Major 18.794 /sec 0.000 /sec > > Kmalloc 0.153 M/sec 0.152 M/sec -0.65% > > Kfree 0.152 M/sec 0.153 M/sec +0.65% > > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56% > > MM Page Free 4.639 M/sec 4.897 M/sec +5.56% > > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32% > > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32% > > page-cache add 238.057 /sec 0.000 /sec > > page-cache del 6.265 /sec 6.267 /sec -0.03% > > > > This is only using a single run in each case. I am not sure what to > > make out of most of these numbers, but they mostly seem in the noise > > (some better, some worse). The lock contention numbers are interesting. > > I am not sure if higher is better or worse here. No new locks or lock > > sections are introduced by this patch either way. > > > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without > > this patch. This is suspicious, but I verified while stress-ng is > > running that all the threads are in the right cgroup. > > > > (3) will-it-scale page_fault tests. These tests (specifically > > per_process_ops in page_fault3 test) detected a 25.9% regression before > > for a change in the stats update path [2]. These are the > > numbers from 30 runs (+ is good): > > > > LABEL | MEAN | MEDIAN | STDDEV | > > ------------------------------+-------------+-------------+------------- > > page_fault1_per_process_ops | | | | > > (A) base | 265207.738 | 262941.000 | 12112.379 | > > (B) patched | 249249.191 | 248781.000 | 8767.457 | > > | -6.02% | -5.39% | | > > page_fault1_per_thread_ops | | | | > > (A) base | 241618.484 | 240209.000 | 10162.207 | > > (B) patched | 229820.671 | 229108.000 | 7506.582 | > > | -4.88% | -4.62% | | > > page_fault1_scalability | | | > > (A) base | 0.03545 | 0.035705 | 0.0015837 | > > (B) patched | 0.029952 | 0.029957 | 0.0013551 | > > | -9.29% | -9.35% | | > > This much regression is not acceptable. > > In addition, I ran netperf with the same 4 level hierarchy as you have > run and I am seeing ~11% regression. Interesting, I thought neper and netperf should be similar. Let me try to reproduce this. Thanks for testing! > > More specifically on a machine with 44 CPUs (HT disabled ixion machine): > > # for server > $ netserver -6 > > # 22 instances of netperf clients > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > (averaged over 4 runs) > > base (next-20231009): 33081 MBPS > patched: 29267 MBPS > > So, this series is not acceptable unless this regression is resolved.
On Tue, Oct 10, 2023 at 2:02 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Oct 10, 2023 at 1:45 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > A global counter for the magnitude of memcg stats update is maintained > > > on the memcg side to avoid invoking rstat flushes when the pending > > > updates are not significant. This avoids unnecessary flushes, which are > > > not very cheap even if there isn't a lot of stats to flush. It also > > > avoids unnecessary lock contention on the underlying global rstat lock. > > > > > > Make this threshold per-memcg. The scheme is followed where percpu (now > > > also per-memcg) counters are incremented in the update path, and only > > > propagated to per-memcg atomics when they exceed a certain threshold. > > > > > > This provides two benefits: > > > (a) On large machines with a lot of memcgs, the global threshold can be > > > reached relatively fast, so guarding the underlying lock becomes less > > > effective. Making the threshold per-memcg avoids this. > > > > > > (b) Having a global threshold makes it hard to do subtree flushes, as we > > > cannot reset the global counter except for a full flush. Per-memcg > > > counters removes this as a blocker from doing subtree flushes, which > > > helps avoid unnecessary work when the stats of a small subtree are > > > needed. > > > > > > Nothing is free, of course. This comes at a cost: > > > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > > > bytes. The extra memory usage is insigificant. > > > > > > (b) More work on the update side, although in the common case it will > > > only be percpu counter updates. The amount of work scales with the > > > number of ancestors (i.e. tree depth). This is not a new concept, adding > > > a cgroup to the rstat tree involves a parent loop, so is charging. > > > Testing results below show no significant regressions. > > > > > > (c) The error margin in the stats for the system as a whole increases > > > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > > > NR_MEMCGS. This is probably fine because we have a similar per-memcg > > > error in charges coming from percpu stocks, and we have a periodic > > > flusher that makes sure we always flush all the stats every 2s anyway. > > > > > > This patch was tested to make sure no significant regressions are > > > introduced on the update path as follows. The following benchmarks were > > > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is > > > deeper than a usual setup: > > > > > > (a) neper [1] with 1000 flows and 100 threads (single machine). The > > > values in the table are the average of server and client throughputs in > > > mbps after 30 iterations, each running for 30s: > > > > > > tcp_rr tcp_stream > > > Base 9504218.56 357366.84 > > > Patched 9656205.68 356978.39 > > > Delta +1.6% -0.1% > > > Standard Deviation 0.95% 1.03% > > > > > > An increase in the performance of tcp_rr doesn't really make sense, but > > > it's probably in the noise. The same tests were ran with 1 flow and 1 > > > thread but the throughput was too noisy to make any conclusions (the > > > averages did not show regressions nonetheless). > > > > > > Looking at perf for one iteration of the above test, __mod_memcg_state() > > > (which is where memcg_rstat_updated() is called) does not show up at all > > > without this patch, but it shows up with this patch as 1.06% for tcp_rr > > > and 0.36% for tcp_stream. > > > > > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand > > > stress-ng very well, so I am not sure that's the best way to test this, > > > but it spawns 384 workers and spits a lot of metrics which looks nice :) > > > I picked a few ones that seem to be relevant to the stats update path. I > > > also included cache misses as this patch introduce more atomics that may > > > bounce between cpu caches: > > > > > > Metric Base Patched Delta > > > Cache Misses 3.394 B/sec 3.433 B/sec +1.14% > > > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05% > > > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8% > > > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4% > > > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4% > > > Page Faults Major 18.794 /sec 0.000 /sec > > > Kmalloc 0.153 M/sec 0.152 M/sec -0.65% > > > Kfree 0.152 M/sec 0.153 M/sec +0.65% > > > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56% > > > MM Page Free 4.639 M/sec 4.897 M/sec +5.56% > > > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32% > > > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32% > > > page-cache add 238.057 /sec 0.000 /sec > > > page-cache del 6.265 /sec 6.267 /sec -0.03% > > > > > > This is only using a single run in each case. I am not sure what to > > > make out of most of these numbers, but they mostly seem in the noise > > > (some better, some worse). The lock contention numbers are interesting. > > > I am not sure if higher is better or worse here. No new locks or lock > > > sections are introduced by this patch either way. > > > > > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without > > > this patch. This is suspicious, but I verified while stress-ng is > > > running that all the threads are in the right cgroup. > > > > > > (3) will-it-scale page_fault tests. These tests (specifically > > > per_process_ops in page_fault3 test) detected a 25.9% regression before > > > for a change in the stats update path [2]. These are the > > > numbers from 30 runs (+ is good): > > > > > > LABEL | MEAN | MEDIAN | STDDEV | > > > ------------------------------+-------------+-------------+------------- > > > page_fault1_per_process_ops | | | | > > > (A) base | 265207.738 | 262941.000 | 12112.379 | > > > (B) patched | 249249.191 | 248781.000 | 8767.457 | > > > | -6.02% | -5.39% | | > > > page_fault1_per_thread_ops | | | | > > > (A) base | 241618.484 | 240209.000 | 10162.207 | > > > (B) patched | 229820.671 | 229108.000 | 7506.582 | > > > | -4.88% | -4.62% | | > > > page_fault1_scalability | | | > > > (A) base | 0.03545 | 0.035705 | 0.0015837 | > > > (B) patched | 0.029952 | 0.029957 | 0.0013551 | > > > | -9.29% | -9.35% | | > > > > This much regression is not acceptable. > > > > In addition, I ran netperf with the same 4 level hierarchy as you have > > run and I am seeing ~11% regression. > > Interesting, I thought neper and netperf should be similar. Let me try > to reproduce this. > > Thanks for testing! > > > > > More specifically on a machine with 44 CPUs (HT disabled ixion machine): > > > > # for server > > $ netserver -6 > > > > # 22 instances of netperf clients > > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > > > (averaged over 4 runs) > > > > base (next-20231009): 33081 MBPS > > patched: 29267 MBPS > > > > So, this series is not acceptable unless this regression is resolved. I tried this on a machine with 72 cpus (also ixion), running both netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows: # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control # mkdir /sys/fs/cgroup/a # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control # mkdir /sys/fs/cgroup/a/b # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control # mkdir /sys/fs/cgroup/a/b/c # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control # mkdir /sys/fs/cgroup/a/b/c/d # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs # ./netserver -6 # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K; done Base: 540000 262144 10240 60.00 54613.89 540000 262144 10240 60.00 54940.52 540000 262144 10240 60.00 55168.86 540000 262144 10240 60.00 54800.15 540000 262144 10240 60.00 54452.55 540000 262144 10240 60.00 54501.60 540000 262144 10240 60.00 55036.11 540000 262144 10240 60.00 52018.91 540000 262144 10240 60.00 54877.78 540000 262144 10240 60.00 55342.38 Average: 54575.275 Patched: 540000 262144 10240 60.00 53694.86 540000 262144 10240 60.00 54807.68 540000 262144 10240 60.00 54782.89 540000 262144 10240 60.00 51404.91 540000 262144 10240 60.00 55024.00 540000 262144 10240 60.00 54725.84 540000 262144 10240 60.00 51400.40 540000 262144 10240 60.00 54212.63 540000 262144 10240 60.00 51951.47 540000 262144 10240 60.00 51978.27 Average: 53398.295 That's ~2% regression. Did I do anything incorrectly?
On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
[...]
>
> I tried this on a machine with 72 cpus (also ixion), running both
> netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a
> # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a/b
> # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a/b/c
> # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a/b/c/d
> # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> # ./netserver -6
>
> # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> -m 10K; done
You are missing '&' at the end. Use something like below:
#!/bin/bash
for i in {1..22}
do
/data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
done
wait
On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> [...]
> >
> > I tried this on a machine with 72 cpus (also ixion), running both
> > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a
> > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a/b
> > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a/b/c
> > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a/b/c/d
> > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > # ./netserver -6
> >
> > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > -m 10K; done
>
> You are missing '&' at the end. Use something like below:
>
> #!/bin/bash
> for i in {1..22}
> do
> /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> done
> wait
>
Oh sorry I missed the fact that you are running instances in parallel, my bad.
So I ran 36 instances on a machine with 72 cpus. I did this 10 times
and got an average from all instances for all runs to reduce noise:
#!/bin/bash
ITER=10
NR_INSTANCES=36
for i in $(seq $ITER); do
echo "iteration $i"
for j in $(seq $NR_INSTANCES); do
echo "iteration $i" >> "out$j"
./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
done
wait
done
cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
Base: 22169 mbps
Patched: 21331.9 mbps
The difference is ~3.7% in my runs. I am not sure what's different.
Perhaps it's the number of runs?
On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > [...]
> > >
> > > I tried this on a machine with 72 cpus (also ixion), running both
> > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a
> > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a/b
> > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a/b/c
> > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > # ./netserver -6
> > >
> > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > -m 10K; done
> >
> > You are missing '&' at the end. Use something like below:
> >
> > #!/bin/bash
> > for i in {1..22}
> > do
> > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > done
> > wait
> >
>
> Oh sorry I missed the fact that you are running instances in parallel, my bad.
>
> So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> and got an average from all instances for all runs to reduce noise:
>
> #!/bin/bash
>
> ITER=10
> NR_INSTANCES=36
>
> for i in $(seq $ITER); do
> echo "iteration $i"
> for j in $(seq $NR_INSTANCES); do
> echo "iteration $i" >> "out$j"
> ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> done
> wait
> done
>
> cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
>
> Base: 22169 mbps
> Patched: 21331.9 mbps
>
> The difference is ~3.7% in my runs. I am not sure what's different.
> Perhaps it's the number of runs?
My base kernel is next-20231009 and I am running experiments with
hyperthreading disabled.
On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > [...]
> > > >
> > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a
> > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a/b
> > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > # ./netserver -6
> > > >
> > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > -m 10K; done
> > >
> > > You are missing '&' at the end. Use something like below:
> > >
> > > #!/bin/bash
> > > for i in {1..22}
> > > do
> > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > done
> > > wait
> > >
> >
> > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> >
> > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > and got an average from all instances for all runs to reduce noise:
> >
> > #!/bin/bash
> >
> > ITER=10
> > NR_INSTANCES=36
> >
> > for i in $(seq $ITER); do
> > echo "iteration $i"
> > for j in $(seq $NR_INSTANCES); do
> > echo "iteration $i" >> "out$j"
> > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > done
> > wait
> > done
> >
> > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> >
> > Base: 22169 mbps
> > Patched: 21331.9 mbps
> >
> > The difference is ~3.7% in my runs. I am not sure what's different.
> > Perhaps it's the number of runs?
>
> My base kernel is next-20231009 and I am running experiments with
> hyperthreading disabled.
Using next-20231009 and a similar 44 core machine with hyperthreading
disabled, I ran 22 instances of netperf in parallel and got the
following numbers from averaging 20 runs:
Base: 33076.5 mbps
Patched: 31410.1 mbps
That's about 5% diff. I guess the number of iterations helps reduce
the noise? I am not sure.
Please also keep in mind that in this case all netperf instances are
in the same cgroup and at a 4-level depth. I imagine in a practical
setup processes would be a little more spread out, which means less
common ancestors, so less contended atomic operations.
On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > [...]
> > > > >
> > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a
> > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > # ./netserver -6
> > > > >
> > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > -m 10K; done
> > > >
> > > > You are missing '&' at the end. Use something like below:
> > > >
> > > > #!/bin/bash
> > > > for i in {1..22}
> > > > do
> > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > done
> > > > wait
> > > >
> > >
> > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > >
> > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > and got an average from all instances for all runs to reduce noise:
> > >
> > > #!/bin/bash
> > >
> > > ITER=10
> > > NR_INSTANCES=36
> > >
> > > for i in $(seq $ITER); do
> > > echo "iteration $i"
> > > for j in $(seq $NR_INSTANCES); do
> > > echo "iteration $i" >> "out$j"
> > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > done
> > > wait
> > > done
> > >
> > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > >
> > > Base: 22169 mbps
> > > Patched: 21331.9 mbps
> > >
> > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > Perhaps it's the number of runs?
> >
> > My base kernel is next-20231009 and I am running experiments with
> > hyperthreading disabled.
>
> Using next-20231009 and a similar 44 core machine with hyperthreading
> disabled, I ran 22 instances of netperf in parallel and got the
> following numbers from averaging 20 runs:
>
> Base: 33076.5 mbps
> Patched: 31410.1 mbps
>
> That's about 5% diff. I guess the number of iterations helps reduce
> the noise? I am not sure.
>
> Please also keep in mind that in this case all netperf instances are
> in the same cgroup and at a 4-level depth. I imagine in a practical
> setup processes would be a little more spread out, which means less
> common ancestors, so less contended atomic operations.
(Resending the reply as I messed up the last one, was not in plain text)
I was curious, so I ran the same testing in a cgroup 2 levels deep
(i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
experience. Here are the numbers:
Base: 40198.0 mbps
Patched: 38629.7 mbps
The regression is reduced to ~3.9%.
What's more interesting is that going from a level 2 cgroup to a level
4 cgroup is already a big hit with or without this patch:
Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
Patched: 38629.7 -> 31410.1 (~18.7% regression)
So going from level 2 to 4 is already a significant regression for
other reasons (e.g. hierarchical charging). This patch only makes it
marginally worse. This puts the numbers more into perspective imo than
comparing values at level 4. What do you think?
On Thu, Oct 12, 2023 at 1:04 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > > [...]
> > > > > >
> > > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # ./netserver -6
> > > > > >
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > > -m 10K; done
> > > > >
> > > > > You are missing '&' at the end. Use something like below:
> > > > >
> > > > > #!/bin/bash
> > > > > for i in {1..22}
> > > > > do
> > > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > > done
> > > > > wait
> > > > >
> > > >
> > > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > > >
> > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > > and got an average from all instances for all runs to reduce noise:
> > > >
> > > > #!/bin/bash
> > > >
> > > > ITER=10
> > > > NR_INSTANCES=36
> > > >
> > > > for i in $(seq $ITER); do
> > > > echo "iteration $i"
> > > > for j in $(seq $NR_INSTANCES); do
> > > > echo "iteration $i" >> "out$j"
> > > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > > done
> > > > wait
> > > > done
> > > >
> > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > > >
> > > > Base: 22169 mbps
> > > > Patched: 21331.9 mbps
> > > >
> > > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > > Perhaps it's the number of runs?
> > >
> > > My base kernel is next-20231009 and I am running experiments with
> > > hyperthreading disabled.
> >
> > Using next-20231009 and a similar 44 core machine with hyperthreading
> > disabled, I ran 22 instances of netperf in parallel and got the
> > following numbers from averaging 20 runs:
> >
> > Base: 33076.5 mbps
> > Patched: 31410.1 mbps
> >
> > That's about 5% diff. I guess the number of iterations helps reduce
> > the noise? I am not sure.
> >
> > Please also keep in mind that in this case all netperf instances are
> > in the same cgroup and at a 4-level depth. I imagine in a practical
> > setup processes would be a little more spread out, which means less
> > common ancestors, so less contended atomic operations.
>
>
> (Resending the reply as I messed up the last one, was not in plain text)
>
> I was curious, so I ran the same testing in a cgroup 2 levels deep
> (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> experience. Here are the numbers:
>
> Base: 40198.0 mbps
> Patched: 38629.7 mbps
>
> The regression is reduced to ~3.9%.
>
> What's more interesting is that going from a level 2 cgroup to a level
> 4 cgroup is already a big hit with or without this patch:
>
> Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> Patched: 38629.7 -> 31410.1 (~18.7% regression)
>
> So going from level 2 to 4 is already a significant regression for
> other reasons (e.g. hierarchical charging). This patch only makes it
> marginally worse. This puts the numbers more into perspective imo than
> comparing values at level 4. What do you think?
This is weird as we are running the experiments on the same machine. I
will rerun with 2 levels as well. Also can you rerun the page fault
benchmark as well which was showing 9% regression in your original
commit message?
On Thu, Oct 12, 2023 at 6:35 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Oct 12, 2023 at 1:04 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > > > [...]
> > > > > > >
> > > > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a
> > > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > > # ./netserver -6
> > > > > > >
> > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > > > -m 10K; done
> > > > > >
> > > > > > You are missing '&' at the end. Use something like below:
> > > > > >
> > > > > > #!/bin/bash
> > > > > > for i in {1..22}
> > > > > > do
> > > > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > > > done
> > > > > > wait
> > > > > >
> > > > >
> > > > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > > > >
> > > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > > > and got an average from all instances for all runs to reduce noise:
> > > > >
> > > > > #!/bin/bash
> > > > >
> > > > > ITER=10
> > > > > NR_INSTANCES=36
> > > > >
> > > > > for i in $(seq $ITER); do
> > > > > echo "iteration $i"
> > > > > for j in $(seq $NR_INSTANCES); do
> > > > > echo "iteration $i" >> "out$j"
> > > > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > > > done
> > > > > wait
> > > > > done
> > > > >
> > > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > > > >
> > > > > Base: 22169 mbps
> > > > > Patched: 21331.9 mbps
> > > > >
> > > > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > > > Perhaps it's the number of runs?
> > > >
> > > > My base kernel is next-20231009 and I am running experiments with
> > > > hyperthreading disabled.
> > >
> > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > disabled, I ran 22 instances of netperf in parallel and got the
> > > following numbers from averaging 20 runs:
> > >
> > > Base: 33076.5 mbps
> > > Patched: 31410.1 mbps
> > >
> > > That's about 5% diff. I guess the number of iterations helps reduce
> > > the noise? I am not sure.
> > >
> > > Please also keep in mind that in this case all netperf instances are
> > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > setup processes would be a little more spread out, which means less
> > > common ancestors, so less contended atomic operations.
> >
> >
> > (Resending the reply as I messed up the last one, was not in plain text)
> >
> > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > experience. Here are the numbers:
> >
> > Base: 40198.0 mbps
> > Patched: 38629.7 mbps
> >
> > The regression is reduced to ~3.9%.
> >
> > What's more interesting is that going from a level 2 cgroup to a level
> > 4 cgroup is already a big hit with or without this patch:
> >
> > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> >
> > So going from level 2 to 4 is already a significant regression for
> > other reasons (e.g. hierarchical charging). This patch only makes it
> > marginally worse. This puts the numbers more into perspective imo than
> > comparing values at level 4. What do you think?
>
> This is weird as we are running the experiments on the same machine. I
> will rerun with 2 levels as well. Also can you rerun the page fault
> benchmark as well which was showing 9% regression in your original
> commit message?
Thanks. I will re-run the page_fault tests, but keep in mind that the
page fault benchmarks in will-it-scale are highly variable. We run
them between kernel versions internally, and I think we ignore any
changes below 10% as the benchmark is naturally noisy.
I have a couple of runs for page_fault3_scalability showing a 2-3%
improvement with this patch :)
[..]
> > > >
> > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > following numbers from averaging 20 runs:
> > > >
> > > > Base: 33076.5 mbps
> > > > Patched: 31410.1 mbps
> > > >
> > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > the noise? I am not sure.
> > > >
> > > > Please also keep in mind that in this case all netperf instances are
> > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > setup processes would be a little more spread out, which means less
> > > > common ancestors, so less contended atomic operations.
> > >
> > >
> > > (Resending the reply as I messed up the last one, was not in plain text)
> > >
> > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > experience. Here are the numbers:
> > >
> > > Base: 40198.0 mbps
> > > Patched: 38629.7 mbps
> > >
> > > The regression is reduced to ~3.9%.
> > >
> > > What's more interesting is that going from a level 2 cgroup to a level
> > > 4 cgroup is already a big hit with or without this patch:
> > >
> > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > >
> > > So going from level 2 to 4 is already a significant regression for
> > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > marginally worse. This puts the numbers more into perspective imo than
> > > comparing values at level 4. What do you think?
> >
> > This is weird as we are running the experiments on the same machine. I
> > will rerun with 2 levels as well. Also can you rerun the page fault
> > benchmark as well which was showing 9% regression in your original
> > commit message?
>
> Thanks. I will re-run the page_fault tests, but keep in mind that the
> page fault benchmarks in will-it-scale are highly variable. We run
> them between kernel versions internally, and I think we ignore any
> changes below 10% as the benchmark is naturally noisy.
>
> I have a couple of runs for page_fault3_scalability showing a 2-3%
> improvement with this patch :)
I ran the page_fault tests for 10 runs on a machine with 256 cpus in a
level 2 cgroup, here are the results (the results in the original
commit message are for 384 cpus in a level 4 cgroup):
LABEL | MEAN | MEDIAN | STDDEV |
------------------------------+-------------+-------------+-------------
page_fault1_per_process_ops | | | |
(A) base | 270249.164 | 265437.000 | 13451.836 |
(B) patched | 261368.709 | 255725.000 | 13394.767 |
| -3.29% | -3.66% | |
page_fault1_per_thread_ops | | | |
(A) base | 242111.345 | 239737.000 | 10026.031 |
(B) patched | 237057.109 | 235305.000 | 9769.687 |
| -2.09% | -1.85% | |
page_fault1_scalability | | |
(A) base | 0.034387 | 0.035168 | 0.0018283 |
(B) patched | 0.033988 | 0.034573 | 0.0018056 |
| -1.16% | -1.69% | |
page_fault2_per_process_ops | | |
(A) base | 203561.836 | 203301.000 | 2550.764 |
(B) patched | 197195.945 | 197746.000 | 2264.263 |
| -3.13% | -2.73% | |
page_fault2_per_thread_ops | | |
(A) base | 171046.473 | 170776.000 | 1509.679 |
(B) patched | 166626.327 | 166406.000 | 768.753 |
| -2.58% | -2.56% | |
page_fault2_scalability | | |
(A) base | 0.054026 | 0.053821 | 0.00062121 |
(B) patched | 0.053329 | 0.05306 | 0.00048394 |
| -1.29% | -1.41% | |
page_fault3_per_process_ops | | |
(A) base | 1295807.782 | 1297550.000 | 5907.585 |
(B) patched | 1275579.873 | 1273359.000 | 8759.160 |
| -1.56% | -1.86% | |
page_fault3_per_thread_ops | | |
(A) base | 391234.164 | 390860.000 | 1760.720 |
(B) patched | 377231.273 | 376369.000 | 1874.971 |
| -3.58% | -3.71% | |
page_fault3_scalability | | |
(A) base | 0.60369 | 0.60072 | 0.0083029 |
(B) patched | 0.61733 | 0.61544 | 0.009855 |
| +2.26% | +2.45% | |
The numbers are much better. I can modify the commit log to include
the testing in the replies instead of what's currently there if this
helps (22 netperf instances on 44 cpus and will-it-scale page_fault on
256 cpus -- all in a level 2 cgroup).
On Thu, Oct 12, 2023 at 2:06 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > [..] > > > > > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > > > disabled, I ran 22 instances of netperf in parallel and got the > > > > > following numbers from averaging 20 runs: > > > > > > > > > > Base: 33076.5 mbps > > > > > Patched: 31410.1 mbps > > > > > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > > > the noise? I am not sure. > > > > > > > > > > Please also keep in mind that in this case all netperf instances are > > > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > > > setup processes would be a little more spread out, which means less > > > > > common ancestors, so less contended atomic operations. > > > > > > > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > > > experience. Here are the numbers: > > > > > > > > Base: 40198.0 mbps > > > > Patched: 38629.7 mbps > > > > > > > > The regression is reduced to ~3.9%. > > > > > > > > What's more interesting is that going from a level 2 cgroup to a level > > > > 4 cgroup is already a big hit with or without this patch: > > > > > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > > > > > So going from level 2 to 4 is already a significant regression for > > > > other reasons (e.g. hierarchical charging). This patch only makes it > > > > marginally worse. This puts the numbers more into perspective imo than > > > > comparing values at level 4. What do you think? > > > > > > This is weird as we are running the experiments on the same machine. I > > > will rerun with 2 levels as well. Also can you rerun the page fault > > > benchmark as well which was showing 9% regression in your original > > > commit message? > > > > Thanks. I will re-run the page_fault tests, but keep in mind that the > > page fault benchmarks in will-it-scale are highly variable. We run > > them between kernel versions internally, and I think we ignore any > > changes below 10% as the benchmark is naturally noisy. > > > > I have a couple of runs for page_fault3_scalability showing a 2-3% > > improvement with this patch :) > > I ran the page_fault tests for 10 runs on a machine with 256 cpus in a > level 2 cgroup, here are the results (the results in the original > commit message are for 384 cpus in a level 4 cgroup): > > LABEL | MEAN | MEDIAN | STDDEV | > ------------------------------+-------------+-------------+------------- > page_fault1_per_process_ops | | | | > (A) base | 270249.164 | 265437.000 | 13451.836 | > (B) patched | 261368.709 | 255725.000 | 13394.767 | > | -3.29% | -3.66% | | > page_fault1_per_thread_ops | | | | > (A) base | 242111.345 | 239737.000 | 10026.031 | > (B) patched | 237057.109 | 235305.000 | 9769.687 | > | -2.09% | -1.85% | | > page_fault1_scalability | | | > (A) base | 0.034387 | 0.035168 | 0.0018283 | > (B) patched | 0.033988 | 0.034573 | 0.0018056 | > | -1.16% | -1.69% | | > page_fault2_per_process_ops | | | > (A) base | 203561.836 | 203301.000 | 2550.764 | > (B) patched | 197195.945 | 197746.000 | 2264.263 | > | -3.13% | -2.73% | | > page_fault2_per_thread_ops | | | > (A) base | 171046.473 | 170776.000 | 1509.679 | > (B) patched | 166626.327 | 166406.000 | 768.753 | > | -2.58% | -2.56% | | > page_fault2_scalability | | | > (A) base | 0.054026 | 0.053821 | 0.00062121 | > (B) patched | 0.053329 | 0.05306 | 0.00048394 | > | -1.29% | -1.41% | | > page_fault3_per_process_ops | | | > (A) base | 1295807.782 | 1297550.000 | 5907.585 | > (B) patched | 1275579.873 | 1273359.000 | 8759.160 | > | -1.56% | -1.86% | | > page_fault3_per_thread_ops | | | > (A) base | 391234.164 | 390860.000 | 1760.720 | > (B) patched | 377231.273 | 376369.000 | 1874.971 | > | -3.58% | -3.71% | | > page_fault3_scalability | | | > (A) base | 0.60369 | 0.60072 | 0.0083029 | > (B) patched | 0.61733 | 0.61544 | 0.009855 | > | +2.26% | +2.45% | | > > The numbers are much better. I can modify the commit log to include > the testing in the replies instead of what's currently there if this > helps (22 netperf instances on 44 cpus and will-it-scale page_fault on > 256 cpus -- all in a level 2 cgroup). Yes this looks better. I think we should also ask intel perf and phoronix folks to run their benchmarks as well (but no need to block on them).
On Thu, Oct 12, 2023 at 2:16 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Oct 12, 2023 at 2:06 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > [..] > > > > > > > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > > > > disabled, I ran 22 instances of netperf in parallel and got the > > > > > > following numbers from averaging 20 runs: > > > > > > > > > > > > Base: 33076.5 mbps > > > > > > Patched: 31410.1 mbps > > > > > > > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > > > > the noise? I am not sure. > > > > > > > > > > > > Please also keep in mind that in this case all netperf instances are > > > > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > > > > setup processes would be a little more spread out, which means less > > > > > > common ancestors, so less contended atomic operations. > > > > > > > > > > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > > > > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > > > > experience. Here are the numbers: > > > > > > > > > > Base: 40198.0 mbps > > > > > Patched: 38629.7 mbps > > > > > > > > > > The regression is reduced to ~3.9%. > > > > > > > > > > What's more interesting is that going from a level 2 cgroup to a level > > > > > 4 cgroup is already a big hit with or without this patch: > > > > > > > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > > > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > > > > > > > So going from level 2 to 4 is already a significant regression for > > > > > other reasons (e.g. hierarchical charging). This patch only makes it > > > > > marginally worse. This puts the numbers more into perspective imo than > > > > > comparing values at level 4. What do you think? > > > > > > > > This is weird as we are running the experiments on the same machine. I > > > > will rerun with 2 levels as well. Also can you rerun the page fault > > > > benchmark as well which was showing 9% regression in your original > > > > commit message? > > > > > > Thanks. I will re-run the page_fault tests, but keep in mind that the > > > page fault benchmarks in will-it-scale are highly variable. We run > > > them between kernel versions internally, and I think we ignore any > > > changes below 10% as the benchmark is naturally noisy. > > > > > > I have a couple of runs for page_fault3_scalability showing a 2-3% > > > improvement with this patch :) > > > > I ran the page_fault tests for 10 runs on a machine with 256 cpus in a > > level 2 cgroup, here are the results (the results in the original > > commit message are for 384 cpus in a level 4 cgroup): > > > > LABEL | MEAN | MEDIAN | STDDEV | > > ------------------------------+-------------+-------------+------------- > > page_fault1_per_process_ops | | | | > > (A) base | 270249.164 | 265437.000 | 13451.836 | > > (B) patched | 261368.709 | 255725.000 | 13394.767 | > > | -3.29% | -3.66% | | > > page_fault1_per_thread_ops | | | | > > (A) base | 242111.345 | 239737.000 | 10026.031 | > > (B) patched | 237057.109 | 235305.000 | 9769.687 | > > | -2.09% | -1.85% | | > > page_fault1_scalability | | | > > (A) base | 0.034387 | 0.035168 | 0.0018283 | > > (B) patched | 0.033988 | 0.034573 | 0.0018056 | > > | -1.16% | -1.69% | | > > page_fault2_per_process_ops | | | > > (A) base | 203561.836 | 203301.000 | 2550.764 | > > (B) patched | 197195.945 | 197746.000 | 2264.263 | > > | -3.13% | -2.73% | | > > page_fault2_per_thread_ops | | | > > (A) base | 171046.473 | 170776.000 | 1509.679 | > > (B) patched | 166626.327 | 166406.000 | 768.753 | > > | -2.58% | -2.56% | | > > page_fault2_scalability | | | > > (A) base | 0.054026 | 0.053821 | 0.00062121 | > > (B) patched | 0.053329 | 0.05306 | 0.00048394 | > > | -1.29% | -1.41% | | > > page_fault3_per_process_ops | | | > > (A) base | 1295807.782 | 1297550.000 | 5907.585 | > > (B) patched | 1275579.873 | 1273359.000 | 8759.160 | > > | -1.56% | -1.86% | | > > page_fault3_per_thread_ops | | | > > (A) base | 391234.164 | 390860.000 | 1760.720 | > > (B) patched | 377231.273 | 376369.000 | 1874.971 | > > | -3.58% | -3.71% | | > > page_fault3_scalability | | | > > (A) base | 0.60369 | 0.60072 | 0.0083029 | > > (B) patched | 0.61733 | 0.61544 | 0.009855 | > > | +2.26% | +2.45% | | > > > > The numbers are much better. I can modify the commit log to include > > the testing in the replies instead of what's currently there if this > > helps (22 netperf instances on 44 cpus and will-it-scale page_fault on > > 256 cpus -- all in a level 2 cgroup). > > Yes this looks better. I think we should also ask intel perf and > phoronix folks to run their benchmarks as well (but no need to block > on them). Anything I need to do for this to happen? (I thought such testing is already done on linux-next) Also, any further comments on the patch (or the series in general)? If not, I can send a new commit message for this patch in-place.
On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > > > Yes this looks better. I think we should also ask intel perf and > > phoronix folks to run their benchmarks as well (but no need to block > > on them). > > Anything I need to do for this to happen? (I thought such testing is > already done on linux-next) Just Cced the relevant folks. Michael, Oliver & Feng, if you have some time/resource available, please do trigger your performance benchmarks on the following series (but nothing urgent): https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/ > > Also, any further comments on the patch (or the series in general)? If > not, I can send a new commit message for this patch in-place. Sorry, I haven't taken a look yet but will try in a week or so.
On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > >
> > > Yes this looks better. I think we should also ask intel perf and
> > > phoronix folks to run their benchmarks as well (but no need to block
> > > on them).
> >
> > Anything I need to do for this to happen? (I thought such testing is
> > already done on linux-next)
>
> Just Cced the relevant folks.
>
> Michael, Oliver & Feng, if you have some time/resource available,
> please do trigger your performance benchmarks on the following series
> (but nothing urgent):
>
> https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/
Thanks for that.
>
> >
> > Also, any further comments on the patch (or the series in general)? If
> > not, I can send a new commit message for this patch in-place.
>
> Sorry, I haven't taken a look yet but will try in a week or so.
Sounds good, thanks.
Meanwhile, Andrew, could you please replace the commit log of this
patch as follows for more updated testing info:
Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
A global counter for the magnitude of memcg stats update is maintained
on the memcg side to avoid invoking rstat flushes when the pending
updates are not significant. This avoids unnecessary flushes, which are
not very cheap even if there isn't a lot of stats to flush. It also
avoids unnecessary lock contention on the underlying global rstat lock.
Make this threshold per-memcg. The scheme is followed where percpu (now
also per-memcg) counters are incremented in the update path, and only
propagated to per-memcg atomics when they exceed a certain threshold.
This provides two benefits:
(a) On large machines with a lot of memcgs, the global threshold can be
reached relatively fast, so guarding the underlying lock becomes less
effective. Making the threshold per-memcg avoids this.
(b) Having a global threshold makes it hard to do subtree flushes, as we
cannot reset the global counter except for a full flush. Per-memcg
counters removes this as a blocker from doing subtree flushes, which
helps avoid unnecessary work when the stats of a small subtree are
needed.
Nothing is free, of course. This comes at a cost:
(a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
bytes. The extra memory usage is insigificant.
(b) More work on the update side, although in the common case it will
only be percpu counter updates. The amount of work scales with the
number of ancestors (i.e. tree depth). This is not a new concept, adding
a cgroup to the rstat tree involves a parent loop, so is charging.
Testing results below show no significant regressions.
(c) The error margin in the stats for the system as a whole increases
from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
NR_MEMCGS. This is probably fine because we have a similar per-memcg
error in charges coming from percpu stocks, and we have a periodic
flusher that makes sure we always flush all the stats every 2s anyway.
This patch was tested to make sure no significant regressions are
introduced on the update path as follows. The following benchmarks were
ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/):
(1) Running 22 instances of netperf on a 44 cpu machine with
hyperthreading disabled. All instances are run in a level 2 cgroup, as
well as netserver:
# netserver -6
# netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
Averaging 20 runs, the numbers are as follows:
Base: 40198.0 mbps
Patched: 38629.7 mbps (-3.9%)
The regression is minimal, especially for 22 instances in the same
cgroup sharing all ancestors (so updating the same atomics).
(2) will-it-scale page_fault tests. These tests (specifically
per_process_ops in page_fault3 test) detected a 25.9% regression before
for a change in the stats update path [1]. These are the
numbers from 10 runs (+ is good) on a machine with 256 cpus:
LABEL | MEAN | MEDIAN | STDDEV |
------------------------------+-------------+-------------+-------------
page_fault1_per_process_ops | | | |
(A) base | 270249.164 | 265437.000 | 13451.836 |
(B) patched | 261368.709 | 255725.000 | 13394.767 |
| -3.29% | -3.66% | |
page_fault1_per_thread_ops | | | |
(A) base | 242111.345 | 239737.000 | 10026.031 |
(B) patched | 237057.109 | 235305.000 | 9769.687 |
| -2.09% | -1.85% | |
page_fault1_scalability | | |
(A) base | 0.034387 | 0.035168 | 0.0018283 |
(B) patched | 0.033988 | 0.034573 | 0.0018056 |
| -1.16% | -1.69% | |
page_fault2_per_process_ops | | |
(A) base | 203561.836 | 203301.000 | 2550.764 |
(B) patched | 197195.945 | 197746.000 | 2264.263 |
| -3.13% | -2.73% | |
page_fault2_per_thread_ops | | |
(A) base | 171046.473 | 170776.000 | 1509.679 |
(B) patched | 166626.327 | 166406.000 | 768.753 |
| -2.58% | -2.56% | |
page_fault2_scalability | | |
(A) base | 0.054026 | 0.053821 | 0.00062121 |
(B) patched | 0.053329 | 0.05306 | 0.00048394 |
| -1.29% | -1.41% | |
page_fault3_per_process_ops | | |
(A) base | 1295807.782 | 1297550.000 | 5907.585 |
(B) patched | 1275579.873 | 1273359.000 | 8759.160 |
| -1.56% | -1.86% | |
page_fault3_per_thread_ops | | |
(A) base | 391234.164 | 390860.000 | 1760.720 |
(B) patched | 377231.273 | 376369.000 | 1874.971 |
| -3.58% | -3.71% | |
page_fault3_scalability | | |
(A) base | 0.60369 | 0.60072 | 0.0083029 |
(B) patched | 0.61733 | 0.61544 | 0.009855 |
| +2.26% | +2.45% | |
All regressions seem to be minimal, and within the normal variance for
the benchmark. The fix for [1] assumes that 3% is noise -- and there were no
further practical complaints), so hopefully this means that such variations
in these microbenchmarks do not reflect on practical workloads.
(3) I also ran stress-ng in a nested cgroup and did not observe any
obvious regressions.
[1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
hi, Yosry Ahmed, hi, Shakeel Butt, On Thu, Oct 12, 2023 at 03:23:06PM -0700, Yosry Ahmed wrote: > On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > [...] > > > > > > > > Yes this looks better. I think we should also ask intel perf and > > > > phoronix folks to run their benchmarks as well (but no need to block > > > > on them). > > > > > > Anything I need to do for this to happen? (I thought such testing is > > > already done on linux-next) > > > > Just Cced the relevant folks. > > > > Michael, Oliver & Feng, if you have some time/resource available, > > please do trigger your performance benchmarks on the following series > > (but nothing urgent): > > > > https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/ > > Thanks for that. we (0day team) have already applied the patch-set as: c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set they've already in our so-called hourly-kernel which under various function and performance tests. our 0day test logic is if we found any regression by these hourly-kernels comparing to base (e.g. milestone release), auto-bisect will be triggnered. then we only report when we capture a first bad commit for a regression. based on this, if you don't receive any report in following 2-3 weeks, you could think 0day cannot capture any regression from your patch-set. *However*, please be aware that 0day is not a traditional CI system, and also due to resource constraints, we cannot guaratee coverage, we cannot tigger specific tests for your patchset, either. (sorry if this is not your expectation) > > > > > > > > > Also, any further comments on the patch (or the series in general)? If > > > not, I can send a new commit message for this patch in-place. > > > > Sorry, I haven't taken a look yet but will try in a week or so. > > Sounds good, thanks. > > Meanwhile, Andrew, could you please replace the commit log of this > patch as follows for more updated testing info: > > Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg > > A global counter for the magnitude of memcg stats update is maintained > on the memcg side to avoid invoking rstat flushes when the pending > updates are not significant. This avoids unnecessary flushes, which are > not very cheap even if there isn't a lot of stats to flush. It also > avoids unnecessary lock contention on the underlying global rstat lock. > > Make this threshold per-memcg. The scheme is followed where percpu (now > also per-memcg) counters are incremented in the update path, and only > propagated to per-memcg atomics when they exceed a certain threshold. > > This provides two benefits: > (a) On large machines with a lot of memcgs, the global threshold can be > reached relatively fast, so guarding the underlying lock becomes less > effective. Making the threshold per-memcg avoids this. > > (b) Having a global threshold makes it hard to do subtree flushes, as we > cannot reset the global counter except for a full flush. Per-memcg > counters removes this as a blocker from doing subtree flushes, which > helps avoid unnecessary work when the stats of a small subtree are > needed. > > Nothing is free, of course. This comes at a cost: > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > bytes. The extra memory usage is insigificant. > > (b) More work on the update side, although in the common case it will > only be percpu counter updates. The amount of work scales with the > number of ancestors (i.e. tree depth). This is not a new concept, adding > a cgroup to the rstat tree involves a parent loop, so is charging. > Testing results below show no significant regressions. > > (c) The error margin in the stats for the system as a whole increases > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > NR_MEMCGS. This is probably fine because we have a similar per-memcg > error in charges coming from percpu stocks, and we have a periodic > flusher that makes sure we always flush all the stats every 2s anyway. > > This patch was tested to make sure no significant regressions are > introduced on the update path as follows. The following benchmarks were > ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): > > (1) Running 22 instances of netperf on a 44 cpu machine with > hyperthreading disabled. All instances are run in a level 2 cgroup, as > well as netserver: > # netserver -6 > # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K > > Averaging 20 runs, the numbers are as follows: > Base: 40198.0 mbps > Patched: 38629.7 mbps (-3.9%) > > The regression is minimal, especially for 22 instances in the same > cgroup sharing all ancestors (so updating the same atomics). > > (2) will-it-scale page_fault tests. These tests (specifically > per_process_ops in page_fault3 test) detected a 25.9% regression before > for a change in the stats update path [1]. These are the > numbers from 10 runs (+ is good) on a machine with 256 cpus: > > LABEL | MEAN | MEDIAN | STDDEV | > ------------------------------+-------------+-------------+------------- > page_fault1_per_process_ops | | | | > (A) base | 270249.164 | 265437.000 | 13451.836 | > (B) patched | 261368.709 | 255725.000 | 13394.767 | > | -3.29% | -3.66% | | > page_fault1_per_thread_ops | | | | > (A) base | 242111.345 | 239737.000 | 10026.031 | > (B) patched | 237057.109 | 235305.000 | 9769.687 | > | -2.09% | -1.85% | | > page_fault1_scalability | | | > (A) base | 0.034387 | 0.035168 | 0.0018283 | > (B) patched | 0.033988 | 0.034573 | 0.0018056 | > | -1.16% | -1.69% | | > page_fault2_per_process_ops | | | > (A) base | 203561.836 | 203301.000 | 2550.764 | > (B) patched | 197195.945 | 197746.000 | 2264.263 | > | -3.13% | -2.73% | | > page_fault2_per_thread_ops | | | > (A) base | 171046.473 | 170776.000 | 1509.679 | > (B) patched | 166626.327 | 166406.000 | 768.753 | > | -2.58% | -2.56% | | > page_fault2_scalability | | | > (A) base | 0.054026 | 0.053821 | 0.00062121 | > (B) patched | 0.053329 | 0.05306 | 0.00048394 | > | -1.29% | -1.41% | | > page_fault3_per_process_ops | | | > (A) base | 1295807.782 | 1297550.000 | 5907.585 | > (B) patched | 1275579.873 | 1273359.000 | 8759.160 | > | -1.56% | -1.86% | | > page_fault3_per_thread_ops | | | > (A) base | 391234.164 | 390860.000 | 1760.720 | > (B) patched | 377231.273 | 376369.000 | 1874.971 | > | -3.58% | -3.71% | | > page_fault3_scalability | | | > (A) base | 0.60369 | 0.60072 | 0.0083029 | > (B) patched | 0.61733 | 0.61544 | 0.009855 | > | +2.26% | +2.45% | | > > All regressions seem to be minimal, and within the normal variance for > the benchmark. The fix for [1] assumes that 3% is noise -- and there were no > further practical complaints), so hopefully this means that such variations > in these microbenchmarks do not reflect on practical workloads. > > (3) I also ran stress-ng in a nested cgroup and did not observe any > obvious regressions. > > [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
On Wed, Oct 18, 2023 at 1:22 AM Oliver Sang <oliver.sang@intel.com> wrote: > > hi, Yosry Ahmed, hi, Shakeel Butt, > > On Thu, Oct 12, 2023 at 03:23:06PM -0700, Yosry Ahmed wrote: > > On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > > > On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > [...] > > > > > > > > > > Yes this looks better. I think we should also ask intel perf and > > > > > phoronix folks to run their benchmarks as well (but no need to block > > > > > on them). > > > > > > > > Anything I need to do for this to happen? (I thought such testing is > > > > already done on linux-next) > > > > > > Just Cced the relevant folks. > > > > > > Michael, Oliver & Feng, if you have some time/resource available, > > > please do trigger your performance benchmarks on the following series > > > (but nothing urgent): > > > > > > https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.com/ > > > > Thanks for that. > > we (0day team) have already applied the patch-set as: > > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent() > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set > > they've already in our so-called hourly-kernel which under various function > and performance tests. > > our 0day test logic is if we found any regression by these hourly-kernels > comparing to base (e.g. milestone release), auto-bisect will be triggnered. > then we only report when we capture a first bad commit for a regression. > > based on this, if you don't receive any report in following 2-3 weeks, you > could think 0day cannot capture any regression from your patch-set. > > *However*, please be aware that 0day is not a traditional CI system, and also > due to resource constraints, we cannot guaratee coverage, we cannot tigger > specific tests for your patchset, either. > (sorry if this is not your expectation) > Thanks for taking a look and clarifying this, much appreciated. Fingers crossed for not getting any reports :)
On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > Meanwhile, Andrew, could you please replace the commit log of this > patch as follows for more updated testing info: Done.
On Sat, Oct 14, 2023 at 4:08 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > Meanwhile, Andrew, could you please replace the commit log of this
> > patch as follows for more updated testing info:
>
> Done.
Sorry Andrew, but could you please also take this fixlet?
From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 17 Oct 2023 23:07:59 +0000
Subject: [PATCH] mm: memcg: clear percpu stats_pending during stats flush
When flushing memcg stats, we clear the per-memcg count of pending stat
updates, as they are captured by the flush. Also clear the percpu count
for the cpu being flushed.
Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
mm/memcontrol.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b1377b16b3e0..fa92de780ac89 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5653,6 +5653,7 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
}
}
}
+ statc->stats_updates = 0;
/* We are in a per-cpu loop here, only do the atomic write once */
if (atomic64_read(&memcg->vmstats->stats_updates))
atomic64_set(&memcg->vmstats->stats_updates, 0);
--
2.42.0.655.g421f12c284-goog
On Sat, Oct 14, 2023 at 4:08 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > > > Meanwhile, Andrew, could you please replace the commit log of this > > patch as follows for more updated testing info: > > Done. Thanks!
On Thu, Oct 12, 2023 at 01:04:03AM -0700, Yosry Ahmed wrote:
> On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > > [...]
> > > > > >
> > > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # ./netserver -6
> > > > > >
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > > -m 10K; done
> > > > >
> > > > > You are missing '&' at the end. Use something like below:
> > > > >
> > > > > #!/bin/bash
> > > > > for i in {1..22}
> > > > > do
> > > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > > done
> > > > > wait
> > > > >
> > > >
> > > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > > >
> > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > > and got an average from all instances for all runs to reduce noise:
> > > >
> > > > #!/bin/bash
> > > >
> > > > ITER=10
> > > > NR_INSTANCES=36
> > > >
> > > > for i in $(seq $ITER); do
> > > > echo "iteration $i"
> > > > for j in $(seq $NR_INSTANCES); do
> > > > echo "iteration $i" >> "out$j"
> > > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > > done
> > > > wait
> > > > done
> > > >
> > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > > >
> > > > Base: 22169 mbps
> > > > Patched: 21331.9 mbps
> > > >
> > > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > > Perhaps it's the number of runs?
> > >
> > > My base kernel is next-20231009 and I am running experiments with
> > > hyperthreading disabled.
> >
> > Using next-20231009 and a similar 44 core machine with hyperthreading
> > disabled, I ran 22 instances of netperf in parallel and got the
> > following numbers from averaging 20 runs:
> >
> > Base: 33076.5 mbps
> > Patched: 31410.1 mbps
> >
> > That's about 5% diff. I guess the number of iterations helps reduce
> > the noise? I am not sure.
> >
> > Please also keep in mind that in this case all netperf instances are
> > in the same cgroup and at a 4-level depth. I imagine in a practical
> > setup processes would be a little more spread out, which means less
> > common ancestors, so less contended atomic operations.
>
>
> (Resending the reply as I messed up the last one, was not in plain text)
>
> I was curious, so I ran the same testing in a cgroup 2 levels deep
> (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> experience. Here are the numbers:
>
> Base: 40198.0 mbps
> Patched: 38629.7 mbps
>
> The regression is reduced to ~3.9%.
>
> What's more interesting is that going from a level 2 cgroup to a level
> 4 cgroup is already a big hit with or without this patch:
>
> Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> Patched: 38629.7 -> 31410.1 (~18.7% regression)
>
> So going from level 2 to 4 is already a significant regression for
> other reasons (e.g. hierarchical charging). This patch only makes it
> marginally worse. This puts the numbers more into perspective imo than
> comparing values at level 4. What do you think?
I think it's reasonable.
Especially comparing to how many cachelines we used to touch on the
write side when all flushing happened there. This looks like a good
trade-off to me.
[..] > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > disabled, I ran 22 instances of netperf in parallel and got the > > > following numbers from averaging 20 runs: > > > > > > Base: 33076.5 mbps > > > Patched: 31410.1 mbps > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > the noise? I am not sure. > > > > > > Please also keep in mind that in this case all netperf instances are > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > setup processes would be a little more spread out, which means less > > > common ancestors, so less contended atomic operations. > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > experience. Here are the numbers: > > > > Base: 40198.0 mbps > > Patched: 38629.7 mbps > > > > The regression is reduced to ~3.9%. > > > > What's more interesting is that going from a level 2 cgroup to a level > > 4 cgroup is already a big hit with or without this patch: > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > So going from level 2 to 4 is already a significant regression for > > other reasons (e.g. hierarchical charging). This patch only makes it > > marginally worse. This puts the numbers more into perspective imo than > > comparing values at level 4. What do you think? > > I think it's reasonable. > > Especially comparing to how many cachelines we used to touch on the > write side when all flushing happened there. This looks like a good > trade-off to me. Thanks. Still wanting to figure out if this patch is what you suggested in our previous discussion [1], to add a Suggested-by if appropriate :) [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/
On Thu, Oct 12, 2023 at 04:28:49PM -0700, Yosry Ahmed wrote: > [..] > > > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > > disabled, I ran 22 instances of netperf in parallel and got the > > > > following numbers from averaging 20 runs: > > > > > > > > Base: 33076.5 mbps > > > > Patched: 31410.1 mbps > > > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > > the noise? I am not sure. > > > > > > > > Please also keep in mind that in this case all netperf instances are > > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > > setup processes would be a little more spread out, which means less > > > > common ancestors, so less contended atomic operations. > > > > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > > experience. Here are the numbers: > > > > > > Base: 40198.0 mbps > > > Patched: 38629.7 mbps > > > > > > The regression is reduced to ~3.9%. > > > > > > What's more interesting is that going from a level 2 cgroup to a level > > > 4 cgroup is already a big hit with or without this patch: > > > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > > > So going from level 2 to 4 is already a significant regression for > > > other reasons (e.g. hierarchical charging). This patch only makes it > > > marginally worse. This puts the numbers more into perspective imo than > > > comparing values at level 4. What do you think? > > > > I think it's reasonable. > > > > Especially comparing to how many cachelines we used to touch on the > > write side when all flushing happened there. This looks like a good > > trade-off to me. > > Thanks. > > Still wanting to figure out if this patch is what you suggested in our > previous discussion [1], to add a > Suggested-by if appropriate :) > > [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/ Haha, sort of. I suggested the cgroup-level flush-batching, but my proposal was missing the clever upward propagation of the pending stat updates that you added. You can add the tag if you're feeling generous, but I wouldn't be mad if you don't!
On Thu, Oct 12, 2023 at 7:33 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Oct 12, 2023 at 04:28:49PM -0700, Yosry Ahmed wrote: > > [..] > > > > > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > > > disabled, I ran 22 instances of netperf in parallel and got the > > > > > following numbers from averaging 20 runs: > > > > > > > > > > Base: 33076.5 mbps > > > > > Patched: 31410.1 mbps > > > > > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > > > the noise? I am not sure. > > > > > > > > > > Please also keep in mind that in this case all netperf instances are > > > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > > > setup processes would be a little more spread out, which means less > > > > > common ancestors, so less contended atomic operations. > > > > > > > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > > > experience. Here are the numbers: > > > > > > > > Base: 40198.0 mbps > > > > Patched: 38629.7 mbps > > > > > > > > The regression is reduced to ~3.9%. > > > > > > > > What's more interesting is that going from a level 2 cgroup to a level > > > > 4 cgroup is already a big hit with or without this patch: > > > > > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > > > > > So going from level 2 to 4 is already a significant regression for > > > > other reasons (e.g. hierarchical charging). This patch only makes it > > > > marginally worse. This puts the numbers more into perspective imo than > > > > comparing values at level 4. What do you think? > > > > > > I think it's reasonable. > > > > > > Especially comparing to how many cachelines we used to touch on the > > > write side when all flushing happened there. This looks like a good > > > trade-off to me. > > > > Thanks. > > > > Still wanting to figure out if this patch is what you suggested in our > > previous discussion [1], to add a > > Suggested-by if appropriate :) > > > > [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/ > > Haha, sort of. I suggested the cgroup-level flush-batching, but my > proposal was missing the clever upward propagation of the pending stat > updates that you added. > > You can add the tag if you're feeling generous, but I wouldn't be mad > if you don't! I like to think that I am a generous person :) Will add it in the next respin.
On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > A global counter for the magnitude of memcg stats update is maintained > on the memcg side to avoid invoking rstat flushes when the pending > updates are not significant. This avoids unnecessary flushes, which are > not very cheap even if there isn't a lot of stats to flush. It also > avoids unnecessary lock contention on the underlying global rstat lock. > > Make this threshold per-memcg. The scheme is followed where percpu (now > also per-memcg) counters are incremented in the update path, and only > propagated to per-memcg atomics when they exceed a certain threshold. > > This provides two benefits: > (a) On large machines with a lot of memcgs, the global threshold can be > reached relatively fast, so guarding the underlying lock becomes less > effective. Making the threshold per-memcg avoids this. > > (b) Having a global threshold makes it hard to do subtree flushes, as we > cannot reset the global counter except for a full flush. Per-memcg > counters removes this as a blocker from doing subtree flushes, which > helps avoid unnecessary work when the stats of a small subtree are > needed. > > Nothing is free, of course. This comes at a cost: > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > bytes. The extra memory usage is insigificant. > > (b) More work on the update side, although in the common case it will > only be percpu counter updates. The amount of work scales with the > number of ancestors (i.e. tree depth). This is not a new concept, adding > a cgroup to the rstat tree involves a parent loop, so is charging. > Testing results below show no significant regressions. > > (c) The error margin in the stats for the system as a whole increases > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > NR_MEMCGS. This is probably fine because we have a similar per-memcg > error in charges coming from percpu stocks, and we have a periodic > flusher that makes sure we always flush all the stats every 2s anyway. > > This patch was tested to make sure no significant regressions are > introduced on the update path as follows. The following benchmarks were > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is > deeper than a usual setup: > > (a) neper [1] with 1000 flows and 100 threads (single machine). The > values in the table are the average of server and client throughputs in > mbps after 30 iterations, each running for 30s: > > tcp_rr tcp_stream > Base 9504218.56 357366.84 > Patched 9656205.68 356978.39 > Delta +1.6% -0.1% > Standard Deviation 0.95% 1.03% > > An increase in the performance of tcp_rr doesn't really make sense, but > it's probably in the noise. The same tests were ran with 1 flow and 1 > thread but the throughput was too noisy to make any conclusions (the > averages did not show regressions nonetheless). > > Looking at perf for one iteration of the above test, __mod_memcg_state() > (which is where memcg_rstat_updated() is called) does not show up at all > without this patch, but it shows up with this patch as 1.06% for tcp_rr > and 0.36% for tcp_stream. > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand > stress-ng very well, so I am not sure that's the best way to test this, > but it spawns 384 workers and spits a lot of metrics which looks nice :) > I picked a few ones that seem to be relevant to the stats update path. I > also included cache misses as this patch introduce more atomics that may > bounce between cpu caches: > > Metric Base Patched Delta > Cache Misses 3.394 B/sec 3.433 B/sec +1.14% > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05% > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8% > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Major 18.794 /sec 0.000 /sec > Kmalloc 0.153 M/sec 0.152 M/sec -0.65% > Kfree 0.152 M/sec 0.153 M/sec +0.65% > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56% > MM Page Free 4.639 M/sec 4.897 M/sec +5.56% > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32% > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32% > page-cache add 238.057 /sec 0.000 /sec > page-cache del 6.265 /sec 6.267 /sec -0.03% > > This is only using a single run in each case. I am not sure what to > make out of most of these numbers, but they mostly seem in the noise > (some better, some worse). The lock contention numbers are interesting. > I am not sure if higher is better or worse here. No new locks or lock > sections are introduced by this patch either way. > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without > this patch. This is suspicious, but I verified while stress-ng is > running that all the threads are in the right cgroup. > > (3) will-it-scale page_fault tests. These tests (specifically > per_process_ops in page_fault3 test) detected a 25.9% regression before > for a change in the stats update path [2]. These are the > numbers from 30 runs (+ is good): > > LABEL | MEAN | MEDIAN | STDDEV | > ------------------------------+-------------+-------------+------------- > page_fault1_per_process_ops | | | | > (A) base | 265207.738 | 262941.000 | 12112.379 | > (B) patched | 249249.191 | 248781.000 | 8767.457 | > | -6.02% | -5.39% | | > page_fault1_per_thread_ops | | | | > (A) base | 241618.484 | 240209.000 | 10162.207 | > (B) patched | 229820.671 | 229108.000 | 7506.582 | > | -4.88% | -4.62% | | > page_fault1_scalability | | | > (A) base | 0.03545 | 0.035705 | 0.0015837 | > (B) patched | 0.029952 | 0.029957 | 0.0013551 | > | -9.29% | -9.35% | | > page_fault2_per_process_ops | | | > (A) base | 203916.148 | 203496.000 | 2908.331 | > (B) patched | 186975.419 | 187023.000 | 1991.100 | > | -6.85% | -6.90% | | > page_fault2_per_thread_ops | | | > (A) base | 170604.972 | 170532.000 | 1624.834 | > (B) patched | 163100.260 | 163263.000 | 1517.967 | > | -4.40% | -4.26% | | > page_fault2_scalability | | | > (A) base | 0.054603 | 0.054693 | 0.00080196 | > (B) patched | 0.044882 | 0.044957 | 0.0011766 | > | -0.05% | +0.33% | | > page_fault3_per_process_ops | | | > (A) base | 1299821.099 | 1297918.000 | 9882.872 | > (B) patched | 1248700.839 | 1247168.000 | 8454.891 | > | -3.93% | -3.91% | | > page_fault3_per_thread_ops | | | > (A) base | 387216.963 | 387115.000 | 1605.760 | > (B) patched | 368538.213 | 368826.000 | 1852.594 | > | -4.82% | -4.72% | | > page_fault3_scalability | | | > (A) base | 0.59909 | 0.59367 | 0.01256 | > (B) patched | 0.59995 | 0.59769 | 0.010088 | > | +0.14% | +0.68% | | > > There is some microbenchmarks regressions (and some minute improvements), > but nothing outside the normal variance of this benchmark between kernel > versions. The fix for [2] assumes that 3% is noise -- and there were no > further practical complaints), so hopefully this means that such variations > in these microbenchmarks do not reflect on practical workloads. > > [1]https://github.com/google/neper > [2]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/ > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Johannes, as I mentioned in a reply to v1, I think this might be what you suggested in our previous discussion [1], but I am not sure this is what you meant for the update path, so I did not add a Suggested-by. Please let me know if this is what you meant and I can amend the tag as such. [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/
© 2016 - 2026 Red Hat, Inc.