The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a
bitmask with "l" trailing zeroes, which is equivalent to
"(~_UL(0) << (l))".
Refactor the calculation so the number of arithmetic instruction will be
reduced from 3 to 1.
Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
---
Test is done to show the speed-up we can get from reducing the number of
instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64
architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor.
The test executes in the form of kernel module.
Test result:
[451675.644026] new __GENMASK() : 5985416
[451675.644030] old __GENMASK() : 6006406
Test script snippet:
/* switch to __BITS_PER_LONG_LONG when testing __GENMASK_ULL() */
\#define __GENMASK_NEW(h, l) \
((~_UL(0) << (l)) & \
(~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
int init_module(void)
{
ktime_t start, end, total1 = 0, total2 = 0;
for (int k = 0; k < 100; k++) {
for (int i = 0; i < __BITS_PER_LONG; i++) {
for (int j = 0; j <= i; j++) {
unsigned result1, result2;
start = ktime_get();
result1 = __GENMASK_NEW(i, j);
end = ktime_get();
total1 += (end - start);
start = ktime_get();
result2 = __GENMASK(i, j);
end = ktime_get();
total2 += (end - start);
if (result1 != result2) {
pr_info("Wrong calculation of GENMASK_NEW()\n");
return 0;
}
}
}
}
pr_info("new __GENMASK() : %lld\n", total1);
pr_info("old __GENMASK() : %lld\n", total2);
return 0;
}
---
include/uapi/linux/bits.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/uapi/linux/bits.h b/include/uapi/linux/bits.h
index 5ee30f882736..8fc7fea65288 100644
--- a/include/uapi/linux/bits.h
+++ b/include/uapi/linux/bits.h
@@ -5,7 +5,7 @@
#define _UAPI_LINUX_BITS_H
#define __GENMASK(h, l) \
- (((~_UL(0)) - (_UL(1) << (l)) + 1) & \
+ ((~_UL(0) << (l)) & \
(~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
#define __GENMASK_ULL(h, l) \
--
2.43.0
On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote:
> The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a
> bitmask with "l" trailing zeroes, which is equivalent to
> "(~_UL(0) << (l))".
I used to think that GENMASK() is a compile-time macro. __GENMASK() is
not, but it has very limited usage through the kernel, all in the uapi.
> Refactor the calculation so the number of arithmetic instruction will be
> reduced from 3 to 1.
I'd like to look at it! Please share disassembly.
> Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
> ---
> Test is done to show the speed-up we can get from reducing the number of
> instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64
> architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor.
So you CC arm maintainers and provide tests against x86_64. Are your
improvements consistent for arm, power and other arches?
Can you run bloat-o-meter against your change?
> The test executes in the form of kernel module.
>
> Test result:
> [451675.644026] new __GENMASK() : 5985416
> [451675.644030] old __GENMASK() : 6006406
The difference is 0.35%? That's impressive!
Can you please run your test multiple times and print those numbers
together with confidence interval?
> Test script snippet:
> /* switch to __BITS_PER_LONG_LONG when testing __GENMASK_ULL() */
> \#define __GENMASK_NEW(h, l) \
> ((~_UL(0) << (l)) & \
> (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
>
> int init_module(void)
> {
> ktime_t start, end, total1 = 0, total2 = 0;
> for (int k = 0; k < 100; k++) {
> for (int i = 0; i < __BITS_PER_LONG; i++) {
> for (int j = 0; j <= i; j++) {
> unsigned result1, result2;
>
> start = ktime_get();
> result1 = __GENMASK_NEW(i, j);
> end = ktime_get();
> total1 += (end - start);
Nah, this is wrong. I feel like you measure performance of
ktime_get() rather than GENMASK().
Do it this way:
start = ktime_get();
for (...) {
__always_used result = GENMASK();
}
time = ktime_get() - start;
>
> start = ktime_get();
> result2 = __GENMASK(i, j);
Please test GENMASK(), not __GENMASK().
> end = ktime_get();
> total2 += (end - start);
>
> if (result1 != result2) {
> pr_info("Wrong calculation of GENMASK_NEW()\n");
> return 0;
> }
For functional testing we have lib/test_bits.c, lib/test_bitmap.c and
others. I expect you'll run them all to check correctness. Once you do
that, you don't need to test correctness here.
> }
> }
> }
>
> pr_info("new __GENMASK() : %lld\n", total1);
> pr_info("old __GENMASK() : %lld\n", total2);
>
> return 0;
> }
Please incorporate this in one of existing tests and prepend the
series with it. That way you'll let the others to run before/after for
your series.
> ---
> include/uapi/linux/bits.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/bits.h b/include/uapi/linux/bits.h
> index 5ee30f882736..8fc7fea65288 100644
> --- a/include/uapi/linux/bits.h
> +++ b/include/uapi/linux/bits.h
> @@ -5,7 +5,7 @@
> #define _UAPI_LINUX_BITS_H
>
> #define __GENMASK(h, l) \
> - (((~_UL(0)) - (_UL(1) << (l)) + 1) & \
> + ((~_UL(0) << (l)) & \
> (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
Now it would fit a single line, right?
If it works, this is a nice simplification, even though it's only a
compile time thing.
Please address all the comments above and run low-level tests I
mentioned above, so that we'll make sure it has no side effects.
Thanks,
Yury
>
> #define __GENMASK_ULL(h, l) \
> --
> 2.43.0
On Tue, Feb 11, 2025 at 01:39:32PM -0500, Yury Norov wrote:
> On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote:
> > The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a
> > bitmask with "l" trailing zeroes, which is equivalent to
> > "(~_UL(0) << (l))".
>
> I used to think that GENMASK() is a compile-time macro. __GENMASK() is
> not, but it has very limited usage through the kernel, all in the uapi.
>
> > Refactor the calculation so the number of arithmetic instruction will be
> > reduced from 3 to 1.
>
> I'd like to look at it! Please share disassembly.
>
> > Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
> > ---
> > Test is done to show the speed-up we can get from reducing the number of
> > instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64
> > architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor.
>
> So you CC arm maintainers and provide tests against x86_64. Are your
> improvements consistent for arm, power and other arches?
>
> Can you run bloat-o-meter against your change?
>
> > The test executes in the form of kernel module.
> >
> > Test result:
> > [451675.644026] new __GENMASK() : 5985416
> > [451675.644030] old __GENMASK() : 6006406
>
> The difference is 0.35%? That's impressive!
>
> Can you please run your test multiple times and print those numbers
> together with confidence interval?
>
> > Test script snippet:
> > /* switch to __BITS_PER_LONG_LONG when testing __GENMASK_ULL() */
> > \#define __GENMASK_NEW(h, l) \
> > ((~_UL(0) << (l)) & \
> > (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
> >
> > int init_module(void)
> > {
> > ktime_t start, end, total1 = 0, total2 = 0;
> > for (int k = 0; k < 100; k++) {
> > for (int i = 0; i < __BITS_PER_LONG; i++) {
> > for (int j = 0; j <= i; j++) {
> > unsigned result1, result2;
> >
> > start = ktime_get();
> > result1 = __GENMASK_NEW(i, j);
> > end = ktime_get();
> > total1 += (end - start);
>
> Nah, this is wrong. I feel like you measure performance of
> ktime_get() rather than GENMASK().
>
> Do it this way:
>
> start = ktime_get();
> for (...) {
> __always_used result = GENMASK();
> }
> time = ktime_get() - start;
>
>
> >
> > start = ktime_get();
> > result2 = __GENMASK(i, j);
>
> Please test GENMASK(), not __GENMASK().
>
> > end = ktime_get();
> > total2 += (end - start);
> >
> > if (result1 != result2) {
> > pr_info("Wrong calculation of GENMASK_NEW()\n");
> > return 0;
> > }
>
> For functional testing we have lib/test_bits.c, lib/test_bitmap.c and
> others. I expect you'll run them all to check correctness. Once you do
> that, you don't need to test correctness here.
>
> > }
> > }
> > }
> >
> > pr_info("new __GENMASK() : %lld\n", total1);
> > pr_info("old __GENMASK() : %lld\n", total2);
> >
> > return 0;
> > }
>
> Please incorporate this in one of existing tests and prepend the
> series with it. That way you'll let the others to run before/after for
> your series.
>
> > ---
> > include/uapi/linux/bits.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/bits.h b/include/uapi/linux/bits.h
> > index 5ee30f882736..8fc7fea65288 100644
> > --- a/include/uapi/linux/bits.h
> > +++ b/include/uapi/linux/bits.h
> > @@ -5,7 +5,7 @@
> > #define _UAPI_LINUX_BITS_H
> >
> > #define __GENMASK(h, l) \
> > - (((~_UL(0)) - (_UL(1) << (l)) + 1) & \
> > + ((~_UL(0) << (l)) & \
> > (~_UL(0) >> (__BITS_PER_LONG - 1 - (h))))
>
> Now it would fit a single line, right?
>
> If it works, this is a nice simplification, even though it's only a
> compile time thing.
>
> Please address all the comments above and run low-level tests I
> mentioned above, so that we'll make sure it has no side effects.
>
> Thanks,
> Yury
>
> >
> > #define __GENMASK_ULL(h, l) \
> > --
> > 2.43.0
Hi Yury,
Thanks for your patience and detailed review!
> > Refactor the calculation so the number of arithmetic instruction will be
> > reduced from 3 to 1.
>
> I'd like to look at it! Please share disassembly.
What I am talking about here is current version has ("-", "<<", "+") as
operators and after the change there'll be only ("<<").
Though I don't think these 2 macro will have any differences in
assembly, because they're all evaluated to a number at compile time, I
can still test to see the result.
I wrote a function for simplicity and check the disassembly using objdump -d
void test_func() {
int a = 10, b = 2;
unsigned n = __GENMASK(a, b);
}
$objdump -d genmask_old
...
0000000000001129 <test_func>:
1129: f3 0f 1e fa endbr64
112d: 55 push %rbp
112e: 48 89 e5 mov %rsp,%rbp
1131: c7 45 f4 0a 00 00 00 movl $0xa,-0xc(%rbp)
1138: c7 45 f8 02 00 00 00 movl $0x2,-0x8(%rbp)
113f: 8b 45 f8 mov -0x8(%rbp),%eax
1142: ba 01 00 00 00 mov $0x1,%edx
1147: 89 c1 mov %eax,%ecx
1149: 48 d3 e2 shl %cl,%rdx
114c: 48 89 d0 mov %rdx,%rax
114f: f7 d8 neg %eax
1151: 89 c2 mov %eax,%edx
1153: b8 3f 00 00 00 mov $0x3f,%eax
1158: 2b 45 f4 sub -0xc(%rbp),%eax
115b: 48 c7 c6 ff ff ff ff mov $0xffffffffffffffff,%rsi
1162: 89 c1 mov %eax,%ecx
1164: 48 d3 ee shr %cl,%rsi
1167: 48 89 f0 mov %rsi,%rax
116a: 21 d0 and %edx,%eax
116c: 89 45 fc mov %eax,-0x4(%rbp)
116f: 90 nop
1170: 5d pop %rbp
1171: c3 ret
...
$objdump -d genmask_new
...
0000000000001129 <test_func>:
1129: f3 0f 1e fa endbr64
112d: 55 push %rbp
112e: 48 89 e5 mov %rsp,%rbp
1131: c7 45 f4 0a 00 00 00 movl $0xa,-0xc(%rbp)
1138: c7 45 f8 02 00 00 00 movl $0x2,-0x8(%rbp)
113f: 8b 45 f8 mov -0x8(%rbp),%eax
1142: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx
1149: 89 c1 mov %eax,%ecx
114b: 48 d3 e2 shl %cl,%rdx
114e: 48 89 d0 mov %rdx,%rax
1151: 89 c2 mov %eax,%edx
1153: b8 3f 00 00 00 mov $0x3f,%eax
1158: 2b 45 f4 sub -0xc(%rbp),%eax
115b: 48 c7 c6 ff ff ff ff mov $0xffffffffffffffff,%rsi
1162: 89 c1 mov %eax,%ecx
1164: 48 d3 ee shr %cl,%rsi
1167: 48 89 f0 mov %rsi,%rax
116a: 21 d0 and %edx,%eax
116c: 89 45 fc mov %eax,-0x4(%rbp)
116f: 90 nop
1170: 5d pop %rbp
1171: c3 ret
...
> So you CC arm maintainers and provide tests against x86_64. Are your
> improvements consistent for arm, power and other arches?
I don't have available arm machines or others now, let me try to get them
and test.
> Nah, this is wrong. I feel like you measure performance of
> ktime_get() rather than GENMASK().
>
> Do it this way:
No problem, I rewrite the test like the following
start = ktime_get();
for (int k = 0; k < 1000; k++) {
for (int i = 0; i < __BITS_PER_LONG_LONG; i++) {
for (int j = 0; j <= i; j++) {
unsigned result1;
result1 = GENMASK(i, j);
}
}
}
end = ktime_get();
And here the result after running old version of GENMASK() and new
version of GENMASK(). Each running for 5 times.
[530775.667879] old GENMASK() : 60
[530811.687541] old GENMASK() : 60
[530824.527262] old GENMASK() : 50
[530826.583092] old GENMASK() : 40
[530831.487851] old GENMASK() : 50
[530893.618084] new GENMASK() : 50
[530897.734948] new GENMASK() : 50
[530900.253968] new GENMASK() : 50
[530902.415798] new GENMASK() : 40
[530904.136347] new GENMASK() : 50
On avgerage, old version of GENMASK() cost 52 time unit, new version of
GENMASK() costs 48 time unit.
> Please incorporate this in one of existing tests and prepend the
> series with it. That way you'll let the others to run before/after for
> your series.
> Now it would fit a single line, right?
>
> If it works, this is a nice simplification, even though it's only a
> compile time thing.
>
> Please address all the comments above and run low-level tests I
> mentioned above, so that we'll make sure it has no side effects.
Sure thing, I'll append the corresponding test into existing tests and
perform all the test again, though for different machine architecture
parts could be hard for me, I only have x86 machine now, I think I can
get an arm machine, but can't sure for other architectures.
I'll try to prepare everything as possible as I can and send v2. Thanks.
Best regards,
I Hsin Cheng.
On Tue, Feb 11, 2025 at 01:39:34PM -0500, Yury Norov wrote: > On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote: > > The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a > > bitmask with "l" trailing zeroes, which is equivalent to > > "(~_UL(0) << (l))". > > I used to think that GENMASK() is a compile-time macro. __GENMASK() is > not, but it has very limited usage through the kernel, all in the uapi. > > > Refactor the calculation so the number of arithmetic instruction will be > > reduced from 3 to 1. > > I'd like to look at it! Please share disassembly. > > > Signed-off-by: I Hsin Cheng <richard120310@gmail.com> > > --- > > Test is done to show the speed-up we can get from reducing the number of > > instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64 > > architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor. > > So you CC arm maintainers and provide tests against x86_64. Are your > improvements consistent for arm, power and other arches? > > Can you run bloat-o-meter against your change? Ah, sorry, overlooked you bloat-o-meter results in cover letter. Anyways, can you provide it for each patch individually?
On Tue, Feb 11, 2025 at 01:44:18PM -0500, Yury Norov wrote: > On Tue, Feb 11, 2025 at 01:39:34PM -0500, Yury Norov wrote: > > On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote: > > > The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a > > > bitmask with "l" trailing zeroes, which is equivalent to > > > "(~_UL(0) << (l))". > > > > I used to think that GENMASK() is a compile-time macro. __GENMASK() is > > not, but it has very limited usage through the kernel, all in the uapi. > > > > > Refactor the calculation so the number of arithmetic instruction will be > > > reduced from 3 to 1. > > > > I'd like to look at it! Please share disassembly. > > > > > Signed-off-by: I Hsin Cheng <richard120310@gmail.com> > > > --- > > > Test is done to show the speed-up we can get from reducing the number of > > > instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64 > > > architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor. > > > > So you CC arm maintainers and provide tests against x86_64. Are your > > improvements consistent for arm, power and other arches? > > > > Can you run bloat-o-meter against your change? > > Ah, sorry, overlooked you bloat-o-meter results in cover letter. > Anyways, can you provide it for each patch individually? Oh ok, let me paste them here first, I'll attach them along with v2 as well. In the section below, vmlinux_old uses old version of GENMASK() and GENMASK_ULL(), vmlinux_p1 use new version of GENMASK() and old version of GENMASK_ULL(), vmlinux_p2 use new version of GENMASK() and new version of GENMASK_ULL(). $ ./scripts/bloat-o-meter vmlinux_old vmlinux_p1 add/remove: 0/2 grow/shrink: 46/505 up/down: 464/-1717 (-1253) Function old new delta store_min_perf_pct 275 368 +93 store_hwp_dynamic_boost 138 225 +87 store_max_perf_pct 253 336 +83 rcu_exp_wait_wake 2620 2651 +31 hybrid_update_cpu_capacity_scaling 311 332 +21 cpuset_mem_spread_node 283 297 +14 perf_assign_events 901 914 +13 xt_init 297 305 +8 partition_sched_domains_locked 1126 1134 +8 srcu_torture_stats_print 534 541 +7 sock_prot_inuse_get 121 128 +7 ring_buffer_empty 271 278 +7 numa_policy_init 602 609 +7 gnet_stats_add_basic 183 190 +7 rcu_init 2383 2389 +6 amd_uncore_ctx_move 286 292 +6 vmalloc_init 1058 1062 +4 reclaim_throttle 756 760 +4 kstat_irqs_desc 118 122 +4 hidinput_connect 20926 20930 +4 dev_change_proto_down_reason 164 168 +4 cgroup_propagate_control 403 407 +4 tracing_total_entries_read 346 349 +3 remove_siblinginfo 1038 1041 +3 perf_event_init 838 841 +3 qdisc_alloc 472 474 +2 probes_profile_seq_show 387 389 +2 clocksource_verify_percpu.part 1065 1067 +2 blk_mq_init_allocated_queue 976 978 +2 __lru_add_drain_all 502 504 +2 zone_reclaimable_pages 692 693 +1 tcp_orphan_count_sum 107 108 +1 sock_inuse_get 102 103 +1 sched_balance_rq 3745 3746 +1 proc_nr_inodes 173 174 +1 percpu_ref_switch_to_atomic_rcu 375 376 +1 nr_processes 102 103 +1 netdev_refcnt_read 102 103 +1 mq_dump_class_stats 234 235 +1 mnt_get_writers 96 97 +1 mnt_get_count 99 100 +1 get_nr_inodes 106 107 +1 get_nr_dirty_inodes 153 154 +1 allow_direct_reclaim 318 319 +1 alloc_fd 376 377 +1 __is_kernel_percpu_address 177 178 +1 zoneinfo_show 823 822 -1 zhaoxin_pmu_handle_irq 772 771 -1 zhaoxin_arch_events_quirk 137 136 -1 workqueue_init 927 926 -1 wake_up_all_idle_cpus 142 141 -1 uprobe_buffer_disable 165 164 -1 update_per_cpu_data_slice_size 216 215 -1 unregister_netdevice_many_notify 2299 2298 -1 uncore_pmu_hrtimer 261 260 -1 uncore_event_cpu_offline 344 343 -1 uncore_device_to_die 163 162 -1 tsc_store_and_check_tsc_adjust 507 506 -1 tsc_restore_sched_clock_state 223 222 -1 tsc_refine_calibration_work 938 937 -1 try_to_unmap_one 2473 2472 -1 try_to_migrate_one 2166 2165 -1 trace_total_entries 192 191 -1 trace_buffered_event_enable 347 346 -1 tmigr_cpu_offline 404 403 -1 timer_list_start 159 158 -1 tick_handle_oneshot_broadcast 380 379 -1 tick_clock_notify 109 108 -1 tick_broadcast_setup_oneshot 288 287 -1 thaw_secondary_cpus 424 423 -1 tcp_sigpool_alloc_ahash 885 884 -1 tc_fill_qdisc 1125 1124 -1 sysctl_vm_numa_stat_handler 456 455 -1 sync_runqueues_membarrier_state 319 318 -1 sw_perf_event_destroy 133 132 -1 svc_create_pooled 799 798 -1 sugov_stop 141 140 -1 store_no_turbo 641 640 -1 snmp_fold_field 113 112 -1 smp_call_function_any 261 260 -1 show_stat 2323 2322 -1 setup_swap_info 235 234 -1 sched_update_numa 424 423 -1 sbitmap_init_node 498 497 -1 rto_next_cpu 137 136 -1 rtm_new_nexthop 5249 5248 -1 rt6_disable_ip 716 715 -1 relay_reset 198 197 -1 relay_flush 198 197 -1 relay_close 358 357 -1 reclaim_and_purge_vmap_areas 450 449 -1 rdmacg_resource_set_max 892 891 -1 rcu_tasks_trace_pregp_step 1143 1142 -1 rcu_spawn_gp_kthread 645 644 -1 rcu_spawn_exp_par_gp_kworker 387 386 -1 rcu_report_exp_cpu_mult 175 174 -1 rcu_barrier_tasks_generic 437 436 -1 rcu_barrier 874 873 -1 pull_dl_task 845 844 -1 pti_init 554 553 -1 print_ICs 299 298 -1 pfifo_fast_reset 349 348 -1 perf_trace_event_init 728 727 -1 perf_reboot 106 105 -1 perf_event_print_debug 1227 1226 -1 percpu_counter_set 127 126 -1 pcpu_setup_first_chunk 2020 2019 -1 pcpu_depopulate_chunk 412 411 -1 pcpu_alloc_noprof 1948 1947 -1 p4_pmu_handle_irq 631 630 -1 od_set_powersave_bias 216 215 -1 node_dev_init 189 188 -1 nh_fill_node 2439 2438 -1 nfs_show_stats 1147 1146 -1 neightbl_fill_info.constprop 1064 1063 -1 native_stop_other_cpus 455 454 -1 mm_init.isra 898 897 -1 memory_tier_late_init 1762 1761 -1 mem_init 520 519 -1 kswapd_init 110 109 -1 kmem_cache_init 369 368 -1 kcompactd_init 242 241 -1 irq_alloc_matrix 247 246 -1 ipv6_add_dev 1303 1302 -1 ipv4_mib_init_net 509 508 -1 ip_rt_init 589 588 -1 ip6_route_init 563 562 -1 iova_domain_init_rcaches 458 457 -1 iommu_put_dma_cookie 1044 1043 -1 ioc_timer_fn 4908 4907 -1 intel_pstate_driver_cleanup 223 222 -1 intel_gt_mcr_init 886 885 -1 intel_get_event_constraints 846 845 -1 intel_arch_events_quirk 137 136 -1 input_leds_connect 720 719 -1 init_sched_rt_class 101 100 -1 init_sched_dl_class 101 100 -1 init_gi_nodes 132 131 -1 inet6_net_init 644 643 -1 i915_pmu_cpu_offline 230 229 -1 hybrid_set_capacity_of_cpus 137 136 -1 housekeeping_init 124 123 -1 group_cpus_evenly 439 438 -1 gnet_stats_add_queue 213 212 -1 get_total_entries 199 198 -1 get_nohz_timer_target 317 316 -1 gen11_gt_irq_handler 924 923 -1 free_acpi_perf_data 75 74 -1 fpu__init_system_xstate 3179 3178 -1 fold_vm_zone_numa_events 232 231 -1 flow_limit_cpu_sysctl 525 524 -1 filelock_init 396 395 -1 ext4_mb_new_blocks 4028 4027 -1 elf_coredump_extra_notes_size 107 106 -1 ehci_hrtimer_func 233 232 -1 dst_cache_reset_now 163 162 -1 do_migrate_pages 807 806 -1 dm_stats_init 195 194 -1 dl_bw_cpus 136 135 -1 disable_freq_invariance_workfn 118 117 -1 default_send_IPI_mask_allbutself_phys 287 286 -1 crash_prepare_elf64_headers 566 565 -1 cpupri_init 182 181 -1 cpupri_find_fitness 416 415 -1 cpuhp_sysfs_init 265 264 -1 cpuhp_smt_disable 213 212 -1 cpuhp_rollback_install 186 185 -1 cpufreq_set_policy 801 800 -1 cpufreq_policy_free 385 384 -1 cpufreq_online 2578 2577 -1 cpufreq_notify_transition 281 280 -1 cpufreq_driver_fast_switch 207 206 -1 cpufreq_dbs_governor_start 443 442 -1 cpufreq_dbs_governor_init 681 680 -1 cpudl_init 159 158 -1 cpuacct_stats_show 380 379 -1 cpu_map_shared_cache 260 259 -1 cpu_dev_init 228 227 -1 compaction_zonelist_suitable 329 328 -1 clock_was_set 561 560 -1 cgroup_subtree_control_write 1008 1007 -1 cgroup_rstat_init 168 167 -1 cgroup_rstat_boot 105 104 -1 cgroup_post_fork 634 633 -1 cgroup_base_stat_cputime_show 760 759 -1 can_migrate_task 519 518 -1 call_function_init 171 170 -1 calibrate_delay_is_known 237 236 -1 build_all_zonelists_init 277 276 -1 bpf_prog_alloc 166 165 -1 blkg_alloc 481 480 -1 blkcg_print_stat 880 879 -1 blkcg_iolatency_done_bio 1775 1774 -1 blkcg_css_alloc 568 567 -1 blk_stat_timer_fn 378 377 -1 blk_stat_add_callback 244 243 -1 blk_mq_sysfs_deinit 123 122 -1 blk_mq_map_hw_queues 175 174 -1 blk_mq_init 358 357 -1 blk_iocost_init 620 619 -1 begin_new_exec 2857 2856 -1 asym_cpu_capacity_scan 642 641 -1 amd_uncore_umc_ctx_init 871 870 -1 amd_pstate_change_mode_without_dvr_change 133 132 -1 amd_pmu_v2_handle_irq 1053 1052 -1 alloc_iova_fast 680 679 -1 alloc_desc 492 491 -1 addrconf_disable_policy_idev 322 321 -1 add_to_avail_list 310 309 -1 acpi_cpc_valid 130 129 -1 _vm_unmap_aliases 646 645 -1 __snmp6_fill_stats64.constprop 282 281 -1 __percpu_ref_switch_mode 543 542 -1 __i915_gem_object_set_pages 547 546 -1 __ftrace_clear_event_pids 593 592 -1 __drain_swap_slots_cache 90 89 -1 __drain_all_pages 449 448 -1 __copy_xstate_to_uabi_buf 2416 2415 -1 __alloc_frozen_pages_noprof 3930 3929 -1 ___gnet_stats_copy_basic.isra 387 386 -1 zone_pcp_enable 127 125 -2 xhci_bus_resume 1011 1009 -2 xfrm_input_init 230 228 -2 xfeature_get_offset 137 135 -2 x86_release_hardware 322 320 -2 workqueue_online_cpu 750 748 -2 workqueue_offline_cpu 550 548 -2 weighted_interleave_nodes 331 329 -2 vmalloc_info_show 1000 998 -2 update_or_create_fnhe 910 908 -2 update_group_capacity 556 554 -2 tsc_init 882 880 -2 try_to_generate_entropy 629 627 -2 tracing_set_cpumask 284 282 -2 tracing_release 336 334 -2 tracing_entries_read 453 451 -2 trace_rb_cpu_prepare 213 211 -2 toggle_bp_slot.constprop 2861 2859 -2 timekeeping_get_mg_floor_swaps 111 109 -2 tcf_idr_create 639 637 -2 task_non_contending 837 835 -2 task_mm_cid_work 498 496 -2 srcu_readers_active 128 126 -2 srcu_barrier 515 513 -2 softirq_init 150 148 -2 snmp_seq_show_ipstats.isra 374 372 -2 snmp6_seq_show_item64.constprop 264 262 -2 snapshot_write_next 2486 2484 -2 snapshot_read_next 626 624 -2 smp_prepare_cpus_common 339 337 -2 skx_pmu_get_topology 313 311 -2 show_slab_objects 1071 1069 -2 show_numa_map 980 978 -2 show_interrupts 817 815 -2 shmem_cppc_enable 225 223 -2 setup_zone_pageset 327 325 -2 setup_per_cpu_pageset 266 264 -2 set_cpus_allowed_dl 276 274 -2 set_cpu_sibling_map 1693 1691 -2 set_buffer_entries 93 91 -2 seq_hlist_start_percpu 131 129 -2 seq_hlist_next_percpu 202 200 -2 send_pcc_cmd 656 654 -2 select_task_rq_fair 4420 4418 -2 scsi_show_rq 800 798 -2 schedule_on_each_cpu 335 333 -2 s_start 865 863 -2 ring_buffer_subbuf_order_set 1200 1198 -2 ring_buffer_overruns 107 105 -2 ring_buffer_entries 124 122 -2 recalc_bh_state.part 122 120 -2 rcu_gp_cleanup 1556 1554 -2 rcu_check_boost_fail 467 465 -2 rcec_assoc_rciep.isra 116 114 -2 pt_init 981 979 -2 process_srcu 1676 1674 -2 phys_package_first_cpu 122 120 -2 perf_output_sample_regs 181 179 -2 perf_event_mux_interval_ms_store 365 363 -2 percpu_is_read_locked 126 124 -2 per_cpu_ptr_to_phys 266 264 -2 part_stat_read_all 193 191 -2 part_inflight_show 308 306 -2 part_in_flight 133 131 -2 padata_do_serial 237 235 -2 padata_do_multithreaded 877 875 -2 padata_alloc_pd 482 480 -2 p4_pmu_enable_all 132 130 -2 numa_nearest_node 210 208 -2 nr_iowait 105 103 -2 nr_context_switches 112 110 -2 node_read_distance 234 232 -2 netif_get_num_default_rss_queues 157 155 -2 native_smp_cpus_done 657 655 -2 metadata_dst_alloc_percpu 370 368 -2 memcpy_count_show 176 174 -2 membarrier_global_expedited 330 328 -2 loopback_get_stats64 145 143 -2 kyber_timer_fn 599 597 -2 kvm_flush_tlb_multi 149 147 -2 kfree_rcu_shrink_count 157 155 -2 kcore_update_ram.isra 783 781 -2 iommu_dma_init_fq 450 448 -2 iolatency_pd_init 495 493 -2 interleave_nodes 166 164 -2 interleave_nid 217 215 -2 intel_pmu_init 14167 14165 -2 intel_pmu_drain_pebs_icl 866 864 -2 intel_gt_init_workarounds 4139 4137 -2 input_register_device 1348 1346 -2 init_sched_fair_class 213 211 -2 init_cpu_to_node 284 282 -2 inactive_task_timer 967 965 -2 ieee80211_sta_last_active 147 145 -2 ieee80211_rx_mgmt_beacon 5755 5753 -2 ieee80211_add_rx_radiotap_header 2267 2265 -2 idle_threads_init 181 179 -2 icl_display_core_init 2347 2345 -2 hw_breakpoint_is_used 210 208 -2 gro_cells_destroy 342 340 -2 ftrace_dump_one 731 729 -2 free_policy_dbs_info 124 122 -2 free_iova_rcaches 285 283 -2 free_area_init 4062 4060 -2 fq_flush_timeout 234 232 -2 flush_tlb_mm_range 427 425 -2 ext4_update_super 942 940 -2 ext4_mb_init 1965 1963 -2 ext4_journal_commit_callback 470 468 -2 ext4_fill_super 12856 12854 -2 drv_change_sta_links 621 619 -2 drop_slab 223 221 -2 do_ipt_set_ctl 1565 1563 -2 do_ip6t_set_ctl 1589 1587 -2 dma_channel_table_init 279 277 -2 dm_wait_for_completion 409 407 -2 dm_stats_message 4382 4380 -2 dl_cpuset_cpumask_can_shrink 268 266 -2 dl_clear_root_domain 176 174 -2 dl_add_task_root_domain 337 335 -2 dev_lstats_read 137 135 -2 dev_fetch_sw_netstats 179 177 -2 detect_cache_attributes 795 793 -2 del_gendisk 849 847 -2 default_hugepagesz_setup 454 452 -2 cpuusage_write 199 197 -2 cpuusage_user_read 120 118 -2 cpuusage_sys_read 125 123 -2 cpuusage_read 111 109 -2 cpuidle_driver_state_disabled 234 232 -2 cpufreq_show_cpus 172 170 -2 cpuacct_all_seq_show 292 290 -2 cpu_stop_init 178 176 -2 cpu_rmap_copy_neigh 140 138 -2 cpu_init_debugfs 277 275 -2 cppc_allow_fast_switch 133 131 -2 core_pmu_enable_all 399 397 -2 cleanup_srcu_struct 424 422 -2 cgroup_rstat_exit 191 189 -2 cgroup_release 302 300 -2 cgroup_exit 416 414 -2 cgroup_can_fork 1282 1280 -2 bytes_transferred_show 177 175 -2 blkcg_reset_stats 481 479 -2 blk_mq_update_queue_map 192 190 -2 blk_mq_sysfs_init 158 156 -2 blk_mq_hw_sysfs_cpus_show 289 287 -2 arch_tlbbatch_flush 248 246 -2 arch_enable_hybrid_capacity_scale 241 239 -2 apply_wqattrs_prepare 502 500 -2 apply_wqattrs_commit 408 406 -2 amd_uncore_ctx_init 516 514 -2 amd_pmu_enable_all 132 130 -2 amd_pmu_cpu_prepare 340 338 -2 amd_pmu_check_overflow 177 175 -2 amd_numa_init 901 899 -2 amd_detect_prefcore 306 304 -2 alloc_pages_bulk_mempolicy_noprof 1530 1528 -2 alloc_cpu_rmap 200 198 -2 all_vm_events 180 178 -2 ahci_reset_controller 301 299 -2 add_del_listener 565 563 -2 acpi_processor_power_state_has_changed 418 416 -2 acpi_map_cpuid 147 145 -2 acpi_get_psd_map 367 365 -2 acpi_cpufreq_probe 490 488 -2 _ieee80211_set_active_links 1341 1339 -2 __zone_set_pageset_high_and_batch 104 102 -2 __sync_rcu_exp_select_node_cpus 954 952 -2 __sched_clock_work 261 259 -2 __percpu_counter_sum 136 134 -2 __dm_stat_init_temporary_percpu_totals 443 441 -2 slabs_cpu_partial_show 326 323 -3 show_rcu_tasks_generic_gp_kthread 473 470 -3 show_free_areas 2871 2868 -3 sched_balance_trigger 923 920 -3 ring_buffer_reset_online_cpus 323 320 -3 relay_open 667 664 -3 register_netdevice 2022 2019 -3 refresh_zone_stat_thresholds 372 369 -3 rebind_subsystems 1264 1261 -3 page_alloc_init_late 754 751 -3 out_of_memory 1780 1777 -3 mgts_show 328 325 -3 init_mm_internals 689 686 -3 ieee80211_subif_start_xmit 1121 1118 -3 hugetlb_init 1602 1599 -3 hugetlb_acct_memory.part 1112 1109 -3 dm_stat_free 247 244 -3 crypto_scomp_init_tfm 261 258 -3 blk_mq_map_queues 213 210 -3 blk_mq_hw_queue_to_node 123 120 -3 amd_pmu_cpu_starting 379 376 -3 update_qos_request 250 246 -4 tracing_open 955 951 -4 snmp6_seq_show_item 369 365 -4 shrink_dcache_sb 327 323 -4 show_rcu_gp_kthreads 834 830 -4 set_pgdat_percpu_threshold 182 178 -4 select_fallback_rq 718 714 -4 sched_dl_do_global 347 343 -4 sbitmap_queue_show 405 401 -4 ring_buffer_reset 230 226 -4 rcu_tasks_kthread 191 187 -4 probe_event_enable 862 858 -4 perf_swevent_init 492 488 -4 pcpu_populate_chunk 1017 1013 -4 netstat_seq_show 717 713 -4 mce_end 651 647 -4 intel_psr_invalidate 667 663 -4 intel_pmu_cpu_starting 1706 1702 -4 intel_dp_compute_link_config 1317 1313 -4 ieee80211_vif_update_links 2562 2558 -4 hpet_open 491 487 -4 housekeeping_setup 617 613 -4 ext4_attr_show 1006 1002 -4 elf_coredump_extra_notes_write 453 449 -4 apply_wqattrs_cleanup.part 233 229 -4 zone_pcp_reset 178 173 -5 xt_free_table_info 134 129 -5 x86_pmu_enable_all 377 372 -5 wq_affn_dfl_set 253 248 -5 workqueue_init_topology 271 266 -5 try_check_zero 438 433 -5 swapfile_init 220 215 -5 start_kernel 1996 1991 -5 sta_get_last_rx_stats 124 119 -5 smpboot_register_percpu_thread 233 228 -5 ring_buffer_free 145 140 -5 release_callchain_buffers_rcu 111 106 -5 rcu_dump_cpu_stacks 433 428 -5 padata_do_parallel 532 527 -5 load_module 9704 9699 -5 kvm_alloc_cpumask 177 172 -5 kvfree_rcu_init 571 566 -5 gro_cells_init 257 252 -5 free_fair_sched_group 162 157 -5 dst_cache_destroy 141 136 -5 cpuhp_smt_enable 243 238 -5 cgroup_print_ss_mask 182 177 -5 alloc_buffer 1345 1340 -5 __list_lru_init 179 174 -5 __group_cpus_evenly 1254 1249 -5 __fib6_drop_pcpu_from.part 209 204 -5 x86_pmu_disable_all 411 405 -6 trace_empty 233 227 -6 tcp_v4_init 273 267 -6 svc_seq_show 328 322 -6 snmp_seq_show_tcp_udp.isra 942 936 -6 schedstat_next 131 125 -6 ring_buffer_wake_waiters 191 185 -6 perf_event_exit_cpu_context 600 594 -6 percpu_down_write 400 394 -6 pcpu_free_pages.isra 169 163 -6 pcpu_build_alloc_info 1307 1301 -6 packet_read_pending.isra 105 99 -6 nv_get_stats64 402 396 -6 net_dev_init 795 789 -6 kvm_smp_send_call_func_ipi 100 94 -6 irq_migrate_all_off_this_cpu 722 716 -6 intel_engines_init_mmio 3432 3426 -6 init_timers 170 164 -6 icmpv6_init 289 283 -6 dma_channel_rebalance 788 782 -6 dl_bw_manage 805 799 -6 cpufreq_dbs_governor_stop 142 136 -6 blk_mq_map_swqueue 1057 1051 -6 acpi_processor_throttling_init 788 782 -6 __percpu_counter_limited_add 481 475 -6 __do_sys_swapon 4260 4254 -6 __cpuacct_percpu_seq_show 163 157 -6 wq_update_node_max_active 529 522 -7 sysrq_timer_list_show 277 270 -7 smpboot_destroy_threads 150 143 -7 smp_call_function_many_cond 1320 1313 -7 setup_cpu_entry_areas 932 925 -7 regmap_field_init 122 115 -7 mst_stream_compute_config 1659 1652 -7 free_node_nr_active 165 158 -7 dl_server_apply_params 972 965 -7 cpu_down_maps_locked 203 196 -7 alloc_workqueue 2078 2071 -7 __is_module_percpu_address 283 276 -7 sta_set_sinfo 3014 3006 -8 show_softirqs 334 326 -8 populate_cache_leaves 1389 1381 -8 is_mddev_idle 471 463 -8 irq_matrix_reserve_managed 350 342 -8 intel_psr_flush 982 974 -8 get_callchain_buffers 414 406 -8 find_next_best_node 275 267 -8 drv_change_vif_links 635 627 -8 core_guest_get_msrs 347 339 -8 check_hw_exists 1074 1066 -8 __do_sys_swapoff 3766 3758 -8 __dl_server_attach_root 238 230 -8 x86_reserve_hardware 680 671 -9 smp_shutdown_nonboot_cpus 198 189 -9 schedstat_start 125 116 -9 do_kmem_cache_create 1252 1243 -9 cache_shared_cpu_map_remove 340 331 -9 bioset_exit 381 372 -9 reserve_ds_buffers 1595 1585 -10 hugetlb_cgroup_free 127 117 -10 flush_all_cpus_locked 349 339 -10 cpu_rmap_update 573 563 -10 compaction_proactiveness_sysctl_handler 298 288 -10 ahci_save_initial_config 1170 1160 -10 __build_all_zonelists 229 219 -10 numa_set_distance 578 567 -11 init_pod_type 588 577 -11 del_from_avail_list 220 209 -11 weighted_interleave_nid 385 373 -12 sched_dl_overflow 1326 1314 -12 rcu_sched_clock_irq 4343 4331 -12 icmp_init 286 274 -12 hugetlb_cgroup_read_numa_stat 607 595 -12 acpi_processor_preregister_performance 1174 1162 -12 sysctl_compaction_handler 196 183 -13 release_ds_buffers 310 297 -13 mempolicy_sysfs_init 664 651 -13 intel_drrs_activate 392 379 -13 cgroup_migrate_execute 1151 1138 -13 sched_init_numa 1916 1902 -14 dev_get_stats 816 802 -14 __reserve_bp_slot 679 664 -15 dl_task_timer 1511 1495 -16 __pfx_intel_pstate_update_policies 16 - -16 numa_init 729 712 -17 __acpi_processor_set_throttling 1132 1112 -20 ring_buffer_resize 1319 1292 -27 cppc_perf_ctrs_in_pcc 309 276 -33 intel_pstate_update_policies 90 - -90 Total: Before=22438085, After=22436832, chg -0.01% $ ./scripts/bloat-o-meter vmlinux_p1 vmlinux_p2 add/remove: 0/0 grow/shrink: 0/5 up/down: 0/-16 (-16) Function old new delta cpc_write_ffh 167 166 -1 mce_read_aux 394 392 -2 __log_error 494 492 -2 cpc_read_ffh 89 85 -4 lpit_read_residency_counter_us 309 302 -7 Total: Before=22436832, After=22436816, chg -0.00% I think this is everything bloat-o-meter generated, it really contains many lines so I didn't paste everything in the cover letter, I'll paste them individually in v2. Best regards, I Hsin Cheng
+ Andrew, Matthias
On Wed, Feb 12, 2025 at 10:01:12PM +0800, I Hsin Cheng wrote:
> On Tue, Feb 11, 2025 at 01:44:18PM -0500, Yury Norov wrote:
> > On Tue, Feb 11, 2025 at 01:39:34PM -0500, Yury Norov wrote:
> > > On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote:
> > > > The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a
> > > > bitmask with "l" trailing zeroes, which is equivalent to
> > > > "(~_UL(0) << (l))".
> > >
> > > I used to think that GENMASK() is a compile-time macro. __GENMASK() is
> > > not, but it has very limited usage through the kernel, all in the uapi.
> > >
> > > > Refactor the calculation so the number of arithmetic instruction will be
> > > > reduced from 3 to 1.
> > >
> > > I'd like to look at it! Please share disassembly.
> > >
> > > > Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
> > > > ---
> > > > Test is done to show the speed-up we can get from reducing the number of
> > > > instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64
> > > > architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor.
> > >
> > > So you CC arm maintainers and provide tests against x86_64. Are your
> > > improvements consistent for arm, power and other arches?
> > >
> > > Can you run bloat-o-meter against your change?
> >
> > Ah, sorry, overlooked you bloat-o-meter results in cover letter.
> > Anyways, can you provide it for each patch individually?
>
> Oh ok, let me paste them here first, I'll attach them along with v2 as
> well.
>
> In the section below, vmlinux_old uses old version of GENMASK() and
> GENMASK_ULL(), vmlinux_p1 use new version of GENMASK() and old version
> of GENMASK_ULL(), vmlinux_p2 use new version of GENMASK() and new
> version of GENMASK_ULL().
>
> $ ./scripts/bloat-o-meter vmlinux_old vmlinux_p1
> add/remove: 0/2 grow/shrink: 46/505 up/down: 464/-1717 (-1253)
So, I ran the build myself and I confirm that reverting c32ee3d9abd284b4
(which this patch actually does) helps to save over 1k of .text. In my
case it's 1269 bytes on x86_64 + defconfig for GENMASK() only.
I looked at some functions affected by this revert, and I found that
they call for_each_*_cpu() macros. This eventually boils down to bitmap
functions like this:
static __always_inline
unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
unsigned long offset)
{
if (small_const_nbits(size)) {
unsigned long val;
if (unlikely(offset >= size))
return size;
val = *addr & GENMASK(size - 1, offset);
return val ? __ffs(val) : size;
}
return _find_next_bit(addr, size, offset);
}
The 'size' here is NR_CPUS, which on my machine is 64.
GENMASK is used with non-constant parameters, but it's OK because
from compiler's point of view, the GENMASK_INPUT_CHECK() which is
just:
BUILD_BUG_ON_ZERO(const_true((l) > (h)))
It is passed as the 'offset > size', and it's is indeed a constant
expression at that point because it's explicitly tested before.
Is this for_each_cpu() thing the only case - I don't know, but it's
enough to consider reverting back to the original version.
I Hsing,
At first, thank you for catching this up.
I think performance testing part is trivial here, so let's focus on
code generation and consistency.
Can you please run your build against few recent GCC and clang versions?
Can you also build the kernel for ARM64? You don't need to run it on
real hardware (but maybe on VM). I just need to make sure that the
result is consistent for all important arches.
Matthias didn't mention how did he build his kernel. Did clang emit
that warning against W=0, 1 or higher, and which code triggered the
warning? Maybe clang already fixed it?
Can you please check how would it work if NR_CPUS is set to be say 1024?
This way the small_const_nbits optimization will be disabled.
If you will find that clang still emits warnings at lower than W=2,
can you please resend this patches together with a patch that
suppresses clang warning?
Can you also please attach one or two examples of code generation for
real functions, not an artificial one as you did before. And maybe a
link to goldbolt?
For the next iteration can you please make sure that you refer your
series as a revert of Matthias's patch?
Thank you for discovering this. I realize that I'm asking you to do
some extra work, but we all need to be 100% sure that it's not a fluke
before reverting the c32ee3d9abd284b4 because it potentially leads to
suppressing another clang warning.
Thanks,
Yury
On Thu, Feb 13, 2025 at 02:54:27PM -0500, Yury Norov wrote:
> + Andrew, Matthias
>
> On Wed, Feb 12, 2025 at 10:01:12PM +0800, I Hsin Cheng wrote:
> > On Tue, Feb 11, 2025 at 01:44:18PM -0500, Yury Norov wrote:
> > > On Tue, Feb 11, 2025 at 01:39:34PM -0500, Yury Norov wrote:
> > > > On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote:
> > > > > The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a
> > > > > bitmask with "l" trailing zeroes, which is equivalent to
> > > > > "(~_UL(0) << (l))".
> > > >
> > > > I used to think that GENMASK() is a compile-time macro. __GENMASK() is
> > > > not, but it has very limited usage through the kernel, all in the uapi.
> > > >
> > > > > Refactor the calculation so the number of arithmetic instruction will be
> > > > > reduced from 3 to 1.
> > > >
> > > > I'd like to look at it! Please share disassembly.
> > > >
> > > > > Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
> > > > > ---
> > > > > Test is done to show the speed-up we can get from reducing the number of
> > > > > instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64
> > > > > architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor.
> > > >
> > > > So you CC arm maintainers and provide tests against x86_64. Are your
> > > > improvements consistent for arm, power and other arches?
> > > >
> > > > Can you run bloat-o-meter against your change?
> > >
> > > Ah, sorry, overlooked you bloat-o-meter results in cover letter.
> > > Anyways, can you provide it for each patch individually?
> >
> > Oh ok, let me paste them here first, I'll attach them along with v2 as
> > well.
> >
> > In the section below, vmlinux_old uses old version of GENMASK() and
> > GENMASK_ULL(), vmlinux_p1 use new version of GENMASK() and old version
> > of GENMASK_ULL(), vmlinux_p2 use new version of GENMASK() and new
> > version of GENMASK_ULL().
> >
> > $ ./scripts/bloat-o-meter vmlinux_old vmlinux_p1
> > add/remove: 0/2 grow/shrink: 46/505 up/down: 464/-1717 (-1253)
>
> So, I ran the build myself and I confirm that reverting c32ee3d9abd284b4
> (which this patch actually does) helps to save over 1k of .text. In my
> case it's 1269 bytes on x86_64 + defconfig for GENMASK() only.
>
> I looked at some functions affected by this revert, and I found that
> they call for_each_*_cpu() macros. This eventually boils down to bitmap
> functions like this:
>
> static __always_inline
> unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
> unsigned long offset)
> {
> if (small_const_nbits(size)) {
> unsigned long val;
>
> if (unlikely(offset >= size))
> return size;
>
> val = *addr & GENMASK(size - 1, offset);
> return val ? __ffs(val) : size;
> }
>
> return _find_next_bit(addr, size, offset);
> }
>
> The 'size' here is NR_CPUS, which on my machine is 64.
>
> GENMASK is used with non-constant parameters, but it's OK because
> from compiler's point of view, the GENMASK_INPUT_CHECK() which is
> just:
> BUILD_BUG_ON_ZERO(const_true((l) > (h)))
>
> It is passed as the 'offset > size', and it's is indeed a constant
> expression at that point because it's explicitly tested before.
>
> Is this for_each_cpu() thing the only case - I don't know, but it's
> enough to consider reverting back to the original version.
>
> I Hsing,
>
> At first, thank you for catching this up.
>
> I think performance testing part is trivial here, so let's focus on
> code generation and consistency.
>
> Can you please run your build against few recent GCC and clang versions?
> Can you also build the kernel for ARM64? You don't need to run it on
> real hardware (but maybe on VM). I just need to make sure that the
> result is consistent for all important arches.
>
> Matthias didn't mention how did he build his kernel. Did clang emit
> that warning against W=0, 1 or higher, and which code triggered the
> warning? Maybe clang already fixed it?
>
> Can you please check how would it work if NR_CPUS is set to be say 1024?
> This way the small_const_nbits optimization will be disabled.
>
> If you will find that clang still emits warnings at lower than W=2,
> can you please resend this patches together with a patch that
> suppresses clang warning?
>
> Can you also please attach one or two examples of code generation for
> real functions, not an artificial one as you did before. And maybe a
> link to goldbolt?
>
> For the next iteration can you please make sure that you refer your
> series as a revert of Matthias's patch?
>
> Thank you for discovering this. I realize that I'm asking you to do
> some extra work, but we all need to be 100% sure that it's not a fluke
> before reverting the c32ee3d9abd284b4 because it potentially leads to
> suppressing another clang warning.
>
> Thanks,
> Yury
Hi Yury,
No problem ! I'll be happy to help with these tests, I'll send the next
iteration when I complete the things you mentioned.
> Is this for_each_cpu() thing the only case - I don't know, but it's
> enough to consider reverting back to the original version.
So basically the reason of sizing up is because for_each_cpu() put
non-constant parameter in GENMASK(), which is supposed to be used by
constant only?
> Can you also please attach one or two examples of code generation for
> real functions, not an artificial one as you did before. And maybe a
> link to goldbolt?
I have no problem of other tests but this one, I wrote a simple
artificial use case because most functions I found according to the
report generated by bloat-o-meter, doesn't use GENMASK() directly or
they're super long and GENMASK() only accounts for very small part of
them, it wasn't very trivial to sense the difference of disassembly at
least for me. Should I just pick 1~2 random use cases? or do you have
any suggestions?
Btw, are you talking about Online Compiler Explorer? I don't really know
what goldbolt means here, sorry XD .
Best regards,
I Hsin Cheng
> No problem ! I'll be happy to help with these tests, I'll send the next
> iteration when I complete the things you mentioned.
>
> > Is this for_each_cpu() thing the only case - I don't know, but it's
> > enough to consider reverting back to the original version.
>
> So basically the reason of sizing up is because for_each_cpu() put
> non-constant parameter in GENMASK(), which is supposed to be used by
> constant only?
The (l - h) is supposed to be a const_true. In most cases both l and h
are compile-time constants. In that case GENMASK() is evaluated at
compile time, and it allows to complicate it to workaround clang
warning.
But now we obviously have an example where workarounding sacrifices
code generation. This is not OK.
> > Can you also please attach one or two examples of code generation for
> > real functions, not an artificial one as you did before. And maybe a
> > link to goldbolt?
>
> I have no problem of other tests but this one, I wrote a simple
> artificial use case because most functions I found according to the
> report generated by bloat-o-meter, doesn't use GENMASK() directly or
> they're super long and GENMASK() only accounts for very small part of
> them, it wasn't very trivial to sense the difference of disassembly at
> least for me. Should I just pick 1~2 random use cases? or do you have
> any suggestions?
In bloat-o-meter output, pick some small function. The function size
is listed in 'old' and correspondingly 'new' columns. To begin with,
pick some with small difference. The cpuusage_read is one good example:
cpuusage_read 111 109 -2
It boils down to __cpuusage_read(), which is:
static u64 __cpuusage_read(struct cgroup_subsys_state *css,
enum cpuacct_stat_index index)
{
struct cpuacct *ca = css_ca(css);
u64 totalcpuusage = 0;
int i;
for_each_possible_cpu(i)
totalcpuusage += cpuacct_cpuusage_read(ca, i, index);
return totalcpuusage;
}
Now disassemble it like:
objdump --disassemble=cpuusage_read ../build-linux-x86_64/vmlinux.orig > cpuusage_read.orig
objdump --disassemble=cpuusage_read ../build-linux-x86_64/vmlinux.new > cpuusage_read.new
And if you put the listings side-by-side, you will see:
Orig New
ja ffffffff812efdf0 <cpuusage_read+0x60> ja ffffffff812efd2e <cpuusage_read+0x5e>
mov %r8,%rdx mov %r8,%rdx
shl %cl,%rdx shl %cl,%rdx
neg %rdx and %rax,%rdx
and %rax,%rdx je ffffffff812efd2e <cpuusage_read+0x5e>
je ffffffff812efdf0 <cpuusage_read+0x60> tzcnt %rdx,%rdx
tzcnt %rdx,%rdx
OK, now we see that new GENMASK saves one negation. Great success!
Can you repeat this exercise for few other functions both with
positive and negative impact and make a conclusion about how exactly
code generation is impacted?
> Btw, are you talking about Online Compiler Explorer? I don't really know
> what goldbolt means here, sorry XD .
https://godbolt.org/
The idea is that you can cherry-pick all necessary ifdefery from
kernel headers and build a simple example by using all supported
compilers. The godbolt's list of compilers is really long. You
will not need to install any of them locally to just check how
they work.
I actually cherry-picked GENMASK() recently for the other discussion,
so you can start from that:
https://lore.kernel.org/all/20250206020227.GA322224@yaz-khff2.amd.com/T/#m27bed06e15809d7b632966a1e861767c72a8722a
Thanks,
Yury
© 2016 - 2026 Red Hat, Inc.