Some recent commits incorrectly assumed 4-byte alignment of locks.
That assumption fails on Linux/m68k (and, interestingly, would have
failed on Linux/cris also). Specify the minimum alignment of atomic
variables for fewer surprises and (hopefully) better performance.
On an m68k system with 14 MB of RAM, this patch reduces the available
memory by a couple of percent. On a 64 MB system, the cost is under 1%
but still significant. I don't know whether there is sufficient
performance gain to justify the memory cost; it still has to be measured.
Link: https://lore.kernel.org/lkml/CAMuHMdW7Ab13DdGs2acMQcix5ObJK0O2dG_Fxzr8_g58Rc1_0g@mail.gmail.com/
---
Changed since v2:
- Specify natural alignment for atomic64_t.
Changed since v1:
- atomic64_t now gets an __aligned attribute too.
- The 'Fixes' tag has been dropped because Lance sent a different fix
for commit e711faaafbe5 ("hung_task: replace blocker_mutex with encoded
blocker") that's suitable for -stable.
---
include/asm-generic/atomic64.h | 2 +-
include/linux/types.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/asm-generic/atomic64.h b/include/asm-generic/atomic64.h
index 100d24b02e52..f22ccfc0df98 100644
--- a/include/asm-generic/atomic64.h
+++ b/include/asm-generic/atomic64.h
@@ -10,7 +10,7 @@
#include <linux/types.h>
typedef struct {
- s64 counter;
+ s64 __aligned(sizeof(s64)) counter;
} atomic64_t;
#define ATOMIC64_INIT(i) { (i) }
diff --git a/include/linux/types.h b/include/linux/types.h
index 6dfdb8e8e4c3..a225a518c2c3 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -179,7 +179,7 @@ typedef phys_addr_t resource_size_t;
typedef unsigned long irq_hw_number_t;
typedef struct {
- int counter;
+ int __aligned(sizeof(int)) counter;
} atomic_t;
#define ATOMIC_INIT(i) { (i) }
--
2.49.1
Hi Finn, On Tue, 21 Oct 2025 at 07:39, Finn Thain <fthain@linux-m68k.org> wrote: > > Some recent commits incorrectly assumed 4-byte alignment of locks. > That assumption fails on Linux/m68k (and, interestingly, would have > failed on Linux/cris also). Specify the minimum alignment of atomic > variables for fewer surprises and (hopefully) better performance. FWIW I implemented jump labels for m68k and I think there is a problem with this in there too. jump_label_init() calls static_key_set_entries() and setting key->entries in there is corrupting 'atomic_t enabled' at the start of key. With this patch the problem goes away. Cheers, Daniel
On Mon, 24 Nov 2025, Daniel Palmer wrote: > On Tue, 21 Oct 2025 at 07:39, Finn Thain <fthain@linux-m68k.org> wrote: > > > > Some recent commits incorrectly assumed 4-byte alignment of locks. > > That assumption fails on Linux/m68k (and, interestingly, would have > > failed on Linux/cris also). Specify the minimum alignment of atomic > > variables for fewer surprises and (hopefully) better performance. > > FWIW I implemented jump labels for m68k and I think there is a problem > with this in there too. > jump_label_init() calls static_key_set_entries() and setting > key->entries in there is corrupting 'atomic_t enabled' at the start of > key. > > With this patch the problem goes away. > That's interesting. I wonder whether the alignment requirements of machine instructions permitted the "appropriation" of the low bits from those pointers... In anycase, a modified jump label algorithm that did not use/abuse pointer bits would need to execute as fast as the existing implementation. And that might be quite difficult (especially a portable algorithm). Recently I had an opportunity to do some performance measurements on m68k for this atomic_t alignment patch. I tested some kernel stressors on an AWS 95 (33 MHz 68040, 128 MB RAM, 512 KiB L2$) and also on a Mac IIfx (40 MHz 68030, 80 MB RAM, 32 KiB L2$). The patch makes the kernel faster or slower, depending the workload. For example, the fifo, futex and shm stressors were consistently faster whereas the splice, signal and msg stressors were consistently slower. There are no hardware counters for cache misses that might account for part of the slowdown. OTOH, alignment also reduces instances of locks split across page boundaries, which might account for the speed-up. (I didn't look at VM performance counters.) Finally, I should note that the stress-ng man page says "do NOT use" as a benchmark. OK, well, if anyone wishes to reproduce my results, I can send you the statically linked binary I used. The job file is attached. I wonder whether others have done any throughput measurement for this patch, using their favourite workloads?run sequential metrics-brief timeout 180s no-rand-seed oomable temp-path /tmp clone 1 clone-ops 4 dentry 1 dentry-ops 8192 #dev 1 #dev-ops 300 dev-shm 1 dev-shm-ops 20 dnotify 1 dnotify-ops 1200 fault 1 fault-ops 8000 fifo 1 fifo-ops 24000 file-ioctl 1 file-ioctl-ops 20000 futex 1 futex-ops 40000 get 1 get-ops 3000 getdent 1 getdent-ops 10000 icmp-flood 1 icmp-flood-ops 40000 inotify 1 inotify-ops 400 ioprio 1 ioprio-ops 8000 kill 1 kill-ops 150000 memfd 1 memfd-bytes 32m memfd-ops 8 mmapfork 1 mmapfork-ops 4 msg 1 msg-ops 300000 nop 1 nop-ops 3000 poll 1 poll-ops 8000 ptrace 1 ptrace-ops 50000 pty 1 pty-ops 2 rawpkt 1 rawpkt-ops 80000 rawudp 1 rawudp-ops 15000 resources 1 resources-ops 300 revio 1 revio-ops 50000 seek 1 seek-ops 12000 #sem 1 #sem-ops 4000 sem-sysv 1 sem-sysv-ops 300000 sendfile 1 sendfile-ops 1500 set 1 set-ops 20000 shm 1 shm-ops 15 sigchld 1 sigchld-ops 5000 signal 1 signal-ops 150000 sigsegv 1 sigsegv-ops 100000 sock 1 sock-ops 50 splice 1 splice-ops 10000 tee 1 tee-ops 1500 udp 1 udp-ops 30000 utime 1 utime-ops 4000 vm 1 vm-ops 2500
© 2016 - 2026 Red Hat, Inc.