[RFC v4 3/5] atomic: Specify alignment for atomic_t and atomic64_t

Finn Thain posted 5 patches 3 months, 2 weeks ago
There is a newer version of this series
[RFC v4 3/5] atomic: Specify alignment for atomic_t and atomic64_t
Posted by Finn Thain 3 months, 2 weeks ago
Some recent commits incorrectly assumed 4-byte alignment of locks.
That assumption fails on Linux/m68k (and, interestingly, would have
failed on Linux/cris also). Specify the minimum alignment of atomic
variables for fewer surprises and (hopefully) better performance.

On an m68k system with 14 MB of RAM, this patch reduces the available
memory by a couple of percent. On a 64 MB system, the cost is under 1%
but still significant. I don't know whether there is sufficient
performance gain to justify the memory cost; it still has to be measured.

Link: https://lore.kernel.org/lkml/CAMuHMdW7Ab13DdGs2acMQcix5ObJK0O2dG_Fxzr8_g58Rc1_0g@mail.gmail.com/
---
Changed since v2:
 - Specify natural alignment for atomic64_t.
Changed since v1:
 - atomic64_t now gets an __aligned attribute too.
 - The 'Fixes' tag has been dropped because Lance sent a different fix
   for commit e711faaafbe5 ("hung_task: replace blocker_mutex with encoded
   blocker") that's suitable for -stable.
---
 include/asm-generic/atomic64.h | 2 +-
 include/linux/types.h          | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/atomic64.h b/include/asm-generic/atomic64.h
index 100d24b02e52..f22ccfc0df98 100644
--- a/include/asm-generic/atomic64.h
+++ b/include/asm-generic/atomic64.h
@@ -10,7 +10,7 @@
 #include <linux/types.h>
 
 typedef struct {
-	s64 counter;
+	s64 __aligned(sizeof(s64)) counter;
 } atomic64_t;
 
 #define ATOMIC64_INIT(i)	{ (i) }
diff --git a/include/linux/types.h b/include/linux/types.h
index 6dfdb8e8e4c3..a225a518c2c3 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -179,7 +179,7 @@ typedef phys_addr_t resource_size_t;
 typedef unsigned long irq_hw_number_t;
 
 typedef struct {
-	int counter;
+	int __aligned(sizeof(int)) counter;
 } atomic_t;
 
 #define ATOMIC_INIT(i) { (i) }
-- 
2.49.1
Re: [RFC v4 3/5] atomic: Specify alignment for atomic_t and atomic64_t
Posted by Daniel Palmer 2 months, 2 weeks ago
Hi Finn,

On Tue, 21 Oct 2025 at 07:39, Finn Thain <fthain@linux-m68k.org> wrote:
>
> Some recent commits incorrectly assumed 4-byte alignment of locks.
> That assumption fails on Linux/m68k (and, interestingly, would have
> failed on Linux/cris also). Specify the minimum alignment of atomic
> variables for fewer surprises and (hopefully) better performance.

FWIW I implemented jump labels for m68k and I think there is a problem
with this in there too.
jump_label_init() calls static_key_set_entries() and setting
key->entries in there is corrupting 'atomic_t enabled' at the start of
key.

With this patch the problem goes away.

Cheers,

Daniel
Re: [RFC v4 3/5] atomic: Specify alignment for atomic_t and atomic64_t
Posted by Finn Thain 2 months, 2 weeks ago
On Mon, 24 Nov 2025, Daniel Palmer wrote:

> On Tue, 21 Oct 2025 at 07:39, Finn Thain <fthain@linux-m68k.org> wrote:
> >
> > Some recent commits incorrectly assumed 4-byte alignment of locks.
> > That assumption fails on Linux/m68k (and, interestingly, would have
> > failed on Linux/cris also). Specify the minimum alignment of atomic
> > variables for fewer surprises and (hopefully) better performance.
> 
> FWIW I implemented jump labels for m68k and I think there is a problem
> with this in there too.
> jump_label_init() calls static_key_set_entries() and setting
> key->entries in there is corrupting 'atomic_t enabled' at the start of
> key.
> 
> With this patch the problem goes away.
> 

That's interesting. I wonder whether the alignment requirements of machine 
instructions permitted the "appropriation" of the low bits from those 
pointers...

In anycase, a modified jump label algorithm that did not use/abuse pointer 
bits would need to execute as fast as the existing implementation. And 
that might be quite difficult (especially a portable algorithm).

Recently I had an opportunity to do some performance measurements on m68k 
for this atomic_t alignment patch. I tested some kernel stressors on an 
AWS 95 (33 MHz 68040, 128 MB RAM, 512 KiB L2$) and also on a Mac IIfx (40 
MHz 68030, 80 MB RAM, 32 KiB L2$).

The patch makes the kernel faster or slower, depending the workload. For 
example, the fifo, futex and shm stressors were consistently faster 
whereas the splice, signal and msg stressors were consistently slower.

There are no hardware counters for cache misses that might account for 
part of the slowdown. OTOH, alignment also reduces instances of locks 
split across page boundaries, which might account for the speed-up. (I 
didn't look at VM performance counters.)

Finally, I should note that the stress-ng man page says "do NOT use" as a 
benchmark. OK, well, if anyone wishes to reproduce my results, I can send 
you the statically linked binary I used. The job file is attached.

I wonder whether others have done any throughput measurement for this 
patch, using their favourite workloads?run sequential
metrics-brief
timeout 180s
no-rand-seed
oomable
temp-path /tmp

clone 1
clone-ops 4

dentry 1
dentry-ops 8192

#dev 1
#dev-ops 300

dev-shm 1
dev-shm-ops 20

dnotify 1
dnotify-ops 1200

fault 1
fault-ops 8000

fifo 1
fifo-ops 24000

file-ioctl 1
file-ioctl-ops 20000

futex 1
futex-ops 40000

get 1
get-ops 3000

getdent 1
getdent-ops 10000

icmp-flood 1
icmp-flood-ops 40000

inotify 1
inotify-ops 400

ioprio 1
ioprio-ops 8000

kill 1
kill-ops 150000

memfd 1
memfd-bytes 32m
memfd-ops 8

mmapfork 1
mmapfork-ops 4

msg 1
msg-ops 300000

nop 1
nop-ops 3000

poll 1
poll-ops 8000

ptrace 1
ptrace-ops 50000

pty 1
pty-ops 2

rawpkt 1
rawpkt-ops 80000

rawudp 1
rawudp-ops 15000

resources 1
resources-ops 300

revio 1
revio-ops 50000

seek 1
seek-ops 12000

#sem 1
#sem-ops 4000

sem-sysv 1
sem-sysv-ops 300000

sendfile 1
sendfile-ops 1500

set 1
set-ops 20000

shm 1
shm-ops 15

sigchld 1
sigchld-ops 5000

signal 1
signal-ops 150000

sigsegv 1
sigsegv-ops 100000

sock 1
sock-ops 50

splice 1
splice-ops 10000

tee 1
tee-ops 1500

udp 1
udp-ops 30000

utime 1
utime-ops 4000

vm 1
vm-ops 2500