From: George Guo <guodongtai@kylinos.cn>
Implement 128-bit atomic compare-and-exchange using LoongArch's
LL.D/SC.Q instructions.
At the same time, fix BPF scheduler test failures (scx_central scx_qmap)
caused by kmalloc_nolock_noprof returning NULL due to missing
128-bit atomics. The NULL returns led to -ENOMEM errors during
scheduler initialization, causing test cases to fail.
Verified by testing with the scx_qmap scheduler (located in
tools/sched_ext/). Building with `make` and running
./tools/sched_ext/build/bin/scx_qmap.
Signed-off-by: George Guo <guodongtai@kylinos.cn>
---
arch/loongarch/include/asm/cmpxchg.h | 47 ++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/arch/loongarch/include/asm/cmpxchg.h b/arch/loongarch/include/asm/cmpxchg.h
index 979fde61bba8a42cb4f019f13ded2a3119d4aaf4..757f6e82b9880d04f4883dc9a802312111aa4588 100644
--- a/arch/loongarch/include/asm/cmpxchg.h
+++ b/arch/loongarch/include/asm/cmpxchg.h
@@ -111,6 +111,44 @@ __arch_xchg(volatile void *ptr, unsigned long x, int size)
__ret; \
})
+union __u128_halves {
+ u128 full;
+ struct {
+ u64 low;
+ u64 high;
+ };
+};
+
+#define __cmpxchg128_asm(ptr, old, new) \
+({ \
+ union __u128_halves __old, __new, __ret; \
+ volatile u64 *__ptr = (volatile u64 *)(ptr); \
+ \
+ __old.full = (old); \
+ __new.full = (new); \
+ \
+ __asm__ __volatile__( \
+ "1: ll.d %0, %3 # 128-bit cmpxchg low \n" \
+ " dbar 0 # memory barrier \n" \
+ " ld.d %1, %4 # 128-bit cmpxchg high \n" \
+ " bne %0, %z5, 2f \n" \
+ " bne %1, %z6, 2f \n" \
+ " move $t0, %z7 \n" \
+ " move $t1, %z8 \n" \
+ " sc.q $t0, $t1, %2 \n" \
+ " beqz $t0, 1b \n" \
+ "2: \n" \
+ __WEAK_LLSC_MB \
+ : "=&r" (__ret.low), "=&r" (__ret.high), \
+ "=ZB" (__ptr[0]) \
+ : "ZC" (__ptr[0]), "m" (__ptr[1]), \
+ "Jr" (__old.low), "Jr" (__old.high), \
+ "Jr" (__new.low), "Jr" (__new.high) \
+ : "t0", "t1", "memory"); \
+ \
+ __ret.full; \
+})
+
static inline unsigned int __cmpxchg_small(volatile void *ptr, unsigned int old,
unsigned int new, unsigned int size)
{
@@ -198,6 +236,15 @@ __cmpxchg(volatile void *ptr, unsigned long old, unsigned long new, unsigned int
__res; \
})
+/* cmpxchg128 */
+#define system_has_cmpxchg128() 1
+
+#define arch_cmpxchg128(ptr, o, n) \
+({ \
+ BUILD_BUG_ON(sizeof(*(ptr)) != 16); \
+ __cmpxchg128_asm(ptr, o, n); \
+})
+
#ifdef CONFIG_64BIT
#define arch_cmpxchg64_local(ptr, o, n) \
({ \
--
2.48.1
On Mon, Nov 24, 2025 at 5:28 PM George Guo <dongtai.guo@linux.dev> wrote:
>
> From: George Guo <guodongtai@kylinos.cn>
>
> Implement 128-bit atomic compare-and-exchange using LoongArch's
> LL.D/SC.Q instructions.
>
> At the same time, fix BPF scheduler test failures (scx_central scx_qmap)
> caused by kmalloc_nolock_noprof returning NULL due to missing
> 128-bit atomics. The NULL returns led to -ENOMEM errors during
> scheduler initialization, causing test cases to fail.
>
> Verified by testing with the scx_qmap scheduler (located in
> tools/sched_ext/). Building with `make` and running
> ./tools/sched_ext/build/bin/scx_qmap.
>
> Signed-off-by: George Guo <guodongtai@kylinos.cn>
> ---
> arch/loongarch/include/asm/cmpxchg.h | 47 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 47 insertions(+)
>
> diff --git a/arch/loongarch/include/asm/cmpxchg.h b/arch/loongarch/include/asm/cmpxchg.h
> index 979fde61bba8a42cb4f019f13ded2a3119d4aaf4..757f6e82b9880d04f4883dc9a802312111aa4588 100644
> --- a/arch/loongarch/include/asm/cmpxchg.h
> +++ b/arch/loongarch/include/asm/cmpxchg.h
> @@ -111,6 +111,44 @@ __arch_xchg(volatile void *ptr, unsigned long x, int size)
> __ret; \
> })
>
> +union __u128_halves {
> + u128 full;
> + struct {
> + u64 low;
> + u64 high;
> + };
> +};
> +
> +#define __cmpxchg128_asm(ptr, old, new) \
> +({ \
> + union __u128_halves __old, __new, __ret; \
> + volatile u64 *__ptr = (volatile u64 *)(ptr); \
> + \
> + __old.full = (old); \
> + __new.full = (new); \
> + \
> + __asm__ __volatile__( \
> + "1: ll.d %0, %3 # 128-bit cmpxchg low \n" \
> + " dbar 0 # memory barrier \n" \
> + " ld.d %1, %4 # 128-bit cmpxchg high \n" \
> + " bne %0, %z5, 2f \n" \
> + " bne %1, %z6, 2f \n" \
> + " move $t0, %z7 \n" \
> + " move $t1, %z8 \n" \
> + " sc.q $t0, $t1, %2 \n" \
> + " beqz $t0, 1b \n" \
> + "2: \n" \
> + __WEAK_LLSC_MB \
> + : "=&r" (__ret.low), "=&r" (__ret.high), \
> + "=ZB" (__ptr[0]) \
"ZB" isn't a legal constraint for the address operand in sc.q. When
assembled, it turns into something like sc.q $r,$r,$r,0, which clearly
doesn't match the instruction format, yet gas happily accepts it wheil
clang rightfully rejects it. Classic GNU-as leniency biting again. :)
> + : "ZC" (__ptr[0]), "m" (__ptr[1]), \
> + "Jr" (__old.low), "Jr" (__old.high), \
> + "Jr" (__new.low), "Jr" (__new.high) \
> + : "t0", "t1", "memory"); \
> + \
> + __ret.full; \
> +})
> +
> static inline unsigned int __cmpxchg_small(volatile void *ptr, unsigned int old,
> unsigned int new, unsigned int size)
> {
> @@ -198,6 +236,15 @@ __cmpxchg(volatile void *ptr, unsigned long old, unsigned long new, unsigned int
> __res; \
> })
>
> +/* cmpxchg128 */
> +#define system_has_cmpxchg128() 1
> +
> +#define arch_cmpxchg128(ptr, o, n) \
> +({ \
> + BUILD_BUG_ON(sizeof(*(ptr)) != 16); \
> + __cmpxchg128_asm(ptr, o, n); \
> +})
> +
> #ifdef CONFIG_64BIT
> #define arch_cmpxchg64_local(ptr, o, n) \
> ({ \
>
> --
> 2.48.1
>
>
On Mon, 24 Nov 2025 19:37:40 +0800
hev <r@hev.cc> wrote:
> On Mon, Nov 24, 2025 at 5:28 PM George Guo <dongtai.guo@linux.dev>
> wrote:
> >
> > From: George Guo <guodongtai@kylinos.cn>
> >
> > Implement 128-bit atomic compare-and-exchange using LoongArch's
> > LL.D/SC.Q instructions.
> >
> > At the same time, fix BPF scheduler test failures (scx_central
> > scx_qmap) caused by kmalloc_nolock_noprof returning NULL due to
> > missing 128-bit atomics. The NULL returns led to -ENOMEM errors
> > during scheduler initialization, causing test cases to fail.
> >
> > Verified by testing with the scx_qmap scheduler (located in
> > tools/sched_ext/). Building with `make` and running
> > ./tools/sched_ext/build/bin/scx_qmap.
> >
> > Signed-off-by: George Guo <guodongtai@kylinos.cn>
> > ---
> > arch/loongarch/include/asm/cmpxchg.h | 47
> > ++++++++++++++++++++++++++++++++++++ 1 file changed, 47
> > insertions(+)
> >
> > diff --git a/arch/loongarch/include/asm/cmpxchg.h
> > b/arch/loongarch/include/asm/cmpxchg.h index
> > 979fde61bba8a42cb4f019f13ded2a3119d4aaf4..757f6e82b9880d04f4883dc9a802312111aa4588
> > 100644 --- a/arch/loongarch/include/asm/cmpxchg.h +++
> > b/arch/loongarch/include/asm/cmpxchg.h @@ -111,6 +111,44 @@
> > __arch_xchg(volatile void *ptr, unsigned long x, int size) __ret;
> > \ })
> >
> > +union __u128_halves {
> > + u128 full;
> > + struct {
> > + u64 low;
> > + u64 high;
> > + };
> > +};
> > +
> > +#define __cmpxchg128_asm(ptr, old, new)
> > \ +({
> > \
> > + union __u128_halves __old, __new, __ret;
> > \
> > + volatile u64 *__ptr = (volatile u64 *)(ptr);
> > \
> > +
> > \
> > + __old.full = (old);
> > \
> > + __new.full = (new);
> > \
> > +
> > \
> > + __asm__ __volatile__(
> > \
> > + "1: ll.d %0, %3 # 128-bit cmpxchg low \n"
> > \
> > + " dbar 0 # memory barrier \n"
> > \
> > + " ld.d %1, %4 # 128-bit cmpxchg high \n"
> > \
> > + " bne %0, %z5, 2f \n"
> > \
> > + " bne %1, %z6, 2f \n"
> > \
> > + " move $t0, %z7 \n"
> > \
> > + " move $t1, %z8 \n"
> > \
> > + " sc.q $t0, $t1, %2 \n"
> > \
> > + " beqz $t0, 1b \n"
> > \
> > + "2: \n"
> > \
> > + __WEAK_LLSC_MB
> > \
> > + : "=&r" (__ret.low), "=&r" (__ret.high),
> > \
> > + "=ZB" (__ptr[0])
> > \
>
> "ZB" isn't a legal constraint for the address operand in sc.q. When
> assembled, it turns into something like sc.q $r,$r,$r,0, which clearly
> doesn't match the instruction format, yet gas happily accepts it wheil
> clang rightfully rejects it. Classic GNU-as leniency biting again. :)
>
Hi Hev,
Thanks for your advice, I tried sc.q with r or ZC. the result as
below: (with gcc 14.2.1 in fedora-42)
- sc.q with "r" caused system hang
- sc.q with "ZC" caused compiler error:
{standard input}: Assembler messages:
{standard input}:10037: Fatal error: Immediate overflow.
format: u0:0 )
> > + : "ZC" (__ptr[0]), "m" (__ptr[1]),
> > \
> > + "Jr" (__old.low), "Jr" (__old.high),
> > \
> > + "Jr" (__new.low), "Jr" (__new.high)
> > \
> > + : "t0", "t1", "memory");
> > \
> > +
> > \
> > + __ret.full;
> > \ +})
> > +
> > static inline unsigned int __cmpxchg_small(volatile void *ptr,
> > unsigned int old, unsigned int new, unsigned int size)
> > {
> > @@ -198,6 +236,15 @@ __cmpxchg(volatile void *ptr, unsigned long
> > old, unsigned long new, unsigned int __res;
> > \ })
> >
> > +/* cmpxchg128 */
> > +#define system_has_cmpxchg128() 1
> > +
> > +#define arch_cmpxchg128(ptr, o, n)
> > \ +({
> > \
> > + BUILD_BUG_ON(sizeof(*(ptr)) != 16);
> > \
> > + __cmpxchg128_asm(ptr, o, n); \
> > +})
> > +
> > #ifdef CONFIG_64BIT
> > #define arch_cmpxchg64_local(ptr, o, n)
> > \ ({
> > \
> >
> > --
> > 2.48.1
> >
> >
>
On Tue, Nov 25, 2025 at 10:43 AM George Guo <dongtai.guo@linux.dev> wrote:
>
> On Mon, 24 Nov 2025 19:37:40 +0800
> hev <r@hev.cc> wrote:
>
> > On Mon, Nov 24, 2025 at 5:28 PM George Guo <dongtai.guo@linux.dev>
> > wrote:
> > >
> > > From: George Guo <guodongtai@kylinos.cn>
> > >
> > > Implement 128-bit atomic compare-and-exchange using LoongArch's
> > > LL.D/SC.Q instructions.
> > >
> > > At the same time, fix BPF scheduler test failures (scx_central
> > > scx_qmap) caused by kmalloc_nolock_noprof returning NULL due to
> > > missing 128-bit atomics. The NULL returns led to -ENOMEM errors
> > > during scheduler initialization, causing test cases to fail.
> > >
> > > Verified by testing with the scx_qmap scheduler (located in
> > > tools/sched_ext/). Building with `make` and running
> > > ./tools/sched_ext/build/bin/scx_qmap.
> > >
> > > Signed-off-by: George Guo <guodongtai@kylinos.cn>
> > > ---
> > > arch/loongarch/include/asm/cmpxchg.h | 47
> > > ++++++++++++++++++++++++++++++++++++ 1 file changed, 47
> > > insertions(+)
> > >
> > > diff --git a/arch/loongarch/include/asm/cmpxchg.h
> > > b/arch/loongarch/include/asm/cmpxchg.h index
> > > 979fde61bba8a42cb4f019f13ded2a3119d4aaf4..757f6e82b9880d04f4883dc9a802312111aa4588
> > > 100644 --- a/arch/loongarch/include/asm/cmpxchg.h +++
> > > b/arch/loongarch/include/asm/cmpxchg.h @@ -111,6 +111,44 @@
> > > __arch_xchg(volatile void *ptr, unsigned long x, int size) __ret;
> > > \ })
> > >
> > > +union __u128_halves {
> > > + u128 full;
> > > + struct {
> > > + u64 low;
> > > + u64 high;
> > > + };
> > > +};
> > > +
> > > +#define __cmpxchg128_asm(ptr, old, new)
> > > \ +({
> > > \
> > > + union __u128_halves __old, __new, __ret;
> > > \
> > > + volatile u64 *__ptr = (volatile u64 *)(ptr);
> > > \
> > > +
> > > \
> > > + __old.full = (old);
> > > \
> > > + __new.full = (new);
> > > \
> > > +
> > > \
> > > + __asm__ __volatile__(
> > > \
> > > + "1: ll.d %0, %3 # 128-bit cmpxchg low \n"
> > > \
> > > + " dbar 0 # memory barrier \n"
> > > \
> > > + " ld.d %1, %4 # 128-bit cmpxchg high \n"
> > > \
> > > + " bne %0, %z5, 2f \n"
> > > \
> > > + " bne %1, %z6, 2f \n"
> > > \
> > > + " move $t0, %z7 \n"
> > > \
> > > + " move $t1, %z8 \n"
> > > \
> > > + " sc.q $t0, $t1, %2 \n"
> > > \
> > > + " beqz $t0, 1b \n"
> > > \
> > > + "2: \n"
> > > \
> > > + __WEAK_LLSC_MB
> > > \
> > > + : "=&r" (__ret.low), "=&r" (__ret.high),
> > > \
> > > + "=ZB" (__ptr[0])
> > > \
> >
> > "ZB" isn't a legal constraint for the address operand in sc.q. When
> > assembled, it turns into something like sc.q $r,$r,$r,0, which clearly
> > doesn't match the instruction format, yet gas happily accepts it wheil
> > clang rightfully rejects it. Classic GNU-as leniency biting again. :)
> >
> Hi Hev,
>
> Thanks for your advice, I tried sc.q with r or ZC. the result as
> below: (with gcc 14.2.1 in fedora-42)
> - sc.q with "r" caused system hang
input operands:
: "r" (__ptr), ...
> - sc.q with "ZC" caused compiler error:
> {standard input}: Assembler messages:
> {standard input}:10037: Fatal error: Immediate overflow.
> format: u0:0 )
> > > + : "ZC" (__ptr[0]), "m" (__ptr[1]),
> > > \
> > > + "Jr" (__old.low), "Jr" (__old.high),
> > > \
> > > + "Jr" (__new.low), "Jr" (__new.high)
> > > \
> > > + : "t0", "t1", "memory");
> > > \
> > > +
> > > \
> > > + __ret.full;
> > > \ +})
> > > +
> > > static inline unsigned int __cmpxchg_small(volatile void *ptr,
> > > unsigned int old, unsigned int new, unsigned int size)
> > > {
> > > @@ -198,6 +236,15 @@ __cmpxchg(volatile void *ptr, unsigned long
> > > old, unsigned long new, unsigned int __res;
> > > \ })
> > >
> > > +/* cmpxchg128 */
> > > +#define system_has_cmpxchg128() 1
> > > +
> > > +#define arch_cmpxchg128(ptr, o, n)
> > > \ +({
> > > \
> > > + BUILD_BUG_ON(sizeof(*(ptr)) != 16);
> > > \
> > > + __cmpxchg128_asm(ptr, o, n); \
> > > +})
> > > +
> > > #ifdef CONFIG_64BIT
> > > #define arch_cmpxchg64_local(ptr, o, n)
> > > \ ({
> > > \
> > >
> > > --
> > > 2.48.1
> > >
> > >
> >
>
On Tue, 2025-11-25 at 10:43 +0800, George Guo wrote:
> > > + "=ZB" (__ptr[0])
> > > \
> >
> > "ZB" isn't a legal constraint for the address operand in sc.q. When
> > assembled, it turns into something like sc.q $r,$r,$r,0, which clearly
> > doesn't match the instruction format, yet gas happily accepts it wheil
> > clang rightfully rejects it. Classic GNU-as leniency biting again. :)
I clearly remember when Jiajie submitted the sc.q support to GAS
Qinggang was really insistent on supporting the additional ",0" here.
But I don't really understand why we must support it...
>
> Thanks for your advice, I tried sc.q with r or ZC. the result as
> below: (with gcc 14.2.1 in fedora-42)
> - sc.q with "r" caused system hang
It won't work because it'll pass the value (not address) of __ptr[0].
> - sc.q with "ZC" caused compiler error:
> {standard input}: Assembler messages:
> {standard input}:10037: Fatal error: Immediate overflow.
It won't work because the only accepted immediate of sc.q is 0, but ZC
would allow any factor of 4 in [-32768, 32768). I.e. ZC is for
{ldptr,stptr,ll,sc}.{w,d}.
As ZB is only used for sc.q (yet) in GCC backend maybe we can change ZB
to print simply $rX instead of $rX,0 and make LLVM do the same. Would
someone submit a GCC patch for that? Or is there already such a
constraint but I don't know?
BTW for the barrier between ll.d and ld.d, "dbar 0x700" is enough to
order two loads on the same address, and a Loongson hardware engineer
just confirmed me privately that "same address" can be read as "in the
same cacheline" here. Thus it's enough in our case and it has a lower
overhead than "dbar 0".
--
Xi Ruoyao <xry111@xry111.site>
© 2016 - 2025 Red Hat, Inc.