From: Weili Qian <qianweili@huawei.com>
Starting from ARMv8.4, stp and ldp instructions become atomic.
Currently, device drivers depend on 128-bit atomic memory IO access,
but these are implemented within the drivers. Therefore, this introduces
generic {__raw_read|__raw_write}128 function for 128-bit memory access.
Signed-off-by: Weili Qian <qianweili@huawei.com>
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
---
arch/arm64/include/asm/io.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 83e03abbb2ca..80430750a28c 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -50,6 +50,17 @@ static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
}
+#define __raw_write128 __raw_write128
+static __always_inline void __raw_write128(u128 val, volatile void __iomem *addr)
+{
+ u64 low, high;
+
+ low = val;
+ high = (u64)(val >> 64);
+
+ asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(low), "rZ"(high), "r"(addr));
+}
+
#define __raw_readb __raw_readb
static __always_inline u8 __raw_readb(const volatile void __iomem *addr)
{
@@ -95,6 +106,16 @@ static __always_inline u64 __raw_readq(const volatile void __iomem *addr)
return val;
}
+#define __raw_read128 __raw_read128
+static __always_inline u128 __raw_read128(const volatile void __iomem *addr)
+{
+ u64 high, low;
+
+ asm volatile("ldp %0, %1, [%2]" : "=r" (low), "=r" (high) : "r" (addr));
+
+ return (((u128)high << 64) | (u128)low);
+}
+
/* IO barriers */
#define __io_ar(v) \
({ \
--
2.33.0
On Wed, Nov 12, 2025 at 09:58:46AM +0800, Chenghai Huang wrote:
> From: Weili Qian <qianweili@huawei.com>
>
> Starting from ARMv8.4, stp and ldp instructions become atomic.
That's not true for accesses to Device memory types.
Per ARM DDI 0487, L.b, section B2.2.1.1 ("Changes to single-copy atomicity in
Armv8.4"):
If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load
or store two 64-bit registers are single-copy atomic when all of the
following conditions are true:
• The overall memory access is aligned to 16 bytes.
• Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.
IIUC when used for Device memory types, those can be split, and a part
of the access could be replayed multiple times (e.g. due to an
intetrupt).
I don't think we can add this generally. It is not atomic, and not
generally safe.
Mark.
> Currently, device drivers depend on 128-bit atomic memory IO access,
> but these are implemented within the drivers. Therefore, this introduces
> generic {__raw_read|__raw_write}128 function for 128-bit memory access.
>
> Signed-off-by: Weili Qian <qianweili@huawei.com>
> Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
> ---
> arch/arm64/include/asm/io.h | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
> index 83e03abbb2ca..80430750a28c 100644
> --- a/arch/arm64/include/asm/io.h
> +++ b/arch/arm64/include/asm/io.h
> @@ -50,6 +50,17 @@ static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
> asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
> }
>
> +#define __raw_write128 __raw_write128
> +static __always_inline void __raw_write128(u128 val, volatile void __iomem *addr)
> +{
> + u64 low, high;
> +
> + low = val;
> + high = (u64)(val >> 64);
> +
> + asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(low), "rZ"(high), "r"(addr));
> +}
> +
> #define __raw_readb __raw_readb
> static __always_inline u8 __raw_readb(const volatile void __iomem *addr)
> {
> @@ -95,6 +106,16 @@ static __always_inline u64 __raw_readq(const volatile void __iomem *addr)
> return val;
> }
>
> +#define __raw_read128 __raw_read128
> +static __always_inline u128 __raw_read128(const volatile void __iomem *addr)
> +{
> + u64 high, low;
> +
> + asm volatile("ldp %0, %1, [%2]" : "=r" (low), "=r" (high) : "r" (addr));
> +
> + return (((u128)high << 64) | (u128)low);
> +}
> +
> /* IO barriers */
> #define __io_ar(v) \
> ({ \
> --
> 2.33.0
>
>
在 2025/11/12 20:28, Mark Rutland 写道:
> On Wed, Nov 12, 2025 at 09:58:46AM +0800, Chenghai Huang wrote:
>> From: Weili Qian <qianweili@huawei.com>
>>
>> Starting from ARMv8.4, stp and ldp instructions become atomic.
> That's not true for accesses to Device memory types.
>
> Per ARM DDI 0487, L.b, section B2.2.1.1 ("Changes to single-copy atomicity in
> Armv8.4"):
>
> If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load
> or store two 64-bit registers are single-copy atomic when all of the
> following conditions are true:
> • The overall memory access is aligned to 16 bytes.
> • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.
>
> IIUC when used for Device memory types, those can be split, and a part
> of the access could be replayed multiple times (e.g. due to an
> intetrupt).
>
> I don't think we can add this generally. It is not atomic, and not
> generally safe.
>
> Mark.
Thanks for your correction. I misunderstood the behavior of LDP and
STP instructions. So, regarding device memory types, LDP and STP
instructions do not guarantee single-copy atomicity.
For devices that require 128-bit atomic access, is it only possible
to implement this functionality in the driver?
Chenghai
>
>> Currently, device drivers depend on 128-bit atomic memory IO access,
>> but these are implemented within the drivers. Therefore, this introduces
>> generic {__raw_read|__raw_write}128 function for 128-bit memory access.
>>
>> Signed-off-by: Weili Qian <qianweili@huawei.com>
>> Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
>> ---
>> arch/arm64/include/asm/io.h | 21 +++++++++++++++++++++
>> 1 file changed, 21 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
>> index 83e03abbb2ca..80430750a28c 100644
>> --- a/arch/arm64/include/asm/io.h
>> +++ b/arch/arm64/include/asm/io.h
>> @@ -50,6 +50,17 @@ static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
>> asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
>> }
>>
>> +#define __raw_write128 __raw_write128
>> +static __always_inline void __raw_write128(u128 val, volatile void __iomem *addr)
>> +{
>> + u64 low, high;
>> +
>> + low = val;
>> + high = (u64)(val >> 64);
>> +
>> + asm volatile ("stp %x0, %x1, [%2]\n" :: "rZ"(low), "rZ"(high), "r"(addr));
>> +}
>> +
>> #define __raw_readb __raw_readb
>> static __always_inline u8 __raw_readb(const volatile void __iomem *addr)
>> {
>> @@ -95,6 +106,16 @@ static __always_inline u64 __raw_readq(const volatile void __iomem *addr)
>> return val;
>> }
>>
>> +#define __raw_read128 __raw_read128
>> +static __always_inline u128 __raw_read128(const volatile void __iomem *addr)
>> +{
>> + u64 high, low;
>> +
>> + asm volatile("ldp %0, %1, [%2]" : "=r" (low), "=r" (high) : "r" (addr));
>> +
>> + return (((u128)high << 64) | (u128)low);
>> +}
>> +
>> /* IO barriers */
>> #define __io_ar(v) \
>> ({ \
>> --
>> 2.33.0
>>
>>
On Wed, 12 Nov 2025 12:28:01 +0000
Mark Rutland <mark.rutland@arm.com> wrote:
> On Wed, Nov 12, 2025 at 09:58:46AM +0800, Chenghai Huang wrote:
> > From: Weili Qian <qianweili@huawei.com>
> >
> > Starting from ARMv8.4, stp and ldp instructions become atomic.
>
> That's not true for accesses to Device memory types.
>
> Per ARM DDI 0487, L.b, section B2.2.1.1 ("Changes to single-copy atomicity in
> Armv8.4"):
>
> If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load
> or store two 64-bit registers are single-copy atomic when all of the
> following conditions are true:
> • The overall memory access is aligned to 16 bytes.
> • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.
>
> IIUC when used for Device memory types, those can be split, and a part
> of the access could be replayed multiple times (e.g. due to an
> intetrupt).
That can't be right.
IO accesses can reference hardware FIFO so must only happen once.
(Or is 'Device memory' something different from 'Device register'?
I'm also not sure that the bus cycles could get split by an interrupt,
that would require a mid-instruction interrupt - very unlikely.
Interleaving is most likely to come from another cpu.
More interesting would be whether the instructions generate a single
PCIe TLP? (perhaps even only most of the time.)
PCIe reads are high latency, anything that can be done to increase the
size of the TLP improves PIO throughput massively.
David
>
> I don't think we can add this generally. It is not atomic, and not
> generally safe.
>
> Mark.
...
On Wed, Nov 12, 2025 at 02:01:57PM +0000, David Laight wrote:
> On Wed, 12 Nov 2025 12:28:01 +0000
> Mark Rutland <mark.rutland@arm.com> wrote:
>
> > On Wed, Nov 12, 2025 at 09:58:46AM +0800, Chenghai Huang wrote:
> > > From: Weili Qian <qianweili@huawei.com>
> > >
> > > Starting from ARMv8.4, stp and ldp instructions become atomic.
> >
> > That's not true for accesses to Device memory types.
> >
> > Per ARM DDI 0487, L.b, section B2.2.1.1 ("Changes to single-copy atomicity in
> > Armv8.4"):
> >
> > If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load
> > or store two 64-bit registers are single-copy atomic when all of the
> > following conditions are true:
> > • The overall memory access is aligned to 16 bytes.
> > • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.
> >
> > IIUC when used for Device memory types, those can be split, and a part
> > of the access could be replayed multiple times (e.g. due to an
> > intetrupt).
>
> That can't be right.
For better or worse, the architecture permits this, and I understand
that there are implementations on which this can happen.
> IO accesses can reference hardware FIFO so must only happen once.
This has nothing to do with the endpoint, and so any FIFO in the
endpoint is immaterial.
I agree that we want to ensure that the accesses only happen once, which
is why I have raised that it is unsound to use LDP/LDNP/STP in this way.
> (Or is 'Device memory' something different from 'Device register'?
I specifically said "Device memory type", which is an attribute that the
MMU associates with a VA, and determines how the MMU (and memory system
as a whole) treats accesses to that VA.
You can find the architecture documentation I referenced at:
https://developer.arm.com/documentation/ddi0487/lb/
> I'm also not sure that the bus cycles could get split by an interrupt,
> that would require a mid-instruction interrupt - very unlikely.
There are various reasons why an implementation might split the accesses
made by a single instruction, and why an interrupt (or other event)
might occur between accesses and cause a replay of some of the
constituent accesses. This has nothing to do with splitting bus cycles.
Mark.
© 2016 - 2026 Red Hat, Inc.