[RFC PATCH v2 00/15] xen/arm: port Linux LL/SC and LSE atomics helpers to Xen.

Ash Wilding posted 15 patches 3 years, 5 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/xen tags/patchew/20201111215203.80336-1-ash.j.wilding@gmail.com
Maintainers: Andrew Cooper <andrew.cooper3@citrix.com>, Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>, Julien Grall <julien@xen.org>, George Dunlap <george.dunlap@citrix.com>, Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wl@xen.org>, Jan Beulich <jbeulich@suse.com>, Ian Jackson <iwj@xenproject.org>
xen/arch/arm/Kconfig                     |  11 +
xen/arch/arm/Makefile                    |   1 +
xen/arch/arm/lse.c                       |  13 +
xen/arch/arm/setup.c                     |  13 +-
xen/include/asm-arm/arm32/atomic.h       | 253 +++++++-----
xen/include/asm-arm/arm32/cmpxchg.h      | 388 +++++++++++-------
xen/include/asm-arm/arm32/system.h       |   2 +-
xen/include/asm-arm/arm64/atomic.h       | 236 +++++------
xen/include/asm-arm/arm64/atomic_ll_sc.h | 231 +++++++++++
xen/include/asm-arm/arm64/atomic_lse.h   | 246 +++++++++++
xen/include/asm-arm/arm64/cmpxchg.h      | 497 ++++++++++++++++-------
xen/include/asm-arm/arm64/lse.h          |  48 +++
xen/include/asm-arm/arm64/system.h       |   2 +-
xen/include/asm-arm/atomic.h             |  15 +-
xen/include/asm-arm/cpufeature.h         |  57 +--
xen/include/asm-arm/system.h             |   9 +-
xen/include/xen/rwonce.h                 |  21 +
17 files changed, 1469 insertions(+), 574 deletions(-)
create mode 100644 xen/arch/arm/lse.c
create mode 100644 xen/include/asm-arm/arm64/atomic_ll_sc.h
create mode 100644 xen/include/asm-arm/arm64/atomic_lse.h
create mode 100644 xen/include/asm-arm/arm64/lse.h
create mode 100644 xen/include/xen/rwonce.h
[RFC PATCH v2 00/15] xen/arm: port Linux LL/SC and LSE atomics helpers to Xen.
Posted by Ash Wilding 3 years, 5 months ago
From: Ash Wilding <ash.j.wilding@gmail.com>


Hey,

I've found some time to improve this series a bit:


Changes since RFC v1
====================

 - Broken up patches into smaller chunks to aid in readability.

 - As per Julien's feedback I've also introduced intermediary patches
   that first remove Xen's existing headers, then pull in the current
   Linux versions as-is. This means we only need to review the changes
   made while porting to Xen, rather than reviewing the existing Linux
   code.

 - Pull in Linux's <asm-generic/rwonce.h> as <xen/rwonce.h> for
   __READ_ONCE()/__WRITE_ONCE() instead of putting these in Xen's
   <xen/compiler.h>. While doing this, provide justification for
   dropping Linux's <linux/compiler_types.h> and relaxing the
   __READ_ONCE() usage of __unqual_scalar_typeof() to just typeof()
   (see patches #5 and #6).



Keeping the rest of the cover-letter unchanged as it would still be
good to discuss these things:


Arguments in favour of doing this
=================================

    - Lets SMMUv3 driver switch to using <asm/atomic.h> rather than
      maintaining its own implementation of the helpers.

    - Provides mitigation against XSA-295 [2], which affects both arm32
      and arm64, across all versions of Xen, and may allow a domU to
      maliciously or erroneously DoS the hypervisor.

    - All Armv8-A core implementations since ~2017 implement LSE so
      there is an argument to be made we are long overdue support in
      Xen. This is compounded by LSE atomics being more performant than
      LL/SC atomics in most real-world applications, especially at high
      core counts.

    - We may be able to get improved performance when using LL/SC too
      as Linux provides helpers with relaxed ordering requirements which
      are currently not available in Xen, though for this we would need
      to go back through existing code to see where the more relaxed
      versions can be safely used.

    - Anything else?


Arguments against doing this
============================

    - Limited testing infrastructure in place to ensure use of new
      atomics helpers does not introduce new bugs and regressions. This
      is a particularly strong argument given how difficult it can be to
      identify and debug malfunctioning atomics. The old adage applies,
      "If it ain't broke, don't fix it."

    - Anything else?


Disclaimers
===========

 - This is a very rough first-pass effort intended to spur the
   discussions along.

 - Only build-tested on arm64 and arm32, *not* run-tested. I did also
   build for x86_64 just to make I didn't inadvertently break that.

 - This version only tackles atomics and cmpxchg; I've not yet had a
   chance to look at locks so those are still using LL/SC.

 - The timeout variants of cmpxchg() (used by guest atomics) are still
   using LL/SC regardless of LSE being available, as these helpers are
   not provided by Linux so I just copied over the existing Xen ones.


Any further comments, guidance, and discussion on how to improve this 
approach and get LSE atomics support merged into Xen would be greatly
appreciated.

Thanks!
Ash.

[1] https://lore.kernel.org/xen-devel/13baac40-8b10-0def-4e44-0d8f655fcaf1@xen.org/
[2] https://xenbits.xen.org/xsa/advisory-295.txt

Ash Wilding (15):
  xen/arm: Support detection of CPU features in other ID registers
  xen/arm: Add detection of Armv8.1-LSE atomic instructions
  xen/arm: Add ARM64_HAS_LSE_ATOMICS hwcap
  xen/arm: Delete Xen atomics helpers
  xen/arm: pull in Linux atomics helpers and dependencies
  xen: port Linux <asm-generic/rwonce.h> to Xen
  xen/arm: prepare existing Xen headers for Linux atomics
  xen/arm64: port Linux LL/SC atomics helpers to Xen
  xen/arm64: port Linux LSE atomics helpers to Xen
  xen/arm64: port Linux's arm64 cmpxchg.h to Xen
  xen/arm64: port Linux's arm64 atomic.h to Xen
  xen/arm64: port Linux's arm64 lse.h to Xen
  xen/arm32: port Linux's arm32 atomic.h to Xen
  xen/arm32: port Linux's arm32 cmpxchg.h to Xen
  xen/arm: remove dependency on gcc built-in __sync_fetch_and_add()

 xen/arch/arm/Kconfig                     |  11 +
 xen/arch/arm/Makefile                    |   1 +
 xen/arch/arm/lse.c                       |  13 +
 xen/arch/arm/setup.c                     |  13 +-
 xen/include/asm-arm/arm32/atomic.h       | 253 +++++++-----
 xen/include/asm-arm/arm32/cmpxchg.h      | 388 +++++++++++-------
 xen/include/asm-arm/arm32/system.h       |   2 +-
 xen/include/asm-arm/arm64/atomic.h       | 236 +++++------
 xen/include/asm-arm/arm64/atomic_ll_sc.h | 231 +++++++++++
 xen/include/asm-arm/arm64/atomic_lse.h   | 246 +++++++++++
 xen/include/asm-arm/arm64/cmpxchg.h      | 497 ++++++++++++++++-------
 xen/include/asm-arm/arm64/lse.h          |  48 +++
 xen/include/asm-arm/arm64/system.h       |   2 +-
 xen/include/asm-arm/atomic.h             |  15 +-
 xen/include/asm-arm/cpufeature.h         |  57 +--
 xen/include/asm-arm/system.h             |   9 +-
 xen/include/xen/rwonce.h                 |  21 +
 17 files changed, 1469 insertions(+), 574 deletions(-)
 create mode 100644 xen/arch/arm/lse.c
 create mode 100644 xen/include/asm-arm/arm64/atomic_ll_sc.h
 create mode 100644 xen/include/asm-arm/arm64/atomic_lse.h
 create mode 100644 xen/include/asm-arm/arm64/lse.h
 create mode 100644 xen/include/xen/rwonce.h

-- 
2.24.3 (Apple Git-128)


Re: [RFC PATCH v2 00/15] xen/arm: port Linux LL/SC and LSE atomics helpers to Xen.
Posted by Julien Grall 3 years, 4 months ago

On 11/11/2020 21:51, Ash Wilding wrote:
> From: Ash Wilding <ash.j.wilding@gmail.com>
> 
> 
> Hey,

Hi Ash,

> 
> I've found some time to improve this series a bit:
> 
> 
> Changes since RFC v1
> ====================
> 
>   - Broken up patches into smaller chunks to aid in readability.
> 
>   - As per Julien's feedback I've also introduced intermediary patches
>     that first remove Xen's existing headers, then pull in the current
>     Linux versions as-is. This means we only need to review the changes
>     made while porting to Xen, rather than reviewing the existing Linux
>     code.
> 
>   - Pull in Linux's <asm-generic/rwonce.h> as <xen/rwonce.h> for
>     __READ_ONCE()/__WRITE_ONCE() instead of putting these in Xen's
>     <xen/compiler.h>. While doing this, provide justification for
>     dropping Linux's <linux/compiler_types.h> and relaxing the
>     __READ_ONCE() usage of __unqual_scalar_typeof() to just typeof()
>     (see patches #5 and #6).
> 
> 
> 
> Keeping the rest of the cover-letter unchanged as it would still be
> good to discuss these things:
> 
> 
> Arguments in favour of doing this
> =================================
> 
>      - Lets SMMUv3 driver switch to using <asm/atomic.h> rather than
>        maintaining its own implementation of the helpers.
> 
>      - Provides mitigation against XSA-295 [2], which affects both arm32
>        and arm64, across all versions of Xen, and may allow a domU to
>        maliciously or erroneously DoS the hypervisor.
> 
>      - All Armv8-A core implementations since ~2017 implement LSE so
>        there is an argument to be made we are long overdue support in
>        Xen. This is compounded by LSE atomics being more performant than
>        LL/SC atomics in most real-world applications, especially at high
>        core counts.
> 
>      - We may be able to get improved performance when using LL/SC too
>        as Linux provides helpers with relaxed ordering requirements which
>        are currently not available in Xen, though for this we would need
>        to go back through existing code to see where the more relaxed
>        versions can be safely used.
> 
>      - Anything else?
> 
> 
> Arguments against doing this
> ============================
> 
>      - Limited testing infrastructure in place to ensure use of new
>        atomics helpers does not introduce new bugs and regressions. This
>        is a particularly strong argument given how difficult it can be to
>        identify and debug malfunctioning atomics. The old adage applies,
>        "If it ain't broke, don't fix it."

I am not too concerned about the Linux atomics implementation. They are 
well tested and your changes doesn't seem to touch the implementation 
itself.

However, I vaguely remember that some of the atomics helper in Xen are 
somewhat much stronger than the Linux counterpart.

Would you be able to look at all the existing helpers and see whether 
the memory ordering diverge?

Once we have a list we could decide whether we want to make them 
stronger again or check the callers.

Cheers,

-- 
Julien Grall

Re: [RFC PATCH v2 00/15] xen/arm: port Linux LL/SC and LSE atomics helpers to Xen
Posted by Ash Wilding 3 years, 4 months ago
Hi Julien,

Thanks for taking a look at the patches and providing feedback. I've seen your
other comments and will reply to those separately when I get a chance (maybe at
the weekend or over the Christmas break).

RE the differences in ordering semantics between Xen's and Linux's atomics
helpers, please find my notes below.

Thoughts?

Cheers,
Ash.


The tables below use format AAA/BBB/CCC/DDD/EEE, where:

 - AAA is the memory barrier before the operation
 - BBB is the acquire semantics of the atomic operation
 - CCC is the release semantics of the atomic operation
 - DDD is whether the asm() block clobbers memory
 - EEE is the memory barrier after the operation

For example, ---/---/rel/mem/dmb would mean:

 - No memory barrier before the operation
 - The atomic does *not* have acquire semantics
 - The atomic *does* have release semantics
 - The asm() block clobbers memory
 - There is a DMB memory barrier after the atomic operation


    arm64 LL/SC
    ===========

        Xen Function            Xen                     Linux                   Inconsistent
        ============            ===                     =====                   ============

        atomic_add              ---/---/---/---/---     ---/---/---/---/---     ---
        atomic_add_return       ---/---/rel/mem/dmb     ---/---/rel/mem/dmb     --- (1)
        atomic_sub              ---/---/---/---/---     ---/---/---/---/---     ---
        atomic_sub_return       ---/---/rel/mem/dmb     ---/---/rel/mem/dmb     --- (1)        
        atomic_and              ---/---/---/---/---     ---/---/---/---/---     ---
        atomic_cmpxchg          dmb/---/---/---/dmb     ---/---/rel/mem/---     YES (2)
        atomic_xchg             ---/---/rel/mem/dmb     ---/acq/rel/mem/dmb     YES (3)

(1) It's actually interesting to me that Linux does it this way. As with the
    LSE atomics below, I'd have expected acq/rel semantics and ditch the DMB.
    Unless I'm missing something where there is a concern around taking an IRQ
    between the LDAXR and the STLXR, which can't happen in the LSE atomic case
    since it's a single instruction. But the exclusive monitor is cleared on
    exception return in AArch64 so I'm struggling to see what that potential
    issue may be. Regardless, Linux and Xen are consistent so we're OK ;-)

(2) The Linux version uses either STLXR with rel semantics if the comparison
    passes, or DMB if the comparison fails. This is weaker than Xen's version,
    which is quite blunt in always wrapping the operation between two DMBs. This
    may be a holdover from Xen's arm32 versions being ported to arm64, as we
    didn't support acq/rel semantics on LDREX and STREX in Armv7-A? Regardless,
    this is quite a big discrepancy and I've not yet given it enough thought to
    determine whether it would actually cause an issue. My feeling is that the
    Linux LL/SC atomic_cmpxchg() should have have acq semantics on the LL, but
    like you said these helpers are well tested so I'd be surprised if there
    is a bug. See (5) below though, where the Linux LSE atomic_cmpxchg() *does*
    have acq semantics.

(3) The Linux version just adds acq semantics to the LL, so we're OK here.


    arm64 LSE (comparison to Xen's LL/SC)
    =====================================

        Xen Function            Xen                     Linux                   Inconsistent
        ============            ===                     =====                   ============

        atomic_add              ---/---/---/---/---     ---/---/---/---/---     ---
        atomic_add_return       ---/---/rel/mem/dmb     ---/acq/rel/mem/---     YES (4)
        atomic_sub              ---/---/---/---/---     ---/---/---/---/---     ---
        atomic_sub_return       ---/---/rel/mem/dmb     ---/acq/rel/mem/---     YES (4)
        atomic_and              ---/---/---/---/---     ---/---/---/---/---     ---
        atomic_cmpxchg          dmb/---/---/---/dmb     ---/acq/rel/mem/---     YES (5)
        atomic_xchg             ---/---/rel/mem/dmb     ---/acq/rel/mem/---     YES (4)

(4) As noted in (1), this is how I would have expected Linux's LL/SC atomics to
    work too. I don't think this discrepancy will cause any issues.

(5) As with (2) above, this is quite a big discrepancy to Xen. However at least
    this version has acq semantics unlike the LL/SC version in (2), so I'm more
    confident that there won't be regressions going from Xen LL/SC to Linux LSE
    version of atomic_cmpxchg().


    arm32 LL/SC
    ===========

        Xen Function            Xen                     Linux                   Inconsistent
        ============            ===                     =====                   ============

        atomic_add              ---/---/---/---/---     ---/---/---/---/---     ---
        atomic_add_return       dmb/---/---/---/dmb     XXX/XXX/XXX/XXX/XXX     YES (6)
        atomic_sub              ---/---/---/---/---     ---/---/---/---/---     ---
        atomic_sub_return       dmb/---/---/---/dmb     XXX/XXX/XXX/XXX/XXX     YES (6)
        atomic_and              ---/---/---/---/---     ---/---/---/---/---     ---  
        atomic_cmpxchg          dmb/---/---/---/dmb     XXX/XXX/XXX/XXX/XXX     YES (6)
        atomic_xchg             dmb/---/---/---/dmb     XXX/XXX/XXX/XXX/XXX     YES (6)

(6) Linux only provides relaxed variants of these functions, such as
    atomic_add_return_relaxed() and atomic_xchg_relaxed(). Patches #13 and #14
    in the series add the stricter versions expected by Xen, wrapping calls to
    Linux's relaxed variants inbetween two calls to smb_mb(). This makes them
    consistent with Xen's existing helpers, though is quite blunt. It is worth
    noting that Armv8-A AArch32 does support acq/rel semantics on exclusive
    accesses, with LDAEX and STLEX, so I could imagine us introducing a new
    arm32 hwcap to detect whether we're on actual Armv7-A hardware or Armv8-A
    AArch32, then swap to lighterweight STLEX versions of these helpers rather
    than the heavyweight double DMB versions. Whether that would actually give
    measurable performance improvements is another story!

Re: [RFC PATCH v2 00/15] xen/arm: port Linux LL/SC and LSE atomics helpers to Xen
Posted by Ash Wilding 3 years, 4 months ago
Having pondered note (1) in my previous email a bit more, I imagine the reason
for using a DMB instead of acq/rel semantics is to prevent accesses following
the STLXR from being reordered between it and the LDAXR.

I won't be winning any awards for this ASCII art but hopefully it helps convey
the point.

Using just an LDAXR/STLXR pair and ditching the DMB, accesses to [D] and [E]
can be reodered between the LDAXR and STLXR:

                    ...
        +---------- LDR   [A]
        |           ...
        |           ...
        |    +----- STR   [B]
        |    |      ...
    ====|====|======LDAXR [C]================
        |    |      ...                  X
        |    +----> ...                  |
        |           ...                  |
        |           ...   <----------+   |
        X           ...              |   |
    ================STLXR [C]========|===|===
                    ...              |   |
                    ...              |   |
                    LDR   [D]--------+   |
                    ...                  |
                    STR   [E]------------+
                    ...


While dropping the acq semantics from the LDAXR and using a DMB instead will
prevent accesses to [D] and [E] being reordered between the LDXR/STLXR pair,
and keeping the rel semantics on the STLXR to prevents accesses to [A] and [B]
from being reordered after the STLXR:

                    ...
        +---------- LDR   [A]
        |           ...
        |           ...
        |    +----- STR   [B]
        |    |      ...
        |    |      LDXR  [C]
        |    |      ...
        |    +----> ...
        |           ...
        X           ...
    ================STLXR [C]================
    ================DMB======================
                    ...          X    X
                    ...          |    |
                    LDR   [D]----+    |  
                    ...               |
                    STR   [E]---------+
                    ...


As mentioned in my original email, the LSE atomic is a single instruction so
we can give it acq/rel semantics and not worry about any accesses to [A], [B],
[D], or [E] being reordered relative to that atomic:

                    ...
        +---------- LDR   [A]
        |           ...
        |           ...
        |    +----- STR   [B]
        |    |      ...
        X    X      ...
    ================LDADDAL [C]================
                    ...          X    X
                    ...          |    |
                    LDR   [D]----+    |  
                    ...               |
                    STR   [E]---------+
                    ...


So, makes sense that Linux uses acq/rel and no DMB for LSE, but Linux (and Xen)
are forced to use rel semantics and a DMB for the LL/SC case.

Anyway, point (2) from my earlier email is the one that's potentially more
concerning as we only have rel semantics and no DMB on the Linux LL/SC version
of atomic_cmpxchg(), in contrast to the existing Xen LL/SC implementation being
sandwiched between two DMBs and the Linux LSE version having acq/rel semantics.

Cheers,
Ash.

Re: [RFC PATCH v2 00/15] xen/arm: port Linux LL/SC and LSE atomics helpers to Xen
Posted by Julien Grall 3 years, 2 months ago

On 17/12/2020 15:37, Ash Wilding wrote:
> Hi Julien,

Hi,

First of all, apologies for the very late reply.

> Thanks for taking a look at the patches and providing feedback. I've seen your
> other comments and will reply to those separately when I get a chance (maybe at
> the weekend or over the Christmas break).
> 
> RE the differences in ordering semantics between Xen's and Linux's atomics
> helpers, please find my notes below.
> 
> Thoughts?

Thank you for the very detailed answer, it made a lot easier to 
understand the differences!

I think it would be good to have some (if not all) the content in 
Documents for future reference.

[...]

> The tables below use format AAA/BBB/CCC/DDD/EEE, where:
> 
>   - AAA is the memory barrier before the operation
>   - BBB is the acquire semantics of the atomic operation
>   - CCC is the release semantics of the atomic operation
>   - DDD is whether the asm() block clobbers memory
>   - EEE is the memory barrier after the operation
> 
> For example, ---/---/rel/mem/dmb would mean:
> 
>   - No memory barrier before the operation
>   - The atomic does *not* have acquire semantics
>   - The atomic *does* have release semantics
>   - The asm() block clobbers memory
>   - There is a DMB memory barrier after the atomic operation
> 
> 
>      arm64 LL/SC
>      ===========
> 
>          Xen Function            Xen                     Linux                   Inconsistent
>          ============            ===                     =====                   ============
> 
>          atomic_add              ---/---/---/---/---     ---/---/---/---/---     ---
>          atomic_add_return       ---/---/rel/mem/dmb     ---/---/rel/mem/dmb     --- (1)
>          atomic_sub              ---/---/---/---/---     ---/---/---/---/---     ---
>          atomic_sub_return       ---/---/rel/mem/dmb     ---/---/rel/mem/dmb     --- (1)
>          atomic_and              ---/---/---/---/---     ---/---/---/---/---     ---
>          atomic_cmpxchg          dmb/---/---/---/dmb     ---/---/rel/mem/---     YES (2)

If I am not mistaken, Linux is implementing atomic_cmpxchg() with the 
*_mb() version. So the semantic would be:

---/---/rel/mem/dmb

>          atomic_xchg             ---/---/rel/mem/dmb     ---/acq/rel/mem/dmb     YES (3)

 From Linux:

#define __XCHG_CASE(w, sfx, name, sz, mb, nop_lse, acq, acq_lse, rel, 
cl)       \

[...]

         /* LL/SC */ 
          \
         "       prfm    pstl1strm, %2\n" 
          \
         "1:     ld" #acq "xr" #sfx "\t%" #w "0, %2\n" 
          \
         "       st" #rel "xr" #sfx "\t%w1, %" #w "3, %2\n" 
          \
         "       cbnz    %w1, 1b\n" 
          \
         "       " #mb, 
          \

[...]

__XCHG_CASE(w,  ,  mb_, 32, dmb ish, nop,  , a, l, "memory")

So I think the Linux semantic would be:

---/---/rel/mem/dmb

Therefore there would be no inconsistency between Xen and Linux.

> 
> (1) It's actually interesting to me that Linux does it this way. As with the
>      LSE atomics below, I'd have expected acq/rel semantics and ditch the DMB.

I have done some digging. The original implementation of atomic_{sub, 
add}_return was actually using acq/rel semantics up until Linux 3.14. 
But this was reworked by 8e86f0b409a4 "arm64: atomics: fix use of 
acquire + release for full barrier semantics".

>      Unless I'm missing something where there is a concern around taking an IRQ
>      between the LDAXR and the STLXR, which can't happen in the LSE atomic case
>      since it's a single instruction. But the exclusive monitor is cleared on
>      exception return in AArch64 so I'm struggling to see what that potential
>      issue may be. Regardless, Linux and Xen are consistent so we're OK ;-)

The commit I pointed above contains a lot of details on the issue. For 
convenience, I copied the most relevant bits below:

"
On arm64, these operations have been incorrectly implemented as follows:

             // A, B, C are independent memory locations

             <Access [A]>

             // atomic_op (B)
     1:      ldaxr   x0, [B]         // Exclusive load with acquire
             <op(B)>
             stlxr   w1, x0, [B]     // Exclusive store with release
             cbnz    w1, 1b

             <Access [C]>

The assumption here being that two half barriers are equivalent to a
full barrier, so the only permitted ordering would be A -> B -> C
(where B is the atomic operation involving both a load and a store).

Unfortunately, this is not the case by the letter of the architecture
and, in fact, the accesses to A and C are permitted to pass their
nearest half barrier resulting in orderings such as Bl -> A -> C -> Bs
or Bl -> C -> A -> Bs (where Bl is the load-acquire on B and Bs is the
store-release on B). This is a clear violation of the full barrier
requirement.
"

> (2) The Linux version uses either STLXR with rel semantics if the comparison
>      passes, or DMB if the comparison fails.

I think the DMB is only on the success path and there is no barrier on 
the failure path. The commit 4e39715f4b5c "arm64: cmpxchg: avoid memory 
barrier on comparison failure" seems to corroborate that.

> This is weaker than Xen's version,
>      which is quite blunt in always wrapping the operation between two DMBs. This
>      may be a holdover from Xen's arm32 versions being ported to arm64, as we
>      didn't support acq/rel semantics on LDREX and STREX in Armv7-A? Regardless,

The atomic helpers used in Xen were originally taken from Linux 3.14. 
Back then, atomic_cmpxchg() were using the two full barriers. This was 
introduced by the commit I pointed out in (1).

This was then optimized with commit 4e39715f4b5c "arm64: cmpxchg: avoid 
memory barrier on comparison failure".

>      this is quite a big discrepancy and I've not yet given it enough thought to
>      determine whether it would actually cause an issue. My feeling is that the
>      Linux LL/SC atomic_cmpxchg() should have have acq semantics on the LL, but
>      like you said these helpers are well tested so I'd be surprised if there
>      is a bug. See (5) below though, where the Linux LSE atomic_cmpxchg() *does*
>      have acq semantics.

If my understanding is correct the semantics would be (xen vs Linux):

  - failure: dmb/---/---/---/dmb     ---/---/rel/mem/---
  - success: dmb/---/---/---/dmb     ---/---/rel/mem/dmb

I think the success path is not going to be a problem. But we would need 
to check if all the callers expect a full barrier for the failure path 
(I would hope not).

> 
> (3) The Linux version just adds acq semantics to the LL, so we're OK here.
> 
> 
>      arm64 LSE (comparison to Xen's LL/SC)
>      =====================================
> 
>          Xen Function            Xen                     Linux                   Inconsistent
>          ============            ===                     =====                   ============
> 
>          atomic_add              ---/---/---/---/---     ---/---/---/---/---     ---
>          atomic_add_return       ---/---/rel/mem/dmb     ---/acq/rel/mem/---     YES (4)
>          atomic_sub              ---/---/---/---/---     ---/---/---/---/---     ---
>          atomic_sub_return       ---/---/rel/mem/dmb     ---/acq/rel/mem/---     YES (4)
>          atomic_and              ---/---/---/---/---     ---/---/---/---/---     ---
>          atomic_cmpxchg          dmb/---/---/---/dmb     ---/acq/rel/mem/---     YES (5)
>          atomic_xchg             ---/---/rel/mem/dmb     ---/acq/rel/mem/---     YES (4)
> 
> (4) As noted in (1), this is how I would have expected Linux's LL/SC atomics to
>      work too. I don't think this discrepancy will cause any issues.
IIUC, the LSE implementation would be using a single instruction that 
has both the acquire and release semantics. Therefore, it would act as a 
full barrier.

On the other hand, in the LL/SC implementation, the acquire and release 
semantics is happening with two different instruction. Therefore, they 
don't act as a full barrier.

So I think the memory ordering is going to be equivalent between Xen and 
the Linux LSE implementation.

> 
> (5) As with (2) above, this is quite a big discrepancy to Xen. However at least
>      this version has acq semantics unlike the LL/SC version in (2), so I'm more
>      confident that there won't be regressions going from Xen LL/SC to Linux LSE
>      version of atomic_cmpxchg().
Are they really different? In both cases, the helper will act as a full 
barrier. The main difference is how this ordering is achieved.

> 
> 
>      arm32 LL/SC
>      ===========
> 
>          Xen Function            Xen                     Linux                   Inconsistent
>          ============            ===                     =====                   ============
> 
>          atomic_add              ---/---/---/---/---     ---/---/---/---/---     ---
>          atomic_add_return       dmb/---/---/---/dmb     XXX/XXX/XXX/XXX/XXX     YES (6)
>          atomic_sub              ---/---/---/---/---     ---/---/---/---/---     ---
>          atomic_sub_return       dmb/---/---/---/dmb     XXX/XXX/XXX/XXX/XXX     YES (6)
>          atomic_and              ---/---/---/---/---     ---/---/---/---/---     ---
>          atomic_cmpxchg          dmb/---/---/---/dmb     XXX/XXX/XXX/XXX/XXX     YES (6)
>          atomic_xchg             dmb/---/---/---/dmb     XXX/XXX/XXX/XXX/XXX     YES (6)
> 
> (6) Linux only provides relaxed variants of these functions, such as
>      atomic_add_return_relaxed() and atomic_xchg_relaxed(). Patches #13 and #14
>      in the series add the stricter versions expected by Xen, wrapping calls to
>      Linux's relaxed variants inbetween two calls to smb_mb(). This makes them
>      consistent with Xen's existing helpers, though is quite blunt.

Linux will do the same when the fully ordered version is not implemented 
(see include/linux/atomic-fallback.h).

>  It is worth
>      noting that Armv8-A AArch32 does support acq/rel semantics on exclusive
>      accesses, with LDAEX and STLEX, so I could imagine us introducing a new
>      arm32 hwcap to detect whether we're on actual Armv7-A hardware or Armv8-A
>      AArch32, then swap to lighterweight STLEX versions of these helpers rather
>      than the heavyweight double DMB versions. Whether that would actually give
>      measurable performance improvements is another story!

That's good to know! So far, I haven't heard anyone using Xen 32-bit on 
Armv8. I would wait until there is a request before introducing a 3rd 
(4th if counting the ll/sc as 2) implementation for the atomics helpers.

That said, the 32-bit port is meant to only be supported on Armv7. It 
should boot on Armv8, but there is no promise it will be fully 
functional nor stable.

Overall, it looks like to me that re-syncing the atomic implementation 
with Linux should not be a major problem.

We are in the middle of 4.15 freeze, I will try to go through the series 
ASAP.

Cheers,

-- 
Julien Grall