[v1] Fix missing memory barriers on ARM

[PATCH 0/8] Fix missing memory barriers on ARM

Posted by Paolo Bonzini 2 years, 11 months ago

This series fixes more instances of the problem fixed by commits
5710a3e09f9b ("async: use explicit memory barriers", 2020-04-09) and
7455ff1aa015 ("aio_wait_kick: add missing memory barrier", 2022-06-24).
This is an interesting case where ARM's memory ordering is somewhat 
stronger than what you would expect.

On ARM, seqcst loads and stores (which QEMU does not use) are compiled
respectively as LDAR and STLR instructions.  Even though STLR is also
used for store-release operations, STLR followed by LDAR provides
store-against-load ordering, which is stronger than a store-release.
Compare this to ARMv7, where store-release is DMB+STR and store-seqcst
is DMB+STR+DMB.

This means that on ARM a sequence of

  qatomic_store_release(&y, ...);         // STLR
  a = qatomic_load_acquire(&x);           // LDAR

provides stronger ordering at the processor level than the two MOV
instructions you'd get on x86.

Likewise, on ARM sequentially consistent read-modify-write operations only
need to use LDAXR and STLXR respectively for the load and the store, which
is weaker than the LOCK prefix used on x86.

In a strange twist of events, however, the _stronger_ semantics
of the ARM instructions can end up causing bugs on ARM, not on x86.
The problems occur when seqcst atomics are mixed with relaxed atomics.
Here is how the two are compiled on ARM:

                   load              store
   relaxed         LDR               STR
   seqcst          LDAR              STLR

QEMU's atomics try to bridge the Linux API (that most of the developers
are familiar with) and the C11 API, and the two have a substantial
difference:

- in Linux, strongly-ordered atomics such as atomic_add_return() affect
  the global ordering of _all_ memory operations, including for example
  READ_ONCE()/WRITE_ONCE()

- in C11, sequentially consistent atomics (except for seqcst fences)
  only affect the ordering of sequentially consistent operations.
  In particular, since relaxed loads are done with LDR on ARM, they are
  not ordered against seqcst stores (which are done with STLR).

QEMU implements high-level synchronization primitives with the idea that
the primitives contain the necessary memory barriers, and the callers can
use relaxed atomics (qatomic_read/qatomic_set) or even regular accesses.
This is very much incompatible with the C11 view that seqcst accesses
are only ordered against other seqcst accesses, and requires using
seqcst fences as in the following example:

   qatomic_set(&y, 1);            qatomic_set(&x, 1);
   smp_mb();                      smp_mb();
   ... qatomic_read(&x) ...       ... qatomic_read(&y) ...

Bugs ensue when a qatomic_*() read-modify write operation is used instead
of one or both stores, for example:

   qatomic_<rmw>(&y, ...);
   smp_mb();
   ... qatomic_read(&x) ...

Developers that are more familiar with the Linux API may be tempted
to omit the smp_mb() and that's exactly what yours truly did in
qemu_event_set() and qemu_event_reset().  After a27dd2de68f3 ("KVM:
keep track of running ioctls", 2023-01-11), this caused hangs due to
threads sleeping forever in qemu_event_wait().

This _could_ also have been the cause of occasional hangs of rcutorture,
though I have not observed them personally.

(As an aside, GCC's older __sync_* builtins *did* provide a full barrier
between the RMW operation and loads on the side of the operation.  The
difference between seqcst C11 atomics and __sync_* atomics is exactly
an extra memory barrier after the STLXR instruction).

In order to fix this, while avoiding worse performance from having two
back-to-back memory barriers on x86, patch 1 introduces optimized
memory barriers smp_mb__before_rmw() and smp_mb__after_rmw().  The usage
is similar to Linux's smp_mb__before/after_atomic(), but the name is
different because they affect _all_ RMW operations.  On Linux, instead,
they are not needed around those RMW operations that return the old value.

The remaining patches add them everywhere they are needed.  In the
case of QemuEvent (patches 2-3), I reviewed the algorithm thoroughly,
dropping barriers that were not necessary and killing optimizations that
I wasn't entirely sure about.  For the other cases, instead, the changes
are minimal.

Note: I have a follow-up set of patches that gets rid completely of
atomic_mb_read(); atomic_mb_set() instead can remain and mimic Linux's
smp_store_mb() operation.  A glimpse of these changes is already visible
in patches 7 and 8.

Thanks to Emanuele Esposito and Gavin Shan for help debugging and
testing the changes!

Paolo

Paolo Bonzini (8):
  qatomic: add smp_mb__before/after_rmw()
  qemu-thread-posix: cleanup, fix, document QemuEvent
  qemu-thread-win32: cleanup, fix, document QemuEvent
  edu: add smp_mb__after_rmw()
  util/async: add smp_mb__after_rmw() around BH enqueue/dequeue
  aio-wait: switch to smp_mb__after_rmw()
  qemu-coroutine-lock: add smp_mb__after_rmw()
  physmem: add missing memory barrier

 docs/devel/atomics.rst     | 26 ++++++++++---
 hw/misc/edu.c              |  5 +++
 include/block/aio-wait.h   |  2 +-
 include/qemu/atomic.h      | 17 ++++++++-
 softmmu/physmem.c          |  3 ++
 util/async.c               | 13 ++++---
 util/qemu-coroutine-lock.c |  9 ++++-
 util/qemu-thread-posix.c   | 64 +++++++++++++++++++++----------
 util/qemu-thread-win32.c   | 78 ++++++++++++++++++++++++++------------
 9 files changed, 160 insertions(+), 57 deletions(-)

-- 
2.39.1

Re: [PATCH 0/8] Fix missing memory barriers on ARM

Posted by David Hildenbrand 2 years, 11 months ago

On 03.03.23 18:19, Paolo Bonzini wrote:
> This series fixes more instances of the problem fixed by commits
> 5710a3e09f9b ("async: use explicit memory barriers", 2020-04-09) and
> 7455ff1aa015 ("aio_wait_kick: add missing memory barrier", 2022-06-24).
> This is an interesting case where ARM's memory ordering is somewhat
> stronger than what you would expect.
> 
> On ARM, seqcst loads and stores (which QEMU does not use) are compiled
> respectively as LDAR and STLR instructions.  Even though STLR is also
> used for store-release operations, STLR followed by LDAR provides
> store-against-load ordering, which is stronger than a store-release.
> Compare this to ARMv7, where store-release is DMB+STR and store-seqcst
> is DMB+STR+DMB.
> 
> This means that on ARM a sequence of
> 
>    qatomic_store_release(&y, ...);         // STLR
>    a = qatomic_load_acquire(&x);           // LDAR
> 
> provides stronger ordering at the processor level than the two MOV
> instructions you'd get on x86.
> 
> Likewise, on ARM sequentially consistent read-modify-write operations only
> need to use LDAXR and STLXR respectively for the load and the store, which
> is weaker than the LOCK prefix used on x86.
> 
> In a strange twist of events, however, the _stronger_ semantics
> of the ARM instructions can end up causing bugs on ARM, not on x86.
> The problems occur when seqcst atomics are mixed with relaxed atomics.
> Here is how the two are compiled on ARM:
> 
>                     load              store
>     relaxed         LDR               STR
>     seqcst          LDAR              STLR
> 
> QEMU's atomics try to bridge the Linux API (that most of the developers
> are familiar with) and the C11 API, and the two have a substantial
> difference:
> 
> - in Linux, strongly-ordered atomics such as atomic_add_return() affect
>    the global ordering of _all_ memory operations, including for example
>    READ_ONCE()/WRITE_ONCE()
> 
> - in C11, sequentially consistent atomics (except for seqcst fences)
>    only affect the ordering of sequentially consistent operations.
>    In particular, since relaxed loads are done with LDR on ARM, they are
>    not ordered against seqcst stores (which are done with STLR).
> 
> QEMU implements high-level synchronization primitives with the idea that
> the primitives contain the necessary memory barriers, and the callers can
> use relaxed atomics (qatomic_read/qatomic_set) or even regular accesses.
> This is very much incompatible with the C11 view that seqcst accesses
> are only ordered against other seqcst accesses, and requires using
> seqcst fences as in the following example:
> 
>     qatomic_set(&y, 1);            qatomic_set(&x, 1);
>     smp_mb();                      smp_mb();
>     ... qatomic_read(&x) ...       ... qatomic_read(&y) ...
> 
> Bugs ensue when a qatomic_*() read-modify write operation is used instead
> of one or both stores, for example:
> 
>     qatomic_<rmw>(&y, ...);
>     smp_mb();
>     ... qatomic_read(&x) ...
> 
> Developers that are more familiar with the Linux API may be tempted
> to omit the smp_mb() and that's exactly what yours truly did in
> qemu_event_set() and qemu_event_reset().  After a27dd2de68f3 ("KVM:
> keep track of running ioctls", 2023-01-11), this caused hangs due to
> threads sleeping forever in qemu_event_wait().
> 
> This _could_ also have been the cause of occasional hangs of rcutorture,
> though I have not observed them personally.
> 
> (As an aside, GCC's older __sync_* builtins *did* provide a full barrier
> between the RMW operation and loads on the side of the operation.  The
> difference between seqcst C11 atomics and __sync_* atomics is exactly
> an extra memory barrier after the STLXR instruction).
> 
> In order to fix this, while avoiding worse performance from having two
> back-to-back memory barriers on x86, patch 1 introduces optimized
> memory barriers smp_mb__before_rmw() and smp_mb__after_rmw().  The usage
> is similar to Linux's smp_mb__before/after_atomic(), but the name is
> different because they affect _all_ RMW operations.  On Linux, instead,
> they are not needed around those RMW operations that return the old value.
> 
> The remaining patches add them everywhere they are needed.  In the
> case of QemuEvent (patches 2-3), I reviewed the algorithm thoroughly,
> dropping barriers that were not necessary and killing optimizations that
> I wasn't entirely sure about.  For the other cases, instead, the changes
> are minimal.
> 
> Note: I have a follow-up set of patches that gets rid completely of
> atomic_mb_read(); atomic_mb_set() instead can remain and mimic Linux's
> smp_store_mb() operation.  A glimpse of these changes is already visible
> in patches 7 and 8.

That sounds interesting. I was briefly confused about atomic_mb_* ...


... do we want to add some Fixes: tags or is it too hard to come up with 
something reasonable?

-- 
Thanks,

David / dhildenb

Re: [PATCH 0/8] Fix missing memory barriers on ARM

Posted by Paolo Bonzini 2 years, 11 months ago

On 3/6/23 14:35, David Hildenbrand wrote:
>> Note: I have a follow-up set of patches that gets rid completely of
>> atomic_mb_read(); atomic_mb_set() instead can remain and mimic Linux's
>> smp_store_mb() operation.  A glimpse of these changes is already visible
>> in patches 7 and 8.
> 
> That sounds interesting. I was briefly confused about atomic_mb_* ...
> 
> ... do we want to add some Fixes: tags or is it too hard to come up with 
> something reasonable?

The Fixes tag would often point to

     commit a0aa44b488b3601415d55041e4619aef5f3a4ba8
     Author: Alex Bennée <alex.bennee@linaro.org>
     Date:   Thu Jan 28 10:15:17 2016 +0000

     include/qemu/atomic.h: default to __atomic functions
     
     The __atomic primitives have been available since GCC 4.7 and provide
     a richer interface for describing memory ordering requirements. As a
     bonus by using the primitives instead of hand-rolled functions we can
     use tools such as the ThreadSanitizer which need the use of well
     defined APIs for its analysis.
     
     If we have __ATOMIC defines we exclusively use the __atomic primitives
     for all our atomic access. Otherwise we fall back to the mixture of
     __sync and hand-rolled barrier cases.

(for which I would have sworn I was the sole perpetrator, not just the
committer).  But it pre-dates some of the code that is being fixed, so
I am not sure it makes sense to add it.

Paolo