[v1] Fix missing memory barriers on ARM

[PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue

Posted by Paolo Bonzini 2 years, 11 months ago

There is no implicit memory barrier in qatomic_fetch_or() and
atomic_fetch_and() on ARM systems.  Add an explicit
smp_mb__after_rmw() to match the intended semantics.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 util/async.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/util/async.c b/util/async.c
index 0657b7539777..6129f2c991cb 100644
--- a/util/async.c
+++ b/util/async.c
@@ -74,13 +74,15 @@ static void aio_bh_enqueue(QEMUBH *bh, unsigned new_flags)
     unsigned old_flags;
 
     /*
-     * The memory barrier implicit in qatomic_fetch_or makes sure that:
-     * 1. idle & any writes needed by the callback are done before the
-     *    locations are read in the aio_bh_poll.
+     * The memory barrier makes sure that:
+     * 1. any writes needed by the callback are visible from the callback
+     *    after aio_bh_dequeue() returns bh.
      * 2. ctx is loaded before the callback has a chance to execute and bh
      *    could be freed.
      */
     old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags);
+    smp_mb__after_rmw();
+
     if (!(old_flags & BH_PENDING)) {
         QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next);
     }
@@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags)
     QSLIST_REMOVE_HEAD(head, next);
 
     /*
-     * The qatomic_and is paired with aio_bh_enqueue().  The implicit memory
-     * barrier ensures that the callback sees all writes done by the scheduling
+     * The memory barrier is paired with aio_bh_enqueue() and it
+     * ensures that the callback sees all writes done by the scheduling
      * thread.  It also ensures that the scheduling thread sees the cleared
      * flag before bh->cb has run, and thus will call aio_notify again if
      * necessary.
      */
     *flags = qatomic_fetch_and(&bh->flags,
                               ~(BH_PENDING | BH_SCHEDULED | BH_IDLE));
+    smp_mb__after_rmw();
     return bh;
 }
 
-- 
2.39.1

Re: [PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue

Posted by Richard Henderson 2 years, 11 months ago

On 3/3/23 09:19, Paolo Bonzini wrote:
> There is no implicit memory barrier in qatomic_fetch_or() and
> atomic_fetch_and() on ARM systems.  Add an explicit
> smp_mb__after_rmw() to match the intended semantics.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   util/async.c | 13 ++++++++-----
>   1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/util/async.c b/util/async.c
> index 0657b7539777..6129f2c991cb 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -74,13 +74,15 @@ static void aio_bh_enqueue(QEMUBH *bh, unsigned new_flags)
>       unsigned old_flags;
>   
>       /*
> -     * The memory barrier implicit in qatomic_fetch_or makes sure that:
> -     * 1. idle & any writes needed by the callback are done before the
> -     *    locations are read in the aio_bh_poll.
> +     * The memory barrier makes sure that:
> +     * 1. any writes needed by the callback are visible from the callback
> +     *    after aio_bh_dequeue() returns bh.
>        * 2. ctx is loaded before the callback has a chance to execute and bh
>        *    could be freed.
>        */
>       old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags);
> +    smp_mb__after_rmw();
> +
>       if (!(old_flags & BH_PENDING)) {
>           QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next);
>       }
> @@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags)
>       QSLIST_REMOVE_HEAD(head, next);
>   
>       /*
> -     * The qatomic_and is paired with aio_bh_enqueue().  The implicit memory
> -     * barrier ensures that the callback sees all writes done by the scheduling
> +     * The memory barrier is paired with aio_bh_enqueue() and it
> +     * ensures that the callback sees all writes done by the scheduling
>        * thread.  It also ensures that the scheduling thread sees the cleared
>        * flag before bh->cb has run, and thus will call aio_notify again if
>        * necessary.
>        */

Is it really paired with aio_bh_enqueue?

Seems like the enqueue barrier really is for aio_bh_poll, and the dequeue barrier is for 
the callback.

While the comments seem off, the code seems correct.  If you agree:
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

Re: [PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue

Posted by Paolo Bonzini 2 years, 11 months ago

On 3/5/23 20:32, Richard Henderson wrote:
> On 3/3/23 09:19, Paolo Bonzini wrote:
>>       unsigned old_flags;
>>       /*
>>        * The memory barrier makes sure that:
>>        * 1. any writes needed by the callback are visible from the callback
>>        *    after aio_bh_dequeue() returns bh.
>>        * 2. ctx is loaded before the callback has a chance to execute and bh
>>        *    could be freed.
>>        */
>>       old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags);
>> +    smp_mb__after_rmw();
>> +
>>       if (!(old_flags & BH_PENDING)) {
>>           QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next);
>>       }
>> @@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, 
>> unsigned *flags)
>>       QSLIST_REMOVE_HEAD(head, next);
>>       /*
>>        * The memory barrier is paired with aio_bh_enqueue() and it
>>        * ensures that the callback sees all writes done by the scheduling
>>        * thread.  It also ensures that the scheduling thread sees the cleared
>>        * flag before bh->cb has run, and thus will call aio_notify again if
>>        * necessary.
>>        */
> 
> Is it really paired with aio_bh_enqueue?
> 
> Seems like the enqueue barrier really is for aio_bh_poll, and the 
> dequeue barrier is for the callback.

The documentation has been quite obsolete since the introduction of
bh_list.  The logic is as follows:

aio_bh_enqueue()
   load bh->ctx
   set PENDING flag                                          // (0)
   if flag was not set
     // bh becomes visible to dequeuing thread here:
     insert into bh_list                                     // (1)

   aio_notify
     // Write bh->flags and bh_list before ctx->notified
     smp_wmb()                                               // (2)
     set notified to true
     // Write notified before reading notify_me
     smp_mb()                                                // (3)
     if notify_me then event_notifier_set()

and on the dequeue side it's tricky to see why all barriers are
needed; it boils down to the busy-wait polling done at the beginning
of aio_poll():

aio_poll()
   compute approximate timeout (*unordered* read of bh_list)

   enable notify_me
   // Write notify_me before reading notified
   smp_mb()                                // synchronizes with (3)
   if notified then timeout = 0
   ctx->fdmon_ops->wait(timeout)

   disable notify_me
   aio_notify_accept()
     set notified to false
     // Write ctx->notified before reading e.g. bh_list
     smp_mb()                              // synchronizes with (2)

   aio_bh_poll()
     QSLIST_MOVE_ATOMIC                    // synchronizes with (1)
     aio_bh_dequeue
       remove from head
       clear PENDING/SCHEDULED/IDLE        // synchronizes with (0)
     if flags were set
       aio_bh_call


That is:

for synchronization point (0)
- the main function here is to ensure that aio_bh_dequeue() removes
   from the list before the PENDING flag is cleared, otherwise enqueuers can
   clobber the list, and vice versa for the enqueuers.  For this, it is enough
   that qatomic_fetch_and() is "release" and qatomic_fetch_or() is "acquire"
   (and they are).

for synchronization point (1)
- the bottom half machinery between aio_bh_enqueue and aio_bh_poll only
   needs release-to-acquire synchronization, and that is provided by (1).
   This is also where ordering ensures that bh->ctx is loaded before the
   callback executes (which could free bh).

- between enqueuers, setting the PENDING flag only needs to be done before
   inserting into bh_list to avoid double insertion (and of course needs to
   be done atomically); for this purpose, QSLIST_INSERT_HEAD_ATOMIC needs to
   be "release" (which it is).

Also the call to aio_notify() in aio_bh_enqueue() is unconditional, so
aio_bh_dequeue() need not protect against missed calls to aio_notify().
aio_notify() only synchronizes with aio_poll() and aio_notify_accept().

for synchronization point (2)
- This cannot be just a smp_rmb() because the timeout that was computed
   earlier was not ordered against either notified or notify_me.

- This is a smp_mb() for generality; as far as bh_list is
   concerned it could be smp_mb__before_rmw().

Synchronization point (3) is pretty mundane in comparison.

So I'm dropping this patch; I have prepared a documentation patch instead.

Paolo

[PATCH 1/8] qatomic: add smp_mb__before/after_rmw()
[PATCH 2/8] qemu-thread-posix: cleanup, fix, document QemuEvent
[PATCH 3/8] qemu-thread-win32: cleanup, fix, document QemuEvent
[PATCH 4/8] edu: add smp_mb__after_rmw()
[PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue
[PATCH 6/8] aio-wait: switch to smp_mb__after_rmw()
[PATCH 7/8] qemu-coroutine-lock: add smp_mb__after_rmw()
[PATCH 8/8] physmem: add missing memory barrier