There is no implicit memory barrier in qatomic_fetch_or() and
atomic_fetch_and() on ARM systems. Add an explicit
smp_mb__after_rmw() to match the intended semantics.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
util/async.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/util/async.c b/util/async.c
index 0657b7539777..6129f2c991cb 100644
--- a/util/async.c
+++ b/util/async.c
@@ -74,13 +74,15 @@ static void aio_bh_enqueue(QEMUBH *bh, unsigned new_flags)
unsigned old_flags;
/*
- * The memory barrier implicit in qatomic_fetch_or makes sure that:
- * 1. idle & any writes needed by the callback are done before the
- * locations are read in the aio_bh_poll.
+ * The memory barrier makes sure that:
+ * 1. any writes needed by the callback are visible from the callback
+ * after aio_bh_dequeue() returns bh.
* 2. ctx is loaded before the callback has a chance to execute and bh
* could be freed.
*/
old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags);
+ smp_mb__after_rmw();
+
if (!(old_flags & BH_PENDING)) {
QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next);
}
@@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags)
QSLIST_REMOVE_HEAD(head, next);
/*
- * The qatomic_and is paired with aio_bh_enqueue(). The implicit memory
- * barrier ensures that the callback sees all writes done by the scheduling
+ * The memory barrier is paired with aio_bh_enqueue() and it
+ * ensures that the callback sees all writes done by the scheduling
* thread. It also ensures that the scheduling thread sees the cleared
* flag before bh->cb has run, and thus will call aio_notify again if
* necessary.
*/
*flags = qatomic_fetch_and(&bh->flags,
~(BH_PENDING | BH_SCHEDULED | BH_IDLE));
+ smp_mb__after_rmw();
return bh;
}
--
2.39.1
On 3/3/23 09:19, Paolo Bonzini wrote:
> There is no implicit memory barrier in qatomic_fetch_or() and
> atomic_fetch_and() on ARM systems. Add an explicit
> smp_mb__after_rmw() to match the intended semantics.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> util/async.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/util/async.c b/util/async.c
> index 0657b7539777..6129f2c991cb 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -74,13 +74,15 @@ static void aio_bh_enqueue(QEMUBH *bh, unsigned new_flags)
> unsigned old_flags;
>
> /*
> - * The memory barrier implicit in qatomic_fetch_or makes sure that:
> - * 1. idle & any writes needed by the callback are done before the
> - * locations are read in the aio_bh_poll.
> + * The memory barrier makes sure that:
> + * 1. any writes needed by the callback are visible from the callback
> + * after aio_bh_dequeue() returns bh.
> * 2. ctx is loaded before the callback has a chance to execute and bh
> * could be freed.
> */
> old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags);
> + smp_mb__after_rmw();
> +
> if (!(old_flags & BH_PENDING)) {
> QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next);
> }
> @@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags)
> QSLIST_REMOVE_HEAD(head, next);
>
> /*
> - * The qatomic_and is paired with aio_bh_enqueue(). The implicit memory
> - * barrier ensures that the callback sees all writes done by the scheduling
> + * The memory barrier is paired with aio_bh_enqueue() and it
> + * ensures that the callback sees all writes done by the scheduling
> * thread. It also ensures that the scheduling thread sees the cleared
> * flag before bh->cb has run, and thus will call aio_notify again if
> * necessary.
> */
Is it really paired with aio_bh_enqueue?
Seems like the enqueue barrier really is for aio_bh_poll, and the dequeue barrier is for
the callback.
While the comments seem off, the code seems correct. If you agree:
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
On 3/5/23 20:32, Richard Henderson wrote:
> On 3/3/23 09:19, Paolo Bonzini wrote:
>> unsigned old_flags;
>> /*
>> * The memory barrier makes sure that:
>> * 1. any writes needed by the callback are visible from the callback
>> * after aio_bh_dequeue() returns bh.
>> * 2. ctx is loaded before the callback has a chance to execute and bh
>> * could be freed.
>> */
>> old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags);
>> + smp_mb__after_rmw();
>> +
>> if (!(old_flags & BH_PENDING)) {
>> QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next);
>> }
>> @@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head,
>> unsigned *flags)
>> QSLIST_REMOVE_HEAD(head, next);
>> /*
>> * The memory barrier is paired with aio_bh_enqueue() and it
>> * ensures that the callback sees all writes done by the scheduling
>> * thread. It also ensures that the scheduling thread sees the cleared
>> * flag before bh->cb has run, and thus will call aio_notify again if
>> * necessary.
>> */
>
> Is it really paired with aio_bh_enqueue?
>
> Seems like the enqueue barrier really is for aio_bh_poll, and the
> dequeue barrier is for the callback.
The documentation has been quite obsolete since the introduction of
bh_list. The logic is as follows:
aio_bh_enqueue()
load bh->ctx
set PENDING flag // (0)
if flag was not set
// bh becomes visible to dequeuing thread here:
insert into bh_list // (1)
aio_notify
// Write bh->flags and bh_list before ctx->notified
smp_wmb() // (2)
set notified to true
// Write notified before reading notify_me
smp_mb() // (3)
if notify_me then event_notifier_set()
and on the dequeue side it's tricky to see why all barriers are
needed; it boils down to the busy-wait polling done at the beginning
of aio_poll():
aio_poll()
compute approximate timeout (*unordered* read of bh_list)
enable notify_me
// Write notify_me before reading notified
smp_mb() // synchronizes with (3)
if notified then timeout = 0
ctx->fdmon_ops->wait(timeout)
disable notify_me
aio_notify_accept()
set notified to false
// Write ctx->notified before reading e.g. bh_list
smp_mb() // synchronizes with (2)
aio_bh_poll()
QSLIST_MOVE_ATOMIC // synchronizes with (1)
aio_bh_dequeue
remove from head
clear PENDING/SCHEDULED/IDLE // synchronizes with (0)
if flags were set
aio_bh_call
That is:
for synchronization point (0)
- the main function here is to ensure that aio_bh_dequeue() removes
from the list before the PENDING flag is cleared, otherwise enqueuers can
clobber the list, and vice versa for the enqueuers. For this, it is enough
that qatomic_fetch_and() is "release" and qatomic_fetch_or() is "acquire"
(and they are).
for synchronization point (1)
- the bottom half machinery between aio_bh_enqueue and aio_bh_poll only
needs release-to-acquire synchronization, and that is provided by (1).
This is also where ordering ensures that bh->ctx is loaded before the
callback executes (which could free bh).
- between enqueuers, setting the PENDING flag only needs to be done before
inserting into bh_list to avoid double insertion (and of course needs to
be done atomically); for this purpose, QSLIST_INSERT_HEAD_ATOMIC needs to
be "release" (which it is).
Also the call to aio_notify() in aio_bh_enqueue() is unconditional, so
aio_bh_dequeue() need not protect against missed calls to aio_notify().
aio_notify() only synchronizes with aio_poll() and aio_notify_accept().
for synchronization point (2)
- This cannot be just a smp_rmb() because the timeout that was computed
earlier was not ordered against either notified or notify_me.
- This is a smp_mb() for generality; as far as bh_list is
concerned it could be smp_mb__before_rmw().
Synchronization point (3) is pretty mundane in comparison.
So I'm dropping this patch; I have prepared a documentation patch instead.
Paolo
© 2016 - 2026 Red Hat, Inc.