Commit b7a745d added a qemu_bh_cancel call to the completion function
as an optimization to prevent it from unnecessarily rescheduling itself.
This completion function is scheduled from worker_thread, after setting
the state of a ThreadPoolElement to THREAD_DONE.
This was considered to be safe, as the completion function restarts the
loop just after the call to qemu_bh_cancel. But, under certain access
patterns and scheduling conditions, the loop may wrongly use a
pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending
the completion function without having processed a pending TPE linked at
pool->head:
worker thread | I/O thread
------------------------------------------------------------------------
| speculatively read req->state
req->state = THREAD_DONE; |
qemu_bh_schedule(p->completion_bh) |
bh->scheduled = 1; |
| qemu_bh_cancel(p->completion_bh)
| bh->scheduled = 0;
| if (req->state == THREAD_DONE)
| // sees THREAD_QUEUED
The source of the misunderstanding was that qemu_bh_cancel is now being
used by the _consumer_ rather than the producer, and therefore now needs
to have acquire semantics just like e.g. aio_bh_poll.
In some situations, if there are no other independent requests in the
same aio context that could eventually trigger the scheduling of the
completion function, the omitted TPE and all operations pending on it
will get stuck forever.
Signed-off-by: Sergio Lopez <slp@redhat.com>
---
util/async.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/util/async.c b/util/async.c
index 355af73ee7..0e1bd8780a 100644
--- a/util/async.c
+++ b/util/async.c
@@ -174,7 +174,7 @@ void qemu_bh_schedule(QEMUBH *bh)
*/
void qemu_bh_cancel(QEMUBH *bh)
{
- bh->scheduled = 0;
+ atomic_mb_set(&bh->scheduled, 0);
}
/* This func is async.The bottom half will do the delete action at the finial
--
2.13.6
On Wed, Nov 08, 2017 at 07:34:47AM +0100, Sergio Lopez wrote: > Commit b7a745d added a qemu_bh_cancel call to the completion function > as an optimization to prevent it from unnecessarily rescheduling itself. > > This completion function is scheduled from worker_thread, after setting > the state of a ThreadPoolElement to THREAD_DONE. > > This was considered to be safe, as the completion function restarts the > loop just after the call to qemu_bh_cancel. But, under certain access > patterns and scheduling conditions, the loop may wrongly use a > pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending > the completion function without having processed a pending TPE linked at > pool->head: > > worker thread | I/O thread > ------------------------------------------------------------------------ > | speculatively read req->state > req->state = THREAD_DONE; | > qemu_bh_schedule(p->completion_bh) | > bh->scheduled = 1; | > | qemu_bh_cancel(p->completion_bh) > | bh->scheduled = 0; > | if (req->state == THREAD_DONE) > | // sees THREAD_QUEUED > > The source of the misunderstanding was that qemu_bh_cancel is now being > used by the _consumer_ rather than the producer, and therefore now needs > to have acquire semantics just like e.g. aio_bh_poll. > > In some situations, if there are no other independent requests in the > same aio context that could eventually trigger the scheduling of the > completion function, the omitted TPE and all operations pending on it > will get stuck forever. > > Signed-off-by: Sergio Lopez <slp@redhat.com> > --- > util/async.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Thanks, applied to my block tree: https://github.com/stefanha/qemu/commits/block Stefan
On 08.11.2017 09:34, Sergio Lopez wrote: > Commit b7a745d added a qemu_bh_cancel call to the completion function > as an optimization to prevent it from unnecessarily rescheduling itself. > > This completion function is scheduled from worker_thread, after setting > the state of a ThreadPoolElement to THREAD_DONE. > Great! We are seeing the same problem, and I was describing my fix, when I came across your patch :) > This was considered to be safe, as the completion function restarts the > loop just after the call to qemu_bh_cancel. But, under certain access > patterns and scheduling conditions, the loop may wrongly use a > pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending > the completion function without having processed a pending TPE linked at > pool->head: I'm not quite sure that the pre-fetched is involved in this issue, because pre-fetch reading a certain addresses should be invalidated by write on another core to the same addresses. In our case write req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE. I am inclined to think that there is a memory-reordering read with write. It's a very real case for x86 and I don't see the reasons which can prevent it: .text:000000000060E21E loc_60E21E: ; CODE XREF: .text:000000000060E2F4j .text:000000000060E21E mov rbx, [r12+98h] .text:000000000060E226 test rbx, rbx .text:000000000060E229 jnz short loc_60E238 .text:000000000060E22B jmp short exit_0 .text:000000000060E22B ; --------------------------------------------------------------------------- .text:000000000060E22D align 10h .text:000000000060E21E loc_60E21E: ; CODE XREF: .text:000000000060E2F4j .text:000000000060E21E mov rbx, [r12+98h] .text:000000000060E226 test rbx, rbx .text:000000000060E229 jnz short loc_60E238 .text:000000000060E22B jmp short exit_0 .text:000000000060E230 loc_60E230: ; CODE XREF: .text:000000000060E240j .text:000000000060E230 test rbp, rbp .text:000000000060E233 jz short exit_0 .text:000000000060E235 .text:000000000060E235 loc_60E235: ; CODE XREF: .text:000000000060E289j .text:000000000060E235 mov rbx, rbp .text:000000000060E238 .text:000000000060E238 loc_60E238: ; CODE XREF: .text:000000000060E229j .text:000000000060E238 cmp [rbx+ThreadPoolElement.state], 2 ; THREAD_DONE .text:000000000060E23C mov rbp, [rbx+ThreadPoolElement.all.link_next] .text:000000000060E240 jnz short loc_60E230 .text:000000000060E242 mov r15d, [rbx+ThreadPoolElement.ret] .text:000000000060E246 mov r13, [rbx+ThreadPoolElement.common.opaque] .text:000000000060E24A nop .text:000000000060E24B lea rax, trace_events_enabled_count .text:000000000060E252 mov eax, [rax] .text:000000000060E254 test eax, eax .text:000000000060E256 mov rax, rbp .text:000000000060E259 jnz loc_60E2F9 ... .text:000000000060E2BC loc_60E2BC: ; CODE XREF: .text:000000000060E27Cj .text:000000000060E2BC mov rdi, [r12+8] .text:000000000060E2C1 call qemu_bh_schedule .text:000000000060E2C6 mov rdi, [r12] .text:000000000060E2CA call aio_context_release .text:000000000060E2CF mov esi, [rbx+44h] .text:000000000060E2D2 mov rdi, [rbx+18h] .text:000000000060E2D6 call qword ptr [rbx+10h] .text:000000000060E2D9 mov rdi, [r12] .text:000000000060E2DD call aio_context_acquire .text:000000000060E2E2 mov rdi, [r12+8] .text:000000000060E2E7 call qemu_bh_cancel .text:000000000060E2EC mov rdi, rbx .text:000000000060E2EF call qemu_aio_unref .text:000000000060E2F4 jmp loc_60E21E The read (req->state == THREAD_DONE) can be reordered with qemu_bh_cancel(p->completion_bh) and then we get the same picture: worker thread | I/O thread ------------------------------------------------------------------------ | reordered read req->state req->state = THREAD_DONE; | qemu_bh_schedule(p->completion_bh) | bh->scheduled = 1; | | qemu_bh_cancel(p->completion_bh) | bh->scheduled = 0; | if (req->state == THREAD_DONE) | // sees THREAD_QUEUED > > worker thread | I/O thread > ------------------------------------------------------------------------ > | speculatively read req->state > req->state = THREAD_DONE; | > qemu_bh_schedule(p->completion_bh) | > bh->scheduled = 1; | > | qemu_bh_cancel(p->completion_bh) > | bh->scheduled = 0; > | if (req->state == THREAD_DONE) > | // sees THREAD_QUEUED > > The source of the misunderstanding was that qemu_bh_cancel is now being > used by the _consumer_ rather than the producer, and therefore now needs > to have acquire semantics just like e.g. aio_bh_poll. > > In some situations, if there are no other independent requests in the > same aio context that could eventually trigger the scheduling of the > completion function, the omitted TPE and all operations pending on it > will get stuck forever. > > Signed-off-by: Sergio Lopez <slp@redhat.com> > --- > util/async.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/util/async.c b/util/async.c > index 355af73ee7..0e1bd8780a 100644 > --- a/util/async.c > +++ b/util/async.c > @@ -174,7 +174,7 @@ void qemu_bh_schedule(QEMUBH *bh) > */ > void qemu_bh_cancel(QEMUBH *bh) > { > - bh->scheduled = 0; > + atomic_mb_set(&bh->scheduled, 0); But in the end, the patch looks correct. atomic_mb_set() is xchg: #if defined(__i386__) || defined(__x86_64__) || defined(__s390x__) #define atomic_mb_set(ptr, i) ((void)atomic_xchg(ptr, i)) Reads and writes cannot be reordered with locked instructions, so it should protect from reordering. > } > > /* This func is async.The bottom half will do the delete action at the finial >
On Wed, Nov 8, 2017 at 2:50 PM, Pavel Butsykin <pbutsykin@virtuozzo.com> wrote: > On 08.11.2017 09:34, Sergio Lopez wrote: >> This was considered to be safe, as the completion function restarts the >> loop just after the call to qemu_bh_cancel. But, under certain access >> patterns and scheduling conditions, the loop may wrongly use a >> pre-fetched elem->state value, reading it as THREAD_QUEUED, and ending >> the completion function without having processed a pending TPE linked at >> pool->head: > > > I'm not quite sure that the pre-fetched is involved in this issue, > because pre-fetch reading a certain addresses should be invalidated by > write on another core to the same addresses. In our case write > req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE. > I am inclined to think that there is a memory-reordering read with > write. It's a very real case for x86 and I don't see the reasons which > can prevent it: > Yes, you're right. This is actually a memory reordering issue. I'm going to rewrite that paragraph. Thanks Pavel.
On 08/11/2017 15:10, Sergio Lopez wrote: >> I'm not quite sure that the pre-fetched is involved in this issue, >> because pre-fetch reading a certain addresses should be invalidated by >> write on another core to the same addresses. In our case write >> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE. >> I am inclined to think that there is a memory-reordering read with >> write. It's a very real case for x86 and I don't see the reasons which >> can prevent it: >> > Yes, you're right. This is actually a memory reordering issue. I'm > going to rewrite that paragraph. Well, memory reordering _is_ caused by speculative prefetching, delayed cache invalidation (store buffers), and so on. But it's probably better indeed to replace "pre-fetched" with "outdated". Whoever commits the patch can do the substitution (I can too). Paolo
On Wed, Nov 8, 2017 at 3:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 08/11/2017 15:10, Sergio Lopez wrote: >>> I'm not quite sure that the pre-fetched is involved in this issue, >>> because pre-fetch reading a certain addresses should be invalidated by >>> write on another core to the same addresses. In our case write >>> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE. >>> I am inclined to think that there is a memory-reordering read with >>> write. It's a very real case for x86 and I don't see the reasons which >>> can prevent it: >>> >> Yes, you're right. This is actually a memory reordering issue. I'm >> going to rewrite that paragraph. > > Well, memory reordering _is_ caused by speculative prefetching, delayed > cache invalidation (store buffers), and so on. > > But it's probably better indeed to replace "pre-fetched" with > "outdated". Whoever commits the patch can do the substitution (I can too). > Alternatively, if we want to explicitly mention the memory barrier, we can replace the third paragraph with something like this: <snip> This was considered to be safe, as the completion function restarts the loop just after the call to qemu_bh_cancel. But, as this loop lacks a HW memory barrier, the read of req->state may actually happen _before_ the call, seeing it still as THREAD_QUEUED, and ending the completion function without having processed a pending TPE linked at pool->head: </snip> --- Sergio
On 08.11.2017 17:24, Sergio Lopez wrote: > On Wed, Nov 8, 2017 at 3:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> On 08/11/2017 15:10, Sergio Lopez wrote: >>>> I'm not quite sure that the pre-fetched is involved in this issue, >>>> because pre-fetch reading a certain addresses should be invalidated by >>>> write on another core to the same addresses. In our case write >>>> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE. >>>> I am inclined to think that there is a memory-reordering read with >>>> write. It's a very real case for x86 and I don't see the reasons which >>>> can prevent it: >>>> >>> Yes, you're right. This is actually a memory reordering issue. I'm >>> going to rewrite that paragraph. >> >> Well, memory reordering _is_ caused by speculative prefetching, delayed >> cache invalidation (store buffers), and so on. >> >> But it's probably better indeed to replace "pre-fetched" with >> "outdated". Whoever commits the patch can do the substitution (I can too). >> > > Alternatively, if we want to explicitly mention the memory barrier, we > can replace the third paragraph with something like this: > > <snip> > This was considered to be safe, as the completion function restarts the > loop just after the call to qemu_bh_cancel. But, as this loop lacks a HW > memory barrier, the read of req->state may actually happen _before_ the > call, seeing it still as THREAD_QUEUED, and ending the completion > function without having processed a pending TPE linked at pool->head: > </snip> Yes, that's better. Thank you. > --- > Sergio >
On Wed, Nov 08, 2017 at 05:32:23PM +0300, Pavel Butsykin wrote: > On 08.11.2017 17:24, Sergio Lopez wrote: > > On Wed, Nov 8, 2017 at 3:15 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 08/11/2017 15:10, Sergio Lopez wrote: > > > > > I'm not quite sure that the pre-fetched is involved in this issue, > > > > > because pre-fetch reading a certain addresses should be invalidated by > > > > > write on another core to the same addresses. In our case write > > > > > req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE. > > > > > I am inclined to think that there is a memory-reordering read with > > > > > write. It's a very real case for x86 and I don't see the reasons which > > > > > can prevent it: > > > > > > > > > Yes, you're right. This is actually a memory reordering issue. I'm > > > > going to rewrite that paragraph. > > > > > > Well, memory reordering _is_ caused by speculative prefetching, delayed > > > cache invalidation (store buffers), and so on. > > > > > > But it's probably better indeed to replace "pre-fetched" with > > > "outdated". Whoever commits the patch can do the substitution (I can too). > > > > > > > Alternatively, if we want to explicitly mention the memory barrier, we > > can replace the third paragraph with something like this: > > > > <snip> > > This was considered to be safe, as the completion function restarts the > > loop just after the call to qemu_bh_cancel. But, as this loop lacks a HW > > memory barrier, the read of req->state may actually happen _before_ the > > call, seeing it still as THREAD_QUEUED, and ending the completion > > function without having processed a pending TPE linked at pool->head: > > </snip> > > Yes, that's better. Thank you. I have updated the commit description and sent an updated pull request for QEMU 2.11-rc1. Stefan
On 08.11.2017 17:15, Paolo Bonzini wrote: > On 08/11/2017 15:10, Sergio Lopez wrote: >>> I'm not quite sure that the pre-fetched is involved in this issue, >>> because pre-fetch reading a certain addresses should be invalidated by >>> write on another core to the same addresses. In our case write >>> req->state = THREAD_DONE should invalidate read req->state == THREAD_DONE. >>> I am inclined to think that there is a memory-reordering read with >>> write. It's a very real case for x86 and I don't see the reasons which >>> can prevent it: >>> >> Yes, you're right. This is actually a memory reordering issue. I'm >> going to rewrite that paragraph. > > Well, memory reordering _is_ caused by speculative prefetching, delayed > cache invalidation (store buffers), and so on. what do you mean? If we are speaking about x86, then a write on another core (like req->state = THREAD_DONE in this issue) should invalidate prefetch read(req->state = THREAD_DONE) and this is prevented in hardware. The prefetch is locked to the L1, when another cpu invalidates the cache lines, the prefetch is invalidated also (As far as I understand it). > But it's probably better indeed to replace "pre-fetched" with > "outdated". Whoever commits the patch can do the substitution (I can too). > > Paolo >
© 2016 - 2024 Red Hat, Inc.