[Qemu-devel] [PATCH 14/17] block: optimize access to reqs_lock

Paolo Bonzini posted 17 patches 8 years, 9 months ago
There is a newer version of this series
[Qemu-devel] [PATCH 14/17] block: optimize access to reqs_lock
Posted by Paolo Bonzini 8 years, 9 months ago
Hot path reqs_lock critical sections are very small; the only large critical
sections happen when a request waits for serialising requests, and these
should never happen in usual circumstances.

We do not want these small critical sections to yield in any case,
which calls for using a spinlock while writing the list.  The reqs_lock
is still used to protect the individual requests' CoQueue.  For this
purpose, serializing removals against concurrent walks of the request
list can use lock_unlock for efficiency and determinism.

The reqs_lock is also used to protect the flush generation counts, but
that's unrelated.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c                   |  1 +
 block/io.c                | 25 ++++++++++++++++++++-----
 include/block/block_int.h | 11 ++++++++---
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index 3b2ed29..7ba6afe 100644
--- a/block.c
+++ b/block.c
@@ -234,6 +234,7 @@ BlockDriverState *bdrv_new(void)
         QLIST_INIT(&bs->op_blockers[i]);
     }
     notifier_with_return_list_init(&bs->before_write_notifiers);
+    qemu_spin_init(&bs->reqs_list_write_lock);
     qemu_co_mutex_init(&bs->reqs_lock);
     bs->refcnt = 1;
     bs->aio_context = qemu_get_aio_context();
diff --git a/block/io.c b/block/io.c
index 7af9d47..476807d 100644
--- a/block/io.c
+++ b/block/io.c
@@ -374,14 +374,29 @@ void bdrv_drain_all(void)
  */
 static void tracked_request_end(BdrvTrackedRequest *req)
 {
+    BlockDriverState *bs = req->bs;
+
     if (req->serialising) {
-        atomic_dec(&req->bs->serialising_in_flight);
+        atomic_dec(&bs->serialising_in_flight);
     }
 
-    qemu_co_mutex_lock(&req->bs->reqs_lock);
+    /* Note that there can be a concurrent visit while we remove the list,
+     * so we need to...
+     */
+    qemu_spin_lock(&bs->reqs_list_write_lock);
     QLIST_REMOVE(req, list);
+    qemu_spin_unlock(&bs->reqs_list_write_lock);
+
+    /* ... wait for it to end before we leave.  qemu_co_mutex_lock_unlock
+     * avoids cacheline bouncing in the common case of no concurrent
+     * reader.
+     */
+    qemu_co_mutex_lock_unlock(&bs->reqs_lock);
+
+    /* Now no coroutine can add itself to the wait queue, so it is
+     * safe to call qemu_co_queue_restart_all outside the reqs_lock.
+     */
     qemu_co_queue_restart_all(&req->wait_queue);
-    qemu_co_mutex_unlock(&req->bs->reqs_lock);
 }
 
 /**
@@ -406,9 +421,9 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
 
     qemu_co_queue_init(&req->wait_queue);
 
-    qemu_co_mutex_lock(&bs->reqs_lock);
+    qemu_spin_lock(&bs->reqs_list_write_lock);
     QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
-    qemu_co_mutex_unlock(&bs->reqs_lock);
+    qemu_spin_unlock(&bs->reqs_list_write_lock);
 }
 
 static void mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 42b49f5..b298de8 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -78,9 +78,10 @@ typedef struct BdrvTrackedRequest {
 
     QLIST_ENTRY(BdrvTrackedRequest) list;
     Coroutine *co; /* owner, used for deadlock detection */
-    CoQueue wait_queue; /* coroutines blocked on this request */
-
     struct BdrvTrackedRequest *waiting_for;
+
+    /* Protected by BlockDriverState's reqs_lock.  */
+    CoQueue wait_queue; /* coroutines blocked on this request */
 } BdrvTrackedRequest;
 
 struct BlockDriver {
@@ -626,11 +627,15 @@ struct BlockDriverState {
     int quiesce_counter;
     unsigned int write_gen;               /* Current data generation */
 
-    /* Protected by reqs_lock.  */
+    /* Writes are protected by reqs_list_write_lock.  Reads take
+     * reqs_lock so that removals can easily synchronize with walks.
+     */
     QLIST_HEAD(, BdrvTrackedRequest) tracked_requests;
     CoQueue flush_queue;                  /* Serializing flush queue */
     bool active_flush_req;                /* Flush request in flight? */
     unsigned int flushed_gen;             /* Flushed write generation */
+
+    QemuSpin reqs_list_write_lock;
     CoMutex reqs_lock;
 };
 
-- 
2.9.3



Re: [Qemu-devel] [Qemu-block] [PATCH 14/17] block: optimize access to reqs_lock
Posted by Stefan Hajnoczi 8 years, 9 months ago
On Thu, Apr 20, 2017 at 02:00:55PM +0200, Paolo Bonzini wrote:
> Hot path reqs_lock critical sections are very small; the only large critical
> sections happen when a request waits for serialising requests, and these
> should never happen in usual circumstances.
> 
> We do not want these small critical sections to yield in any case,
> which calls for using a spinlock while writing the list.

Is this patch purely an optimization?

I'm hesitant about using spinlocks in userspace.  There are cases where
the thread is descheduled that are beyond our control.  Nested virt will
probably make things worse.  People have been optimizing and trying
paravirt approaches to kernel spinlocks for these reasons for years.

Isn't a futex-based lock efficient enough?  That way we don't hog the
CPU when there is contention.

Also, there are no performance results included in this patch that
justify the spinlock.
Re: [Qemu-devel] [Qemu-block] [PATCH 14/17] block: optimize access to reqs_lock
Posted by Paolo Bonzini 8 years, 9 months ago

On 04/05/2017 16:59, Stefan Hajnoczi wrote:
> On Thu, Apr 20, 2017 at 02:00:55PM +0200, Paolo Bonzini wrote:
>> Hot path reqs_lock critical sections are very small; the only large critical
>> sections happen when a request waits for serialising requests, and these
>> should never happen in usual circumstances.
>>
>> We do not want these small critical sections to yield in any case,
>> which calls for using a spinlock while writing the list.
> 
> Is this patch purely an optimization?

Yes, it is, and pretty much a no-op until we have true multiqueue.  But
I expect it to have a significant effect for multiqueue.

> I'm hesitant about using spinlocks in userspace.  There are cases where
> the thread is descheduled that are beyond our control.  Nested virt will
> probably make things worse.  People have been optimizing and trying
> paravirt approaches to kernel spinlocks for these reasons for years.

This is true, but here we're talking about a 5-10 instruction window for
preemption; it matches the usage of spinlocks in other parts of QEMU.
The long critical sections, which only happen with combination with
copy-on-read or RMW (large logical block sizes on the host), take the
CoMutex.

On one hand it's true that the more you nest, the more things get worse.
 On the other hand there can only ever be contention with multiqueue,
and the multiqueue scenarios are going to use pinning.

> Isn't a futex-based lock efficient enough?  That way we don't hog the
> CPU when there is contention.

It is efficient when there is no contention, but when there is, the
latency goes up by several orders of magnitude.

Paolo

> Also, there are no performance results included in this patch that
> justify the spinlock.
> 

Re: [Qemu-devel] [Qemu-block] [PATCH 14/17] block: optimize access to reqs_lock
Posted by Stefan Hajnoczi 8 years, 9 months ago
On Thu, May 04, 2017 at 06:06:39PM +0200, Paolo Bonzini wrote:
> On 04/05/2017 16:59, Stefan Hajnoczi wrote:
> > On Thu, Apr 20, 2017 at 02:00:55PM +0200, Paolo Bonzini wrote:
> >> Hot path reqs_lock critical sections are very small; the only large critical
> >> sections happen when a request waits for serialising requests, and these
> >> should never happen in usual circumstances.
> >>
> >> We do not want these small critical sections to yield in any case,
> >> which calls for using a spinlock while writing the list.
> > 
> > Is this patch purely an optimization?
> 
> Yes, it is, and pretty much a no-op until we have true multiqueue.  But
> I expect it to have a significant effect for multiqueue.
> 
> > I'm hesitant about using spinlocks in userspace.  There are cases where
> > the thread is descheduled that are beyond our control.  Nested virt will
> > probably make things worse.  People have been optimizing and trying
> > paravirt approaches to kernel spinlocks for these reasons for years.
> 
> This is true, but here we're talking about a 5-10 instruction window for
> preemption; it matches the usage of spinlocks in other parts of QEMU.

Only util/qht.c uses spinlocks, it's not a widely used primitive.

> The long critical sections, which only happen with combination with
> copy-on-read or RMW (large logical block sizes on the host), take the
> CoMutex.
> 
> On one hand it's true that the more you nest, the more things get worse.
>  On the other hand there can only ever be contention with multiqueue,
> and the multiqueue scenarios are going to use pinning.
> 
> > Isn't a futex-based lock efficient enough?  That way we don't hog the
> > CPU when there is contention.
> 
> It is efficient when there is no contention, but when there is, the
> latency goes up by several orders of magnitude.

Doesn't glibc spin for a while before waiting on the futex?  i.e. the
best of both worlds.

Stefan
Re: [Qemu-devel] [Qemu-block] [PATCH 14/17] block: optimize access to reqs_lock
Posted by Paolo Bonzini 8 years, 9 months ago

On 05/05/2017 12:25, Stefan Hajnoczi wrote:
> On Thu, May 04, 2017 at 06:06:39PM +0200, Paolo Bonzini wrote:
>> On 04/05/2017 16:59, Stefan Hajnoczi wrote:
>>> On Thu, Apr 20, 2017 at 02:00:55PM +0200, Paolo Bonzini wrote:
>>>> Hot path reqs_lock critical sections are very small; the only large critical
>>>> sections happen when a request waits for serialising requests, and these
>>>> should never happen in usual circumstances.
>>>>
>>>> We do not want these small critical sections to yield in any case,
>>>> which calls for using a spinlock while writing the list.
>>>
>>> Is this patch purely an optimization?
>>
>> Yes, it is, and pretty much a no-op until we have true multiqueue.  But
>> I expect it to have a significant effect for multiqueue.
>>
>>> I'm hesitant about using spinlocks in userspace.  There are cases where
>>> the thread is descheduled that are beyond our control.  Nested virt will
>>> probably make things worse.  People have been optimizing and trying
>>> paravirt approaches to kernel spinlocks for these reasons for years.
>>
>> This is true, but here we're talking about a 5-10 instruction window for
>> preemption; it matches the usage of spinlocks in other parts of QEMU.
> 
> Only util/qht.c uses spinlocks, it's not a widely used primitive.

Right, but the idea is the same---very short, heavy and
performance-critical cases use spinlocks.  (util/qht.c is used heavily
in TCG mode).

>> It is efficient when there is no contention, but when there is, the
>> latency goes up by several orders of magnitude.
> 
> Doesn't glibc spin for a while before waiting on the futex?  i.e. the
> best of both worlds.

You have to specify that manually with pthread_mutexattr_settype(...,
PTHRED_MUTEX_ADAPTIVE_NP).  It is not enabled by default because IIUC
the adaptive one doesn't support pthread_mutex_timedlock.

Paolo

Re: [Qemu-devel] [Qemu-block] [PATCH 14/17] block: optimize access to reqs_lock
Posted by Stefan Hajnoczi 8 years, 9 months ago
On Fri, May 05, 2017 at 12:45:38PM +0200, Paolo Bonzini wrote:
> 
> 
> On 05/05/2017 12:25, Stefan Hajnoczi wrote:
> > On Thu, May 04, 2017 at 06:06:39PM +0200, Paolo Bonzini wrote:
> >> On 04/05/2017 16:59, Stefan Hajnoczi wrote:
> >>> On Thu, Apr 20, 2017 at 02:00:55PM +0200, Paolo Bonzini wrote:
> >>>> Hot path reqs_lock critical sections are very small; the only large critical
> >>>> sections happen when a request waits for serialising requests, and these
> >>>> should never happen in usual circumstances.
> >>>>
> >>>> We do not want these small critical sections to yield in any case,
> >>>> which calls for using a spinlock while writing the list.
> >>>
> >>> Is this patch purely an optimization?
> >>
> >> Yes, it is, and pretty much a no-op until we have true multiqueue.  But
> >> I expect it to have a significant effect for multiqueue.
> >>
> >>> I'm hesitant about using spinlocks in userspace.  There are cases where
> >>> the thread is descheduled that are beyond our control.  Nested virt will
> >>> probably make things worse.  People have been optimizing and trying
> >>> paravirt approaches to kernel spinlocks for these reasons for years.
> >>
> >> This is true, but here we're talking about a 5-10 instruction window for
> >> preemption; it matches the usage of spinlocks in other parts of QEMU.
> > 
> > Only util/qht.c uses spinlocks, it's not a widely used primitive.
> 
> Right, but the idea is the same---very short, heavy and
> performance-critical cases use spinlocks.  (util/qht.c is used heavily
> in TCG mode).
> 
> >> It is efficient when there is no contention, but when there is, the
> >> latency goes up by several orders of magnitude.
> > 
> > Doesn't glibc spin for a while before waiting on the futex?  i.e. the
> > best of both worlds.
> 
> You have to specify that manually with pthread_mutexattr_settype(...,
> PTHRED_MUTEX_ADAPTIVE_NP).  It is not enabled by default because IIUC
> the adaptive one doesn't support pthread_mutex_timedlock.

If you want to use spinlocks in QEMU please document strict rules that
ensure they will be used safely.  For example, I'd be comfortable with:

Spinlock regions may only call functions in the same source file or
functions explicitly documented as spinlock-safe.  Only memory accesses
and computation may be performed; do not make system calls or invoke
library functions.

Stefan
Re: [Qemu-devel] [Qemu-block] [PATCH 14/17] block: optimize access to reqs_lock
Posted by Paolo Bonzini 8 years, 9 months ago

On 08/05/2017 18:21, Stefan Hajnoczi wrote:
> If you want to use spinlocks in QEMU please document strict rules that
> ensure they will be used safely.  For example, I'd be comfortable with:
> 
> Spinlock regions may only call functions in the same source file or
> functions explicitly documented as spinlock-safe.  Only memory accesses
> and computation may be performed; do not make system calls or invoke
> library functions.

Sounds good.  But for now I'll just drop these patches in the interest
of simplicity.

Paolo