[PATCH 2/7] aio-posix: move RCU_READ_LOCK() into run_poll_handlers()

Stefan Hajnoczi posted 7 patches 5 years, 8 months ago
[PATCH 2/7] aio-posix: move RCU_READ_LOCK() into run_poll_handlers()
Posted by Stefan Hajnoczi 5 years, 8 months ago
Now that run_poll_handlers_once() is only called by run_poll_handlers()
we can improve the CPU time profile by moving the expensive
RCU_READ_LOCK() out of the polling loop.

This reduces the run_poll_handlers() from 40% CPU to 10% CPU in perf's
sampling profiler output.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/aio-posix.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 029f146455..38c51f5d8f 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -583,16 +583,6 @@ static bool run_poll_handlers_once(AioContext *ctx, int64_t *timeout)
     bool progress = false;
     AioHandler *node;
 
-    /*
-     * Optimization: ->io_poll() handlers often contain RCU read critical
-     * sections and we therefore see many rcu_read_lock() -> rcu_read_unlock()
-     * -> rcu_read_lock() -> ... sequences with expensive memory
-     * synchronization primitives.  Make the entire polling loop an RCU
-     * critical section because nested rcu_read_lock()/rcu_read_unlock() calls
-     * are cheap.
-     */
-    RCU_READ_LOCK_GUARD();
-
     QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
         if (!QLIST_IS_INSERTED(node, node_deleted) && node->io_poll &&
             aio_node_check(ctx, node->is_external) &&
@@ -636,6 +626,16 @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout)
 
     trace_run_poll_handlers_begin(ctx, max_ns, *timeout);
 
+    /*
+     * Optimization: ->io_poll() handlers often contain RCU read critical
+     * sections and we therefore see many rcu_read_lock() -> rcu_read_unlock()
+     * -> rcu_read_lock() -> ... sequences with expensive memory
+     * synchronization primitives.  Make the entire polling loop an RCU
+     * critical section because nested rcu_read_lock()/rcu_read_unlock() calls
+     * are cheap.
+     */
+    RCU_READ_LOCK_GUARD();
+
     start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     do {
         progress = run_poll_handlers_once(ctx, timeout);
-- 
2.24.1

Re: [PATCH 2/7] aio-posix: move RCU_READ_LOCK() into run_poll_handlers()
Posted by Paolo Bonzini 5 years, 8 months ago
On 05/03/20 18:08, Stefan Hajnoczi wrote:
>  
> +    /*
> +     * Optimization: ->io_poll() handlers often contain RCU read critical
> +     * sections and we therefore see many rcu_read_lock() -> rcu_read_unlock()
> +     * -> rcu_read_lock() -> ... sequences with expensive memory
> +     * synchronization primitives.  Make the entire polling loop an RCU
> +     * critical section because nested rcu_read_lock()/rcu_read_unlock() calls
> +     * are cheap.
> +     */
> +    RCU_READ_LOCK_GUARD();
> +

Looks good, but I suggest that you compile with --enable-membarrier as
that makes RCU critical sections basically free.

Paolo


Re: [PATCH 2/7] aio-posix: move RCU_READ_LOCK() into run_poll_handlers()
Posted by Stefan Hajnoczi 5 years, 8 months ago
On Thu, Mar 05, 2020 at 06:15:36PM +0100, Paolo Bonzini wrote:
> On 05/03/20 18:08, Stefan Hajnoczi wrote:
> >  
> > +    /*
> > +     * Optimization: ->io_poll() handlers often contain RCU read critical
> > +     * sections and we therefore see many rcu_read_lock() -> rcu_read_unlock()
> > +     * -> rcu_read_lock() -> ... sequences with expensive memory
> > +     * synchronization primitives.  Make the entire polling loop an RCU
> > +     * critical section because nested rcu_read_lock()/rcu_read_unlock() calls
> > +     * are cheap.
> > +     */
> > +    RCU_READ_LOCK_GUARD();
> > +
> 
> Looks good, but I suggest that you compile with --enable-membarrier as
> that makes RCU critical sections basically free.

Interesting, --enable-membarrier decreases performance from 105k to 97k
IOPS in the NVMe latency benchmark that I'm running.

Stefan