[PATCH 0/7] aio-posix: polling scalability improvements

Stefan Hajnoczi posted 7 patches 5 years, 8 months ago
Test docker-mingw@fedora passed
Test docker-quick@centos7 passed
Test checkpatch passed
Test FreeBSD passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20200305170806.1313245-1-stefanha@redhat.com
MAINTAINERS           |   2 +
configure             |   5 +
include/block/aio.h   |  70 ++++++-
util/Makefile.objs    |   3 +
util/aio-posix.c      | 449 ++++++++++++++----------------------------
util/aio-posix.h      |  81 ++++++++
util/fdmon-epoll.c    | 155 +++++++++++++++
util/fdmon-io_uring.c | 332 +++++++++++++++++++++++++++++++
util/fdmon-poll.c     | 107 ++++++++++
util/trace-events     |   2 +
10 files changed, 898 insertions(+), 308 deletions(-)
create mode 100644 util/aio-posix.h
create mode 100644 util/fdmon-epoll.c
create mode 100644 util/fdmon-io_uring.c
create mode 100644 util/fdmon-poll.c
[PATCH 0/7] aio-posix: polling scalability improvements
Posted by Stefan Hajnoczi 5 years, 8 months ago
A guest with 100 virtio-blk-pci,num-queues=32 devices only reaches 10k IOPS
while a guest with a single device reaches 105k IOPS
(rw=randread,bs=4k,iodepth=1,ioengine=libaio).

The bottleneck is that aio_poll() userspace polling iterates over all
AioHandlers to invoke their ->io_poll() callbacks.  All AioHandlers are polled
even if only one of them was recently active.  Therefore a guest with many
disks is slower than a guest with a single disk even when the workload only
accesses a single disk.

This patch series solves this scalability problem so that IOPS is unaffected by
the number of devices.  The trick is to poll only AioHandlers that were
recently active so that userspace polling scales well.

Unfortunately it's not possible to accomplish this with the existing epoll(7)
fd monitoring implementation.  This patch series adds a Linux io_uring fd
monitoring implementation.  The critical feature is that io_uring can check the
readiness of file descriptors through userspace polling.  This makes it
possible to safely poll a subset of AioHandlers from userspace without risk of
starving the other AioHandlers.

Stefan Hajnoczi (7):
  aio-posix: completely stop polling when disabled
  aio-posix: move RCU_READ_LOCK() into run_poll_handlers()
  aio-posix: extract ppoll(2) and epoll(7) fd monitoring
  aio-posix: simplify FDMonOps->update() prototype
  aio-posix: add io_uring fd monitoring implementation
  aio-posix: support userspace polling of fd monitoring
  aio-posix: remove idle poll handlers to improve scalability

 MAINTAINERS           |   2 +
 configure             |   5 +
 include/block/aio.h   |  70 ++++++-
 util/Makefile.objs    |   3 +
 util/aio-posix.c      | 449 ++++++++++++++----------------------------
 util/aio-posix.h      |  81 ++++++++
 util/fdmon-epoll.c    | 155 +++++++++++++++
 util/fdmon-io_uring.c | 332 +++++++++++++++++++++++++++++++
 util/fdmon-poll.c     | 107 ++++++++++
 util/trace-events     |   2 +
 10 files changed, 898 insertions(+), 308 deletions(-)
 create mode 100644 util/aio-posix.h
 create mode 100644 util/fdmon-epoll.c
 create mode 100644 util/fdmon-io_uring.c
 create mode 100644 util/fdmon-poll.c

-- 
2.24.1

Re: [PATCH 0/7] aio-posix: polling scalability improvements
Posted by Stefan Hajnoczi 5 years, 8 months ago
On Thu, Mar 05, 2020 at 05:07:59PM +0000, Stefan Hajnoczi wrote:
> A guest with 100 virtio-blk-pci,num-queues=32 devices only reaches 10k IOPS
> while a guest with a single device reaches 105k IOPS
> (rw=randread,bs=4k,iodepth=1,ioengine=libaio).
> 
> The bottleneck is that aio_poll() userspace polling iterates over all
> AioHandlers to invoke their ->io_poll() callbacks.  All AioHandlers are polled
> even if only one of them was recently active.  Therefore a guest with many
> disks is slower than a guest with a single disk even when the workload only
> accesses a single disk.
> 
> This patch series solves this scalability problem so that IOPS is unaffected by
> the number of devices.  The trick is to poll only AioHandlers that were
> recently active so that userspace polling scales well.
> 
> Unfortunately it's not possible to accomplish this with the existing epoll(7)
> fd monitoring implementation.  This patch series adds a Linux io_uring fd
> monitoring implementation.  The critical feature is that io_uring can check the
> readiness of file descriptors through userspace polling.  This makes it
> possible to safely poll a subset of AioHandlers from userspace without risk of
> starving the other AioHandlers.
> 
> Stefan Hajnoczi (7):
>   aio-posix: completely stop polling when disabled
>   aio-posix: move RCU_READ_LOCK() into run_poll_handlers()
>   aio-posix: extract ppoll(2) and epoll(7) fd monitoring
>   aio-posix: simplify FDMonOps->update() prototype
>   aio-posix: add io_uring fd monitoring implementation
>   aio-posix: support userspace polling of fd monitoring
>   aio-posix: remove idle poll handlers to improve scalability
> 
>  MAINTAINERS           |   2 +
>  configure             |   5 +
>  include/block/aio.h   |  70 ++++++-
>  util/Makefile.objs    |   3 +
>  util/aio-posix.c      | 449 ++++++++++++++----------------------------
>  util/aio-posix.h      |  81 ++++++++
>  util/fdmon-epoll.c    | 155 +++++++++++++++
>  util/fdmon-io_uring.c | 332 +++++++++++++++++++++++++++++++
>  util/fdmon-poll.c     | 107 ++++++++++
>  util/trace-events     |   2 +
>  10 files changed, 898 insertions(+), 308 deletions(-)
>  create mode 100644 util/aio-posix.h
>  create mode 100644 util/fdmon-epoll.c
>  create mode 100644 util/fdmon-io_uring.c
>  create mode 100644 util/fdmon-poll.c
> 
> -- 
> 2.24.1
> 

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan