qemu-file writing performance improving

[RFC patch v1 0/3] qemu-file writing performance improving

Posted by Denis Plotnikov 4 years ago

Problem description: qcow2 internal snapshot saving time is too big on HDD ~ 25 sec

When a qcow2 image is placed on a regular HDD and the image is openned with
O_DIRECT the snapshot saving time is around 26 sec.
The snapshot saving time can be 4 times sorter.
The patch series propose the way to achive that. 

Why is the saving time = ~25 sec?

There are three things:
1. qemu-file iov limit (currently 64)
2. direct qemu_fflush calls, inducing disk writings
3. ram data copying and synchronous disk wrtings

When 1, 2 are quite clear, the 3rd needs some explaination:

Internal snapshot uses qemu-file as an interface to store the data with
stream semantics.
qemu-file avoids data coping when possible (mostly for ram data)
and use iovectors to propagate the data to an undelying block driver state.
In the case when qcow2 openned with O_DIRECT it is suboptimal.

This is what happens: on writing, when the iovectors query goes from qemu-file
to bdrv (here and further by brdv I mean qcow2 with posix O_DIRECT openned backend),
the brdv checks all iovectors to be base and size aligned, if it's not the case,
the data copied to an internal buffer and synchronous pwrite is called.
If the iovectors are aligned, io_submit is called.

In our case, snapshot almost always induces pwrite, since we never have all
the iovectors aligned in the query, because of frequent adding a short iovector:
8 byte ram-page delimiters, after adding each ram page iovector.

So the qemu-file code in this case:
1. doesn't aviod ram copying
2. works fully synchronously

How to improve the snapshot time:

1. easy way: to increase iov limit to IOV_MAX (1024).
This will reduce synchronous writing frequency.
My test revealed that with iov limit = IOV_MAX the snapshot time *~12 sec*.

2. complex way: do writings asynchronously.
Introduce both base- and size-aligned buffer, write the data only when
the buffer is full, write the buffer asynchronously, meanwhile filling another
buffer with snapshot data.
My test revealed that this complex way provides the snapshot time *~6 sec*,
2 times better than just iov limit increasing.

The patch proposes how to improve the snapshot performance in the complex way,
allowing to use the asyncronous writings when needed.

This is an RFC series, as I didn't confident that I fully understand all
qemu-file use cases. I tried to make the series in a safe way to not break
anything related to qemu-file using in other places, like migration.

All comments are *VERY* appriciated!

Thanks,
Denis

Denis Plotnikov (3):
  qemu-file: introduce current buffer
  qemu-file: add buffered mode
  migration/savevm: use qemu-file buffered mode for non-cached bdrv

 include/qemu/typedefs.h |   2 +
 migration/qemu-file.c   | 479 +++++++++++++++++++++++++++++++++++++++++-------
 migration/qemu-file.h   |   9 +
 migration/savevm.c      |  38 +++-
 4 files changed, 456 insertions(+), 72 deletions(-)

-- 
1.8.3.1

Re: [RFC patch v1 0/3] qemu-file writing performance improving

Posted by Denis Plotnikov 4 years ago

Ping!

On 13.04.2020 14:12, Denis Plotnikov wrote:
> Problem description: qcow2 internal snapshot saving time is too big on HDD ~ 25 sec
>
> When a qcow2 image is placed on a regular HDD and the image is openned with
> O_DIRECT the snapshot saving time is around 26 sec.
> The snapshot saving time can be 4 times sorter.
> The patch series propose the way to achive that.
>
> Why is the saving time = ~25 sec?
>
> There are three things:
> 1. qemu-file iov limit (currently 64)
> 2. direct qemu_fflush calls, inducing disk writings
> 3. ram data copying and synchronous disk wrtings
>
> When 1, 2 are quite clear, the 3rd needs some explaination:
>
> Internal snapshot uses qemu-file as an interface to store the data with
> stream semantics.
> qemu-file avoids data coping when possible (mostly for ram data)
> and use iovectors to propagate the data to an undelying block driver state.
> In the case when qcow2 openned with O_DIRECT it is suboptimal.
>
> This is what happens: on writing, when the iovectors query goes from qemu-file
> to bdrv (here and further by brdv I mean qcow2 with posix O_DIRECT openned backend),
> the brdv checks all iovectors to be base and size aligned, if it's not the case,
> the data copied to an internal buffer and synchronous pwrite is called.
> If the iovectors are aligned, io_submit is called.
>
> In our case, snapshot almost always induces pwrite, since we never have all
> the iovectors aligned in the query, because of frequent adding a short iovector:
> 8 byte ram-page delimiters, after adding each ram page iovector.
>
> So the qemu-file code in this case:
> 1. doesn't aviod ram copying
> 2. works fully synchronously
>
> How to improve the snapshot time:
>
> 1. easy way: to increase iov limit to IOV_MAX (1024).
> This will reduce synchronous writing frequency.
> My test revealed that with iov limit = IOV_MAX the snapshot time *~12 sec*.
>
> 2. complex way: do writings asynchronously.
> Introduce both base- and size-aligned buffer, write the data only when
> the buffer is full, write the buffer asynchronously, meanwhile filling another
> buffer with snapshot data.
> My test revealed that this complex way provides the snapshot time *~6 sec*,
> 2 times better than just iov limit increasing.
>
> The patch proposes how to improve the snapshot performance in the complex way,
> allowing to use the asyncronous writings when needed.
>
> This is an RFC series, as I didn't confident that I fully understand all
> qemu-file use cases. I tried to make the series in a safe way to not break
> anything related to qemu-file using in other places, like migration.
>
> All comments are *VERY* appriciated!
>
> Thanks,
> Denis
>
> Denis Plotnikov (3):
>    qemu-file: introduce current buffer
>    qemu-file: add buffered mode
>    migration/savevm: use qemu-file buffered mode for non-cached bdrv
>
>   include/qemu/typedefs.h |   2 +
>   migration/qemu-file.c   | 479 +++++++++++++++++++++++++++++++++++++++++-------
>   migration/qemu-file.h   |   9 +
>   migration/savevm.c      |  38 +++-
>   4 files changed, 456 insertions(+), 72 deletions(-)
>

Re: [RFC patch v1 0/3] qemu-file writing performance improving

Posted by no-reply@patchew.org 4 years ago

Patchew URL: https://patchew.org/QEMU/1586776334-641239-1-git-send-email-dplotnikov@virtuozzo.com/



Hi,

This series failed the asan build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1
=== TEST SCRIPT END ===

/tmp/qemu-test/src/migration/qemu-file.c:415: undefined reference to `aio_task_pool_start_task'
/usr/bin/ld: migration/qemu-file.o: in function `qemu_file_switch_current_buf':
/tmp/qemu-test/src/migration/qemu-file.c:380: undefined reference to `aio_task_pool_wait_slot'
clang-8: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [/tmp/qemu-test/src/rules.mak:124: tests/test-vmstate] Error 1
make: *** Waiting for unfinished jobs....
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 664, in <module>
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=305548e9be434ddeb3a06c706e9dab1a', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=x86_64-softmmu', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-aounzkok/src/docker-src.2020-04-13-09.44.50.27035:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-debug']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=305548e9be434ddeb3a06c706e9dab1a
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-aounzkok/src'
make: *** [docker-run-test-debug@fedora] Error 2

real    6m39.403s
user    0m9.069s


The full log is available at
http://patchew.org/logs/1586776334-641239-1-git-send-email-dplotnikov@virtuozzo.com/testing.asan/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

Re: [RFC patch v1 0/3] qemu-file writing performance improving

Posted by Denis V. Lunev 4 years ago

On 4/13/20 2:12 PM, Denis Plotnikov wrote:
> Problem description: qcow2 internal snapshot saving time is too big on HDD ~ 25 sec
>
> When a qcow2 image is placed on a regular HDD and the image is openned with
> O_DIRECT the snapshot saving time is around 26 sec.
> The snapshot saving time can be 4 times sorter.
> The patch series propose the way to achive that. 
>
> Why is the saving time = ~25 sec?
>
> There are three things:
> 1. qemu-file iov limit (currently 64)
> 2. direct qemu_fflush calls, inducing disk writings
in a non-aligned way, which results further in READ-MODIFY-WRITE
operations at the beginning and at the end of the writing data.
Within synchronous operations this slow-downs the process a lot!

> 3. ram data copying and synchronous disk wrtings
>
> When 1, 2 are quite clear, the 3rd needs some explaination:
>
> Internal snapshot uses qemu-file as an interface to store the data with
> stream semantics.
> qemu-file avoids data coping when possible (mostly for ram data)
> and use iovectors to propagate the data to an undelying block driver state.
> In the case when qcow2 openned with O_DIRECT it is suboptimal.
>
> This is what happens: on writing, when the iovectors query goes from qemu-file
> to bdrv (here and further by brdv I mean qcow2 with posix O_DIRECT openned backend),
> the brdv checks all iovectors to be base and size aligned, if it's not the case,
> the data copied to an internal buffer and synchronous pwrite is called.
> If the iovectors are aligned, io_submit is called.
>
> In our case, snapshot almost always induces pwrite, since we never have all
> the iovectors aligned in the query, because of frequent adding a short iovector:
> 8 byte ram-page delimiters, after adding each ram page iovector.
>
> So the qemu-file code in this case:
> 1. doesn't aviod ram copying
> 2. works fully synchronously
>
> How to improve the snapshot time:
>
> 1. easy way: to increase iov limit to IOV_MAX (1024).
> This will reduce synchronous writing frequency.
> My test revealed that with iov limit = IOV_MAX the snapshot time *~12 sec*.
>
> 2. complex way: do writings asynchronously.
> Introduce both base- and size-aligned buffer, write the data only when
> the buffer is full, write the buffer asynchronously, meanwhile filling another
> buffer with snapshot data.
> My test revealed that this complex way provides the snapshot time *~6 sec*,
> 2 times better than just iov limit increasing.

We also align written data as flush operations over the disk
are not mandatory.

> The patch proposes how to improve the snapshot performance in the complex way,
> allowing to use the asyncronous writings when needed.
>
> This is an RFC series, as I didn't confident that I fully understand all
> qemu-file use cases. I tried to make the series in a safe way to not break
> anything related to qemu-file using in other places, like migration.
>
> All comments are *VERY* appriciated!
>
> Thanks,
> Denis
>
> Denis Plotnikov (3):
>   qemu-file: introduce current buffer
>   qemu-file: add buffered mode
>   migration/savevm: use qemu-file buffered mode for non-cached bdrv
>
>  include/qemu/typedefs.h |   2 +
>  migration/qemu-file.c   | 479 +++++++++++++++++++++++++++++++++++++++++-------
>  migration/qemu-file.h   |   9 +
>  migration/savevm.c      |  38 +++-
>  4 files changed, 456 insertions(+), 72 deletions(-)
>