[v4] qcow2: async handling of fragmented io

[Qemu-devel] [PATCH v4 0/5] qcow2: async handling of fragmented io

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

Hi all!

Here is an asynchronous scheme for handling fragmented qcow2
reads and writes. Both qcow2 read and write functions loops through
sequential portions of data. The series aim it to parallelize these
loops iterations.
It improves performance for fragmented qcow2 images, I've tested it
as described below.

v4 [perf results not updated]:
01: new patch. Unrelated, but need to fix 026 before the series to
    correctly fix it after :)
02: - use coroutine_fn where appropriate (i.e. in aio_task_pool_new too)
    - add Max's r-b
03,04: add Max's r-b
05: fix 026 output

v3 (by Max's comments) [perf results not updated]:

01: - use coroutine_fn where appropriate !!!!!!!!!!!!!!!!!!!!!!!
    - add aio_task_pool_free
    - add some comments
    - move header to include/block
    - s/wait_done/waiting
02: - Rewrite note about decryption in guest buffers [thx to Eric]
    - separate g_assert_not_reached for QCOW2_CLUSTER_ZERO_*
    - drop return after g_assert_not_reached
03: - drop bytes_done and correctly use qiov_offset
    - fix comment
04: - move QCOW2_MAX_WORKERS to block/qcow2.h
    - initialize ret in qcow2_co_preadv_part
Based-on: https://github.com/stefanha/qemu/commits/block


v2: changed a lot, as
 1. a lot of preparations around locks, hd_qiovs, threads for encryption
    are done
 2. I decided to create separate file with async request handling API, to
    reuse it for backup, stream and copy-on-read to improve their performance
    too. Mirror and qemu-img convert has their own async request handling,
    may be we'll be able finally merge all these similar code into one
    feature.
    Note that not all API calls used in qcow2, some will be needed on
    following steps for parallelizing other io loops.

About testing:

I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
t-seq.qcow2 - sequentially written qcow2 image
t-reverse.qcow2 - filled by writing 64k portions from end to the start
t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
(see source code of image generation in the end for details)

and I've done several runs like the following (sequential io by 1mb chunks):

    out=/tmp/block; echo > $out; cat /tmp/files | while read file; do for wr in {"","-w"}; do echo "$file" $wr; ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $wr "$file" | grep 'Run completed in' | awk '{print $4}' >> $out; done; done


short info about parameters:
  -w - do writes (otherwise do reads)
  -c - count of blocks
  -s - block size
  -t none - disable cache
  -n - native aio
  -d 1 - don't use parallel requests provided by qemu-img bench itself

results:

    +---------------------------+---------+---------+
    | file                      | master  | async   |
    +---------------------------+---------+---------+
    | /ssd/t-part-rand.qcow2    | 14.671  | 9.193   |
    +---------------------------+---------+---------+
    | /ssd/t-part-rand.qcow2 -w | 11.434  | 8.621   |
    +---------------------------+---------+---------+
    | /ssd/t-rand.qcow2         | 20.421  | 10.05   |
    +---------------------------+---------+---------+
    | /ssd/t-rand.qcow2 -w      | 11.097  | 8.915   |
    +---------------------------+---------+---------+
    | /ssd/t-reverse.qcow2      | 17.515  | 9.407   |
    +---------------------------+---------+---------+
    | /ssd/t-reverse.qcow2 -w   | 11.255  | 8.649   |
    +---------------------------+---------+---------+
    | /ssd/t-seq.qcow2          | 9.081   | 9.072   |
    +---------------------------+---------+---------+
    | /ssd/t-seq.qcow2 -w       | 8.761   | 8.747   |
    +---------------------------+---------+---------+
    | /tmp/t-part-rand.qcow2    | 41.179  | 41.37   |
    +---------------------------+---------+---------+
    | /tmp/t-part-rand.qcow2 -w | 54.097  | 55.323  |
    +---------------------------+---------+---------+
    | /tmp/t-rand.qcow2         | 711.899 | 514.339 |
    +---------------------------+---------+---------+
    | /tmp/t-rand.qcow2 -w      | 546.259 | 642.114 |
    +---------------------------+---------+---------+
    | /tmp/t-reverse.qcow2      | 86.065  | 96.522  |
    +---------------------------+---------+---------+
    | /tmp/t-reverse.qcow2 -w   | 46.557  | 48.499  |
    +---------------------------+---------+---------+
    | /tmp/t-seq.qcow2          | 33.804  | 33.862  |
    +---------------------------+---------+---------+
    | /tmp/t-seq.qcow2 -w       | 34.299  | 34.233  |
    +---------------------------+---------+---------+


Performance gain is obvious, especially for read and especially for ssd.
For hdd there is a degradation for reverse case, but this is the most
impossible case and seems not critical.

How images are generated:

    ==== gen-writes ======
    #!/usr/bin/env python
    import random
    import sys

    size = 4 * 1024 * 1024 * 1024
    block = 64 * 1024
    block2 = 1024 * 1024

    arg = sys.argv[1]

    if arg in ('rand', 'reverse', 'seq'):
        writes = list(range(0, size, block))

    if arg == 'rand':
        random.shuffle(writes)
    elif arg == 'reverse':
        writes.reverse()
    elif arg == 'part-rand':
        writes = []
        for off in range(0, size, block2):
            wr = list(range(off, off + block2, block))
            random.shuffle(wr)
            writes.extend(wr)
    elif arg != 'seq':
        sys.exit(1)

    for w in writes:
        print 'write -P 0xff {} {}'.format(w, block)

    print 'q'
    ==========================

    ===== gen-test-images.sh =====
    #!/bin/bash

    IMG_PATH=/ssd

    for name in seq reverse rand part-rand; do
        IMG=$IMG_PATH/t-$name.qcow2
        echo createing $IMG ...
        rm -f $IMG
        qemu-img create -f qcow2 $IMG 4G
        gen-writes $name | qemu-io $IMG
    done
    ==============================


Vladimir Sementsov-Ogievskiy (5):
  tests/qemu-iotests: Fix qemu-io related output in 026.out.nocache
  block: introduce aio task pool
  block/qcow2: refactor qcow2_co_preadv_part
  block/qcow2: refactor qcow2_co_pwritev_part
  block/qcow2: introduce parallel subrequest handling in read and write

 qapi/block-core.json               |   2 +-
 block/qcow2.h                      |   3 +
 include/block/aio_task.h           |  54 ++++
 block/aio_task.c                   | 124 ++++++++
 block/qcow2.c                      | 461 +++++++++++++++++++----------
 block/Makefile.objs                |   2 +
 block/trace-events                 |   1 +
 tests/qemu-iotests/026.out         |  18 +-
 tests/qemu-iotests/026.out.nocache | 188 ++++++------
 9 files changed, 591 insertions(+), 262 deletions(-)
 create mode 100644 include/block/aio_task.h
 create mode 100644 block/aio_task.c

-- 
2.18.0

Re: [Qemu-devel] [PATCH v4 0/5] qcow2: async handling of fragmented io

Posted by Vladimir Sementsov-Ogievskiy 6 years, 2 months ago

Pinging, as Stefan's branch merged into master and now these series based on master.

16.08.2019 18:30, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> Here is an asynchronous scheme for handling fragmented qcow2
> reads and writes. Both qcow2 read and write functions loops through
> sequential portions of data. The series aim it to parallelize these
> loops iterations.
> It improves performance for fragmented qcow2 images, I've tested it
> as described below.
> 
> v4 [perf results not updated]:
> 01: new patch. Unrelated, but need to fix 026 before the series to
>      correctly fix it after :)
> 02: - use coroutine_fn where appropriate (i.e. in aio_task_pool_new too)
>      - add Max's r-b
> 03,04: add Max's r-b
> 05: fix 026 output
> 
> v3 (by Max's comments) [perf results not updated]:
> 
> 01: - use coroutine_fn where appropriate !!!!!!!!!!!!!!!!!!!!!!!
>      - add aio_task_pool_free
>      - add some comments
>      - move header to include/block
>      - s/wait_done/waiting
> 02: - Rewrite note about decryption in guest buffers [thx to Eric]
>      - separate g_assert_not_reached for QCOW2_CLUSTER_ZERO_*
>      - drop return after g_assert_not_reached
> 03: - drop bytes_done and correctly use qiov_offset
>      - fix comment
> 04: - move QCOW2_MAX_WORKERS to block/qcow2.h
>      - initialize ret in qcow2_co_preadv_part
> Based-on: https://github.com/stefanha/qemu/commits/block
> 
> 
> v2: changed a lot, as
>   1. a lot of preparations around locks, hd_qiovs, threads for encryption
>      are done
>   2. I decided to create separate file with async request handling API, to
>      reuse it for backup, stream and copy-on-read to improve their performance
>      too. Mirror and qemu-img convert has their own async request handling,
>      may be we'll be able finally merge all these similar code into one
>      feature.
>      Note that not all API calls used in qcow2, some will be needed on
>      following steps for parallelizing other io loops.
> 
> About testing:
> 
> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
> t-seq.qcow2 - sequentially written qcow2 image
> t-reverse.qcow2 - filled by writing 64k portions from end to the start
> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
> (see source code of image generation in the end for details)
> 
> and I've done several runs like the following (sequential io by 1mb chunks):
> 
>      out=/tmp/block; echo > $out; cat /tmp/files | while read file; do for wr in {"","-w"}; do echo "$file" $wr; ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $wr "$file" | grep 'Run completed in' | awk '{print $4}' >> $out; done; done
> 
> 
> short info about parameters:
>    -w - do writes (otherwise do reads)
>    -c - count of blocks
>    -s - block size
>    -t none - disable cache
>    -n - native aio
>    -d 1 - don't use parallel requests provided by qemu-img bench itself
> 
> results:
> 
>      +---------------------------+---------+---------+
>      | file                      | master  | async   |
>      +---------------------------+---------+---------+
>      | /ssd/t-part-rand.qcow2    | 14.671  | 9.193   |
>      +---------------------------+---------+---------+
>      | /ssd/t-part-rand.qcow2 -w | 11.434  | 8.621   |
>      +---------------------------+---------+---------+
>      | /ssd/t-rand.qcow2         | 20.421  | 10.05   |
>      +---------------------------+---------+---------+
>      | /ssd/t-rand.qcow2 -w      | 11.097  | 8.915   |
>      +---------------------------+---------+---------+
>      | /ssd/t-reverse.qcow2      | 17.515  | 9.407   |
>      +---------------------------+---------+---------+
>      | /ssd/t-reverse.qcow2 -w   | 11.255  | 8.649   |
>      +---------------------------+---------+---------+
>      | /ssd/t-seq.qcow2          | 9.081   | 9.072   |
>      +---------------------------+---------+---------+
>      | /ssd/t-seq.qcow2 -w       | 8.761   | 8.747   |
>      +---------------------------+---------+---------+
>      | /tmp/t-part-rand.qcow2    | 41.179  | 41.37   |
>      +---------------------------+---------+---------+
>      | /tmp/t-part-rand.qcow2 -w | 54.097  | 55.323  |
>      +---------------------------+---------+---------+
>      | /tmp/t-rand.qcow2         | 711.899 | 514.339 |
>      +---------------------------+---------+---------+
>      | /tmp/t-rand.qcow2 -w      | 546.259 | 642.114 |
>      +---------------------------+---------+---------+
>      | /tmp/t-reverse.qcow2      | 86.065  | 96.522  |
>      +---------------------------+---------+---------+
>      | /tmp/t-reverse.qcow2 -w   | 46.557  | 48.499  |
>      +---------------------------+---------+---------+
>      | /tmp/t-seq.qcow2          | 33.804  | 33.862  |
>      +---------------------------+---------+---------+
>      | /tmp/t-seq.qcow2 -w       | 34.299  | 34.233  |
>      +---------------------------+---------+---------+
> 
> 
> Performance gain is obvious, especially for read and especially for ssd.
> For hdd there is a degradation for reverse case, but this is the most
> impossible case and seems not critical.
> 
> How images are generated:
> 
>      ==== gen-writes ======
>      #!/usr/bin/env python
>      import random
>      import sys
> 
>      size = 4 * 1024 * 1024 * 1024
>      block = 64 * 1024
>      block2 = 1024 * 1024
> 
>      arg = sys.argv[1]
> 
>      if arg in ('rand', 'reverse', 'seq'):
>          writes = list(range(0, size, block))
> 
>      if arg == 'rand':
>          random.shuffle(writes)
>      elif arg == 'reverse':
>          writes.reverse()
>      elif arg == 'part-rand':
>          writes = []
>          for off in range(0, size, block2):
>              wr = list(range(off, off + block2, block))
>              random.shuffle(wr)
>              writes.extend(wr)
>      elif arg != 'seq':
>          sys.exit(1)
> 
>      for w in writes:
>          print 'write -P 0xff {} {}'.format(w, block)
> 
>      print 'q'
>      ==========================
> 
>      ===== gen-test-images.sh =====
>      #!/bin/bash
> 
>      IMG_PATH=/ssd
> 
>      for name in seq reverse rand part-rand; do
>          IMG=$IMG_PATH/t-$name.qcow2
>          echo createing $IMG ...
>          rm -f $IMG
>          qemu-img create -f qcow2 $IMG 4G
>          gen-writes $name | qemu-io $IMG
>      done
>      ==============================
> 
> 
> Vladimir Sementsov-Ogievskiy (5):
>    tests/qemu-iotests: Fix qemu-io related output in 026.out.nocache
>    block: introduce aio task pool
>    block/qcow2: refactor qcow2_co_preadv_part
>    block/qcow2: refactor qcow2_co_pwritev_part
>    block/qcow2: introduce parallel subrequest handling in read and write
> 
>   qapi/block-core.json               |   2 +-
>   block/qcow2.h                      |   3 +
>   include/block/aio_task.h           |  54 ++++
>   block/aio_task.c                   | 124 ++++++++
>   block/qcow2.c                      | 461 +++++++++++++++++++----------
>   block/Makefile.objs                |   2 +
>   block/trace-events                 |   1 +
>   tests/qemu-iotests/026.out         |  18 +-
>   tests/qemu-iotests/026.out.nocache | 188 ++++++------
>   9 files changed, 591 insertions(+), 262 deletions(-)
>   create mode 100644 include/block/aio_task.h
>   create mode 100644 block/aio_task.c
> 


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH v4 0/5] qcow2: async handling of fragmented io

Posted by Max Reitz 6 years, 1 month ago

On 16.08.19 17:30, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> Here is an asynchronous scheme for handling fragmented qcow2
> reads and writes. Both qcow2 read and write functions loops through
> sequential portions of data. The series aim it to parallelize these
> loops iterations.
> It improves performance for fragmented qcow2 images, I've tested it
> as described below.

Thanks, I’ve changed two things:
- Replaced assert((x & (BDRV_SECTOR_SIZE - 1)) == 0) by
  assert(QEMU_IS_ALIGNED(x, BDRV_SECTOR_SIZE)) in patch 3 (conflict with
  “block: Use QEMU_IS_ALIGNED”), and
- Replaced the remaining instance of “qcow2_co_do_pwritev()” by
  “qcow2_co_pwritev_task()” in a comment in patch 4

and applied the series to my block branch:

https://git.xanclic.moe/XanClic/qemu/commits/branch/block

Max

Re: [Qemu-devel] [PATCH v4 0/5] qcow2: async handling of fragmented io

Posted by Vladimir Sementsov-Ogievskiy 6 years, 1 month ago

13.09.2019 11:58, Max Reitz wrote:
> On 16.08.19 17:30, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> Here is an asynchronous scheme for handling fragmented qcow2
>> reads and writes. Both qcow2 read and write functions loops through
>> sequential portions of data. The series aim it to parallelize these
>> loops iterations.
>> It improves performance for fragmented qcow2 images, I've tested it
>> as described below.
> 
> Thanks, I’ve changed two things:
> - Replaced assert((x & (BDRV_SECTOR_SIZE - 1)) == 0) by
>    assert(QEMU_IS_ALIGNED(x, BDRV_SECTOR_SIZE)) in patch 3 (conflict with
>    “block: Use QEMU_IS_ALIGNED”), and
> - Replaced the remaining instance of “qcow2_co_do_pwritev()” by
>    “qcow2_co_pwritev_task()” in a comment in patch 4
> 
> and applied the series to my block branch:
> 
> https://git.xanclic.moe/XanClic/qemu/commits/branch/block
> 
> Max
> 

Thank you!!!

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH v4 0/5] qcow2: async handling of fragmented io

Posted by Max Reitz 6 years, 1 month ago

On 13.09.19 10:58, Max Reitz wrote:
> On 16.08.19 17:30, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> Here is an asynchronous scheme for handling fragmented qcow2
>> reads and writes. Both qcow2 read and write functions loops through
>> sequential portions of data. The series aim it to parallelize these
>> loops iterations.
>> It improves performance for fragmented qcow2 images, I've tested it
>> as described below.
> 
> Thanks, I’ve changed two things:
> - Replaced assert((x & (BDRV_SECTOR_SIZE - 1)) == 0) by
>   assert(QEMU_IS_ALIGNED(x, BDRV_SECTOR_SIZE)) in patch 3 (conflict with
>   “block: Use QEMU_IS_ALIGNED”), and
> - Replaced the remaining instance of “qcow2_co_do_pwritev()” by
>   “qcow2_co_pwritev_task()” in a comment in patch 4
> 
> and applied the series to my block branch:
> 
> https://git.xanclic.moe/XanClic/qemu/commits/branch/block

Unfortunately, I’ll have to unstage the series for now because the fix
to 026’s reference output isn’t stable.

When running the test in parallel (I can reproduce it with four
instances on my machine with two cores + HT), I get failures like:

026      fail       [15:21:09] [15:21:37]      (last: 18s)   output
mismatch (see 026.out.bad)
--- tests/qemu-iotests/026.out 2019-09-16 14:49:20.720410701 +0200
+++ tests/qemu-iotests/026.out.bad       2019-09-16 15:21:37.180711936 +0200
@@ -563,7 +563,7 @@
 qemu-io: Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device

-74 leaked clusters were found on the image.
+522 leaked clusters were found on the image.
 This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824

Failures: 026
Failed 1 of 1 iotests

Max

Re: [Qemu-devel] [PATCH v4 0/5] qcow2: async handling of fragmented io

Posted by Vladimir Sementsov-Ogievskiy 6 years, 1 month ago

16.09.2019 16:26, Max Reitz wrote:
> On 13.09.19 10:58, Max Reitz wrote:
>> On 16.08.19 17:30, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> Here is an asynchronous scheme for handling fragmented qcow2
>>> reads and writes. Both qcow2 read and write functions loops through
>>> sequential portions of data. The series aim it to parallelize these
>>> loops iterations.
>>> It improves performance for fragmented qcow2 images, I've tested it
>>> as described below.
>>
>> Thanks, I’ve changed two things:
>> - Replaced assert((x & (BDRV_SECTOR_SIZE - 1)) == 0) by
>>    assert(QEMU_IS_ALIGNED(x, BDRV_SECTOR_SIZE)) in patch 3 (conflict with
>>    “block: Use QEMU_IS_ALIGNED”), and
>> - Replaced the remaining instance of “qcow2_co_do_pwritev()” by
>>    “qcow2_co_pwritev_task()” in a comment in patch 4
>>
>> and applied the series to my block branch:
>>
>> https://git.xanclic.moe/XanClic/qemu/commits/branch/block
> 
> Unfortunately, I’ll have to unstage the series for now because the fix
> to 026’s reference output isn’t stable.
> 
> When running the test in parallel (I can reproduce it with four
> instances on my machine with two cores + HT), I get failures like:
> 
> 026      fail       [15:21:09] [15:21:37]      (last: 18s)   output
> mismatch (see 026.out.bad)
> --- tests/qemu-iotests/026.out 2019-09-16 14:49:20.720410701 +0200
> +++ tests/qemu-iotests/026.out.bad       2019-09-16 15:21:37.180711936 +0200
> @@ -563,7 +563,7 @@
>   qemu-io: Failed to flush the refcount block cache: No space left on device
>   write failed: No space left on device
> 
> -74 leaked clusters were found on the image.
> +522 leaked clusters were found on the image.
>   This means waste of disk space, but no harm to data.
>   Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
> 
> Failures: 026
> Failed 1 of 1 iotests
> 

Unfortunate enough:)

Hmm, can't reproduce, but I tend to fix this by just filtering out information about
leaked clusters in this test, as no sense in tracking it for failure paths, keeping
in mind newly introduced async handling of request parts.

-- 
Best regards,
Vladimir