[v1] qcow2: async handling of fragmented io

[Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io

Posted by Vladimir Sementsov-Ogievskiy 7 years, 2 months ago

Hi all!

Here is an asynchronous scheme for handling fragmented qcow2
reads and writes. Both qcow2 read and write functions loops through
sequential portions of data. The series aim it to parallelize these
loops iterations.

It improves performance for fragmented qcow2 images, I've tested it
as follows:

I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
t-seq.qcow2 - sequentially written qcow2 image
t-reverse.qcow2 - filled by writing 64k portions from end to the start
t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
(see source code of image generation in the end for details)

and the test (sequential io by 1mb chunks):

test write:
    for t in /ssd/t-*; \
        do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
        ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
    done

test read (same, just drop -w parameter):
    for t in /ssd/t-*; \
        do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
        ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
    done

short info about parameters:
  -w - do writes (otherwise do reads)
  -c - count of blocks
  -s - block size
  -t none - disable cache
  -n - native aio
  -d 1 - don't use parallel requests provided by qemu-img bench itself

results:
    +-----------+-----------+----------+-----------+----------+
    |   file    | wr before | wr after | rd before | rd after |
    +-----------+-----------+----------+-----------+----------+
    | seq       |     8.605 |    8.636 |     9.043 |    9.010 |
    | reverse   |     9.934 |    8.654 |    17.162 |    8.662 |
    | rand      |     9.983 |    8.687 |    19.775 |    9.010 |
    | part-rand |     9.871 |    8.650 |    14.241 |    8.669 |
    +-----------+-----------+----------+-----------+----------+

Performance gain is obvious, especially for read.

how images are generated:

 === gen-writes file ===
    #!/usr/bin/env python
    import random
    import sys

    size = 4 * 1024 * 1024 * 1024
    block = 64 * 1024
    block2 = 1024 * 1024

    arg = sys.argv[1]

    if arg in ('rand', 'reverse', 'seq'):
        writes = list(range(0, size, block))

    if arg == 'rand':
        random.shuffle(writes)
    elif arg == 'reverse':
        writes.reverse()
    elif arg == 'part-rand':
        writes = []
        for off in range(0, size, block2):
            wr = list(range(off, off + block2, block))
            random.shuffle(wr)
            writes.extend(wr)
    elif arg != 'seq':
        sys.exit(1)

    for w in writes:
        print 'write -P 0xff {} {}'.format(w, block)

    print 'q'


 === gen-test-images.sh file ===
    #!/bin/bash

    IMG_PATH=/ssd

    for name in seq reverse rand part-rand; do
        IMG=$IMG_PATH/t-$name.qcow2
        echo createing $IMG ...
        rm -f $IMG
        qemu-img create -f qcow2 $IMG 4G
        gen-writes $name | qemu-io $IMG
    done

Denis V. Lunev (1):
  qcow2: move qemu_co_mutex_lock below decryption procedure

Vladimir Sementsov-Ogievskiy (6):
  qcow2: bdrv_co_pwritev: move encryption code out of lock
  qcow2: split out reading normal clusters from qcow2_co_preadv
  qcow2: async scheme for qcow2_co_preadv
  qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev
  qcow2: refactor qcow2_co_pwritev locals scope
  qcow2: async scheme for qcow2_co_pwritev

 block/qcow2.c                      | 506 +++++++++++++++++++++++++++++--------
 tests/qemu-iotests/026.out         |  18 +-
 tests/qemu-iotests/026.out.nocache |  20 +-
 3 files changed, 415 insertions(+), 129 deletions(-)

-- 
2.11.1

Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io

Posted by Max Reitz 7 years, 2 months ago

On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> Here is an asynchronous scheme for handling fragmented qcow2
> reads and writes. Both qcow2 read and write functions loops through
> sequential portions of data. The series aim it to parallelize these
> loops iterations.
> 
> It improves performance for fragmented qcow2 images, I've tested it
> as follows:
> 
> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
> t-seq.qcow2 - sequentially written qcow2 image
> t-reverse.qcow2 - filled by writing 64k portions from end to the start
> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
> (see source code of image generation in the end for details)
> 
> and the test (sequential io by 1mb chunks):
> 
> test write:
>     for t in /ssd/t-*; \
>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>     done
> 
> test read (same, just drop -w parameter):
>     for t in /ssd/t-*; \
>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>     done
> 
> short info about parameters:
>   -w - do writes (otherwise do reads)
>   -c - count of blocks
>   -s - block size
>   -t none - disable cache
>   -n - native aio
>   -d 1 - don't use parallel requests provided by qemu-img bench itself

Hm, actually, why not?  And how does a guest behave?

If parallel requests on an SSD perform better, wouldn't a guest issue
parallel requests to the virtual device and thus to qcow2 anyway?

(I suppose the global qcow2 lock could be an issue here, but then your
benchmark should work even without -d 1.)

Max

Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io

Posted by Vladimir Sementsov-Ogievskiy 7 years, 2 months ago

16.08.2018 03:51, Max Reitz wrote:
> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> Here is an asynchronous scheme for handling fragmented qcow2
>> reads and writes. Both qcow2 read and write functions loops through
>> sequential portions of data. The series aim it to parallelize these
>> loops iterations.
>>
>> It improves performance for fragmented qcow2 images, I've tested it
>> as follows:
>>
>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
>> t-seq.qcow2 - sequentially written qcow2 image
>> t-reverse.qcow2 - filled by writing 64k portions from end to the start
>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
>> (see source code of image generation in the end for details)
>>
>> and the test (sequential io by 1mb chunks):
>>
>> test write:
>>      for t in /ssd/t-*; \
>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>>      done
>>
>> test read (same, just drop -w parameter):
>>      for t in /ssd/t-*; \
>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>>      done
>>
>> short info about parameters:
>>    -w - do writes (otherwise do reads)
>>    -c - count of blocks
>>    -s - block size
>>    -t none - disable cache
>>    -n - native aio
>>    -d 1 - don't use parallel requests provided by qemu-img bench itself
> Hm, actually, why not?  And how does a guest behave?
>
> If parallel requests on an SSD perform better, wouldn't a guest issue
> parallel requests to the virtual device and thus to qcow2 anyway?

Guest knows nothing about qcow2 fragmentation, so this kind of 
"asynchronization" could be done only at qcow2 level.
However, if guest do async io, send a lot of parallel requests, it 
behave like qemu-img without -d 1 option, and in this case,
parallel loop iterations in qcow2 doesn't have such great sense. 
However, I think that async parallel requests are better in
general than sequential, because if device have some unused opportunity 
of parallelization, it will be utilized. We've already
use this approach in mirror and qemu-img convert. In Virtuozzo we have 
backup, improved by parallelization of requests
loop too. I think, it would be good to have some general code for such 
things in future.

>
> (I suppose the global qcow2 lock could be an issue here, but then your
> benchmark should work even without -d 1.)
>
> Max
>


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io

Posted by Max Reitz 7 years, 2 months ago

On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote:
> 16.08.2018 03:51, Max Reitz wrote:
>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> Here is an asynchronous scheme for handling fragmented qcow2
>>> reads and writes. Both qcow2 read and write functions loops through
>>> sequential portions of data. The series aim it to parallelize these
>>> loops iterations.
>>>
>>> It improves performance for fragmented qcow2 images, I've tested it
>>> as follows:
>>>
>>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
>>> t-seq.qcow2 - sequentially written qcow2 image
>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start
>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
>>> (see source code of image generation in the end for details)
>>>
>>> and the test (sequential io by 1mb chunks):
>>>
>>> test write:
>>>     for t in /ssd/t-*; \
>>>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>>>     done
>>>
>>> test read (same, just drop -w parameter):
>>>     for t in /ssd/t-*; \
>>>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>>>     done
>>>
>>> short info about parameters:
>>>   -w - do writes (otherwise do reads)
>>>   -c - count of blocks
>>>   -s - block size
>>>   -t none - disable cache
>>>   -n - native aio
>>>   -d 1 - don't use parallel requests provided by qemu-img bench itself
>> Hm, actually, why not?  And how does a guest behave?
>>
>> If parallel requests on an SSD perform better, wouldn't a guest issue
>> parallel requests to the virtual device and thus to qcow2 anyway?
> 
> Guest knows nothing about qcow2 fragmentation, so this kind of
> "asynchronization" could be done only at qcow2 level.

Hm, yes.  I'm sorry, but without having looked closer at the series
(which is why I'm sorry in advance), I would suspect that the
performance improvement comes from us being able to send parallel
requests to an SSD.

So if you send large requests to an SSD, you may either send them in
parallel or sequentially, it doesn't matter.  But for small requests,
it's better to send them in parallel so the SSD always has requests in
its queue.

I would think this is where the performance improvement comes from.  But
I would also think that a guest OS knows this and it would also send
many requests in parallel so the virtual block device never runs out of
requests.

> However, if guest do async io, send a lot of parallel requests, it
> behave like qemu-img without -d 1 option, and in this case,
> parallel loop iterations in qcow2 doesn't have such great sense.
> However, I think that async parallel requests are better in
> general than sequential, because if device have some unused opportunity
> of parallelization, it will be utilized.

I agree that it probably doesn't make things worse performance-wise, but
it's always added complexity (see the diffstat), which is why I'm just
routinely asking how useful it is in practice. :-)

Anyway, I suspect there are indeed cases where a guest doesn't send many
requests in parallel but it makes sense for the qcow2 driver to
parallelize it.  That would be mainly when the guest reads seemingly
sequential data that is then fragmented in the qcow2 file.  So basically
what your benchmark is testing. :-)

Then, the guest could assume that there is no sense in parallelizing it
because the latency from the device is large enough, whereas in qemu
itself we always run dry and wait for different parts of the single
large request to finish.  So, yes, in that case, parallelization that's
internal to qcow2 would make sense.

Now another question is, does this negatively impact devices where
seeking is slow, i.e. HDDs?  Unfortunately I'm not home right now, so I
don't have access to an HDD to test myself...

> We've already
> use this approach in mirror and qemu-img convert.

Indeed, but here you could always argue that this is just what guests
do, so we should, too.

> In Virtuozzo we have
> backup, improved by parallelization of requests
> loop too. I think, it would be good to have some general code for such
> things in future.
Well, those are different things, I'd think.  Parallelization in
mirror/backup/convert is useful not just because of qcow2 issues, but
also because you have a volume to read from and a volume to write to, so
that's where parallelization gives you some pipelining.  And it gives
you buffers for latency spikes, I guess.

Max

Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io

Posted by Vladimir Sementsov-Ogievskiy 7 years, 2 months ago

17.08.2018 22:34, Max Reitz wrote:
> On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote:
>> 16.08.2018 03:51, Max Reitz wrote:
>>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> Here is an asynchronous scheme for handling fragmented qcow2
>>>> reads and writes. Both qcow2 read and write functions loops through
>>>> sequential portions of data. The series aim it to parallelize these
>>>> loops iterations.
>>>>
>>>> It improves performance for fragmented qcow2 images, I've tested it
>>>> as follows:
>>>>
>>>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
>>>> t-seq.qcow2 - sequentially written qcow2 image
>>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start
>>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
>>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
>>>> (see source code of image generation in the end for details)
>>>>
>>>> and the test (sequential io by 1mb chunks):
>>>>
>>>> test write:
>>>>      for t in /ssd/t-*; \
>>>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>>>>      done
>>>>
>>>> test read (same, just drop -w parameter):
>>>>      for t in /ssd/t-*; \
>>>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>>>>      done
>>>>
>>>> short info about parameters:
>>>>    -w - do writes (otherwise do reads)
>>>>    -c - count of blocks
>>>>    -s - block size
>>>>    -t none - disable cache
>>>>    -n - native aio
>>>>    -d 1 - don't use parallel requests provided by qemu-img bench itself
>>> Hm, actually, why not?  And how does a guest behave?
>>>
>>> If parallel requests on an SSD perform better, wouldn't a guest issue
>>> parallel requests to the virtual device and thus to qcow2 anyway?
>> Guest knows nothing about qcow2 fragmentation, so this kind of
>> "asynchronization" could be done only at qcow2 level.
> Hm, yes.  I'm sorry, but without having looked closer at the series
> (which is why I'm sorry in advance), I would suspect that the
> performance improvement comes from us being able to send parallel
> requests to an SSD.
>
> So if you send large requests to an SSD, you may either send them in
> parallel or sequentially, it doesn't matter.  But for small requests,
> it's better to send them in parallel so the SSD always has requests in
> its queue.
>
> I would think this is where the performance improvement comes from.  But
> I would also think that a guest OS knows this and it would also send
> many requests in parallel so the virtual block device never runs out of
> requests.
>
>> However, if guest do async io, send a lot of parallel requests, it
>> behave like qemu-img without -d 1 option, and in this case,
>> parallel loop iterations in qcow2 doesn't have such great sense.
>> However, I think that async parallel requests are better in
>> general than sequential, because if device have some unused opportunity
>> of parallelization, it will be utilized.
> I agree that it probably doesn't make things worse performance-wise, but
> it's always added complexity (see the diffstat), which is why I'm just
> routinely asking how useful it is in practice. :-)
>
> Anyway, I suspect there are indeed cases where a guest doesn't send many
> requests in parallel but it makes sense for the qcow2 driver to
> parallelize it.  That would be mainly when the guest reads seemingly
> sequential data that is then fragmented in the qcow2 file.  So basically
> what your benchmark is testing. :-)
>
> Then, the guest could assume that there is no sense in parallelizing it
> because the latency from the device is large enough, whereas in qemu
> itself we always run dry and wait for different parts of the single
> large request to finish.  So, yes, in that case, parallelization that's
> internal to qcow2 would make sense.
>
> Now another question is, does this negatively impact devices where
> seeking is slow, i.e. HDDs?  Unfortunately I'm not home right now, so I
> don't have access to an HDD to test myself...


hdd:

+-----------+-----------+----------+-----------+----------+
|   file    | wr before | wr after | rd before | rd after |
+-----------+-----------+----------+-----------+----------+
| seq       |    39.821 |   40.513 |    38.600 |   38.916 |
| reverse   |    60.320 |   57.902 |    98.223 |  111.717 |
| rand      |   614.826 |  580.452 |   672.600 |  465.120 |
| part-rand |    52.311 |   52.450 |    37.663 |   37.989 |
+-----------+-----------+----------+-----------+----------+

hmm. 10% degradation on "reverse" case, strange magic.. However reverse 
is near to impossible.


>
>> We've already
>> use this approach in mirror and qemu-img convert.
> Indeed, but here you could always argue that this is just what guests
> do, so we should, too.
>
>> In Virtuozzo we have
>> backup, improved by parallelization of requests
>> loop too. I think, it would be good to have some general code for such
>> things in future.
> Well, those are different things, I'd think.  Parallelization in
> mirror/backup/convert is useful not just because of qcow2 issues, but
> also because you have a volume to read from and a volume to write to, so
> that's where parallelization gives you some pipelining.  And it gives
> you buffers for latency spikes, I guess.
>
> Max
>


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io

Posted by Max Reitz 7 years, 2 months ago

On 2018-08-20 18:33, Vladimir Sementsov-Ogievskiy wrote:
> 17.08.2018 22:34, Max Reitz wrote:
>> On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote:
>>> 16.08.2018 03:51, Max Reitz wrote:
>>>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Hi all!
>>>>>
>>>>> Here is an asynchronous scheme for handling fragmented qcow2
>>>>> reads and writes. Both qcow2 read and write functions loops through
>>>>> sequential portions of data. The series aim it to parallelize these
>>>>> loops iterations.
>>>>>
>>>>> It improves performance for fragmented qcow2 images, I've tested it
>>>>> as follows:
>>>>>
>>>>> I have four 4G qcow2 images (with default 64k block size) on my ssd
>>>>> disk:
>>>>> t-seq.qcow2 - sequentially written qcow2 image
>>>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start
>>>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random
>>>>> order
>>>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m
>>>>> clusters
>>>>> (see source code of image generation in the end for details)
>>>>>
>>>>> and the test (sequential io by 1mb chunks):
>>>>>
>>>>> test write:
>>>>>      for t in /ssd/t-*; \
>>>>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t 
>>>>> ===; \
>>>>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w
>>>>> $t; \
>>>>>      done
>>>>>
>>>>> test read (same, just drop -w parameter):
>>>>>      for t in /ssd/t-*; \
>>>>>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t 
>>>>> ===; \
>>>>>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>>>>>      done
>>>>>
>>>>> short info about parameters:
>>>>>    -w - do writes (otherwise do reads)
>>>>>    -c - count of blocks
>>>>>    -s - block size
>>>>>    -t none - disable cache
>>>>>    -n - native aio
>>>>>    -d 1 - don't use parallel requests provided by qemu-img bench
>>>>> itself
>>>> Hm, actually, why not?  And how does a guest behave?
>>>>
>>>> If parallel requests on an SSD perform better, wouldn't a guest issue
>>>> parallel requests to the virtual device and thus to qcow2 anyway?
>>> Guest knows nothing about qcow2 fragmentation, so this kind of
>>> "asynchronization" could be done only at qcow2 level.
>> Hm, yes.  I'm sorry, but without having looked closer at the series
>> (which is why I'm sorry in advance), I would suspect that the
>> performance improvement comes from us being able to send parallel
>> requests to an SSD.
>>
>> So if you send large requests to an SSD, you may either send them in
>> parallel or sequentially, it doesn't matter.  But for small requests,
>> it's better to send them in parallel so the SSD always has requests in
>> its queue.
>>
>> I would think this is where the performance improvement comes from.  But
>> I would also think that a guest OS knows this and it would also send
>> many requests in parallel so the virtual block device never runs out of
>> requests.
>>
>>> However, if guest do async io, send a lot of parallel requests, it
>>> behave like qemu-img without -d 1 option, and in this case,
>>> parallel loop iterations in qcow2 doesn't have such great sense.
>>> However, I think that async parallel requests are better in
>>> general than sequential, because if device have some unused opportunity
>>> of parallelization, it will be utilized.
>> I agree that it probably doesn't make things worse performance-wise, but
>> it's always added complexity (see the diffstat), which is why I'm just
>> routinely asking how useful it is in practice. :-)
>>
>> Anyway, I suspect there are indeed cases where a guest doesn't send many
>> requests in parallel but it makes sense for the qcow2 driver to
>> parallelize it.  That would be mainly when the guest reads seemingly
>> sequential data that is then fragmented in the qcow2 file.  So basically
>> what your benchmark is testing. :-)
>>
>> Then, the guest could assume that there is no sense in parallelizing it
>> because the latency from the device is large enough, whereas in qemu
>> itself we always run dry and wait for different parts of the single
>> large request to finish.  So, yes, in that case, parallelization that's
>> internal to qcow2 would make sense.
>>
>> Now another question is, does this negatively impact devices where
>> seeking is slow, i.e. HDDs?  Unfortunately I'm not home right now, so I
>> don't have access to an HDD to test myself...
> 
> 
> hdd:
> 
> +-----------+-----------+----------+-----------+----------+
> |   file    | wr before | wr after | rd before | rd after |
> +-----------+-----------+----------+-----------+----------+
> | seq       |    39.821 |   40.513 |    38.600 |   38.916 |
> | reverse   |    60.320 |   57.902 |    98.223 |  111.717 |
> | rand      |   614.826 |  580.452 |   672.600 |  465.120 |
> | part-rand |    52.311 |   52.450 |    37.663 |   37.989 |
> +-----------+-----------+----------+-----------+----------+
> 
> hmm. 10% degradation on "reverse" case, strange magic.. However reverse
> is near to impossible.

I tend to agree.  It's faster for random, and that's what matters more.

(Distinguishing between the cases in qcow2 seems like not so good of an
idea, and making it user-configurable is probably pointless because
noone will change the default.)

Max

Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io

Posted by Denis V. Lunev 7 years, 2 months ago

On 08/17/2018 10:34 PM, Max Reitz wrote:
> On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote:
>> 16.08.2018 03:51, Max Reitz wrote:
>>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> Here is an asynchronous scheme for handling fragmented qcow2
>>>> reads and writes. Both qcow2 read and write functions loops through
>>>> sequential portions of data. The series aim it to parallelize these
>>>> loops iterations.
>>>>
>>>> It improves performance for fragmented qcow2 images, I've tested it
>>>> as follows:
>>>>
>>>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
>>>> t-seq.qcow2 - sequentially written qcow2 image
>>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start
>>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
>>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
>>>> (see source code of image generation in the end for details)
>>>>
>>>> and the test (sequential io by 1mb chunks):
>>>>
>>>> test write:
>>>>     for t in /ssd/t-*; \
>>>>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>>>>     done
>>>>
>>>> test read (same, just drop -w parameter):
>>>>     for t in /ssd/t-*; \
>>>>         do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>>>>         ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>>>>     done
>>>>
>>>> short info about parameters:
>>>>   -w - do writes (otherwise do reads)
>>>>   -c - count of blocks
>>>>   -s - block size
>>>>   -t none - disable cache
>>>>   -n - native aio
>>>>   -d 1 - don't use parallel requests provided by qemu-img bench itself
>>> Hm, actually, why not?  And how does a guest behave?
>>>
>>> If parallel requests on an SSD perform better, wouldn't a guest issue
>>> parallel requests to the virtual device and thus to qcow2 anyway?
>> Guest knows nothing about qcow2 fragmentation, so this kind of
>> "asynchronization" could be done only at qcow2 level.
> Hm, yes.  I'm sorry, but without having looked closer at the series
> (which is why I'm sorry in advance), I would suspect that the
> performance improvement comes from us being able to send parallel
> requests to an SSD.
>
> So if you send large requests to an SSD, you may either send them in
> parallel or sequentially, it doesn't matter.  But for small requests,
> it's better to send them in parallel so the SSD always has requests in
> its queue.
>
> I would think this is where the performance improvement comes from.  But
> I would also think that a guest OS knows this and it would also send
> many requests in parallel so the virtual block device never runs out of
> requests.
>
>> However, if guest do async io, send a lot of parallel requests, it
>> behave like qemu-img without -d 1 option, and in this case,
>> parallel loop iterations in qcow2 doesn't have such great sense.
>> However, I think that async parallel requests are better in
>> general than sequential, because if device have some unused opportunity
>> of parallelization, it will be utilized.
> I agree that it probably doesn't make things worse performance-wise, but
> it's always added complexity (see the diffstat), which is why I'm just
> routinely asking how useful it is in practice. :-)
>
> Anyway, I suspect there are indeed cases where a guest doesn't send many
> requests in parallel but it makes sense for the qcow2 driver to
> parallelize it.  That would be mainly when the guest reads seemingly
> sequential data that is then fragmented in the qcow2 file.  So basically
> what your benchmark is testing. :-)
>
> Then, the guest could assume that there is no sense in parallelizing it
> because the latency from the device is large enough, whereas in qemu
> itself we always run dry and wait for different parts of the single
> large request to finish.  So, yes, in that case, parallelization that's
> internal to qcow2 would make sense.
>
> Now another question is, does this negatively impact devices where
> seeking is slow, i.e. HDDs?  Unfortunately I'm not home right now, so I
> don't have access to an HDD to test myself...
There are different situations and different load pattern, f.e. there
are situations when the guest executes sequential read
in a single thread. This looks obvious and dummy, but this is for sure
is possible in the real life. Also there is an observation, that Windows
guest prefers long requests. There is not unusual to observe 4Mb
requests in a pipeline.

Thus for such a load in a scattered file the performance difference should
be very big, even on SSD as without this SSD will starve without requests.

Here we are speaking in terms of latency, which definitely will be bigger
in sequential case.

Den

>> We've already
>> use this approach in mirror and qemu-img convert.
> Indeed, but here you could always argue that this is just what guests
> do, so we should, too.
>
>> In Virtuozzo we have
>> backup, improved by parallelization of requests
>> loop too. I think, it would be good to have some general code for such
>> things in future.
> Well, those are different things, I'd think.  Parallelization in
> mirror/backup/convert is useful not just because of qcow2 issues, but
> also because you have a volume to read from and a volume to write to, so
> that's where parallelization gives you some pipelining.  And it gives
> you buffers for latency spikes, I guess.
>
> Max
>

Re: [Qemu-devel] [PATCH 0/7] qcow2: async handling of fragmented io

Posted by Vladimir Sementsov-Ogievskiy 7 years, 1 month ago

ping. Finally, what about this?

07.08.2018 20:43, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
>
> Here is an asynchronous scheme for handling fragmented qcow2
> reads and writes. Both qcow2 read and write functions loops through
> sequential portions of data. The series aim it to parallelize these
> loops iterations.
>
> It improves performance for fragmented qcow2 images, I've tested it
> as follows:
>
> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
> t-seq.qcow2 - sequentially written qcow2 image
> t-reverse.qcow2 - filled by writing 64k portions from end to the start
> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
> (see source code of image generation in the end for details)
>
> and the test (sequential io by 1mb chunks):
>
> test write:
>      for t in /ssd/t-*; \
>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
>      done
>
> test read (same, just drop -w parameter):
>      for t in /ssd/t-*; \
>          do sync; echo 1 > /proc/sys/vm/drop_caches; echo ===  $t  ===; \
>          ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
>      done
>
> short info about parameters:
>    -w - do writes (otherwise do reads)
>    -c - count of blocks
>    -s - block size
>    -t none - disable cache
>    -n - native aio
>    -d 1 - don't use parallel requests provided by qemu-img bench itself
>
> results:
>      +-----------+-----------+----------+-----------+----------+
>      |   file    | wr before | wr after | rd before | rd after |
>      +-----------+-----------+----------+-----------+----------+
>      | seq       |     8.605 |    8.636 |     9.043 |    9.010 |
>      | reverse   |     9.934 |    8.654 |    17.162 |    8.662 |
>      | rand      |     9.983 |    8.687 |    19.775 |    9.010 |
>      | part-rand |     9.871 |    8.650 |    14.241 |    8.669 |
>      +-----------+-----------+----------+-----------+----------+
>
> Performance gain is obvious, especially for read.
>
> how images are generated:
>
>   === gen-writes file ===
>      #!/usr/bin/env python
>      import random
>      import sys
>
>      size = 4 * 1024 * 1024 * 1024
>      block = 64 * 1024
>      block2 = 1024 * 1024
>
>      arg = sys.argv[1]
>
>      if arg in ('rand', 'reverse', 'seq'):
>          writes = list(range(0, size, block))
>
>      if arg == 'rand':
>          random.shuffle(writes)
>      elif arg == 'reverse':
>          writes.reverse()
>      elif arg == 'part-rand':
>          writes = []
>          for off in range(0, size, block2):
>              wr = list(range(off, off + block2, block))
>              random.shuffle(wr)
>              writes.extend(wr)
>      elif arg != 'seq':
>          sys.exit(1)
>
>      for w in writes:
>          print 'write -P 0xff {} {}'.format(w, block)
>
>      print 'q'
>
>
>   === gen-test-images.sh file ===
>      #!/bin/bash
>
>      IMG_PATH=/ssd
>
>      for name in seq reverse rand part-rand; do
>          IMG=$IMG_PATH/t-$name.qcow2
>          echo createing $IMG ...
>          rm -f $IMG
>          qemu-img create -f qcow2 $IMG 4G
>          gen-writes $name | qemu-io $IMG
>      done
>
> Denis V. Lunev (1):
>    qcow2: move qemu_co_mutex_lock below decryption procedure
>
> Vladimir Sementsov-Ogievskiy (6):
>    qcow2: bdrv_co_pwritev: move encryption code out of lock
>    qcow2: split out reading normal clusters from qcow2_co_preadv
>    qcow2: async scheme for qcow2_co_preadv
>    qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev
>    qcow2: refactor qcow2_co_pwritev locals scope
>    qcow2: async scheme for qcow2_co_pwritev
>
>   block/qcow2.c                      | 506 +++++++++++++++++++++++++++++--------
>   tests/qemu-iotests/026.out         |  18 +-
>   tests/qemu-iotests/026.out.nocache |  20 +-
>   3 files changed, 415 insertions(+), 129 deletions(-)
>


-- 
Best regards,
Vladimir