block/qcow2.c | 506 +++++++++++++++++++++++++++++-------- tests/qemu-iotests/026.out | 18 +- tests/qemu-iotests/026.out.nocache | 20 +- 3 files changed, 415 insertions(+), 129 deletions(-)
Hi all!
Here is an asynchronous scheme for handling fragmented qcow2
reads and writes. Both qcow2 read and write functions loops through
sequential portions of data. The series aim it to parallelize these
loops iterations.
It improves performance for fragmented qcow2 images, I've tested it
as follows:
I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
t-seq.qcow2 - sequentially written qcow2 image
t-reverse.qcow2 - filled by writing 64k portions from end to the start
t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
(see source code of image generation in the end for details)
and the test (sequential io by 1mb chunks):
test write:
for t in /ssd/t-*; \
do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \
./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
done
test read (same, just drop -w parameter):
for t in /ssd/t-*; \
do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \
./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
done
short info about parameters:
-w - do writes (otherwise do reads)
-c - count of blocks
-s - block size
-t none - disable cache
-n - native aio
-d 1 - don't use parallel requests provided by qemu-img bench itself
results:
+-----------+-----------+----------+-----------+----------+
| file | wr before | wr after | rd before | rd after |
+-----------+-----------+----------+-----------+----------+
| seq | 8.605 | 8.636 | 9.043 | 9.010 |
| reverse | 9.934 | 8.654 | 17.162 | 8.662 |
| rand | 9.983 | 8.687 | 19.775 | 9.010 |
| part-rand | 9.871 | 8.650 | 14.241 | 8.669 |
+-----------+-----------+----------+-----------+----------+
Performance gain is obvious, especially for read.
how images are generated:
=== gen-writes file ===
#!/usr/bin/env python
import random
import sys
size = 4 * 1024 * 1024 * 1024
block = 64 * 1024
block2 = 1024 * 1024
arg = sys.argv[1]
if arg in ('rand', 'reverse', 'seq'):
writes = list(range(0, size, block))
if arg == 'rand':
random.shuffle(writes)
elif arg == 'reverse':
writes.reverse()
elif arg == 'part-rand':
writes = []
for off in range(0, size, block2):
wr = list(range(off, off + block2, block))
random.shuffle(wr)
writes.extend(wr)
elif arg != 'seq':
sys.exit(1)
for w in writes:
print 'write -P 0xff {} {}'.format(w, block)
print 'q'
=== gen-test-images.sh file ===
#!/bin/bash
IMG_PATH=/ssd
for name in seq reverse rand part-rand; do
IMG=$IMG_PATH/t-$name.qcow2
echo createing $IMG ...
rm -f $IMG
qemu-img create -f qcow2 $IMG 4G
gen-writes $name | qemu-io $IMG
done
Denis V. Lunev (1):
qcow2: move qemu_co_mutex_lock below decryption procedure
Vladimir Sementsov-Ogievskiy (6):
qcow2: bdrv_co_pwritev: move encryption code out of lock
qcow2: split out reading normal clusters from qcow2_co_preadv
qcow2: async scheme for qcow2_co_preadv
qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev
qcow2: refactor qcow2_co_pwritev locals scope
qcow2: async scheme for qcow2_co_pwritev
block/qcow2.c | 506 +++++++++++++++++++++++++++++--------
tests/qemu-iotests/026.out | 18 +-
tests/qemu-iotests/026.out.nocache | 20 +-
3 files changed, 415 insertions(+), 129 deletions(-)
--
2.11.1
On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote: > Hi all! > > Here is an asynchronous scheme for handling fragmented qcow2 > reads and writes. Both qcow2 read and write functions loops through > sequential portions of data. The series aim it to parallelize these > loops iterations. > > It improves performance for fragmented qcow2 images, I've tested it > as follows: > > I have four 4G qcow2 images (with default 64k block size) on my ssd disk: > t-seq.qcow2 - sequentially written qcow2 image > t-reverse.qcow2 - filled by writing 64k portions from end to the start > t-rand.qcow2 - filled by writing 64k portions (aligned) in random order > t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters > (see source code of image generation in the end for details) > > and the test (sequential io by 1mb chunks): > > test write: > for t in /ssd/t-*; \ > do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ > ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \ > done > > test read (same, just drop -w parameter): > for t in /ssd/t-*; \ > do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ > ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \ > done > > short info about parameters: > -w - do writes (otherwise do reads) > -c - count of blocks > -s - block size > -t none - disable cache > -n - native aio > -d 1 - don't use parallel requests provided by qemu-img bench itself Hm, actually, why not? And how does a guest behave? If parallel requests on an SSD perform better, wouldn't a guest issue parallel requests to the virtual device and thus to qcow2 anyway? (I suppose the global qcow2 lock could be an issue here, but then your benchmark should work even without -d 1.) Max
16.08.2018 03:51, Max Reitz wrote: > On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote: >> Hi all! >> >> Here is an asynchronous scheme for handling fragmented qcow2 >> reads and writes. Both qcow2 read and write functions loops through >> sequential portions of data. The series aim it to parallelize these >> loops iterations. >> >> It improves performance for fragmented qcow2 images, I've tested it >> as follows: >> >> I have four 4G qcow2 images (with default 64k block size) on my ssd disk: >> t-seq.qcow2 - sequentially written qcow2 image >> t-reverse.qcow2 - filled by writing 64k portions from end to the start >> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order >> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters >> (see source code of image generation in the end for details) >> >> and the test (sequential io by 1mb chunks): >> >> test write: >> for t in /ssd/t-*; \ >> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ >> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \ >> done >> >> test read (same, just drop -w parameter): >> for t in /ssd/t-*; \ >> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ >> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \ >> done >> >> short info about parameters: >> -w - do writes (otherwise do reads) >> -c - count of blocks >> -s - block size >> -t none - disable cache >> -n - native aio >> -d 1 - don't use parallel requests provided by qemu-img bench itself > Hm, actually, why not? And how does a guest behave? > > If parallel requests on an SSD perform better, wouldn't a guest issue > parallel requests to the virtual device and thus to qcow2 anyway? Guest knows nothing about qcow2 fragmentation, so this kind of "asynchronization" could be done only at qcow2 level. However, if guest do async io, send a lot of parallel requests, it behave like qemu-img without -d 1 option, and in this case, parallel loop iterations in qcow2 doesn't have such great sense. However, I think that async parallel requests are better in general than sequential, because if device have some unused opportunity of parallelization, it will be utilized. We've already use this approach in mirror and qemu-img convert. In Virtuozzo we have backup, improved by parallelization of requests loop too. I think, it would be good to have some general code for such things in future. > > (I suppose the global qcow2 lock could be an issue here, but then your > benchmark should work even without -d 1.) > > Max > -- Best regards, Vladimir
On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote: > 16.08.2018 03:51, Max Reitz wrote: >> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote: >>> Hi all! >>> >>> Here is an asynchronous scheme for handling fragmented qcow2 >>> reads and writes. Both qcow2 read and write functions loops through >>> sequential portions of data. The series aim it to parallelize these >>> loops iterations. >>> >>> It improves performance for fragmented qcow2 images, I've tested it >>> as follows: >>> >>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk: >>> t-seq.qcow2 - sequentially written qcow2 image >>> t-reverse.qcow2 - filled by writing 64k portions from end to the start >>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order >>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters >>> (see source code of image generation in the end for details) >>> >>> and the test (sequential io by 1mb chunks): >>> >>> test write: >>> for t in /ssd/t-*; \ >>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ >>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \ >>> done >>> >>> test read (same, just drop -w parameter): >>> for t in /ssd/t-*; \ >>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ >>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \ >>> done >>> >>> short info about parameters: >>> -w - do writes (otherwise do reads) >>> -c - count of blocks >>> -s - block size >>> -t none - disable cache >>> -n - native aio >>> -d 1 - don't use parallel requests provided by qemu-img bench itself >> Hm, actually, why not? And how does a guest behave? >> >> If parallel requests on an SSD perform better, wouldn't a guest issue >> parallel requests to the virtual device and thus to qcow2 anyway? > > Guest knows nothing about qcow2 fragmentation, so this kind of > "asynchronization" could be done only at qcow2 level. Hm, yes. I'm sorry, but without having looked closer at the series (which is why I'm sorry in advance), I would suspect that the performance improvement comes from us being able to send parallel requests to an SSD. So if you send large requests to an SSD, you may either send them in parallel or sequentially, it doesn't matter. But for small requests, it's better to send them in parallel so the SSD always has requests in its queue. I would think this is where the performance improvement comes from. But I would also think that a guest OS knows this and it would also send many requests in parallel so the virtual block device never runs out of requests. > However, if guest do async io, send a lot of parallel requests, it > behave like qemu-img without -d 1 option, and in this case, > parallel loop iterations in qcow2 doesn't have such great sense. > However, I think that async parallel requests are better in > general than sequential, because if device have some unused opportunity > of parallelization, it will be utilized. I agree that it probably doesn't make things worse performance-wise, but it's always added complexity (see the diffstat), which is why I'm just routinely asking how useful it is in practice. :-) Anyway, I suspect there are indeed cases where a guest doesn't send many requests in parallel but it makes sense for the qcow2 driver to parallelize it. That would be mainly when the guest reads seemingly sequential data that is then fragmented in the qcow2 file. So basically what your benchmark is testing. :-) Then, the guest could assume that there is no sense in parallelizing it because the latency from the device is large enough, whereas in qemu itself we always run dry and wait for different parts of the single large request to finish. So, yes, in that case, parallelization that's internal to qcow2 would make sense. Now another question is, does this negatively impact devices where seeking is slow, i.e. HDDs? Unfortunately I'm not home right now, so I don't have access to an HDD to test myself... > We've already > use this approach in mirror and qemu-img convert. Indeed, but here you could always argue that this is just what guests do, so we should, too. > In Virtuozzo we have > backup, improved by parallelization of requests > loop too. I think, it would be good to have some general code for such > things in future. Well, those are different things, I'd think. Parallelization in mirror/backup/convert is useful not just because of qcow2 issues, but also because you have a volume to read from and a volume to write to, so that's where parallelization gives you some pipelining. And it gives you buffers for latency spikes, I guess. Max
17.08.2018 22:34, Max Reitz wrote: > On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote: >> 16.08.2018 03:51, Max Reitz wrote: >>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote: >>>> Hi all! >>>> >>>> Here is an asynchronous scheme for handling fragmented qcow2 >>>> reads and writes. Both qcow2 read and write functions loops through >>>> sequential portions of data. The series aim it to parallelize these >>>> loops iterations. >>>> >>>> It improves performance for fragmented qcow2 images, I've tested it >>>> as follows: >>>> >>>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk: >>>> t-seq.qcow2 - sequentially written qcow2 image >>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start >>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order >>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters >>>> (see source code of image generation in the end for details) >>>> >>>> and the test (sequential io by 1mb chunks): >>>> >>>> test write: >>>> for t in /ssd/t-*; \ >>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ >>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \ >>>> done >>>> >>>> test read (same, just drop -w parameter): >>>> for t in /ssd/t-*; \ >>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ >>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \ >>>> done >>>> >>>> short info about parameters: >>>> -w - do writes (otherwise do reads) >>>> -c - count of blocks >>>> -s - block size >>>> -t none - disable cache >>>> -n - native aio >>>> -d 1 - don't use parallel requests provided by qemu-img bench itself >>> Hm, actually, why not? And how does a guest behave? >>> >>> If parallel requests on an SSD perform better, wouldn't a guest issue >>> parallel requests to the virtual device and thus to qcow2 anyway? >> Guest knows nothing about qcow2 fragmentation, so this kind of >> "asynchronization" could be done only at qcow2 level. > Hm, yes. I'm sorry, but without having looked closer at the series > (which is why I'm sorry in advance), I would suspect that the > performance improvement comes from us being able to send parallel > requests to an SSD. > > So if you send large requests to an SSD, you may either send them in > parallel or sequentially, it doesn't matter. But for small requests, > it's better to send them in parallel so the SSD always has requests in > its queue. > > I would think this is where the performance improvement comes from. But > I would also think that a guest OS knows this and it would also send > many requests in parallel so the virtual block device never runs out of > requests. > >> However, if guest do async io, send a lot of parallel requests, it >> behave like qemu-img without -d 1 option, and in this case, >> parallel loop iterations in qcow2 doesn't have such great sense. >> However, I think that async parallel requests are better in >> general than sequential, because if device have some unused opportunity >> of parallelization, it will be utilized. > I agree that it probably doesn't make things worse performance-wise, but > it's always added complexity (see the diffstat), which is why I'm just > routinely asking how useful it is in practice. :-) > > Anyway, I suspect there are indeed cases where a guest doesn't send many > requests in parallel but it makes sense for the qcow2 driver to > parallelize it. That would be mainly when the guest reads seemingly > sequential data that is then fragmented in the qcow2 file. So basically > what your benchmark is testing. :-) > > Then, the guest could assume that there is no sense in parallelizing it > because the latency from the device is large enough, whereas in qemu > itself we always run dry and wait for different parts of the single > large request to finish. So, yes, in that case, parallelization that's > internal to qcow2 would make sense. > > Now another question is, does this negatively impact devices where > seeking is slow, i.e. HDDs? Unfortunately I'm not home right now, so I > don't have access to an HDD to test myself... hdd: +-----------+-----------+----------+-----------+----------+ | file | wr before | wr after | rd before | rd after | +-----------+-----------+----------+-----------+----------+ | seq | 39.821 | 40.513 | 38.600 | 38.916 | | reverse | 60.320 | 57.902 | 98.223 | 111.717 | | rand | 614.826 | 580.452 | 672.600 | 465.120 | | part-rand | 52.311 | 52.450 | 37.663 | 37.989 | +-----------+-----------+----------+-----------+----------+ hmm. 10% degradation on "reverse" case, strange magic.. However reverse is near to impossible. > >> We've already >> use this approach in mirror and qemu-img convert. > Indeed, but here you could always argue that this is just what guests > do, so we should, too. > >> In Virtuozzo we have >> backup, improved by parallelization of requests >> loop too. I think, it would be good to have some general code for such >> things in future. > Well, those are different things, I'd think. Parallelization in > mirror/backup/convert is useful not just because of qcow2 issues, but > also because you have a volume to read from and a volume to write to, so > that's where parallelization gives you some pipelining. And it gives > you buffers for latency spikes, I guess. > > Max > -- Best regards, Vladimir
On 2018-08-20 18:33, Vladimir Sementsov-Ogievskiy wrote: > 17.08.2018 22:34, Max Reitz wrote: >> On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote: >>> 16.08.2018 03:51, Max Reitz wrote: >>>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote: >>>>> Hi all! >>>>> >>>>> Here is an asynchronous scheme for handling fragmented qcow2 >>>>> reads and writes. Both qcow2 read and write functions loops through >>>>> sequential portions of data. The series aim it to parallelize these >>>>> loops iterations. >>>>> >>>>> It improves performance for fragmented qcow2 images, I've tested it >>>>> as follows: >>>>> >>>>> I have four 4G qcow2 images (with default 64k block size) on my ssd >>>>> disk: >>>>> t-seq.qcow2 - sequentially written qcow2 image >>>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start >>>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random >>>>> order >>>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m >>>>> clusters >>>>> (see source code of image generation in the end for details) >>>>> >>>>> and the test (sequential io by 1mb chunks): >>>>> >>>>> test write: >>>>> for t in /ssd/t-*; \ >>>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t >>>>> ===; \ >>>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w >>>>> $t; \ >>>>> done >>>>> >>>>> test read (same, just drop -w parameter): >>>>> for t in /ssd/t-*; \ >>>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t >>>>> ===; \ >>>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \ >>>>> done >>>>> >>>>> short info about parameters: >>>>> -w - do writes (otherwise do reads) >>>>> -c - count of blocks >>>>> -s - block size >>>>> -t none - disable cache >>>>> -n - native aio >>>>> -d 1 - don't use parallel requests provided by qemu-img bench >>>>> itself >>>> Hm, actually, why not? And how does a guest behave? >>>> >>>> If parallel requests on an SSD perform better, wouldn't a guest issue >>>> parallel requests to the virtual device and thus to qcow2 anyway? >>> Guest knows nothing about qcow2 fragmentation, so this kind of >>> "asynchronization" could be done only at qcow2 level. >> Hm, yes. I'm sorry, but without having looked closer at the series >> (which is why I'm sorry in advance), I would suspect that the >> performance improvement comes from us being able to send parallel >> requests to an SSD. >> >> So if you send large requests to an SSD, you may either send them in >> parallel or sequentially, it doesn't matter. But for small requests, >> it's better to send them in parallel so the SSD always has requests in >> its queue. >> >> I would think this is where the performance improvement comes from. But >> I would also think that a guest OS knows this and it would also send >> many requests in parallel so the virtual block device never runs out of >> requests. >> >>> However, if guest do async io, send a lot of parallel requests, it >>> behave like qemu-img without -d 1 option, and in this case, >>> parallel loop iterations in qcow2 doesn't have such great sense. >>> However, I think that async parallel requests are better in >>> general than sequential, because if device have some unused opportunity >>> of parallelization, it will be utilized. >> I agree that it probably doesn't make things worse performance-wise, but >> it's always added complexity (see the diffstat), which is why I'm just >> routinely asking how useful it is in practice. :-) >> >> Anyway, I suspect there are indeed cases where a guest doesn't send many >> requests in parallel but it makes sense for the qcow2 driver to >> parallelize it. That would be mainly when the guest reads seemingly >> sequential data that is then fragmented in the qcow2 file. So basically >> what your benchmark is testing. :-) >> >> Then, the guest could assume that there is no sense in parallelizing it >> because the latency from the device is large enough, whereas in qemu >> itself we always run dry and wait for different parts of the single >> large request to finish. So, yes, in that case, parallelization that's >> internal to qcow2 would make sense. >> >> Now another question is, does this negatively impact devices where >> seeking is slow, i.e. HDDs? Unfortunately I'm not home right now, so I >> don't have access to an HDD to test myself... > > > hdd: > > +-----------+-----------+----------+-----------+----------+ > | file | wr before | wr after | rd before | rd after | > +-----------+-----------+----------+-----------+----------+ > | seq | 39.821 | 40.513 | 38.600 | 38.916 | > | reverse | 60.320 | 57.902 | 98.223 | 111.717 | > | rand | 614.826 | 580.452 | 672.600 | 465.120 | > | part-rand | 52.311 | 52.450 | 37.663 | 37.989 | > +-----------+-----------+----------+-----------+----------+ > > hmm. 10% degradation on "reverse" case, strange magic.. However reverse > is near to impossible. I tend to agree. It's faster for random, and that's what matters more. (Distinguishing between the cases in qcow2 seems like not so good of an idea, and making it user-configurable is probably pointless because noone will change the default.) Max
On 08/17/2018 10:34 PM, Max Reitz wrote: > On 2018-08-16 15:58, Vladimir Sementsov-Ogievskiy wrote: >> 16.08.2018 03:51, Max Reitz wrote: >>> On 2018-08-07 19:43, Vladimir Sementsov-Ogievskiy wrote: >>>> Hi all! >>>> >>>> Here is an asynchronous scheme for handling fragmented qcow2 >>>> reads and writes. Both qcow2 read and write functions loops through >>>> sequential portions of data. The series aim it to parallelize these >>>> loops iterations. >>>> >>>> It improves performance for fragmented qcow2 images, I've tested it >>>> as follows: >>>> >>>> I have four 4G qcow2 images (with default 64k block size) on my ssd disk: >>>> t-seq.qcow2 - sequentially written qcow2 image >>>> t-reverse.qcow2 - filled by writing 64k portions from end to the start >>>> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order >>>> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters >>>> (see source code of image generation in the end for details) >>>> >>>> and the test (sequential io by 1mb chunks): >>>> >>>> test write: >>>> for t in /ssd/t-*; \ >>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ >>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \ >>>> done >>>> >>>> test read (same, just drop -w parameter): >>>> for t in /ssd/t-*; \ >>>> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \ >>>> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \ >>>> done >>>> >>>> short info about parameters: >>>> -w - do writes (otherwise do reads) >>>> -c - count of blocks >>>> -s - block size >>>> -t none - disable cache >>>> -n - native aio >>>> -d 1 - don't use parallel requests provided by qemu-img bench itself >>> Hm, actually, why not? And how does a guest behave? >>> >>> If parallel requests on an SSD perform better, wouldn't a guest issue >>> parallel requests to the virtual device and thus to qcow2 anyway? >> Guest knows nothing about qcow2 fragmentation, so this kind of >> "asynchronization" could be done only at qcow2 level. > Hm, yes. I'm sorry, but without having looked closer at the series > (which is why I'm sorry in advance), I would suspect that the > performance improvement comes from us being able to send parallel > requests to an SSD. > > So if you send large requests to an SSD, you may either send them in > parallel or sequentially, it doesn't matter. But for small requests, > it's better to send them in parallel so the SSD always has requests in > its queue. > > I would think this is where the performance improvement comes from. But > I would also think that a guest OS knows this and it would also send > many requests in parallel so the virtual block device never runs out of > requests. > >> However, if guest do async io, send a lot of parallel requests, it >> behave like qemu-img without -d 1 option, and in this case, >> parallel loop iterations in qcow2 doesn't have such great sense. >> However, I think that async parallel requests are better in >> general than sequential, because if device have some unused opportunity >> of parallelization, it will be utilized. > I agree that it probably doesn't make things worse performance-wise, but > it's always added complexity (see the diffstat), which is why I'm just > routinely asking how useful it is in practice. :-) > > Anyway, I suspect there are indeed cases where a guest doesn't send many > requests in parallel but it makes sense for the qcow2 driver to > parallelize it. That would be mainly when the guest reads seemingly > sequential data that is then fragmented in the qcow2 file. So basically > what your benchmark is testing. :-) > > Then, the guest could assume that there is no sense in parallelizing it > because the latency from the device is large enough, whereas in qemu > itself we always run dry and wait for different parts of the single > large request to finish. So, yes, in that case, parallelization that's > internal to qcow2 would make sense. > > Now another question is, does this negatively impact devices where > seeking is slow, i.e. HDDs? Unfortunately I'm not home right now, so I > don't have access to an HDD to test myself... There are different situations and different load pattern, f.e. there are situations when the guest executes sequential read in a single thread. This looks obvious and dummy, but this is for sure is possible in the real life. Also there is an observation, that Windows guest prefers long requests. There is not unusual to observe 4Mb requests in a pipeline. Thus for such a load in a scattered file the performance difference should be very big, even on SSD as without this SSD will starve without requests. Here we are speaking in terms of latency, which definitely will be bigger in sequential case. Den >> We've already >> use this approach in mirror and qemu-img convert. > Indeed, but here you could always argue that this is just what guests > do, so we should, too. > >> In Virtuozzo we have >> backup, improved by parallelization of requests >> loop too. I think, it would be good to have some general code for such >> things in future. > Well, those are different things, I'd think. Parallelization in > mirror/backup/convert is useful not just because of qcow2 issues, but > also because you have a volume to read from and a volume to write to, so > that's where parallelization gives you some pipelining. And it gives > you buffers for latency spikes, I guess. > > Max >
ping. Finally, what about this?
07.08.2018 20:43, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
>
> Here is an asynchronous scheme for handling fragmented qcow2
> reads and writes. Both qcow2 read and write functions loops through
> sequential portions of data. The series aim it to parallelize these
> loops iterations.
>
> It improves performance for fragmented qcow2 images, I've tested it
> as follows:
>
> I have four 4G qcow2 images (with default 64k block size) on my ssd disk:
> t-seq.qcow2 - sequentially written qcow2 image
> t-reverse.qcow2 - filled by writing 64k portions from end to the start
> t-rand.qcow2 - filled by writing 64k portions (aligned) in random order
> t-part-rand.qcow2 - filled by shuffling order of 64k writes in 1m clusters
> (see source code of image generation in the end for details)
>
> and the test (sequential io by 1mb chunks):
>
> test write:
> for t in /ssd/t-*; \
> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \
> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none -w $t; \
> done
>
> test read (same, just drop -w parameter):
> for t in /ssd/t-*; \
> do sync; echo 1 > /proc/sys/vm/drop_caches; echo === $t ===; \
> ./qemu-img bench -c 4096 -d 1 -f qcow2 -n -s 1m -t none $t; \
> done
>
> short info about parameters:
> -w - do writes (otherwise do reads)
> -c - count of blocks
> -s - block size
> -t none - disable cache
> -n - native aio
> -d 1 - don't use parallel requests provided by qemu-img bench itself
>
> results:
> +-----------+-----------+----------+-----------+----------+
> | file | wr before | wr after | rd before | rd after |
> +-----------+-----------+----------+-----------+----------+
> | seq | 8.605 | 8.636 | 9.043 | 9.010 |
> | reverse | 9.934 | 8.654 | 17.162 | 8.662 |
> | rand | 9.983 | 8.687 | 19.775 | 9.010 |
> | part-rand | 9.871 | 8.650 | 14.241 | 8.669 |
> +-----------+-----------+----------+-----------+----------+
>
> Performance gain is obvious, especially for read.
>
> how images are generated:
>
> === gen-writes file ===
> #!/usr/bin/env python
> import random
> import sys
>
> size = 4 * 1024 * 1024 * 1024
> block = 64 * 1024
> block2 = 1024 * 1024
>
> arg = sys.argv[1]
>
> if arg in ('rand', 'reverse', 'seq'):
> writes = list(range(0, size, block))
>
> if arg == 'rand':
> random.shuffle(writes)
> elif arg == 'reverse':
> writes.reverse()
> elif arg == 'part-rand':
> writes = []
> for off in range(0, size, block2):
> wr = list(range(off, off + block2, block))
> random.shuffle(wr)
> writes.extend(wr)
> elif arg != 'seq':
> sys.exit(1)
>
> for w in writes:
> print 'write -P 0xff {} {}'.format(w, block)
>
> print 'q'
>
>
> === gen-test-images.sh file ===
> #!/bin/bash
>
> IMG_PATH=/ssd
>
> for name in seq reverse rand part-rand; do
> IMG=$IMG_PATH/t-$name.qcow2
> echo createing $IMG ...
> rm -f $IMG
> qemu-img create -f qcow2 $IMG 4G
> gen-writes $name | qemu-io $IMG
> done
>
> Denis V. Lunev (1):
> qcow2: move qemu_co_mutex_lock below decryption procedure
>
> Vladimir Sementsov-Ogievskiy (6):
> qcow2: bdrv_co_pwritev: move encryption code out of lock
> qcow2: split out reading normal clusters from qcow2_co_preadv
> qcow2: async scheme for qcow2_co_preadv
> qcow2: refactor qcow2_co_pwritev: split out qcow2_co_do_pwritev
> qcow2: refactor qcow2_co_pwritev locals scope
> qcow2: async scheme for qcow2_co_pwritev
>
> block/qcow2.c | 506 +++++++++++++++++++++++++++++--------
> tests/qemu-iotests/026.out | 18 +-
> tests/qemu-iotests/026.out.nocache | 20 +-
> 3 files changed, 415 insertions(+), 129 deletions(-)
>
--
Best regards,
Vladimir
© 2016 - 2025 Red Hat, Inc.