[v1] qcow2: Fix data corruption on XFS

[PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Max Reitz 4 years, 5 months ago

Hi,

This series builds on the previous RFC.  The workaround is now applied
unconditionally of AIO mode and filesystem because we don’t know those
things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
has been moved to block/io.c.

Applying the workaround unconditionally is fine from a performance
standpoint, because it should actually be dead code, thanks to patch 1
(the elephant in the room).  As far as I know, there is no other block
driver but qcow2 in handle_alloc_space() that would submit zero writes
as part of normal I/O so it can occur concurrently to other write
requests.  It still makes sense to take the workaround for file-posix
because we can’t really prevent that any other block driver will submit
zero writes as part of normal I/O in the future.

Anyway, let’s get to the elephant.

From input by XFS developers
(https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
that c8bb23cbdbe causes fundamental performance problems on XFS with
aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
performance or we wouldn’t have it.

In general, avoiding performance regressions is more important than
improving performance, unless the regressions are just a minor corner
case or insignificant when compared to the improvement.  The XFS
regression is no minor corner case, and it isn’t insignificant.  Laurent
Vivier has found performance to decrease by as much as 88 % (on ppc64le,
fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).

Thus, I believe we should revert the commit for now (and most
importantly for 4.1.1).  We can think about reintroducing it for 5.0,
but that would require more extensive benchmarks on a variety of
systems, and we must see how subclusters change the picture.


I would have liked to do benchmarks myself before making this decision,
but as far as I’m informed, patches for 4.1.1 are to be collected on
Monday, so we need to be quick.


git-backport-diff against the RFC:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/4:[down] 'Revert "qcow2: skip writing zero buffers to empty COW areas"'
002/4:[----] [-C] 'block: Make wait/mark serialising requests public'
003/4:[down] 'block: Add bdrv_co_get_self_request()'
004/4:[0036] [FC] 'block/file-posix: Let post-EOF fallocate serialize'


Max Reitz (4):
  Revert "qcow2: skip writing zero buffers to empty COW areas"
  block: Make wait/mark serialising requests public
  block: Add bdrv_co_get_self_request()
  block/file-posix: Let post-EOF fallocate serialize

 qapi/block-core.json       |  4 +-
 block/qcow2.h              |  6 ---
 include/block/block_int.h  |  4 ++
 block/file-posix.c         | 38 +++++++++++++++++
 block/io.c                 | 42 +++++++++++++------
 block/qcow2-cluster.c      |  2 +-
 block/qcow2.c              | 86 --------------------------------------
 block/trace-events         |  1 -
 tests/qemu-iotests/060     |  7 +---
 tests/qemu-iotests/060.out |  5 +--
 10 files changed, 76 insertions(+), 119 deletions(-)

-- 
2.21.0

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Max Reitz 4 years, 5 months ago

On 01.11.19 11:00, Max Reitz wrote:
> Hi,
> 
> This series builds on the previous RFC.  The workaround is now applied
> unconditionally of AIO mode and filesystem because we don’t know those
> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
> has been moved to block/io.c.
> 
> Applying the workaround unconditionally is fine from a performance
> standpoint, because it should actually be dead code, thanks to patch 1
> (the elephant in the room).  As far as I know, there is no other block
> driver but qcow2 in handle_alloc_space() that would submit zero writes
> as part of normal I/O so it can occur concurrently to other write
> requests.  It still makes sense to take the workaround for file-posix
> because we can’t really prevent that any other block driver will submit
> zero writes as part of normal I/O in the future.
> 
> Anyway, let’s get to the elephant.
> 
> From input by XFS developers
> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
> that c8bb23cbdbe causes fundamental performance problems on XFS with
> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
> performance or we wouldn’t have it.
> 
> In general, avoiding performance regressions is more important than
> improving performance, unless the regressions are just a minor corner
> case or insignificant when compared to the improvement.  The XFS
> regression is no minor corner case, and it isn’t insignificant.  Laurent
> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).

Ah, crap.

I wanted to send this series as early today as possible to get as much
feedback as possible, so I’ve only started doing benchmarks now.

The obvious

$ qemu-img bench -t none -n -w -S 65536 test.qcow2

on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
c8bb23cbdbe reverted.  So now on to guest tests...

(Well, and the question is still where the performance regression on
ppc64 comes from.)

Max

> Thus, I believe we should revert the commit for now (and most
> importantly for 4.1.1).  We can think about reintroducing it for 5.0,
> but that would require more extensive benchmarks on a variety of
> systems, and we must see how subclusters change the picture.
> 
> 
> I would have liked to do benchmarks myself before making this decision,
> but as far as I’m informed, patches for 4.1.1 are to be collected on
> Monday, so we need to be quick.
> 
> 
> git-backport-diff against the RFC:
> 
> Key:
> [----] : patches are identical
> [####] : number of functional differences between upstream/downstream patch
> [down] : patch is downstream-only
> The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively
> 
> 001/4:[down] 'Revert "qcow2: skip writing zero buffers to empty COW areas"'
> 002/4:[----] [-C] 'block: Make wait/mark serialising requests public'
> 003/4:[down] 'block: Add bdrv_co_get_self_request()'
> 004/4:[0036] [FC] 'block/file-posix: Let post-EOF fallocate serialize'
> 
> 
> Max Reitz (4):
>   Revert "qcow2: skip writing zero buffers to empty COW areas"
>   block: Make wait/mark serialising requests public
>   block: Add bdrv_co_get_self_request()
>   block/file-posix: Let post-EOF fallocate serialize
> 
>  qapi/block-core.json       |  4 +-
>  block/qcow2.h              |  6 ---
>  include/block/block_int.h  |  4 ++
>  block/file-posix.c         | 38 +++++++++++++++++
>  block/io.c                 | 42 +++++++++++++------
>  block/qcow2-cluster.c      |  2 +-
>  block/qcow2.c              | 86 --------------------------------------
>  block/trace-events         |  1 -
>  tests/qemu-iotests/060     |  7 +---
>  tests/qemu-iotests/060.out |  5 +--
>  10 files changed, 76 insertions(+), 119 deletions(-)
>

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Vladimir Sementsov-Ogievskiy 4 years, 5 months ago

01.11.2019 13:20, Max Reitz wrote:
> On 01.11.19 11:00, Max Reitz wrote:
>> Hi,
>>
>> This series builds on the previous RFC.  The workaround is now applied
>> unconditionally of AIO mode and filesystem because we don’t know those
>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>> has been moved to block/io.c.
>>
>> Applying the workaround unconditionally is fine from a performance
>> standpoint, because it should actually be dead code, thanks to patch 1
>> (the elephant in the room).  As far as I know, there is no other block
>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>> as part of normal I/O so it can occur concurrently to other write
>> requests.  It still makes sense to take the workaround for file-posix
>> because we can’t really prevent that any other block driver will submit
>> zero writes as part of normal I/O in the future.
>>
>> Anyway, let’s get to the elephant.
>>
>>  From input by XFS developers
>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>> performance or we wouldn’t have it.
>>
>> In general, avoiding performance regressions is more important than
>> improving performance, unless the regressions are just a minor corner
>> case or insignificant when compared to the improvement.  The XFS
>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
> 
> Ah, crap.
> 
> I wanted to send this series as early today as possible to get as much
> feedback as possible, so I’ve only started doing benchmarks now.
> 
> The obvious
> 
> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
> 
> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
> c8bb23cbdbe reverted.  So now on to guest tests...

Aha, that's very interesting) What about aio-native which should be slowed down?
Could it be tested like this?

> 
> (Well, and the question is still where the performance regression on
> ppc64 comes from.)
> 
> Max
> 
>> Thus, I believe we should revert the commit for now (and most
>> importantly for 4.1.1).  We can think about reintroducing it for 5.0,
>> but that would require more extensive benchmarks on a variety of
>> systems, and we must see how subclusters change the picture.
>>
>>
>> I would have liked to do benchmarks myself before making this decision,
>> but as far as I’m informed, patches for 4.1.1 are to be collected on
>> Monday, so we need to be quick.
>>
>>
>> git-backport-diff against the RFC:
>>
>> Key:
>> [----] : patches are identical
>> [####] : number of functional differences between upstream/downstream patch
>> [down] : patch is downstream-only
>> The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively
>>
>> 001/4:[down] 'Revert "qcow2: skip writing zero buffers to empty COW areas"'
>> 002/4:[----] [-C] 'block: Make wait/mark serialising requests public'
>> 003/4:[down] 'block: Add bdrv_co_get_self_request()'
>> 004/4:[0036] [FC] 'block/file-posix: Let post-EOF fallocate serialize'
>>
>>
>> Max Reitz (4):
>>    Revert "qcow2: skip writing zero buffers to empty COW areas"
>>    block: Make wait/mark serialising requests public
>>    block: Add bdrv_co_get_self_request()
>>    block/file-posix: Let post-EOF fallocate serialize
>>
>>   qapi/block-core.json       |  4 +-
>>   block/qcow2.h              |  6 ---
>>   include/block/block_int.h  |  4 ++
>>   block/file-posix.c         | 38 +++++++++++++++++
>>   block/io.c                 | 42 +++++++++++++------
>>   block/qcow2-cluster.c      |  2 +-
>>   block/qcow2.c              | 86 --------------------------------------
>>   block/trace-events         |  1 -
>>   tests/qemu-iotests/060     |  7 +---
>>   tests/qemu-iotests/060.out |  5 +--
>>   10 files changed, 76 insertions(+), 119 deletions(-)
>>
> 
> 


-- 
Best regards,
Vladimir

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Max Reitz 4 years, 5 months ago

On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
> 01.11.2019 13:20, Max Reitz wrote:
>> On 01.11.19 11:00, Max Reitz wrote:
>>> Hi,
>>>
>>> This series builds on the previous RFC.  The workaround is now applied
>>> unconditionally of AIO mode and filesystem because we don’t know those
>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>> has been moved to block/io.c.
>>>
>>> Applying the workaround unconditionally is fine from a performance
>>> standpoint, because it should actually be dead code, thanks to patch 1
>>> (the elephant in the room).  As far as I know, there is no other block
>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>> as part of normal I/O so it can occur concurrently to other write
>>> requests.  It still makes sense to take the workaround for file-posix
>>> because we can’t really prevent that any other block driver will submit
>>> zero writes as part of normal I/O in the future.
>>>
>>> Anyway, let’s get to the elephant.
>>>
>>>  From input by XFS developers
>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>> performance or we wouldn’t have it.
>>>
>>> In general, avoiding performance regressions is more important than
>>> improving performance, unless the regressions are just a minor corner
>>> case or insignificant when compared to the improvement.  The XFS
>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>
>> Ah, crap.
>>
>> I wanted to send this series as early today as possible to get as much
>> feedback as possible, so I’ve only started doing benchmarks now.
>>
>> The obvious
>>
>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>
>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>> c8bb23cbdbe reverted.  So now on to guest tests...
> 
> Aha, that's very interesting) What about aio-native which should be slowed down?
> Could it be tested like this?

That is aio=native (-n).

But so far I don’t see any significant difference in guest tests (i.e.,
fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
--ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
about ext4 still.)

(Reverting c8bb23cbdbe makes it like 1 to 2 % faster.)

Max

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Vladimir Sementsov-Ogievskiy 4 years, 5 months ago

01.11.2019 14:12, Max Reitz wrote:
> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>> 01.11.2019 13:20, Max Reitz wrote:
>>> On 01.11.19 11:00, Max Reitz wrote:
>>>> Hi,
>>>>
>>>> This series builds on the previous RFC.  The workaround is now applied
>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>> has been moved to block/io.c.
>>>>
>>>> Applying the workaround unconditionally is fine from a performance
>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>> (the elephant in the room).  As far as I know, there is no other block
>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>> as part of normal I/O so it can occur concurrently to other write
>>>> requests.  It still makes sense to take the workaround for file-posix
>>>> because we can’t really prevent that any other block driver will submit
>>>> zero writes as part of normal I/O in the future.
>>>>
>>>> Anyway, let’s get to the elephant.
>>>>
>>>>   From input by XFS developers
>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>> performance or we wouldn’t have it.
>>>>
>>>> In general, avoiding performance regressions is more important than
>>>> improving performance, unless the regressions are just a minor corner
>>>> case or insignificant when compared to the improvement.  The XFS
>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>
>>> Ah, crap.
>>>
>>> I wanted to send this series as early today as possible to get as much
>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>
>>> The obvious
>>>
>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>
>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>
>> Aha, that's very interesting) What about aio-native which should be slowed down?
>> Could it be tested like this?
> 
> That is aio=native (-n).
> 
> But so far I don’t see any significant difference in guest tests (i.e.,
> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
> about ext4 still.)

hmm, this possibly mostly tests writes to already allocated clusters. Has fio
an option to behave like qemu-img bench with -S 65536, i.e. write once into
each cluster?

> 
> (Reverting c8bb23cbdbe makes it like 1 to 2 % faster.)
> 
> Max
> 


-- 
Best regards,
Vladimir

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Max Reitz 4 years, 5 months ago

On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
> 01.11.2019 14:12, Max Reitz wrote:
>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>> 01.11.2019 13:20, Max Reitz wrote:
>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>> Hi,
>>>>>
>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>> has been moved to block/io.c.
>>>>>
>>>>> Applying the workaround unconditionally is fine from a performance
>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>> because we can’t really prevent that any other block driver will submit
>>>>> zero writes as part of normal I/O in the future.
>>>>>
>>>>> Anyway, let’s get to the elephant.
>>>>>
>>>>>   From input by XFS developers
>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>> performance or we wouldn’t have it.
>>>>>
>>>>> In general, avoiding performance regressions is more important than
>>>>> improving performance, unless the regressions are just a minor corner
>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>
>>>> Ah, crap.
>>>>
>>>> I wanted to send this series as early today as possible to get as much
>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>
>>>> The obvious
>>>>
>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>
>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>
>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>> Could it be tested like this?
>>
>> That is aio=native (-n).
>>
>> But so far I don’t see any significant difference in guest tests (i.e.,
>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>> about ext4 still.)
> 
> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
> an option to behave like qemu-img bench with -S 65536, i.e. write once into
> each cluster?

Maybe, but is that a realistic depiction of whether this change is worth
it?  That is why I’m doing the guest test, to see whether it actually
has much impact on the guest.

(The thing is that Laurent and our QE has seen significant real-world
performance regression on pcc64.)

Max

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Max Reitz 4 years, 5 months ago

On 01.11.19 12:20, Max Reitz wrote:
> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
>> 01.11.2019 14:12, Max Reitz wrote:
>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>>> 01.11.2019 13:20, Max Reitz wrote:
>>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>>> has been moved to block/io.c.
>>>>>>
>>>>>> Applying the workaround unconditionally is fine from a performance
>>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>>> because we can’t really prevent that any other block driver will submit
>>>>>> zero writes as part of normal I/O in the future.
>>>>>>
>>>>>> Anyway, let’s get to the elephant.
>>>>>>
>>>>>>   From input by XFS developers
>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>>> performance or we wouldn’t have it.
>>>>>>
>>>>>> In general, avoiding performance regressions is more important than
>>>>>> improving performance, unless the regressions are just a minor corner
>>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>>
>>>>> Ah, crap.
>>>>>
>>>>> I wanted to send this series as early today as possible to get as much
>>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>>
>>>>> The obvious
>>>>>
>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>>
>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>>
>>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>>> Could it be tested like this?
>>>
>>> That is aio=native (-n).
>>>
>>> But so far I don’t see any significant difference in guest tests (i.e.,
>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>>> about ext4 still.)
>>
>> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
>> an option to behave like qemu-img bench with -S 65536, i.e. write once into
>> each cluster?
> 
> Maybe, but is that a realistic depiction of whether this change is worth
> it?  That is why I’m doing the guest test, to see whether it actually
> has much impact on the guest.

I’ve changed the above fio invocation to use --rw=randwrite and added
--fallocate=none.  The performance went down, but it went down both with
and without c8bb23cbdbe.

So on my XFS system (XFS on luks on SSD), I see:
- with c8bb23cbdbe: 26.0 - 27.9 MB/s
- without c8bb23cbdbe: 25.6 - 27 MB/s

On my ext4 system (native on SSD), I see:
- with: 39.4 - 41.5 MB/s
- without: 39.4 - 42.0 MB/s

So basically no difference for XFS, and really no difference for ext4.
(I ran these tests with 2 MB clusters.)

Max

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Vladimir Sementsov-Ogievskiy 4 years, 5 months ago

01.11.2019 15:34, Max Reitz wrote:
> On 01.11.19 12:20, Max Reitz wrote:
>> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
>>> 01.11.2019 14:12, Max Reitz wrote:
>>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 01.11.2019 13:20, Max Reitz wrote:
>>>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>>>> has been moved to block/io.c.
>>>>>>>
>>>>>>> Applying the workaround unconditionally is fine from a performance
>>>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>>>> because we can’t really prevent that any other block driver will submit
>>>>>>> zero writes as part of normal I/O in the future.
>>>>>>>
>>>>>>> Anyway, let’s get to the elephant.
>>>>>>>
>>>>>>>    From input by XFS developers
>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>>>> performance or we wouldn’t have it.
>>>>>>>
>>>>>>> In general, avoiding performance regressions is more important than
>>>>>>> improving performance, unless the regressions are just a minor corner
>>>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>>>
>>>>>> Ah, crap.
>>>>>>
>>>>>> I wanted to send this series as early today as possible to get as much
>>>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>>>
>>>>>> The obvious
>>>>>>
>>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>>>
>>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>>>
>>>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>>>> Could it be tested like this?
>>>>
>>>> That is aio=native (-n).
>>>>
>>>> But so far I don’t see any significant difference in guest tests (i.e.,
>>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>>>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>>>> about ext4 still.)
>>>
>>> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
>>> an option to behave like qemu-img bench with -S 65536, i.e. write once into
>>> each cluster?
>>
>> Maybe, but is that a realistic depiction of whether this change is worth
>> it?  That is why I’m doing the guest test, to see whether it actually
>> has much impact on the guest.
> 
> I’ve changed the above fio invocation to use --rw=randwrite and added
> --fallocate=none.  The performance went down, but it went down both with
> and without c8bb23cbdbe.
> 
> So on my XFS system (XFS on luks on SSD), I see:
> - with c8bb23cbdbe: 26.0 - 27.9 MB/s
> - without c8bb23cbdbe: 25.6 - 27 MB/s
> 
> On my ext4 system (native on SSD), I see:
> - with: 39.4 - 41.5 MB/s
> - without: 39.4 - 42.0 MB/s
> 
> So basically no difference for XFS, and really no difference for ext4.
> (I ran these tests with 2 MB clusters.)
> 

Hmm. I don't know. For me it seems obvious that zeroing 2M cluster is slow, and this
is proved by simple tests with qemu-img bench, that fallocate is faster than zeroing
most of the cluster.

So, if some guest test doesn't show the difference, this means that "small write into
new cluster" is effectively rare case in this test.. And this doesn't prove that it's
always rare and insignificant.

I don't sure that we have a real-world example that proves necessity of this optimization,
or was there some original bug about low-performance which was fixed by this optimization..
Den, Anton, do we have something about it?

-- 
Best regards,
Vladimir

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Denis Lunev 4 years, 5 months ago

On 11/1/19 4:09 PM, Vladimir Sementsov-Ogievskiy wrote:
> 01.11.2019 15:34, Max Reitz wrote:
>> On 01.11.19 12:20, Max Reitz wrote:
>>> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
>>>> 01.11.2019 14:12, Max Reitz wrote:
>>>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 01.11.2019 13:20, Max Reitz wrote:
>>>>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>>>>> has been moved to block/io.c.
>>>>>>>>
>>>>>>>> Applying the workaround unconditionally is fine from a performance
>>>>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>>>>> because we can’t really prevent that any other block driver will submit
>>>>>>>> zero writes as part of normal I/O in the future.
>>>>>>>>
>>>>>>>> Anyway, let’s get to the elephant.
>>>>>>>>
>>>>>>>>    From input by XFS developers
>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>>>>> performance or we wouldn’t have it.
>>>>>>>>
>>>>>>>> In general, avoiding performance regressions is more important than
>>>>>>>> improving performance, unless the regressions are just a minor corner
>>>>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>>>> Ah, crap.
>>>>>>>
>>>>>>> I wanted to send this series as early today as possible to get as much
>>>>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>>>>
>>>>>>> The obvious
>>>>>>>
>>>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>>>>
>>>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>>>>> Could it be tested like this?
>>>>> That is aio=native (-n).
>>>>>
>>>>> But so far I don’t see any significant difference in guest tests (i.e.,
>>>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>>>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>>>>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>>>>> about ext4 still.)
>>>> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
>>>> an option to behave like qemu-img bench with -S 65536, i.e. write once into
>>>> each cluster?
>>> Maybe, but is that a realistic depiction of whether this change is worth
>>> it?  That is why I’m doing the guest test, to see whether it actually
>>> has much impact on the guest.
>> I’ve changed the above fio invocation to use --rw=randwrite and added
>> --fallocate=none.  The performance went down, but it went down both with
>> and without c8bb23cbdbe.
>>
>> So on my XFS system (XFS on luks on SSD), I see:
>> - with c8bb23cbdbe: 26.0 - 27.9 MB/s
>> - without c8bb23cbdbe: 25.6 - 27 MB/s
>>
>> On my ext4 system (native on SSD), I see:
>> - with: 39.4 - 41.5 MB/s
>> - without: 39.4 - 42.0 MB/s
>>
>> So basically no difference for XFS, and really no difference for ext4.
>> (I ran these tests with 2 MB clusters.)
>>
> Hmm. I don't know. For me it seems obvious that zeroing 2M cluster is slow, and this
> is proved by simple tests with qemu-img bench, that fallocate is faster than zeroing
> most of the cluster.
>
> So, if some guest test doesn't show the difference, this means that "small write into
> new cluster" is effectively rare case in this test.. And this doesn't prove that it's
> always rare and insignificant.
>
> I don't sure that we have a real-world example that proves necessity of this optimization,
> or was there some original bug about low-performance which was fixed by this optimization..
> Den, Anton, do we have something about it?
>
sorry, I have missed the beginning of the thread.

Which driver is used for virtual disk - cached or non-cached IO
is used in QEMU? We use non-cached by default and this could
make a difference significantly.

Max,

can you pls share your domain.xml of the guest config and
fio file for guest. I will recheck to be 120% sure.

Thank you in advance,
    Den

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Max Reitz 4 years, 5 months ago

On 01.11.19 14:36, Denis Lunev wrote:
> On 11/1/19 4:09 PM, Vladimir Sementsov-Ogievskiy wrote:
>> 01.11.2019 15:34, Max Reitz wrote:
>>> On 01.11.19 12:20, Max Reitz wrote:
>>>> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 01.11.2019 14:12, Max Reitz wrote:
>>>>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 01.11.2019 13:20, Max Reitz wrote:
>>>>>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>>>>>> has been moved to block/io.c.
>>>>>>>>>
>>>>>>>>> Applying the workaround unconditionally is fine from a performance
>>>>>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>>>>>> because we can’t really prevent that any other block driver will submit
>>>>>>>>> zero writes as part of normal I/O in the future.
>>>>>>>>>
>>>>>>>>> Anyway, let’s get to the elephant.
>>>>>>>>>
>>>>>>>>>    From input by XFS developers
>>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>>>>>> performance or we wouldn’t have it.
>>>>>>>>>
>>>>>>>>> In general, avoiding performance regressions is more important than
>>>>>>>>> improving performance, unless the regressions are just a minor corner
>>>>>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>>>>> Ah, crap.
>>>>>>>>
>>>>>>>> I wanted to send this series as early today as possible to get as much
>>>>>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>>>>>
>>>>>>>> The obvious
>>>>>>>>
>>>>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>>>>>
>>>>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>>>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>>>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>>>>>> Could it be tested like this?
>>>>>> That is aio=native (-n).
>>>>>>
>>>>>> But so far I don’t see any significant difference in guest tests (i.e.,
>>>>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>>>>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>>>>>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>>>>>> about ext4 still.)
>>>>> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
>>>>> an option to behave like qemu-img bench with -S 65536, i.e. write once into
>>>>> each cluster?
>>>> Maybe, but is that a realistic depiction of whether this change is worth
>>>> it?  That is why I’m doing the guest test, to see whether it actually
>>>> has much impact on the guest.
>>> I’ve changed the above fio invocation to use --rw=randwrite and added
>>> --fallocate=none.  The performance went down, but it went down both with
>>> and without c8bb23cbdbe.
>>>
>>> So on my XFS system (XFS on luks on SSD), I see:
>>> - with c8bb23cbdbe: 26.0 - 27.9 MB/s
>>> - without c8bb23cbdbe: 25.6 - 27 MB/s
>>>
>>> On my ext4 system (native on SSD), I see:
>>> - with: 39.4 - 41.5 MB/s
>>> - without: 39.4 - 42.0 MB/s
>>>
>>> So basically no difference for XFS, and really no difference for ext4.
>>> (I ran these tests with 2 MB clusters.)
>>>
>> Hmm. I don't know. For me it seems obvious that zeroing 2M cluster is slow, and this
>> is proved by simple tests with qemu-img bench, that fallocate is faster than zeroing
>> most of the cluster.
>>
>> So, if some guest test doesn't show the difference, this means that "small write into
>> new cluster" is effectively rare case in this test.. And this doesn't prove that it's
>> always rare and insignificant.
>>
>> I don't sure that we have a real-world example that proves necessity of this optimization,
>> or was there some original bug about low-performance which was fixed by this optimization..
>> Den, Anton, do we have something about it?
>>
> sorry, I have missed the beginning of the thread.
> 
> Which driver is used for virtual disk - cached or non-cached IO
> is used in QEMU? We use non-cached by default and this could
> make a difference significantly.

I’m using no cache, the above tests were done with aio=native; I’ve sent
another response with aio=threads numbers.

> Max,
> 
> can you pls share your domain.xml of the guest config and
> fio file for guest. I will recheck to be 120% sure.

I’m running qemu directly as follows:

x86_64-softmmu/qemu-system-x86_64 \


    -serial stdio \
    -cdrom ~/tmp/arch.iso \
    -m 4096 \
    -enable-kvm \
    -drive \
  if=none,id=t,format=qcow2,file=test/test.qcow2,cache=none,aio=native \
    -device virtio-scsi \
    -device scsi-hd,drive=t \
    -net user \
    -net nic,model=rtl8139 \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 \
    -cpu SandyBridge \
    -nodefaults \
    -nographic

The full FIO command line is:

fio --rw=randwrite --bs=4k --iodepth=8 --runtime=1m --direct=1 \
    --filename=/mnt/foo --name=job1 --ioengine=libaio --thread \
    --group_reporting --numjobs=16 --size=2G --time_based \
    --output=/tmp/fio_result --fallocate=none

Max

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Max Reitz 4 years, 5 months ago

On 01.11.19 13:34, Max Reitz wrote:
> On 01.11.19 12:20, Max Reitz wrote:
>> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote:
>>> 01.11.2019 14:12, Max Reitz wrote:
>>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 01.11.2019 13:20, Max Reitz wrote:
>>>>>> On 01.11.19 11:00, Max Reitz wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This series builds on the previous RFC.  The workaround is now applied
>>>>>>> unconditionally of AIO mode and filesystem because we don’t know those
>>>>>>> things for remote filesystems.  Furthermore, bdrv_co_get_self_request()
>>>>>>> has been moved to block/io.c.
>>>>>>>
>>>>>>> Applying the workaround unconditionally is fine from a performance
>>>>>>> standpoint, because it should actually be dead code, thanks to patch 1
>>>>>>> (the elephant in the room).  As far as I know, there is no other block
>>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes
>>>>>>> as part of normal I/O so it can occur concurrently to other write
>>>>>>> requests.  It still makes sense to take the workaround for file-posix
>>>>>>> because we can’t really prevent that any other block driver will submit
>>>>>>> zero writes as part of normal I/O in the future.
>>>>>>>
>>>>>>> Anyway, let’s get to the elephant.
>>>>>>>
>>>>>>>   From input by XFS developers
>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems clear
>>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with
>>>>>>> aio=native that cannot be fixed.  In other cases, c8bb23cbdbe improves
>>>>>>> performance or we wouldn’t have it.
>>>>>>>
>>>>>>> In general, avoiding performance regressions is more important than
>>>>>>> improving performance, unless the regressions are just a minor corner
>>>>>>> case or insignificant when compared to the improvement.  The XFS
>>>>>>> regression is no minor corner case, and it isn’t insignificant.  Laurent
>>>>>>> Vivier has found performance to decrease by as much as 88 % (on ppc64le,
>>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s).
>>>>>>
>>>>>> Ah, crap.
>>>>>>
>>>>>> I wanted to send this series as early today as possible to get as much
>>>>>> feedback as possible, so I’ve only started doing benchmarks now.
>>>>>>
>>>>>> The obvious
>>>>>>
>>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2
>>>>>>
>>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with
>>>>>> c8bb23cbdbe reverted.  So now on to guest tests...
>>>>>
>>>>> Aha, that's very interesting) What about aio-native which should be slowed down?
>>>>> Could it be tested like this?
>>>>
>>>> That is aio=native (-n).
>>>>
>>>> But so far I don’t see any significant difference in guest tests (i.e.,
>>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1
>>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither
>>>> with 64 kB nor with 2 MB clusters.  (But only on XFS, I’ll have to see
>>>> about ext4 still.)
>>>
>>> hmm, this possibly mostly tests writes to already allocated clusters. Has fio
>>> an option to behave like qemu-img bench with -S 65536, i.e. write once into
>>> each cluster?
>>
>> Maybe, but is that a realistic depiction of whether this change is worth
>> it?  That is why I’m doing the guest test, to see whether it actually
>> has much impact on the guest.
> 
> I’ve changed the above fio invocation to use --rw=randwrite and added
> --fallocate=none.  The performance went down, but it went down both with
> and without c8bb23cbdbe.
> 
> So on my XFS system (XFS on luks on SSD), I see:
> - with c8bb23cbdbe: 26.0 - 27.9 MB/s
> - without c8bb23cbdbe: 25.6 - 27 MB/s
> 
> On my ext4 system (native on SSD), I see:
> - with: 39.4 - 41.5 MB/s
> - without: 39.4 - 42.0 MB/s
> 
> So basically no difference for XFS, and really no difference for ext4.
> (I ran these tests with 2 MB clusters.)

For 64 kB clusters, I have on XFS:
- with: 30.3 - 31.3 MB/s
- without: 27.4 - 27.9 MB/s

On ext4:
- with: 34.9 - 36.4 MB/s
- without: 33.8 - 38 MB/s

For good measure, 2 MB clusters with aio=threads (but still O_DIRECT) on
XFS:
- with: 25.7 MB/s - 27.0 MB/s
- without: 24.6 MB/s - 26.7 MB/s

On ext4:
- with: 16.7 MB/s - 19.3 MB/s
- without: 16.3 MB/s - 18.4 MB/s

(Note that the difference between ext4 and XFS is probably just because
these are two different systems with different SSDs.)

So there is little difference, if any significant at all (those were all
just three test runs).  There does seem to be a bit of a drift for
aio=threads, but then again it’s also just much slower than aio=native
for ext4.

(Also note that I ran this on an unfixed kernel.  Maybe the XFS fix will
slow things down, but I haven’t tested that yet.)

So unless there are realistic guest benchmarks for ext4 that say we
should keep the patch, I’m going to queue the revert for now (“now” =
4.1.1 and 4.2.0).

Max

Re: [PATCH for-4.2 0/4] qcow2: Fix data corruption on XFS

Posted by Max Reitz 4 years, 5 months ago

On 01.11.19 14:30, Max Reitz wrote:

[...]

> So unless there are realistic guest benchmarks for ext4 that say we
> should keep the patch, I’m going to queue the revert for now (“now” =
> 4.1.1 and 4.2.0).

I found one case where the performance is significantly improved by
c8bb23cbdbe: In the cases so far I had XFS in the guest, now I used
ext4, and with aio=native (on the ext4 host with 2 MB clusters), the
performance goes from 63.9 - 65.0 MB/s to 75.7 - 76.4 MB/s (so +18%).

The difference is smaller for 64 kB clusters, but still there at +13%.
That’s probably the more important fact, because these are the default
settings, and this is probably about what would happen with 2 MB
clusters + subclusters.

(Patch 4 in this series doesn’t decrease the performance.)

This is a tough decision for me because from some people tell me “Let’s
just revert it, there are problems with it and we don’t quite know what
good it does in practice”, and others say “We have (not really
practical) benchmarks that show it does something good for our specific
case”.  And all that while those two groups never seem to talk to each
other directly.

So I suppose I’m going to have to make a decision.  I now know a case
where c8bb23cbdbe is actually beneficial.  I myself have never seen
c8bb23cbdbe decrease performance, but I know Laurent has seen a drastic
performance degradation, and he’s used it to bisect the XFS problem to
that commit, so it’s really real.  But I haven’t seen it, and as far as
I know it really only happens on ppc64.

In light of this, I would prefer to revert c8bb23cbdbe for 4.1.1, and
keep it for 4.2.0.  But I don’t know whether we can do that, all I know
is that I’m not going to find out in time for 4.1.1.

If we keep c8bb23cbdbe in any way, we need patches 2 through 4, that
much is clear.

I believe we can think about the performance problem after 4.2.0.  I
would love to benchmark c8bb23cbdbe on a fixed kernel, but there just
isn’t time for that anymore.

I’m not a fan of keeping c8bb23cbdbe behind a configure switch.  If it’s
beneficial, it should be there for everyone.

OK.  Some may see this as a wrong decision, but someone needs to make
one now, so here goes: ext4 is the default Linux FS for most
distributions.  As far as I can tell from my own benchmarks, c8bb23cbdbe
brings a significant performance improvement for qcow2 images with the
default configuration on this default filesystem with aio=native and
doesn’t change much in any other case.

What happens on ppc64 is a problem, but that’s a RHEL problem because
it’s specific to XFS (and also ppc64).  It also won’t be a regression on
4.2 compared to 4.1.

Dave’s argument was good that fallocate() and AIO cannot mix (at least
on XFS), but I couldn’t see any impact of that in my benchmarks (maybe
my benchmarks were just wrong).

So I think for upstream it’ll be best if I send a v2 which doesn’t touch
handle_alloc_space(), and instead just consists of patches 2 through 4.
 (And CC it all to stable.)

I think we still need to keep track of the XFS/ppc64 issue and do more
benchmarks especially with the fixed XFS driver.

tl;dr:
The main arguments for reverting c8bb23cbdbe were (AFAIU):
- a general uneasy feeling about it
- theoretical arguments that it must be bad on XFS
- actual problems on ppc64/XFS
- “what good does it do in practice?”
- that subclusters would make it obsolete anyway

What I could see is:
- no impact on XFS in practice
- significant practical benefit on ext4
- subclusters probably wouldn’t make it obsolete, because I can still
see a +13% improvement for 64 kB clusters (2 MB clusters + subclusters
gives you 64 kB subclusters)

In addition, it needs to be considered that ext4 is the default FS for
most Linux distributions.

As such, I personally am not convinced of reverting this patch.  Let’s
keep it, have patches 2 through 4 for 4.1.1 and 4.2.0, and think about
what to do for ppc64/XFS later.

Max