block/file-posix: Work around XFS bug

[RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

Hi,

It seems to me that there is a bug in Linux’s XFS kernel driver, as
I’ve explained here:

https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html

In combination with our commit c8bb23cbdbe32f, this may lead to guest
data corruption when using qcow2 images on XFS with aio=native.

We can’t wait until the XFS kernel driver is fixed, we should work
around the problem ourselves.

This is an RFC for two reasons:
(1) I don’t know whether this is the right way to address the issue,
(2) Ideally, we should detect whether the XFS kernel driver is fixed and
    if so stop applying the workaround.
    I don’t know how we would go about this, so this series doesn’t do
    it.  (Hence it’s an RFC.)
(3) Perhaps it’s a bit of a layering violation to let the file-posix
    driver access and modify a BdrvTrackedRequest object.

As for how we can address the issue, I see three ways:
(1) The one presented in this series: On XFS with aio=native, we extend
    tracked requests for post-EOF fallocate() calls (i.e., write-zero
    operations) to reach until infinity (INT64_MAX in practice), mark
    them serializing and wait for other conflicting requests.

    Advantages:
    + Limits the impact to very specific cases
      (And that means it wouldn’t hurt too much to keep this workaround
      even when the XFS driver has been fixed)
    + Works around the bug where it happens, namely in file-posix

    Disadvantages:
    - A bit complex
    - A bit of a layering violation (should file-posix have access to
      tracked requests?)

(2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
    becomes visible due to that function: I don’t think qcow2 writes
    zeroes in any other I/O path, and raw images are fixed in size so
    post-EOF writes won’t happen.

    Advantages:
    + Maybe simpler, depending on how difficult it is to handle the
      layering violation
    + Also fixes the performance problem of handle_alloc_space() being
      slow on ppc64+XFS.

    Disadvantages:
    - Huge layering violation because qcow2 would need to know whether
      the image is stored on XFS or not.
    - We’d definitely want to skip this workaround when the XFS driver
      has been fixed, so we need some method to find out whether it has

(3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
    To my knowledge I’m the only one who has provided any benchmarks for
    this commit, and even then I was a bit skeptical because it performs
    well in some cases and bad in others.  I concluded that it’s
    probably worth it because the “some cases” are more likely to occur.

    Now we have this problem of corruption here (granted due to a bug in
    the XFS driver), and another report of massively degraded
    performance on ppc64
    (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
    private BZ; I hate that :-/  The report is about 40 % worse
    performance for an in-guest fio write benchmark.)

    So I have to ask the question about what the justification for
    keeping c8bb23cbdbe32f is.  How much does performance increase with
    it actually?  (On non-(ppc64+XFS) machines, obviously)

    Advantages:
    + Trivial
    + No layering violations
    + We wouldn’t need to keep track of whether the kernel bug has been
      fixed or not
    + Fixes the ppc64+XFS performance problem

    Disadvantages:
    - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
      levels, whatever that means

So this is the main reason this is an RFC: What should we do?  Is (1)
really the best choice?


In any case, I’ve ran the test case I showed in
https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
more than ten times with this series applied and the installation
succeeded every time.  (Without this series, it fails like every other
time.)


Max Reitz (3):
  block: Make wait/mark serialising requests public
  block/file-posix: Detect XFS with CONFIG_FALLOCATE
  block/file-posix: Let post-EOF fallocate serialize

 include/block/block_int.h |  3 +++
 block/file-posix.c        | 46 +++++++++++++++++++++++++++++++++++++--
 block/io.c                | 24 ++++++++++----------
 3 files changed, 59 insertions(+), 14 deletions(-)

-- 
2.21.0

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by no-reply@patchew.org 4 years, 6 months ago

Patchew URL: https://patchew.org/QEMU/20191025095849.25283-1-mreitz@redhat.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      block/sheepdog.o
  CC      block/accounting.o
/tmp/qemu-test/src/block/file-posix.c: In function 'raw_open_common':
/tmp/qemu-test/src/block/file-posix.c:671:5: error: implicit declaration of function 'platform_test_xfs_fd' [-Werror=implicit-function-declaration]
     if (platform_test_xfs_fd(s->fd)) {
     ^
/tmp/qemu-test/src/block/file-posix.c:671:5: error: nested extern declaration of 'platform_test_xfs_fd' [-Werror=nested-externs]
cc1: all warnings being treated as errors
make: *** [block/file-posix.o] Error 1
make: *** Waiting for unfinished jobs....
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 662, in <module>
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=c74d47e9bfb24107b6e94885fa8a2151', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-ytazf4e4/src/docker-src.2019-10-25-20.11.53.7609:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=c74d47e9bfb24107b6e94885fa8a2151
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-ytazf4e4/src'
make: *** [docker-run-test-quick@centos7] Error 2

real    2m32.235s
user    0m8.092s


The full log is available at
http://patchew.org/logs/20191025095849.25283-1-mreitz@redhat.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Peter Maydell 4 years, 6 months ago

On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote:
>
> Hi,
>
> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> I’ve explained here:
>
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>
> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> data corruption when using qcow2 images on XFS with aio=native.

Have we reported that upstream to the xfs folks?

thanks
-- PMM

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 25.10.19 15:46, Peter Maydell wrote:
> On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote:
>>
>> Hi,
>>
>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>> I’ve explained here:
>>
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>
>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>> data corruption when using qcow2 images on XFS with aio=native.
> 
> Have we reported that upstream to the xfs folks?

I’ve created an RH BZ here:

https://bugzilla.redhat.com/show_bug.cgi?id=1765547

So at least some XFS folks are aware of it (and I trust them to inform
the others as necessary :-)).  (Eric Sandeen has been part of the
discussion for the last couple of days already.)

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Peter Maydell 4 years, 6 months ago

On Fri, 25 Oct 2019 at 15:16, Max Reitz <mreitz@redhat.com> wrote:
>
> On 25.10.19 15:46, Peter Maydell wrote:
> > On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote:
> >>
> >> Hi,
> >>
> >> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> >> I’ve explained here:
> >>
> >> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
> >>
> >> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> >> data corruption when using qcow2 images on XFS with aio=native.
> >
> > Have we reported that upstream to the xfs folks?
>
> I’ve created an RH BZ here:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1765547

Currently "You are not authorized to access bug #1765547." for
anonymous browsers, just fyi.

thanks
-- PMM

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 25.10.19 16:17, Peter Maydell wrote:
> On Fri, 25 Oct 2019 at 15:16, Max Reitz <mreitz@redhat.com> wrote:
>>
>> On 25.10.19 15:46, Peter Maydell wrote:
>>> On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>>>> I’ve explained here:
>>>>
>>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>>>
>>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>>>> data corruption when using qcow2 images on XFS with aio=native.
>>>
>>> Have we reported that upstream to the xfs folks?
>>
>> I’ve created an RH BZ here:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1765547
> 
> Currently "You are not authorized to access bug #1765547." for
> anonymous browsers, just fyi.

Err.  Oops.  That wasn’t my intention.  I hate that misfeature.

Thanks for telling me, I fixed it. :-)

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Peter Maydell 4 years, 6 months ago

On Fri, 25 Oct 2019 at 15:21, Max Reitz <mreitz@redhat.com> wrote:
>
> On 25.10.19 16:17, Peter Maydell wrote:
> > On Fri, 25 Oct 2019 at 15:16, Max Reitz <mreitz@redhat.com> wrote:
> >> I’ve created an RH BZ here:
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1765547
> >
> > Currently "You are not authorized to access bug #1765547." for
> > anonymous browsers, just fyi.
>
> Err.  Oops.  That wasn’t my intention.  I hate that misfeature.
>
> Thanks for telling me, I fixed it. :-)

Yeah, that's a really well written bug report, you definitely
don't want to hide it away :-) Thanks for making it public.

-- PMM

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Nir Soffer 4 years, 6 months ago

On Fri, Oct 25, 2019 at 1:11 PM Max Reitz <mreitz@redhat.com> wrote:
>
> Hi,
>
> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> I’ve explained here:
>
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>
> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> data corruption when using qcow2 images on XFS with aio=native.
>
> We can’t wait until the XFS kernel driver is fixed, we should work
> around the problem ourselves.
>
> This is an RFC for two reasons:
> (1) I don’t know whether this is the right way to address the issue,
> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>     if so stop applying the workaround.
>     I don’t know how we would go about this, so this series doesn’t do
>     it.  (Hence it’s an RFC.)
> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>     driver access and modify a BdrvTrackedRequest object.
>
> As for how we can address the issue, I see three ways:
> (1) The one presented in this series: On XFS with aio=native, we extend
>     tracked requests for post-EOF fallocate() calls (i.e., write-zero
>     operations) to reach until infinity (INT64_MAX in practice), mark
>     them serializing and wait for other conflicting requests.
>
>     Advantages:
>     + Limits the impact to very specific cases
>       (And that means it wouldn’t hurt too much to keep this workaround
>       even when the XFS driver has been fixed)
>     + Works around the bug where it happens, namely in file-posix
>
>     Disadvantages:
>     - A bit complex
>     - A bit of a layering violation (should file-posix have access to
>       tracked requests?)
>
> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>     becomes visible due to that function: I don’t think qcow2 writes
>     zeroes in any other I/O path, and raw images are fixed in size so
>     post-EOF writes won’t happen.
>
>     Advantages:
>     + Maybe simpler, depending on how difficult it is to handle the
>       layering violation
>     + Also fixes the performance problem of handle_alloc_space() being
>       slow on ppc64+XFS.
>
>     Disadvantages:
>     - Huge layering violation because qcow2 would need to know whether
>       the image is stored on XFS or not.
>     - We’d definitely want to skip this workaround when the XFS driver
>       has been fixed, so we need some method to find out whether it has
>
> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>     To my knowledge I’m the only one who has provided any benchmarks for
>     this commit, and even then I was a bit skeptical because it performs
>     well in some cases and bad in others.  I concluded that it’s
>     probably worth it because the “some cases” are more likely to occur.
>
>     Now we have this problem of corruption here (granted due to a bug in
>     the XFS driver), and another report of massively degraded
>     performance on ppc64
>     (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>     private BZ; I hate that :-/  The report is about 40 % worse
>     performance for an in-guest fio write benchmark.)
>
>     So I have to ask the question about what the justification for
>     keeping c8bb23cbdbe32f is.  How much does performance increase with
>     it actually?  (On non-(ppc64+XFS) machines, obviously)
>
>     Advantages:
>     + Trivial
>     + No layering violations
>     + We wouldn’t need to keep track of whether the kernel bug has been
>       fixed or not
>     + Fixes the ppc64+XFS performance problem
>
>     Disadvantages:
>     - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>       levels, whatever that means

Correctness is more important than performance, so this is my
preference as a user.

Nir

> So this is the main reason this is an RFC: What should we do?  Is (1)
> really the best choice?
>
>
> In any case, I’ve ran the test case I showed in
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
> more than ten times with this series applied and the installation
> succeeded every time.  (Without this series, it fails like every other
> time.)
>
>
> Max Reitz (3):
>   block: Make wait/mark serialising requests public
>   block/file-posix: Detect XFS with CONFIG_FALLOCATE
>   block/file-posix: Let post-EOF fallocate serialize
>
>  include/block/block_int.h |  3 +++
>  block/file-posix.c        | 46 +++++++++++++++++++++++++++++++++++++--
>  block/io.c                | 24 ++++++++++----------
>  3 files changed, 59 insertions(+), 14 deletions(-)
>
> --
> 2.21.0
>
>

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Vladimir Sementsov-Ogievskiy 4 years, 6 months ago

26.10.2019 20:37, Nir Soffer wrote:
> On Fri, Oct 25, 2019 at 1:11 PM Max Reitz <mreitz@redhat.com> wrote:
>>
>> Hi,
>>
>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>> I’ve explained here:
>>
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>
>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>> data corruption when using qcow2 images on XFS with aio=native.
>>
>> We can’t wait until the XFS kernel driver is fixed, we should work
>> around the problem ourselves.
>>
>> This is an RFC for two reasons:
>> (1) I don’t know whether this is the right way to address the issue,
>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>      if so stop applying the workaround.
>>      I don’t know how we would go about this, so this series doesn’t do
>>      it.  (Hence it’s an RFC.)
>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>      driver access and modify a BdrvTrackedRequest object.
>>
>> As for how we can address the issue, I see three ways:
>> (1) The one presented in this series: On XFS with aio=native, we extend
>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>      them serializing and wait for other conflicting requests.
>>
>>      Advantages:
>>      + Limits the impact to very specific cases
>>        (And that means it wouldn’t hurt too much to keep this workaround
>>        even when the XFS driver has been fixed)
>>      + Works around the bug where it happens, namely in file-posix
>>
>>      Disadvantages:
>>      - A bit complex
>>      - A bit of a layering violation (should file-posix have access to
>>        tracked requests?)
>>
>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>      becomes visible due to that function: I don’t think qcow2 writes
>>      zeroes in any other I/O path, and raw images are fixed in size so
>>      post-EOF writes won’t happen.
>>
>>      Advantages:
>>      + Maybe simpler, depending on how difficult it is to handle the
>>        layering violation
>>      + Also fixes the performance problem of handle_alloc_space() being
>>        slow on ppc64+XFS.
>>
>>      Disadvantages:
>>      - Huge layering violation because qcow2 would need to know whether
>>        the image is stored on XFS or not.
>>      - We’d definitely want to skip this workaround when the XFS driver
>>        has been fixed, so we need some method to find out whether it has
>>
>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>      To my knowledge I’m the only one who has provided any benchmarks for
>>      this commit, and even then I was a bit skeptical because it performs
>>      well in some cases and bad in others.  I concluded that it’s
>>      probably worth it because the “some cases” are more likely to occur.
>>
>>      Now we have this problem of corruption here (granted due to a bug in
>>      the XFS driver), and another report of massively degraded
>>      performance on ppc64
>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>      private BZ; I hate that :-/  The report is about 40 % worse
>>      performance for an in-guest fio write benchmark.)
>>
>>      So I have to ask the question about what the justification for
>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>
>>      Advantages:
>>      + Trivial
>>      + No layering violations
>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>        fixed or not
>>      + Fixes the ppc64+XFS performance problem
>>
>>      Disadvantages:
>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>        levels, whatever that means
> 
> Correctness is more important than performance, so this is my
> preference as a user.
> 

Hmm, still, incorrect is XFS, not Qemu. This bug may be triggered by another
software, or may be another scenario in Qemu (not sure).

> 
>> So this is the main reason this is an RFC: What should we do?  Is (1)
>> really the best choice?
>>
>>
>> In any case, I’ve ran the test case I showed in
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
>> more than ten times with this series applied and the installation
>> succeeded every time.  (Without this series, it fails like every other
>> time.)
>>
>>
>> Max Reitz (3):
>>    block: Make wait/mark serialising requests public
>>    block/file-posix: Detect XFS with CONFIG_FALLOCATE
>>    block/file-posix: Let post-EOF fallocate serialize
>>
>>   include/block/block_int.h |  3 +++
>>   block/file-posix.c        | 46 +++++++++++++++++++++++++++++++++++++--
>>   block/io.c                | 24 ++++++++++----------
>>   3 files changed, 59 insertions(+), 14 deletions(-)
>>
>> --
>> 2.21.0
>>
>>


-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 26.10.19 19:52, Vladimir Sementsov-Ogievskiy wrote:
> 26.10.2019 20:37, Nir Soffer wrote:
>> On Fri, Oct 25, 2019 at 1:11 PM Max Reitz <mreitz@redhat.com> wrote:
>>>
>>> Hi,
>>>
>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>>> I’ve explained here:
>>>
>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>>
>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>>> data corruption when using qcow2 images on XFS with aio=native.
>>>
>>> We can’t wait until the XFS kernel driver is fixed, we should work
>>> around the problem ourselves.
>>>
>>> This is an RFC for two reasons:
>>> (1) I don’t know whether this is the right way to address the issue,
>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>>      if so stop applying the workaround.
>>>      I don’t know how we would go about this, so this series doesn’t do
>>>      it.  (Hence it’s an RFC.)
>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>>      driver access and modify a BdrvTrackedRequest object.
>>>
>>> As for how we can address the issue, I see three ways:
>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>      them serializing and wait for other conflicting requests.
>>>
>>>      Advantages:
>>>      + Limits the impact to very specific cases
>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>        even when the XFS driver has been fixed)
>>>      + Works around the bug where it happens, namely in file-posix
>>>
>>>      Disadvantages:
>>>      - A bit complex
>>>      - A bit of a layering violation (should file-posix have access to
>>>        tracked requests?)
>>>
>>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>>      becomes visible due to that function: I don’t think qcow2 writes
>>>      zeroes in any other I/O path, and raw images are fixed in size so
>>>      post-EOF writes won’t happen.
>>>
>>>      Advantages:
>>>      + Maybe simpler, depending on how difficult it is to handle the
>>>        layering violation
>>>      + Also fixes the performance problem of handle_alloc_space() being
>>>        slow on ppc64+XFS.
>>>
>>>      Disadvantages:
>>>      - Huge layering violation because qcow2 would need to know whether
>>>        the image is stored on XFS or not.
>>>      - We’d definitely want to skip this workaround when the XFS driver
>>>        has been fixed, so we need some method to find out whether it has
>>>
>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>      To my knowledge I’m the only one who has provided any benchmarks for
>>>      this commit, and even then I was a bit skeptical because it performs
>>>      well in some cases and bad in others.  I concluded that it’s
>>>      probably worth it because the “some cases” are more likely to occur.
>>>
>>>      Now we have this problem of corruption here (granted due to a bug in
>>>      the XFS driver), and another report of massively degraded
>>>      performance on ppc64
>>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>      private BZ; I hate that :-/  The report is about 40 % worse
>>>      performance for an in-guest fio write benchmark.)
>>>
>>>      So I have to ask the question about what the justification for
>>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>
>>>      Advantages:
>>>      + Trivial
>>>      + No layering violations
>>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>>        fixed or not
>>>      + Fixes the ppc64+XFS performance problem
>>>
>>>      Disadvantages:
>>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>        levels, whatever that means
>>
>> Correctness is more important than performance, so this is my
>> preference as a user.
>>
> 
> Hmm, still, incorrect is XFS, not Qemu. This bug may be triggered by another
> software, or may be another scenario in Qemu (not sure).

I don’t see any other image format (but qcow2, raw, and filters)
creating zero-writes apart from the parallels format which does so
before writes can happen to that area.  So with parallels, it’s
impossible to get a data write behind an ongoing zero-write.

Raw and filters do not initiate zero-writes on their own.  So there must
be something else that requests them.  Block jobs and guests are limited
to generally fixed-size disks, so I don’t see a way to generate
zero-writes beyond the EOF there.

Which leaves us with qcow2.  To my knowledge the only place where qcow2
generates zero-writes in an I/O path (so they can occur concurrently to
data writes) is in handle_alloc_space().  Therefore, dropping it should
remove the only place where the XFS bug can be triggered.


That the bug is in XFS and not in qemu is a good argument from a moral
standpoint, but it isn’t very pragmatic or reasonable from a technical
standpoint.  There is no fix for XFS yet, so right now the situation is
that there is a bug that causes guest data loss, that qemu can prevent
it, and that there is no other workaround or fix.

So we must work around it.

Of course, that doesn’t mean that we have to do everything in our power
to implement a work around no matter the cost.  We do indeed have to
balance between how far we’re willing to go to fix it and how much we
impact non-XFS workloads.


Honestly, I caught myself saying “Well, that’s too bad for XFS over
gluster if XFS is broken”.  But that just isn’t a reasonable thing to say.

I suppose what we could do is drop the is_xfs check in patch 3.  The
only outcome is that you can no longer do concurrent data writes past
zero-writes past the EOF.  And as I’ve claimed above, this only affects
handle_alloc_space() from qcow2.  Would it be so bad if we simply did
that on every filesystem?

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Stefan Hajnoczi 4 years, 6 months ago

On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
> As for how we can address the issue, I see three ways:
> (1) The one presented in this series: On XFS with aio=native, we extend
>     tracked requests for post-EOF fallocate() calls (i.e., write-zero
>     operations) to reach until infinity (INT64_MAX in practice), mark
>     them serializing and wait for other conflicting requests.
> 
>     Advantages:
>     + Limits the impact to very specific cases
>       (And that means it wouldn’t hurt too much to keep this workaround
>       even when the XFS driver has been fixed)
>     + Works around the bug where it happens, namely in file-posix
> 
>     Disadvantages:
>     - A bit complex
>     - A bit of a layering violation (should file-posix have access to
>       tracked requests?)

Your patch series is reasonable.  I don't think it's too bad.

The main question is how to detect the XFS fix once it ships.  XFS
already has a ton of ioctls, so maybe they don't mind adding a
feature/quirk bit map ioctl for publishing information about bug fixes
to userspace.  I didn't see another obvious way of doing it, maybe a
mount option that the kernel automatically sets and that gets reported
to userspace?

If we imagine that XFS will not provide a mechanism to detect the
presence of the fix, then could we ask QEMU package maintainers to
./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
in the future when their distro has been shipping a fixed kernel for a
while?  It's ugly because it doesn't work if the user installs an older
custom-built kernel on the host.  But at least it will cover 98% of
users...

> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>     To my knowledge I’m the only one who has provided any benchmarks for
>     this commit, and even then I was a bit skeptical because it performs
>     well in some cases and bad in others.  I concluded that it’s
>     probably worth it because the “some cases” are more likely to occur.
> 
>     Now we have this problem of corruption here (granted due to a bug in
>     the XFS driver), and another report of massively degraded
>     performance on ppc64
>     (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>     private BZ; I hate that :-/  The report is about 40 % worse
>     performance for an in-guest fio write benchmark.)
> 
>     So I have to ask the question about what the justification for
>     keeping c8bb23cbdbe32f is.  How much does performance increase with
>     it actually?  (On non-(ppc64+XFS) machines, obviously)
> 
>     Advantages:
>     + Trivial
>     + No layering violations
>     + We wouldn’t need to keep track of whether the kernel bug has been
>       fixed or not
>     + Fixes the ppc64+XFS performance problem
> 
>     Disadvantages:
>     - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>       levels, whatever that means

My favorite because it is clean and simple, but Vladimir has a valid
use-case for requiring this performance optimization so reverting isn't
an option.

Stefan

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 27.10.19 13:35, Stefan Hajnoczi wrote:
> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>> As for how we can address the issue, I see three ways:
>> (1) The one presented in this series: On XFS with aio=native, we extend
>>     tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>     operations) to reach until infinity (INT64_MAX in practice), mark
>>     them serializing and wait for other conflicting requests.
>>
>>     Advantages:
>>     + Limits the impact to very specific cases
>>       (And that means it wouldn’t hurt too much to keep this workaround
>>       even when the XFS driver has been fixed)
>>     + Works around the bug where it happens, namely in file-posix
>>
>>     Disadvantages:
>>     - A bit complex
>>     - A bit of a layering violation (should file-posix have access to
>>       tracked requests?)
> 
> Your patch series is reasonable.  I don't think it's too bad.
> 
> The main question is how to detect the XFS fix once it ships.  XFS
> already has a ton of ioctls, so maybe they don't mind adding a
> feature/quirk bit map ioctl for publishing information about bug fixes
> to userspace.  I didn't see another obvious way of doing it, maybe a
> mount option that the kernel automatically sets and that gets reported
> to userspace?

I’ll add a note to the RH BZ.

> If we imagine that XFS will not provide a mechanism to detect the
> presence of the fix, then could we ask QEMU package maintainers to
> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
> in the future when their distro has been shipping a fixed kernel for a
> while?  It's ugly because it doesn't work if the user installs an older
> custom-built kernel on the host.  But at least it will cover 98% of
> users...

:-/

I don’t like it, but I suppose it would work.  We could also
automatically enable this disabling option in configure when we detect
uname to report a kernel version that must include the fix.  (This
wouldn’t work for kernel with backported fixes, but those disappear over
time...)

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 28.10.19 10:24, Max Reitz wrote:
> On 27.10.19 13:35, Stefan Hajnoczi wrote:
>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>> As for how we can address the issue, I see three ways:
>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>     tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>     operations) to reach until infinity (INT64_MAX in practice), mark
>>>     them serializing and wait for other conflicting requests.
>>>
>>>     Advantages:
>>>     + Limits the impact to very specific cases
>>>       (And that means it wouldn’t hurt too much to keep this workaround
>>>       even when the XFS driver has been fixed)
>>>     + Works around the bug where it happens, namely in file-posix
>>>
>>>     Disadvantages:
>>>     - A bit complex
>>>     - A bit of a layering violation (should file-posix have access to
>>>       tracked requests?)
>>
>> Your patch series is reasonable.  I don't think it's too bad.
>>
>> The main question is how to detect the XFS fix once it ships.  XFS
>> already has a ton of ioctls, so maybe they don't mind adding a
>> feature/quirk bit map ioctl for publishing information about bug fixes
>> to userspace.  I didn't see another obvious way of doing it, maybe a
>> mount option that the kernel automatically sets and that gets reported
>> to userspace?
> 
> I’ll add a note to the RH BZ.
> 
>> If we imagine that XFS will not provide a mechanism to detect the
>> presence of the fix, then could we ask QEMU package maintainers to
>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>> in the future when their distro has been shipping a fixed kernel for a
>> while?  It's ugly because it doesn't work if the user installs an older
>> custom-built kernel on the host.  But at least it will cover 98% of
>> users...
> 
> :-/
> 
> I don’t like it, but I suppose it would work.  We could also
> automatically enable this disabling option in configure when we detect
> uname to report a kernel version that must include the fix.  (This
> wouldn’t work for kernel with backported fixes, but those disappear over
> time...)
I just realized that none of this is going to work for the gluster case
brought up by Nir.  The affected kernel is the remote one and we have no
insight into that.  I don’t think we can do ioctls to XFS over gluster,
can we?

I also realized that uname is a stupid idea because the user may be
running a different kernel version than what runs on the build machine... :(

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 28.10.19 10:30, Max Reitz wrote:
> On 28.10.19 10:24, Max Reitz wrote:
>> On 27.10.19 13:35, Stefan Hajnoczi wrote:
>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>> As for how we can address the issue, I see three ways:
>>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>>     tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>>     operations) to reach until infinity (INT64_MAX in practice), mark
>>>>     them serializing and wait for other conflicting requests.
>>>>
>>>>     Advantages:
>>>>     + Limits the impact to very specific cases
>>>>       (And that means it wouldn’t hurt too much to keep this workaround
>>>>       even when the XFS driver has been fixed)
>>>>     + Works around the bug where it happens, namely in file-posix
>>>>
>>>>     Disadvantages:
>>>>     - A bit complex
>>>>     - A bit of a layering violation (should file-posix have access to
>>>>       tracked requests?)
>>>
>>> Your patch series is reasonable.  I don't think it's too bad.
>>>
>>> The main question is how to detect the XFS fix once it ships.  XFS
>>> already has a ton of ioctls, so maybe they don't mind adding a
>>> feature/quirk bit map ioctl for publishing information about bug fixes
>>> to userspace.  I didn't see another obvious way of doing it, maybe a
>>> mount option that the kernel automatically sets and that gets reported
>>> to userspace?
>>
>> I’ll add a note to the RH BZ.
>>
>>> If we imagine that XFS will not provide a mechanism to detect the
>>> presence of the fix, then could we ask QEMU package maintainers to
>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>>> in the future when their distro has been shipping a fixed kernel for a
>>> while?  It's ugly because it doesn't work if the user installs an older
>>> custom-built kernel on the host.  But at least it will cover 98% of
>>> users...
>>
>> :-/
>>
>> I don’t like it, but I suppose it would work.  We could also
>> automatically enable this disabling option in configure when we detect
>> uname to report a kernel version that must include the fix.  (This
>> wouldn’t work for kernel with backported fixes, but those disappear over
>> time...)
> I just realized that none of this is going to work for the gluster case
> brought up by Nir.  The affected kernel is the remote one and we have no
> insight into that.  I don’t think we can do ioctls to XFS over gluster,
> can we?

On third thought, we could try to detect whether the file is on a remote
filesystem, and if so enable the workaround unconditionally.  I suppose
it wouldn’t hurt performance-wise, given that it’s a remote filesystem
anyway.

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Vladimir Sementsov-Ogievskiy 4 years, 6 months ago

28.10.2019 12:56, Max Reitz wrote:
> On 28.10.19 10:30, Max Reitz wrote:
>> On 28.10.19 10:24, Max Reitz wrote:
>>> On 27.10.19 13:35, Stefan Hajnoczi wrote:
>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>> As for how we can address the issue, I see three ways:
>>>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>>>      them serializing and wait for other conflicting requests.
>>>>>
>>>>>      Advantages:
>>>>>      + Limits the impact to very specific cases
>>>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>>>        even when the XFS driver has been fixed)
>>>>>      + Works around the bug where it happens, namely in file-posix
>>>>>
>>>>>      Disadvantages:
>>>>>      - A bit complex
>>>>>      - A bit of a layering violation (should file-posix have access to
>>>>>        tracked requests?)
>>>>
>>>> Your patch series is reasonable.  I don't think it's too bad.
>>>>
>>>> The main question is how to detect the XFS fix once it ships.  XFS
>>>> already has a ton of ioctls, so maybe they don't mind adding a
>>>> feature/quirk bit map ioctl for publishing information about bug fixes
>>>> to userspace.  I didn't see another obvious way of doing it, maybe a
>>>> mount option that the kernel automatically sets and that gets reported
>>>> to userspace?
>>>
>>> I’ll add a note to the RH BZ.
>>>
>>>> If we imagine that XFS will not provide a mechanism to detect the
>>>> presence of the fix, then could we ask QEMU package maintainers to
>>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>>>> in the future when their distro has been shipping a fixed kernel for a
>>>> while?  It's ugly because it doesn't work if the user installs an older
>>>> custom-built kernel on the host.  But at least it will cover 98% of
>>>> users...
>>>
>>> :-/
>>>
>>> I don’t like it, but I suppose it would work.  We could also
>>> automatically enable this disabling option in configure when we detect
>>> uname to report a kernel version that must include the fix.  (This
>>> wouldn’t work for kernel with backported fixes, but those disappear over
>>> time...)
>> I just realized that none of this is going to work for the gluster case
>> brought up by Nir.  The affected kernel is the remote one and we have no
>> insight into that.  I don’t think we can do ioctls to XFS over gluster,
>> can we?
> 
> On third thought, we could try to detect whether the file is on a remote
> filesystem, and if so enable the workaround unconditionally.  I suppose
> it wouldn’t hurt performance-wise, given that it’s a remote filesystem
> anyway.
> 

I think, for remote, the difference may be even higher than for local, as cost
of writing real zeroes through the wire vs fast zero command is high.

Really, can we live with simple config option, is it so bad?


-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 28.10.19 11:07, Vladimir Sementsov-Ogievskiy wrote:
> 28.10.2019 12:56, Max Reitz wrote:
>> On 28.10.19 10:30, Max Reitz wrote:
>>> On 28.10.19 10:24, Max Reitz wrote:
>>>> On 27.10.19 13:35, Stefan Hajnoczi wrote:
>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>>> As for how we can address the issue, I see three ways:
>>>>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>>>>      them serializing and wait for other conflicting requests.
>>>>>>
>>>>>>      Advantages:
>>>>>>      + Limits the impact to very specific cases
>>>>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>>>>        even when the XFS driver has been fixed)
>>>>>>      + Works around the bug where it happens, namely in file-posix
>>>>>>
>>>>>>      Disadvantages:
>>>>>>      - A bit complex
>>>>>>      - A bit of a layering violation (should file-posix have access to
>>>>>>        tracked requests?)
>>>>>
>>>>> Your patch series is reasonable.  I don't think it's too bad.
>>>>>
>>>>> The main question is how to detect the XFS fix once it ships.  XFS
>>>>> already has a ton of ioctls, so maybe they don't mind adding a
>>>>> feature/quirk bit map ioctl for publishing information about bug fixes
>>>>> to userspace.  I didn't see another obvious way of doing it, maybe a
>>>>> mount option that the kernel automatically sets and that gets reported
>>>>> to userspace?
>>>>
>>>> I’ll add a note to the RH BZ.
>>>>
>>>>> If we imagine that XFS will not provide a mechanism to detect the
>>>>> presence of the fix, then could we ask QEMU package maintainers to
>>>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>>>>> in the future when their distro has been shipping a fixed kernel for a
>>>>> while?  It's ugly because it doesn't work if the user installs an older
>>>>> custom-built kernel on the host.  But at least it will cover 98% of
>>>>> users...
>>>>
>>>> :-/
>>>>
>>>> I don’t like it, but I suppose it would work.  We could also
>>>> automatically enable this disabling option in configure when we detect
>>>> uname to report a kernel version that must include the fix.  (This
>>>> wouldn’t work for kernel with backported fixes, but those disappear over
>>>> time...)
>>> I just realized that none of this is going to work for the gluster case
>>> brought up by Nir.  The affected kernel is the remote one and we have no
>>> insight into that.  I don’t think we can do ioctls to XFS over gluster,
>>> can we?
>>
>> On third thought, we could try to detect whether the file is on a remote
>> filesystem, and if so enable the workaround unconditionally.  I suppose
>> it wouldn’t hurt performance-wise, given that it’s a remote filesystem
>> anyway.
>>
> 
> I think, for remote, the difference may be even higher than for local, as cost
> of writing real zeroes through the wire vs fast zero command is high.

I was speaking of a workaround in general, and that includes the
workaround presented in this series.

> Really, can we live with simple config option, is it so bad?

The config option won’t work for remote hosts, though.  That’s exactly
the problem.

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Vladimir Sementsov-Ogievskiy 4 years, 6 months ago

28.10.2019 13:10, Max Reitz wrote:
> On 28.10.19 11:07, Vladimir Sementsov-Ogievskiy wrote:
>> 28.10.2019 12:56, Max Reitz wrote:
>>> On 28.10.19 10:30, Max Reitz wrote:
>>>> On 28.10.19 10:24, Max Reitz wrote:
>>>>> On 27.10.19 13:35, Stefan Hajnoczi wrote:
>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>>>> As for how we can address the issue, I see three ways:
>>>>>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>>>>>       tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>>>>>       operations) to reach until infinity (INT64_MAX in practice), mark
>>>>>>>       them serializing and wait for other conflicting requests.
>>>>>>>
>>>>>>>       Advantages:
>>>>>>>       + Limits the impact to very specific cases
>>>>>>>         (And that means it wouldn’t hurt too much to keep this workaround
>>>>>>>         even when the XFS driver has been fixed)
>>>>>>>       + Works around the bug where it happens, namely in file-posix
>>>>>>>
>>>>>>>       Disadvantages:
>>>>>>>       - A bit complex
>>>>>>>       - A bit of a layering violation (should file-posix have access to
>>>>>>>         tracked requests?)
>>>>>>
>>>>>> Your patch series is reasonable.  I don't think it's too bad.
>>>>>>
>>>>>> The main question is how to detect the XFS fix once it ships.  XFS
>>>>>> already has a ton of ioctls, so maybe they don't mind adding a
>>>>>> feature/quirk bit map ioctl for publishing information about bug fixes
>>>>>> to userspace.  I didn't see another obvious way of doing it, maybe a
>>>>>> mount option that the kernel automatically sets and that gets reported
>>>>>> to userspace?
>>>>>
>>>>> I’ll add a note to the RH BZ.
>>>>>
>>>>>> If we imagine that XFS will not provide a mechanism to detect the
>>>>>> presence of the fix, then could we ask QEMU package maintainers to
>>>>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>>>>>> in the future when their distro has been shipping a fixed kernel for a
>>>>>> while?  It's ugly because it doesn't work if the user installs an older
>>>>>> custom-built kernel on the host.  But at least it will cover 98% of
>>>>>> users...
>>>>>
>>>>> :-/
>>>>>
>>>>> I don’t like it, but I suppose it would work.  We could also
>>>>> automatically enable this disabling option in configure when we detect
>>>>> uname to report a kernel version that must include the fix.  (This
>>>>> wouldn’t work for kernel with backported fixes, but those disappear over
>>>>> time...)
>>>> I just realized that none of this is going to work for the gluster case
>>>> brought up by Nir.  The affected kernel is the remote one and we have no
>>>> insight into that.  I don’t think we can do ioctls to XFS over gluster,
>>>> can we?
>>>
>>> On third thought, we could try to detect whether the file is on a remote
>>> filesystem, and if so enable the workaround unconditionally.  I suppose
>>> it wouldn’t hurt performance-wise, given that it’s a remote filesystem
>>> anyway.
>>>
>>
>> I think, for remote, the difference may be even higher than for local, as cost
>> of writing real zeroes through the wire vs fast zero command is high.
> 
> I was speaking of a workaround in general, and that includes the
> workaround presented in this series.

Ah, yes, sorry. Then it should be OK.

> 
>> Really, can we live with simple config option, is it so bad?
> 
> The config option won’t work for remote hosts, though.  That’s exactly
> the problem.
> 

Still config option will help people who have xfs setups and want to be absolutely safe.
(Or, with another default, help people who doesn't have xfs setups (Virtuozzo) and want to be fast)


-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Kevin Wolf 4 years, 6 months ago

Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
> > As for how we can address the issue, I see three ways:
> > (1) The one presented in this series: On XFS with aio=native, we extend
> >     tracked requests for post-EOF fallocate() calls (i.e., write-zero
> >     operations) to reach until infinity (INT64_MAX in practice), mark
> >     them serializing and wait for other conflicting requests.
> > 
> >     Advantages:
> >     + Limits the impact to very specific cases
> >       (And that means it wouldn’t hurt too much to keep this workaround
> >       even when the XFS driver has been fixed)
> >     + Works around the bug where it happens, namely in file-posix
> > 
> >     Disadvantages:
> >     - A bit complex
> >     - A bit of a layering violation (should file-posix have access to
> >       tracked requests?)
> 
> Your patch series is reasonable.  I don't think it's too bad.
> 
> The main question is how to detect the XFS fix once it ships.  XFS
> already has a ton of ioctls, so maybe they don't mind adding a
> feature/quirk bit map ioctl for publishing information about bug fixes
> to userspace.  I didn't see another obvious way of doing it, maybe a
> mount option that the kernel automatically sets and that gets reported
> to userspace?

I think the CC list is too short for this question. We should involve
the XFS people here.

> If we imagine that XFS will not provide a mechanism to detect the
> presence of the fix, then could we ask QEMU package maintainers to
> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
> in the future when their distro has been shipping a fixed kernel for a
> while?  It's ugly because it doesn't work if the user installs an older
> custom-built kernel on the host.  But at least it will cover 98% of
> users...
> 
> > (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
> >     To my knowledge I’m the only one who has provided any benchmarks for
> >     this commit, and even then I was a bit skeptical because it performs
> >     well in some cases and bad in others.  I concluded that it’s
> >     probably worth it because the “some cases” are more likely to occur.
> > 
> >     Now we have this problem of corruption here (granted due to a bug in
> >     the XFS driver), and another report of massively degraded
> >     performance on ppc64
> >     (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
> >     private BZ; I hate that :-/  The report is about 40 % worse
> >     performance for an in-guest fio write benchmark.)
> > 
> >     So I have to ask the question about what the justification for
> >     keeping c8bb23cbdbe32f is.  How much does performance increase with
> >     it actually?  (On non-(ppc64+XFS) machines, obviously)
> > 
> >     Advantages:
> >     + Trivial
> >     + No layering violations
> >     + We wouldn’t need to keep track of whether the kernel bug has been
> >       fixed or not
> >     + Fixes the ppc64+XFS performance problem
> > 
> >     Disadvantages:
> >     - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
> >       levels, whatever that means
> 
> My favorite because it is clean and simple, but Vladimir has a valid
> use-case for requiring this performance optimization so reverting isn't
> an option.

Vladimir also said that qcow2 subclusters would probably also solve his
problem, so maybe reverting and applying the subcluster patches instead
is a possible solution, too?

We already have some cases where the existing handle_alloc_space()
causes performance to actually become worse, and serialising requests as
a workaround isn't going to make performance any better. So even on
these grounds, keeping commit c8bb23cbdbe32f is questionable.

Kevin

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Vladimir Sementsov-Ogievskiy 4 years, 6 months ago

28.10.2019 14:04, Kevin Wolf wrote:
> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>> As for how we can address the issue, I see three ways:
>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>      them serializing and wait for other conflicting requests.
>>>
>>>      Advantages:
>>>      + Limits the impact to very specific cases
>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>        even when the XFS driver has been fixed)
>>>      + Works around the bug where it happens, namely in file-posix
>>>
>>>      Disadvantages:
>>>      - A bit complex
>>>      - A bit of a layering violation (should file-posix have access to
>>>        tracked requests?)
>>
>> Your patch series is reasonable.  I don't think it's too bad.
>>
>> The main question is how to detect the XFS fix once it ships.  XFS
>> already has a ton of ioctls, so maybe they don't mind adding a
>> feature/quirk bit map ioctl for publishing information about bug fixes
>> to userspace.  I didn't see another obvious way of doing it, maybe a
>> mount option that the kernel automatically sets and that gets reported
>> to userspace?
> 
> I think the CC list is too short for this question. We should involve
> the XFS people here.
> 
>> If we imagine that XFS will not provide a mechanism to detect the
>> presence of the fix, then could we ask QEMU package maintainers to
>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point
>> in the future when their distro has been shipping a fixed kernel for a
>> while?  It's ugly because it doesn't work if the user installs an older
>> custom-built kernel on the host.  But at least it will cover 98% of
>> users...
>>
>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>      To my knowledge I’m the only one who has provided any benchmarks for
>>>      this commit, and even then I was a bit skeptical because it performs
>>>      well in some cases and bad in others.  I concluded that it’s
>>>      probably worth it because the “some cases” are more likely to occur.
>>>
>>>      Now we have this problem of corruption here (granted due to a bug in
>>>      the XFS driver), and another report of massively degraded
>>>      performance on ppc64
>>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>      private BZ; I hate that :-/  The report is about 40 % worse
>>>      performance for an in-guest fio write benchmark.)
>>>
>>>      So I have to ask the question about what the justification for
>>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>
>>>      Advantages:
>>>      + Trivial
>>>      + No layering violations
>>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>>        fixed or not
>>>      + Fixes the ppc64+XFS performance problem
>>>
>>>      Disadvantages:
>>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>        levels, whatever that means
>>
>> My favorite because it is clean and simple, but Vladimir has a valid
>> use-case for requiring this performance optimization so reverting isn't
>> an option.
> 
> Vladimir also said that qcow2 subclusters would probably also solve his
> problem, so maybe reverting and applying the subcluster patches instead
> is a possible solution, too?

I'm not sure about ssd case, it may need write-zero optimization anyway.

> 
> We already have some cases where the existing handle_alloc_space()
> causes performance to actually become worse, and serialising requests as
> a workaround isn't going to make performance any better. So even on
> these grounds, keeping commit c8bb23cbdbe32f is questionable.
> 

Can keeping handle_alloc_space under some config option be an option?


-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
> 28.10.2019 14:04, Kevin Wolf wrote:
>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:

[...]

>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>      To my knowledge I’m the only one who has provided any benchmarks for
>>>>      this commit, and even then I was a bit skeptical because it performs
>>>>      well in some cases and bad in others.  I concluded that it’s
>>>>      probably worth it because the “some cases” are more likely to occur.
>>>>
>>>>      Now we have this problem of corruption here (granted due to a bug in
>>>>      the XFS driver), and another report of massively degraded
>>>>      performance on ppc64
>>>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>      private BZ; I hate that :-/  The report is about 40 % worse
>>>>      performance for an in-guest fio write benchmark.)
>>>>
>>>>      So I have to ask the question about what the justification for
>>>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>
>>>>      Advantages:
>>>>      + Trivial
>>>>      + No layering violations
>>>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>>>        fixed or not
>>>>      + Fixes the ppc64+XFS performance problem
>>>>
>>>>      Disadvantages:
>>>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>        levels, whatever that means
>>>
>>> My favorite because it is clean and simple, but Vladimir has a valid
>>> use-case for requiring this performance optimization so reverting isn't
>>> an option.
>>
>> Vladimir also said that qcow2 subclusters would probably also solve his
>> problem, so maybe reverting and applying the subcluster patches instead
>> is a possible solution, too?
> 
> I'm not sure about ssd case, it may need write-zero optimization anyway.

What exactly do you need?  Do you actually need to write zeroes (e.g.
because you’re storing images on block devices) or would it be
sufficient to just drop the COW areas when bdrv_has_zero_init() and
bdrv_has_zero_init_truncate() are true?

I’m asking because Dave Chinner said
(https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
fallocate() is always slow at least with aio=native because it needs to
wait for all concurrent AIO writes to finish, and so it causes the AIO
pipeline to stall.

(He suggested using XFS extent size hints to get the same effect as
write-zeroes for free, basically, but I don’t know whether that’s really
useful to us; as unallocated areas on XFS read back as zero anyway.)

>> We already have some cases where the existing handle_alloc_space()
>> causes performance to actually become worse, and serialising requests as
>> a workaround isn't going to make performance any better. So even on
>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>
> 
> Can keeping handle_alloc_space under some config option be an option?

Hm.  A config option is weird if you’re the only one who’s going to
enable it.  But other than that I don’t have anything against it.

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Vladimir Sementsov-Ogievskiy 4 years, 6 months ago

29.10.2019 11:50, Max Reitz wrote:
> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>> 28.10.2019 14:04, Kevin Wolf wrote:
>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
> 
> [...]
> 
>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>       To my knowledge I’m the only one who has provided any benchmarks for
>>>>>       this commit, and even then I was a bit skeptical because it performs
>>>>>       well in some cases and bad in others.  I concluded that it’s
>>>>>       probably worth it because the “some cases” are more likely to occur.
>>>>>
>>>>>       Now we have this problem of corruption here (granted due to a bug in
>>>>>       the XFS driver), and another report of massively degraded
>>>>>       performance on ppc64
>>>>>       (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>       private BZ; I hate that :-/  The report is about 40 % worse
>>>>>       performance for an in-guest fio write benchmark.)
>>>>>
>>>>>       So I have to ask the question about what the justification for
>>>>>       keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>       it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>
>>>>>       Advantages:
>>>>>       + Trivial
>>>>>       + No layering violations
>>>>>       + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>         fixed or not
>>>>>       + Fixes the ppc64+XFS performance problem
>>>>>
>>>>>       Disadvantages:
>>>>>       - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>         levels, whatever that means
>>>>
>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>> use-case for requiring this performance optimization so reverting isn't
>>>> an option.
>>>
>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>> problem, so maybe reverting and applying the subcluster patches instead
>>> is a possible solution, too?
>>
>> I'm not sure about ssd case, it may need write-zero optimization anyway.
> 
> What exactly do you need?  Do you actually need to write zeroes (e.g.
> because you’re storing images on block devices) or would it be
> sufficient to just drop the COW areas when bdrv_has_zero_init() and
> bdrv_has_zero_init_truncate() are true?

Hmm, what do you mean? We need to zero COW areas. So, original way is to
write real zeroes, optimized way is fallocate.. What do you mean by drop?
Mark sublusters as zeroes by metadata?

But still we'll have COW areas in subcluster, and we'll need to directly zero
them.. And fallocate will most probably be faster on ssd ext4 case..

> 
> I’m asking because Dave Chinner said
> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
> fallocate() is always slow at least with aio=native because it needs to
> wait for all concurrent AIO writes to finish, and so it causes the AIO
> pipeline to stall.
> 
> (He suggested using XFS extent size hints to get the same effect as
> write-zeroes for free, basically, but I don’t know whether that’s really
> useful to us; as unallocated areas on XFS read back as zero anyway.)
> 
>>> We already have some cases where the existing handle_alloc_space()
>>> causes performance to actually become worse, and serialising requests as
>>> a workaround isn't going to make performance any better. So even on
>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>
>>
>> Can keeping handle_alloc_space under some config option be an option?
> 
> Hm.  A config option is weird if you’re the only one who’s going to
> enable it.  But other than that I don’t have anything against it.
> 

It's just a bit easier for us to maintain config option, than out-of-tree patch.
On the other hand, it's not a real problem to maintain this one patch in separate.
It may return again to the tree, when XFS bug fixed.

-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
> 29.10.2019 11:50, Max Reitz wrote:
>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>
>> [...]
>>
>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>       To my knowledge I’m the only one who has provided any benchmarks for
>>>>>>       this commit, and even then I was a bit skeptical because it performs
>>>>>>       well in some cases and bad in others.  I concluded that it’s
>>>>>>       probably worth it because the “some cases” are more likely to occur.
>>>>>>
>>>>>>       Now we have this problem of corruption here (granted due to a bug in
>>>>>>       the XFS driver), and another report of massively degraded
>>>>>>       performance on ppc64
>>>>>>       (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>       private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>       performance for an in-guest fio write benchmark.)
>>>>>>
>>>>>>       So I have to ask the question about what the justification for
>>>>>>       keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>>       it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>
>>>>>>       Advantages:
>>>>>>       + Trivial
>>>>>>       + No layering violations
>>>>>>       + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>>         fixed or not
>>>>>>       + Fixes the ppc64+XFS performance problem
>>>>>>
>>>>>>       Disadvantages:
>>>>>>       - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>         levels, whatever that means
>>>>>
>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>> an option.
>>>>
>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>> is a possible solution, too?
>>>
>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>
>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>> because you’re storing images on block devices) or would it be
>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>> bdrv_has_zero_init_truncate() are true?
> 
> Hmm, what do you mean? We need to zero COW areas. So, original way is to
> write real zeroes, optimized way is fallocate.. What do you mean by drop?
> Mark sublusters as zeroes by metadata?

Why do you need to zero COW areas?  For normal files, any data will read
as zero if you didn’t write anything there.

> But still we'll have COW areas in subcluster, and we'll need to directly zero
> them.. And fallocate will most probably be faster on ssd ext4 case..
> 
>>
>> I’m asking because Dave Chinner said
>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
>> fallocate() is always slow at least with aio=native because it needs to
>> wait for all concurrent AIO writes to finish, and so it causes the AIO
>> pipeline to stall.
>>
>> (He suggested using XFS extent size hints to get the same effect as
>> write-zeroes for free, basically, but I don’t know whether that’s really
>> useful to us; as unallocated areas on XFS read back as zero anyway.)
>>
>>>> We already have some cases where the existing handle_alloc_space()
>>>> causes performance to actually become worse, and serialising requests as
>>>> a workaround isn't going to make performance any better. So even on
>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>>
>>>
>>> Can keeping handle_alloc_space under some config option be an option?
>>
>> Hm.  A config option is weird if you’re the only one who’s going to
>> enable it.  But other than that I don’t have anything against it.
>>
> 
> It's just a bit easier for us to maintain config option, than out-of-tree patch.
> On the other hand, it's not a real problem to maintain this one patch in separate.
> It may return again to the tree, when XFS bug fixed.

We’ll still have the problem that fallocate() must stall aio=native
requests.

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Vladimir Sementsov-Ogievskiy 4 years, 6 months ago

29.10.2019 14:55, Max Reitz wrote:
> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
>> 29.10.2019 11:50, Max Reitz wrote:
>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>
>>> [...]
>>>
>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>>        To my knowledge I’m the only one who has provided any benchmarks for
>>>>>>>        this commit, and even then I was a bit skeptical because it performs
>>>>>>>        well in some cases and bad in others.  I concluded that it’s
>>>>>>>        probably worth it because the “some cases” are more likely to occur.
>>>>>>>
>>>>>>>        Now we have this problem of corruption here (granted due to a bug in
>>>>>>>        the XFS driver), and another report of massively degraded
>>>>>>>        performance on ppc64
>>>>>>>        (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>>        private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>>        performance for an in-guest fio write benchmark.)
>>>>>>>
>>>>>>>        So I have to ask the question about what the justification for
>>>>>>>        keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>>>        it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>>
>>>>>>>        Advantages:
>>>>>>>        + Trivial
>>>>>>>        + No layering violations
>>>>>>>        + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>>>          fixed or not
>>>>>>>        + Fixes the ppc64+XFS performance problem
>>>>>>>
>>>>>>>        Disadvantages:
>>>>>>>        - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>>          levels, whatever that means
>>>>>>
>>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>>> an option.
>>>>>
>>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>>> is a possible solution, too?
>>>>
>>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>>
>>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>>> because you’re storing images on block devices) or would it be
>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>>> bdrv_has_zero_init_truncate() are true?
>>
>> Hmm, what do you mean? We need to zero COW areas. So, original way is to
>> write real zeroes, optimized way is fallocate.. What do you mean by drop?
>> Mark sublusters as zeroes by metadata?
> 
> Why do you need to zero COW areas?  For normal files, any data will read
> as zero if you didn’t write anything there.

Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero,
as it may be reused previously allocated cluster..

> 
>> But still we'll have COW areas in subcluster, and we'll need to directly zero
>> them.. And fallocate will most probably be faster on ssd ext4 case..
>>
>>>
>>> I’m asking because Dave Chinner said
>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
>>> fallocate() is always slow at least with aio=native because it needs to
>>> wait for all concurrent AIO writes to finish, and so it causes the AIO
>>> pipeline to stall.
>>>
>>> (He suggested using XFS extent size hints to get the same effect as
>>> write-zeroes for free, basically, but I don’t know whether that’s really
>>> useful to us; as unallocated areas on XFS read back as zero anyway.)
>>>
>>>>> We already have some cases where the existing handle_alloc_space()
>>>>> causes performance to actually become worse, and serialising requests as
>>>>> a workaround isn't going to make performance any better. So even on
>>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>>>
>>>>
>>>> Can keeping handle_alloc_space under some config option be an option?
>>>
>>> Hm.  A config option is weird if you’re the only one who’s going to
>>> enable it.  But other than that I don’t have anything against it.
>>>
>>
>> It's just a bit easier for us to maintain config option, than out-of-tree patch.
>> On the other hand, it's not a real problem to maintain this one patch in separate.
>> It may return again to the tree, when XFS bug fixed.
> 
> We’ll still have the problem that fallocate() must stall aio=native
> requests.
> 

Does it mean that fallocate is bad in general? Practice shows the opposite..
At least I have my examples with qemu-img bench. Can that thing be shown with
qemu-img bench or something?


-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 29.10.19 13:05, Vladimir Sementsov-Ogievskiy wrote:
> 29.10.2019 14:55, Max Reitz wrote:
>> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
>>> 29.10.2019 11:50, Max Reitz wrote:
>>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>
>>>> [...]
>>>>
>>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>>>        To my knowledge I’m the only one who has provided any benchmarks for
>>>>>>>>        this commit, and even then I was a bit skeptical because it performs
>>>>>>>>        well in some cases and bad in others.  I concluded that it’s
>>>>>>>>        probably worth it because the “some cases” are more likely to occur.
>>>>>>>>
>>>>>>>>        Now we have this problem of corruption here (granted due to a bug in
>>>>>>>>        the XFS driver), and another report of massively degraded
>>>>>>>>        performance on ppc64
>>>>>>>>        (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>>>        private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>>>        performance for an in-guest fio write benchmark.)
>>>>>>>>
>>>>>>>>        So I have to ask the question about what the justification for
>>>>>>>>        keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>>>>        it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>>>
>>>>>>>>        Advantages:
>>>>>>>>        + Trivial
>>>>>>>>        + No layering violations
>>>>>>>>        + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>>>>          fixed or not
>>>>>>>>        + Fixes the ppc64+XFS performance problem
>>>>>>>>
>>>>>>>>        Disadvantages:
>>>>>>>>        - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>>>          levels, whatever that means
>>>>>>>
>>>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>>>> an option.
>>>>>>
>>>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>>>> is a possible solution, too?
>>>>>
>>>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>>>
>>>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>>>> because you’re storing images on block devices) or would it be
>>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>>>> bdrv_has_zero_init_truncate() are true?
>>>
>>> Hmm, what do you mean? We need to zero COW areas. So, original way is to
>>> write real zeroes, optimized way is fallocate.. What do you mean by drop?
>>> Mark sublusters as zeroes by metadata?
>>
>> Why do you need to zero COW areas?  For normal files, any data will read
>> as zero if you didn’t write anything there.
> 
> Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero,
> as it may be reused previously allocated cluster..

Hm, right.  We could special-case something for beyond the EOF, but I
don’t know whether that really makes it better.

OTOH, maybe allocations at the EOF are the real bottleneck.  Reusing
existing clusters should be rare enough that maybe the existing code
which explicitly writes zeroes is sufficient.

>>
>>> But still we'll have COW areas in subcluster, and we'll need to directly zero
>>> them.. And fallocate will most probably be faster on ssd ext4 case..
>>>
>>>>
>>>> I’m asking because Dave Chinner said
>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
>>>> fallocate() is always slow at least with aio=native because it needs to
>>>> wait for all concurrent AIO writes to finish, and so it causes the AIO
>>>> pipeline to stall.
>>>>
>>>> (He suggested using XFS extent size hints to get the same effect as
>>>> write-zeroes for free, basically, but I don’t know whether that’s really
>>>> useful to us; as unallocated areas on XFS read back as zero anyway.)
>>>>
>>>>>> We already have some cases where the existing handle_alloc_space()
>>>>>> causes performance to actually become worse, and serialising requests as
>>>>>> a workaround isn't going to make performance any better. So even on
>>>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>>>>
>>>>>
>>>>> Can keeping handle_alloc_space under some config option be an option?
>>>>
>>>> Hm.  A config option is weird if you’re the only one who’s going to
>>>> enable it.  But other than that I don’t have anything against it.
>>>>
>>>
>>> It's just a bit easier for us to maintain config option, than out-of-tree patch.
>>> On the other hand, it's not a real problem to maintain this one patch in separate.
>>> It may return again to the tree, when XFS bug fixed.
>>
>> We’ll still have the problem that fallocate() must stall aio=native
>> requests.
>>
> 
> Does it mean that fallocate is bad in general? Practice shows the opposite..
> At least I have my examples with qemu-img bench. Can that thing be shown with
> qemu-img bench or something?

I haven’t done benchmarks yet, but I don’t think Laurent will mind if I
paste his fio benchmark results from the unfortunately private BZ here:

         QCOW2 ON |          EXT4         |          XFS          |
                  |                       |                       |
(rw, bs, iodepth) |   2.12.0  |   4.1.0   |   2.12.0  |   4.1.0   |
------------------+-----------+-----------+-----------+-----------+
(write, 16k, 1)   | 1868KiB/s | 1865KiB/s | 18.6MiB/s | 13.7MiB/s |
------------------+-----------+-----------+-----------+-----------+
(write, 64k, 1)   | 7036KiB/s | 7413KiB/s | 27.0MiB/s | 7465KiB/s |
------------------+-----------+-----------+-----------+-----------+
(write,  4k, 8)   |  535KiB/s |  524KiB/s | 13.9MiB/s | 1662KiB/s |
------------------+-----------+-----------+-----------+-----------+

And that just looks like ext4 performs worse with aio=native in general.
 But I’ll have to do my own benchmarks first.

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Vladimir Sementsov-Ogievskiy 4 years, 6 months ago

29.10.2019 15:11, Max Reitz wrote:
> On 29.10.19 13:05, Vladimir Sementsov-Ogievskiy wrote:
>> 29.10.2019 14:55, Max Reitz wrote:
>>> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
>>>> 29.10.2019 11:50, Max Reitz wrote:
>>>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>>>>         To my knowledge I’m the only one who has provided any benchmarks for
>>>>>>>>>         this commit, and even then I was a bit skeptical because it performs
>>>>>>>>>         well in some cases and bad in others.  I concluded that it’s
>>>>>>>>>         probably worth it because the “some cases” are more likely to occur.
>>>>>>>>>
>>>>>>>>>         Now we have this problem of corruption here (granted due to a bug in
>>>>>>>>>         the XFS driver), and another report of massively degraded
>>>>>>>>>         performance on ppc64
>>>>>>>>>         (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>>>>         private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>>>>         performance for an in-guest fio write benchmark.)
>>>>>>>>>
>>>>>>>>>         So I have to ask the question about what the justification for
>>>>>>>>>         keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>>>>>         it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>>>>
>>>>>>>>>         Advantages:
>>>>>>>>>         + Trivial
>>>>>>>>>         + No layering violations
>>>>>>>>>         + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>>>>>           fixed or not
>>>>>>>>>         + Fixes the ppc64+XFS performance problem
>>>>>>>>>
>>>>>>>>>         Disadvantages:
>>>>>>>>>         - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>>>>           levels, whatever that means
>>>>>>>>
>>>>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>>>>> an option.
>>>>>>>
>>>>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>>>>> is a possible solution, too?
>>>>>>
>>>>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>>>>
>>>>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>>>>> because you’re storing images on block devices) or would it be
>>>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>>>>> bdrv_has_zero_init_truncate() are true?
>>>>
>>>> Hmm, what do you mean? We need to zero COW areas. So, original way is to
>>>> write real zeroes, optimized way is fallocate.. What do you mean by drop?
>>>> Mark sublusters as zeroes by metadata?
>>>
>>> Why do you need to zero COW areas?  For normal files, any data will read
>>> as zero if you didn’t write anything there.
>>
>> Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero,
>> as it may be reused previously allocated cluster..
> 
> Hm, right.  We could special-case something for beyond the EOF, but I
> don’t know whether that really makes it better.
> 
> OTOH, maybe allocations at the EOF are the real bottleneck.  Reusing
> existing clusters should be rare enough that maybe the existing code
> which explicitly writes zeroes is sufficient.

But, as I understand pre-EOF fallocates are safe in xfs? So, we may just drop calling
fallocate past-EOF (it's zero anyway) and do fallocate pre-EOF (it's safe) ?
(the only special case is intersecting EOF.. so better is keep file length at cluster-size
boundary, truncating on exit)

> 
>>>
>>>> But still we'll have COW areas in subcluster, and we'll need to directly zero
>>>> them.. And fallocate will most probably be faster on ssd ext4 case..
>>>>
>>>>>
>>>>> I’m asking because Dave Chinner said
>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that
>>>>> fallocate() is always slow at least with aio=native because it needs to
>>>>> wait for all concurrent AIO writes to finish, and so it causes the AIO
>>>>> pipeline to stall.
>>>>>
>>>>> (He suggested using XFS extent size hints to get the same effect as
>>>>> write-zeroes for free, basically, but I don’t know whether that’s really
>>>>> useful to us; as unallocated areas on XFS read back as zero anyway.)
>>>>>
>>>>>>> We already have some cases where the existing handle_alloc_space()
>>>>>>> causes performance to actually become worse, and serialising requests as
>>>>>>> a workaround isn't going to make performance any better. So even on
>>>>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable.
>>>>>>>
>>>>>>
>>>>>> Can keeping handle_alloc_space under some config option be an option?
>>>>>
>>>>> Hm.  A config option is weird if you’re the only one who’s going to
>>>>> enable it.  But other than that I don’t have anything against it.
>>>>>
>>>>
>>>> It's just a bit easier for us to maintain config option, than out-of-tree patch.
>>>> On the other hand, it's not a real problem to maintain this one patch in separate.
>>>> It may return again to the tree, when XFS bug fixed.
>>>
>>> We’ll still have the problem that fallocate() must stall aio=native
>>> requests.
>>>
>>
>> Does it mean that fallocate is bad in general? Practice shows the opposite..
>> At least I have my examples with qemu-img bench. Can that thing be shown with
>> qemu-img bench or something?
> 
> I haven’t done benchmarks yet, but I don’t think Laurent will mind if I
> paste his fio benchmark results from the unfortunately private BZ here:
> 
>           QCOW2 ON |          EXT4         |          XFS          |
>                    |                       |                       |
> (rw, bs, iodepth) |   2.12.0  |   4.1.0   |   2.12.0  |   4.1.0   |
> ------------------+-----------+-----------+-----------+-----------+
> (write, 16k, 1)   | 1868KiB/s | 1865KiB/s | 18.6MiB/s | 13.7MiB/s |
> ------------------+-----------+-----------+-----------+-----------+
> (write, 64k, 1)   | 7036KiB/s | 7413KiB/s | 27.0MiB/s | 7465KiB/s |
> ------------------+-----------+-----------+-----------+-----------+
> (write,  4k, 8)   |  535KiB/s |  524KiB/s | 13.9MiB/s | 1662KiB/s |
> ------------------+-----------+-----------+-----------+-----------+
> 
> And that just looks like ext4 performs worse with aio=native in general.
>   But I’ll have to do my own benchmarks first.
> 
> Max
> 


-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 29.10.19 13:19, Vladimir Sementsov-Ogievskiy wrote:
> 29.10.2019 15:11, Max Reitz wrote:
>> On 29.10.19 13:05, Vladimir Sementsov-Ogievskiy wrote:
>>> 29.10.2019 14:55, Max Reitz wrote:
>>>> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 29.10.2019 11:50, Max Reitz wrote:
>>>>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 28.10.2019 14:04, Kevin Wolf wrote:
>>>>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben:
>>>>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote:
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>>>>>>>         To my knowledge I’m the only one who has provided any benchmarks for
>>>>>>>>>>         this commit, and even then I was a bit skeptical because it performs
>>>>>>>>>>         well in some cases and bad in others.  I concluded that it’s
>>>>>>>>>>         probably worth it because the “some cases” are more likely to occur.
>>>>>>>>>>
>>>>>>>>>>         Now we have this problem of corruption here (granted due to a bug in
>>>>>>>>>>         the XFS driver), and another report of massively degraded
>>>>>>>>>>         performance on ppc64
>>>>>>>>>>         (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>>>>>>>         private BZ; I hate that :-/  The report is about 40 % worse
>>>>>>>>>>         performance for an in-guest fio write benchmark.)
>>>>>>>>>>
>>>>>>>>>>         So I have to ask the question about what the justification for
>>>>>>>>>>         keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>>>>>>>         it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>>>>>>>
>>>>>>>>>>         Advantages:
>>>>>>>>>>         + Trivial
>>>>>>>>>>         + No layering violations
>>>>>>>>>>         + We wouldn’t need to keep track of whether the kernel bug has been
>>>>>>>>>>           fixed or not
>>>>>>>>>>         + Fixes the ppc64+XFS performance problem
>>>>>>>>>>
>>>>>>>>>>         Disadvantages:
>>>>>>>>>>         - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>>>>>>>           levels, whatever that means
>>>>>>>>>
>>>>>>>>> My favorite because it is clean and simple, but Vladimir has a valid
>>>>>>>>> use-case for requiring this performance optimization so reverting isn't
>>>>>>>>> an option.
>>>>>>>>
>>>>>>>> Vladimir also said that qcow2 subclusters would probably also solve his
>>>>>>>> problem, so maybe reverting and applying the subcluster patches instead
>>>>>>>> is a possible solution, too?
>>>>>>>
>>>>>>> I'm not sure about ssd case, it may need write-zero optimization anyway.
>>>>>>
>>>>>> What exactly do you need?  Do you actually need to write zeroes (e.g.
>>>>>> because you’re storing images on block devices) or would it be
>>>>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and
>>>>>> bdrv_has_zero_init_truncate() are true?
>>>>>
>>>>> Hmm, what do you mean? We need to zero COW areas. So, original way is to
>>>>> write real zeroes, optimized way is fallocate.. What do you mean by drop?
>>>>> Mark sublusters as zeroes by metadata?
>>>>
>>>> Why do you need to zero COW areas?  For normal files, any data will read
>>>> as zero if you didn’t write anything there.
>>>
>>> Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero,
>>> as it may be reused previously allocated cluster..
>>
>> Hm, right.  We could special-case something for beyond the EOF, but I
>> don’t know whether that really makes it better.
>>
>> OTOH, maybe allocations at the EOF are the real bottleneck.  Reusing
>> existing clusters should be rare enough that maybe the existing code
>> which explicitly writes zeroes is sufficient.
> 
> But, as I understand pre-EOF fallocates are safe in xfs? So, we may just drop calling
> fallocate past-EOF (it's zero anyway) and do fallocate pre-EOF (it's safe) ?

But probably slow still.  And there’s the question of how much
complexity we want to heap onto this.

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Vladimir Sementsov-Ogievskiy 4 years, 6 months ago

25.10.2019 12:58, Max Reitz wrote:
> Hi,
> 
> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> I’ve explained here:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
> 
> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> data corruption when using qcow2 images on XFS with aio=native.
> 
> We can’t wait until the XFS kernel driver is fixed, we should work
> around the problem ourselves.
> 
> This is an RFC for two reasons:
> (1) I don’t know whether this is the right way to address the issue,
> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>      if so stop applying the workaround.
>      I don’t know how we would go about this, so this series doesn’t do
>      it.  (Hence it’s an RFC.)
> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>      driver access and modify a BdrvTrackedRequest object.
> 
> As for how we can address the issue, I see three ways:
> (1) The one presented in this series: On XFS with aio=native, we extend
>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>      operations) to reach until infinity (INT64_MAX in practice), mark
>      them serializing and wait for other conflicting requests.
> 
>      Advantages:
>      + Limits the impact to very specific cases
>        (And that means it wouldn’t hurt too much to keep this workaround
>        even when the XFS driver has been fixed)
>      + Works around the bug where it happens, namely in file-posix
> 
>      Disadvantages:
>      - A bit complex
>      - A bit of a layering violation (should file-posix have access to
>        tracked requests?)
> 
> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>      becomes visible due to that function: I don’t think qcow2 writes
>      zeroes in any other I/O path, and raw images are fixed in size so
>      post-EOF writes won’t happen.
> 
>      Advantages:
>      + Maybe simpler, depending on how difficult it is to handle the
>        layering violation
>      + Also fixes the performance problem of handle_alloc_space() being
>        slow on ppc64+XFS.
> 
>      Disadvantages:
>      - Huge layering violation because qcow2 would need to know whether
>        the image is stored on XFS or not.
>      - We’d definitely want to skip this workaround when the XFS driver
>        has been fixed, so we need some method to find out whether it has
> 
> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>      To my knowledge I’m the only one who has provided any benchmarks for
>      this commit, and even then I was a bit skeptical because it performs
>      well in some cases and bad in others.  I concluded that it’s
>      probably worth it because the “some cases” are more likely to occur.
> 
>      Now we have this problem of corruption here (granted due to a bug in
>      the XFS driver), and another report of massively degraded
>      performance on ppc64
>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>      private BZ; I hate that :-/  The report is about 40 % worse
>      performance for an in-guest fio write benchmark.)
> 
>      So I have to ask the question about what the justification for
>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>      it actually?  (On non-(ppc64+XFS) machines, obviously)
> 
>      Advantages:
>      + Trivial
>      + No layering violations
>      + We wouldn’t need to keep track of whether the kernel bug has been
>        fixed or not
>      + Fixes the ppc64+XFS performance problem
> 
>      Disadvantages:
>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>        levels, whatever that means
> 
> So this is the main reason this is an RFC: What should we do?  Is (1)
> really the best choice?
> 
> 
> In any case, I’ve ran the test case I showed in
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
> more than ten times with this series applied and the installation
> succeeded every time.  (Without this series, it fails like every other
> time.)
> 
> 

Hi!

First, great thanks for your investigation!

We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
in time.

I've tested a bit:

test:
for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done

on master:

/ssd/test.img  cl=64K step=4K    : 0.291
/ssd/test.img  cl=64K step=64K   : 0.813
/ssd/test.img  cl=64K step=1M    : 2.799
/ssd/test.img  cl=1M  step=4K    : 0.217
/ssd/test.img  cl=1M  step=64K   : 0.332
/ssd/test.img  cl=1M  step=1M    : 0.685
/test.img      cl=64K step=4K    : 1.751
/test.img      cl=64K step=64K   : 14.811
/test.img      cl=64K step=1M    : 18.321
/test.img      cl=1M  step=4K    : 0.759
/test.img      cl=1M  step=64K   : 13.574
/test.img      cl=1M  step=1M    : 28.970

rerun on master:

/ssd/test.img  cl=64K step=4K    : 0.295
/ssd/test.img  cl=64K step=64K   : 0.803
/ssd/test.img  cl=64K step=1M    : 2.921
/ssd/test.img  cl=1M  step=4K    : 0.233
/ssd/test.img  cl=1M  step=64K   : 0.321
/ssd/test.img  cl=1M  step=1M    : 0.762
/test.img      cl=64K step=4K    : 1.873
/test.img      cl=64K step=64K   : 15.621
/test.img      cl=64K step=1M    : 18.428
/test.img      cl=1M  step=4K    : 0.883
/test.img      cl=1M  step=64K   : 13.484
/test.img      cl=1M  step=1M    : 26.244


on master + revert c8bb23cbdbe32f5c326

/ssd/test.img  cl=64K step=4K    : 0.395
/ssd/test.img  cl=64K step=64K   : 4.231
/ssd/test.img  cl=64K step=1M    : 5.598
/ssd/test.img  cl=1M  step=4K    : 0.352
/ssd/test.img  cl=1M  step=64K   : 2.519
/ssd/test.img  cl=1M  step=1M    : 38.919
/test.img      cl=64K step=4K    : 1.758
/test.img      cl=64K step=64K   : 9.838
/test.img      cl=64K step=1M    : 13.384
/test.img      cl=1M  step=4K    : 1.849
/test.img      cl=1M  step=64K   : 19.405
/test.img      cl=1M  step=1M    : 157.090

rerun:

/ssd/test.img  cl=64K step=4K    : 0.407
/ssd/test.img  cl=64K step=64K   : 3.325
/ssd/test.img  cl=64K step=1M    : 5.641
/ssd/test.img  cl=1M  step=4K    : 0.346
/ssd/test.img  cl=1M  step=64K   : 2.583
/ssd/test.img  cl=1M  step=1M    : 39.692
/test.img      cl=64K step=4K    : 1.727
/test.img      cl=64K step=64K   : 10.058
/test.img      cl=64K step=1M    : 13.441
/test.img      cl=1M  step=4K    : 1.926
/test.img      cl=1M  step=64K   : 19.738
/test.img      cl=1M  step=1M    : 158.268


So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
wrong, we need smarter heuristics..

So, I'd prefer (1) or (2).

-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Vladimir Sementsov-Ogievskiy 4 years, 6 months ago

25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
> 25.10.2019 12:58, Max Reitz wrote:
>> Hi,
>>
>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>> I’ve explained here:
>>
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>
>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>> data corruption when using qcow2 images on XFS with aio=native.
>>
>> We can’t wait until the XFS kernel driver is fixed, we should work
>> around the problem ourselves.
>>
>> This is an RFC for two reasons:
>> (1) I don’t know whether this is the right way to address the issue,
>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>      if so stop applying the workaround.
>>      I don’t know how we would go about this, so this series doesn’t do
>>      it.  (Hence it’s an RFC.)
>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>      driver access and modify a BdrvTrackedRequest object.
>>
>> As for how we can address the issue, I see three ways:
>> (1) The one presented in this series: On XFS with aio=native, we extend
>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>      them serializing and wait for other conflicting requests.
>>
>>      Advantages:
>>      + Limits the impact to very specific cases
>>        (And that means it wouldn’t hurt too much to keep this workaround
>>        even when the XFS driver has been fixed)
>>      + Works around the bug where it happens, namely in file-posix
>>
>>      Disadvantages:
>>      - A bit complex
>>      - A bit of a layering violation (should file-posix have access to
>>        tracked requests?)
>>
>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>      becomes visible due to that function: I don’t think qcow2 writes
>>      zeroes in any other I/O path, and raw images are fixed in size so
>>      post-EOF writes won’t happen.
>>
>>      Advantages:
>>      + Maybe simpler, depending on how difficult it is to handle the
>>        layering violation
>>      + Also fixes the performance problem of handle_alloc_space() being
>>        slow on ppc64+XFS.
>>
>>      Disadvantages:
>>      - Huge layering violation because qcow2 would need to know whether
>>        the image is stored on XFS or not.
>>      - We’d definitely want to skip this workaround when the XFS driver
>>        has been fixed, so we need some method to find out whether it has
>>
>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>      To my knowledge I’m the only one who has provided any benchmarks for
>>      this commit, and even then I was a bit skeptical because it performs
>>      well in some cases and bad in others.  I concluded that it’s
>>      probably worth it because the “some cases” are more likely to occur.
>>
>>      Now we have this problem of corruption here (granted due to a bug in
>>      the XFS driver), and another report of massively degraded
>>      performance on ppc64
>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>      private BZ; I hate that :-/  The report is about 40 % worse
>>      performance for an in-guest fio write benchmark.)
>>
>>      So I have to ask the question about what the justification for
>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>
>>      Advantages:
>>      + Trivial
>>      + No layering violations
>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>        fixed or not
>>      + Fixes the ppc64+XFS performance problem
>>
>>      Disadvantages:
>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>        levels, whatever that means
>>
>> So this is the main reason this is an RFC: What should we do?  Is (1)
>> really the best choice?
>>
>>
>> In any case, I’ve ran the test case I showed in
>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
>> more than ten times with this series applied and the installation
>> succeeded every time.  (Without this series, it fails like every other
>> time.)
>>
>>
> 
> Hi!
> 
> First, great thanks for your investigation!
> 
> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
> in time.
> 
> I've tested a bit:
> 
> test:
> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done
> 
> on master:
> 
> /ssd/test.img  cl=64K step=4K    : 0.291
> /ssd/test.img  cl=64K step=64K   : 0.813
> /ssd/test.img  cl=64K step=1M    : 2.799
> /ssd/test.img  cl=1M  step=4K    : 0.217
> /ssd/test.img  cl=1M  step=64K   : 0.332
> /ssd/test.img  cl=1M  step=1M    : 0.685
> /test.img      cl=64K step=4K    : 1.751
> /test.img      cl=64K step=64K   : 14.811
> /test.img      cl=64K step=1M    : 18.321
> /test.img      cl=1M  step=4K    : 0.759
> /test.img      cl=1M  step=64K   : 13.574
> /test.img      cl=1M  step=1M    : 28.970
> 
> rerun on master:
> 
> /ssd/test.img  cl=64K step=4K    : 0.295
> /ssd/test.img  cl=64K step=64K   : 0.803
> /ssd/test.img  cl=64K step=1M    : 2.921
> /ssd/test.img  cl=1M  step=4K    : 0.233
> /ssd/test.img  cl=1M  step=64K   : 0.321
> /ssd/test.img  cl=1M  step=1M    : 0.762
> /test.img      cl=64K step=4K    : 1.873
> /test.img      cl=64K step=64K   : 15.621
> /test.img      cl=64K step=1M    : 18.428
> /test.img      cl=1M  step=4K    : 0.883
> /test.img      cl=1M  step=64K   : 13.484
> /test.img      cl=1M  step=1M    : 26.244
> 
> 
> on master + revert c8bb23cbdbe32f5c326
> 
> /ssd/test.img  cl=64K step=4K    : 0.395
> /ssd/test.img  cl=64K step=64K   : 4.231
> /ssd/test.img  cl=64K step=1M    : 5.598
> /ssd/test.img  cl=1M  step=4K    : 0.352
> /ssd/test.img  cl=1M  step=64K   : 2.519
> /ssd/test.img  cl=1M  step=1M    : 38.919
> /test.img      cl=64K step=4K    : 1.758
> /test.img      cl=64K step=64K   : 9.838
> /test.img      cl=64K step=1M    : 13.384
> /test.img      cl=1M  step=4K    : 1.849
> /test.img      cl=1M  step=64K   : 19.405
> /test.img      cl=1M  step=1M    : 157.090
> 
> rerun:
> 
> /ssd/test.img  cl=64K step=4K    : 0.407
> /ssd/test.img  cl=64K step=64K   : 3.325
> /ssd/test.img  cl=64K step=1M    : 5.641
> /ssd/test.img  cl=1M  step=4K    : 0.346
> /ssd/test.img  cl=1M  step=64K   : 2.583
> /ssd/test.img  cl=1M  step=1M    : 39.692
> /test.img      cl=64K step=4K    : 1.727
> /test.img      cl=64K step=64K   : 10.058
> /test.img      cl=64K step=1M    : 13.441
> /test.img      cl=1M  step=4K    : 1.926
> /test.img      cl=1M  step=64K   : 19.738
> /test.img      cl=1M  step=1M    : 158.268
> 
> 
> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
> wrong, we need smarter heuristics..
> 
> So, I'd prefer (1) or (2).
> 

About degradation in some cases: I think the problem is that one (a bit larger)
write may be faster than fast-write-zeroes + small write, as the latter means
additional write to metadata. And it's expected for small clusters in
conjunction with rotational disk. But the actual limit is dependent on specific
disk. So, I think possible solution is just sometimes try work with
handle_alloc_space and sometimes without, remember time and length of request
and make dynamic limit...
-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote:
> 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
>> 25.10.2019 12:58, Max Reitz wrote:
>>> Hi,
>>>
>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>>> I’ve explained here:
>>>
>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>>
>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>>> data corruption when using qcow2 images on XFS with aio=native.
>>>
>>> We can’t wait until the XFS kernel driver is fixed, we should work
>>> around the problem ourselves.
>>>
>>> This is an RFC for two reasons:
>>> (1) I don’t know whether this is the right way to address the issue,
>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>>      if so stop applying the workaround.
>>>      I don’t know how we would go about this, so this series doesn’t do
>>>      it.  (Hence it’s an RFC.)
>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>>      driver access and modify a BdrvTrackedRequest object.
>>>
>>> As for how we can address the issue, I see three ways:
>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>      operations) to reach until infinity (INT64_MAX in practice), mark
>>>      them serializing and wait for other conflicting requests.
>>>
>>>      Advantages:
>>>      + Limits the impact to very specific cases
>>>        (And that means it wouldn’t hurt too much to keep this workaround
>>>        even when the XFS driver has been fixed)
>>>      + Works around the bug where it happens, namely in file-posix
>>>
>>>      Disadvantages:
>>>      - A bit complex
>>>      - A bit of a layering violation (should file-posix have access to
>>>        tracked requests?)
>>>
>>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>>      becomes visible due to that function: I don’t think qcow2 writes
>>>      zeroes in any other I/O path, and raw images are fixed in size so
>>>      post-EOF writes won’t happen.
>>>
>>>      Advantages:
>>>      + Maybe simpler, depending on how difficult it is to handle the
>>>        layering violation
>>>      + Also fixes the performance problem of handle_alloc_space() being
>>>        slow on ppc64+XFS.
>>>
>>>      Disadvantages:
>>>      - Huge layering violation because qcow2 would need to know whether
>>>        the image is stored on XFS or not.
>>>      - We’d definitely want to skip this workaround when the XFS driver
>>>        has been fixed, so we need some method to find out whether it has
>>>
>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>      To my knowledge I’m the only one who has provided any benchmarks for
>>>      this commit, and even then I was a bit skeptical because it performs
>>>      well in some cases and bad in others.  I concluded that it’s
>>>      probably worth it because the “some cases” are more likely to occur.
>>>
>>>      Now we have this problem of corruption here (granted due to a bug in
>>>      the XFS driver), and another report of massively degraded
>>>      performance on ppc64
>>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>      private BZ; I hate that :-/  The report is about 40 % worse
>>>      performance for an in-guest fio write benchmark.)
>>>
>>>      So I have to ask the question about what the justification for
>>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>
>>>      Advantages:
>>>      + Trivial
>>>      + No layering violations
>>>      + We wouldn’t need to keep track of whether the kernel bug has been
>>>        fixed or not
>>>      + Fixes the ppc64+XFS performance problem
>>>
>>>      Disadvantages:
>>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>        levels, whatever that means
>>>
>>> So this is the main reason this is an RFC: What should we do?  Is (1)
>>> really the best choice?
>>>
>>>
>>> In any case, I’ve ran the test case I showed in
>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
>>> more than ten times with this series applied and the installation
>>> succeeded every time.  (Without this series, it fails like every other
>>> time.)
>>>
>>>
>>
>> Hi!
>>
>> First, great thanks for your investigation!
>>
>> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
>> in time.
>>
>> I've tested a bit:
>>
>> test:
>> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done
>>
>> on master:
>>
>> /ssd/test.img  cl=64K step=4K    : 0.291
>> /ssd/test.img  cl=64K step=64K   : 0.813
>> /ssd/test.img  cl=64K step=1M    : 2.799
>> /ssd/test.img  cl=1M  step=4K    : 0.217
>> /ssd/test.img  cl=1M  step=64K   : 0.332
>> /ssd/test.img  cl=1M  step=1M    : 0.685
>> /test.img      cl=64K step=4K    : 1.751
>> /test.img      cl=64K step=64K   : 14.811
>> /test.img      cl=64K step=1M    : 18.321
>> /test.img      cl=1M  step=4K    : 0.759
>> /test.img      cl=1M  step=64K   : 13.574
>> /test.img      cl=1M  step=1M    : 28.970
>>
>> rerun on master:
>>
>> /ssd/test.img  cl=64K step=4K    : 0.295
>> /ssd/test.img  cl=64K step=64K   : 0.803
>> /ssd/test.img  cl=64K step=1M    : 2.921
>> /ssd/test.img  cl=1M  step=4K    : 0.233
>> /ssd/test.img  cl=1M  step=64K   : 0.321
>> /ssd/test.img  cl=1M  step=1M    : 0.762
>> /test.img      cl=64K step=4K    : 1.873
>> /test.img      cl=64K step=64K   : 15.621
>> /test.img      cl=64K step=1M    : 18.428
>> /test.img      cl=1M  step=4K    : 0.883
>> /test.img      cl=1M  step=64K   : 13.484
>> /test.img      cl=1M  step=1M    : 26.244
>>
>>
>> on master + revert c8bb23cbdbe32f5c326
>>
>> /ssd/test.img  cl=64K step=4K    : 0.395
>> /ssd/test.img  cl=64K step=64K   : 4.231
>> /ssd/test.img  cl=64K step=1M    : 5.598
>> /ssd/test.img  cl=1M  step=4K    : 0.352
>> /ssd/test.img  cl=1M  step=64K   : 2.519
>> /ssd/test.img  cl=1M  step=1M    : 38.919
>> /test.img      cl=64K step=4K    : 1.758
>> /test.img      cl=64K step=64K   : 9.838
>> /test.img      cl=64K step=1M    : 13.384
>> /test.img      cl=1M  step=4K    : 1.849
>> /test.img      cl=1M  step=64K   : 19.405
>> /test.img      cl=1M  step=1M    : 157.090
>>
>> rerun:
>>
>> /ssd/test.img  cl=64K step=4K    : 0.407
>> /ssd/test.img  cl=64K step=64K   : 3.325
>> /ssd/test.img  cl=64K step=1M    : 5.641
>> /ssd/test.img  cl=1M  step=4K    : 0.346
>> /ssd/test.img  cl=1M  step=64K   : 2.583
>> /ssd/test.img  cl=1M  step=1M    : 39.692
>> /test.img      cl=64K step=4K    : 1.727
>> /test.img      cl=64K step=64K   : 10.058
>> /test.img      cl=64K step=1M    : 13.441
>> /test.img      cl=1M  step=4K    : 1.926
>> /test.img      cl=1M  step=64K   : 19.738
>> /test.img      cl=1M  step=1M    : 158.268
>>
>>
>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
>> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
>> wrong, we need smarter heuristics..
>>
>> So, I'd prefer (1) or (2).

OK.  I wonder whether that problem would go away with Berto’s subcluster
series, though.

> About degradation in some cases: I think the problem is that one (a bit larger)
> write may be faster than fast-write-zeroes + small write, as the latter means
> additional write to metadata. And it's expected for small clusters in
> conjunction with rotational disk. But the actual limit is dependent on specific
> disk. So, I think possible solution is just sometimes try work with
> handle_alloc_space and sometimes without, remember time and length of request
> and make dynamic limit...

Maybe make a decision based both on the ratio of data size to COW area
length (only invoke handle_alloc_space() under a certain threshold), and
the absolute COW area length (always invoke it above a certain
threshold, unless the ratio doesn’t allow it)?

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Vladimir Sementsov-Ogievskiy 4 years, 6 months ago

25.10.2019 17:19, Max Reitz wrote:
> On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote:
>> 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
>>> 25.10.2019 12:58, Max Reitz wrote:
>>>> Hi,
>>>>
>>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
>>>> I’ve explained here:
>>>>
>>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
>>>>
>>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
>>>> data corruption when using qcow2 images on XFS with aio=native.
>>>>
>>>> We can’t wait until the XFS kernel driver is fixed, we should work
>>>> around the problem ourselves.
>>>>
>>>> This is an RFC for two reasons:
>>>> (1) I don’t know whether this is the right way to address the issue,
>>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>>>>       if so stop applying the workaround.
>>>>       I don’t know how we would go about this, so this series doesn’t do
>>>>       it.  (Hence it’s an RFC.)
>>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>>>>       driver access and modify a BdrvTrackedRequest object.
>>>>
>>>> As for how we can address the issue, I see three ways:
>>>> (1) The one presented in this series: On XFS with aio=native, we extend
>>>>       tracked requests for post-EOF fallocate() calls (i.e., write-zero
>>>>       operations) to reach until infinity (INT64_MAX in practice), mark
>>>>       them serializing and wait for other conflicting requests.
>>>>
>>>>       Advantages:
>>>>       + Limits the impact to very specific cases
>>>>         (And that means it wouldn’t hurt too much to keep this workaround
>>>>         even when the XFS driver has been fixed)
>>>>       + Works around the bug where it happens, namely in file-posix
>>>>
>>>>       Disadvantages:
>>>>       - A bit complex
>>>>       - A bit of a layering violation (should file-posix have access to
>>>>         tracked requests?)
>>>>
>>>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>>>>       becomes visible due to that function: I don’t think qcow2 writes
>>>>       zeroes in any other I/O path, and raw images are fixed in size so
>>>>       post-EOF writes won’t happen.
>>>>
>>>>       Advantages:
>>>>       + Maybe simpler, depending on how difficult it is to handle the
>>>>         layering violation
>>>>       + Also fixes the performance problem of handle_alloc_space() being
>>>>         slow on ppc64+XFS.
>>>>
>>>>       Disadvantages:
>>>>       - Huge layering violation because qcow2 would need to know whether
>>>>         the image is stored on XFS or not.
>>>>       - We’d definitely want to skip this workaround when the XFS driver
>>>>         has been fixed, so we need some method to find out whether it has
>>>>
>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>>>>       To my knowledge I’m the only one who has provided any benchmarks for
>>>>       this commit, and even then I was a bit skeptical because it performs
>>>>       well in some cases and bad in others.  I concluded that it’s
>>>>       probably worth it because the “some cases” are more likely to occur.
>>>>
>>>>       Now we have this problem of corruption here (granted due to a bug in
>>>>       the XFS driver), and another report of massively degraded
>>>>       performance on ppc64
>>>>       (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>>>>       private BZ; I hate that :-/  The report is about 40 % worse
>>>>       performance for an in-guest fio write benchmark.)
>>>>
>>>>       So I have to ask the question about what the justification for
>>>>       keeping c8bb23cbdbe32f is.  How much does performance increase with
>>>>       it actually?  (On non-(ppc64+XFS) machines, obviously)
>>>>
>>>>       Advantages:
>>>>       + Trivial
>>>>       + No layering violations
>>>>       + We wouldn’t need to keep track of whether the kernel bug has been
>>>>         fixed or not
>>>>       + Fixes the ppc64+XFS performance problem
>>>>
>>>>       Disadvantages:
>>>>       - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>>>>         levels, whatever that means
>>>>
>>>> So this is the main reason this is an RFC: What should we do?  Is (1)
>>>> really the best choice?
>>>>
>>>>
>>>> In any case, I’ve ran the test case I showed in
>>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
>>>> more than ten times with this series applied and the installation
>>>> succeeded every time.  (Without this series, it fails like every other
>>>> time.)
>>>>
>>>>
>>>
>>> Hi!
>>>
>>> First, great thanks for your investigation!
>>>
>>> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
>>> in time.
>>>
>>> I've tested a bit:
>>>
>>> test:
>>> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done
>>>
>>> on master:
>>>
>>> /ssd/test.img  cl=64K step=4K    : 0.291
>>> /ssd/test.img  cl=64K step=64K   : 0.813
>>> /ssd/test.img  cl=64K step=1M    : 2.799
>>> /ssd/test.img  cl=1M  step=4K    : 0.217
>>> /ssd/test.img  cl=1M  step=64K   : 0.332
>>> /ssd/test.img  cl=1M  step=1M    : 0.685
>>> /test.img      cl=64K step=4K    : 1.751
>>> /test.img      cl=64K step=64K   : 14.811
>>> /test.img      cl=64K step=1M    : 18.321
>>> /test.img      cl=1M  step=4K    : 0.759
>>> /test.img      cl=1M  step=64K   : 13.574
>>> /test.img      cl=1M  step=1M    : 28.970
>>>
>>> rerun on master:
>>>
>>> /ssd/test.img  cl=64K step=4K    : 0.295
>>> /ssd/test.img  cl=64K step=64K   : 0.803
>>> /ssd/test.img  cl=64K step=1M    : 2.921
>>> /ssd/test.img  cl=1M  step=4K    : 0.233
>>> /ssd/test.img  cl=1M  step=64K   : 0.321
>>> /ssd/test.img  cl=1M  step=1M    : 0.762
>>> /test.img      cl=64K step=4K    : 1.873
>>> /test.img      cl=64K step=64K   : 15.621
>>> /test.img      cl=64K step=1M    : 18.428
>>> /test.img      cl=1M  step=4K    : 0.883
>>> /test.img      cl=1M  step=64K   : 13.484
>>> /test.img      cl=1M  step=1M    : 26.244
>>>
>>>
>>> on master + revert c8bb23cbdbe32f5c326
>>>
>>> /ssd/test.img  cl=64K step=4K    : 0.395
>>> /ssd/test.img  cl=64K step=64K   : 4.231
>>> /ssd/test.img  cl=64K step=1M    : 5.598
>>> /ssd/test.img  cl=1M  step=4K    : 0.352
>>> /ssd/test.img  cl=1M  step=64K   : 2.519
>>> /ssd/test.img  cl=1M  step=1M    : 38.919
>>> /test.img      cl=64K step=4K    : 1.758
>>> /test.img      cl=64K step=64K   : 9.838
>>> /test.img      cl=64K step=1M    : 13.384
>>> /test.img      cl=1M  step=4K    : 1.849
>>> /test.img      cl=1M  step=64K   : 19.405
>>> /test.img      cl=1M  step=1M    : 157.090
>>>
>>> rerun:
>>>
>>> /ssd/test.img  cl=64K step=4K    : 0.407
>>> /ssd/test.img  cl=64K step=64K   : 3.325
>>> /ssd/test.img  cl=64K step=1M    : 5.641
>>> /ssd/test.img  cl=1M  step=4K    : 0.346
>>> /ssd/test.img  cl=1M  step=64K   : 2.583
>>> /ssd/test.img  cl=1M  step=1M    : 39.692
>>> /test.img      cl=64K step=4K    : 1.727
>>> /test.img      cl=64K step=64K   : 10.058
>>> /test.img      cl=64K step=1M    : 13.441
>>> /test.img      cl=1M  step=4K    : 1.926
>>> /test.img      cl=1M  step=64K   : 19.738
>>> /test.img      cl=1M  step=1M    : 158.268
>>>
>>>
>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
>>> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
>>> wrong, we need smarter heuristics..
>>>
>>> So, I'd prefer (1) or (2).
> 
> OK.  I wonder whether that problem would go away with Berto’s subcluster
> series, though.

Very possible, I thought about it too.

> 
>> About degradation in some cases: I think the problem is that one (a bit larger)
>> write may be faster than fast-write-zeroes + small write, as the latter means
>> additional write to metadata. And it's expected for small clusters in
>> conjunction with rotational disk. But the actual limit is dependent on specific
>> disk. So, I think possible solution is just sometimes try work with
>> handle_alloc_space and sometimes without, remember time and length of request
>> and make dynamic limit...
> 
> Maybe make a decision based both on the ratio of data size to COW area
> length (only invoke handle_alloc_space() under a certain threshold), and
> the absolute COW area length (always invoke it above a certain
> threshold, unless the ratio doesn’t allow it)?
> 

Yes, something like this..

without handle_alloc_space, time = time(write aligned up to cluster)
with handle_alloc_space, time = time(fast zero write) + time(original write)

If we have some statistics on normal-write vs zero-write timing, we can just
calculate both variants and choose faster.

if (predict_zero_write_time(aligned up request) + predict_write_time(request) < predict_write_time(aligned up request)) {
    use handle_alloc_space()
}


-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Stefan Hajnoczi 4 years, 6 months ago

On Fri, Oct 25, 2019 at 02:36:49PM +0000, Vladimir Sementsov-Ogievskiy wrote:
> 25.10.2019 17:19, Max Reitz wrote:
> > On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote:
> >> 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
> >>> 25.10.2019 12:58, Max Reitz wrote:
> >>>> Hi,
> >>>>
> >>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> >>>> I’ve explained here:
> >>>>
> >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
> >>>>
> >>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> >>>> data corruption when using qcow2 images on XFS with aio=native.
> >>>>
> >>>> We can’t wait until the XFS kernel driver is fixed, we should work
> >>>> around the problem ourselves.
> >>>>
> >>>> This is an RFC for two reasons:
> >>>> (1) I don’t know whether this is the right way to address the issue,
> >>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
> >>>>       if so stop applying the workaround.
> >>>>       I don’t know how we would go about this, so this series doesn’t do
> >>>>       it.  (Hence it’s an RFC.)
> >>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
> >>>>       driver access and modify a BdrvTrackedRequest object.
> >>>>
> >>>> As for how we can address the issue, I see three ways:
> >>>> (1) The one presented in this series: On XFS with aio=native, we extend
> >>>>       tracked requests for post-EOF fallocate() calls (i.e., write-zero
> >>>>       operations) to reach until infinity (INT64_MAX in practice), mark
> >>>>       them serializing and wait for other conflicting requests.
> >>>>
> >>>>       Advantages:
> >>>>       + Limits the impact to very specific cases
> >>>>         (And that means it wouldn’t hurt too much to keep this workaround
> >>>>         even when the XFS driver has been fixed)
> >>>>       + Works around the bug where it happens, namely in file-posix
> >>>>
> >>>>       Disadvantages:
> >>>>       - A bit complex
> >>>>       - A bit of a layering violation (should file-posix have access to
> >>>>         tracked requests?)
> >>>>
> >>>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
> >>>>       becomes visible due to that function: I don’t think qcow2 writes
> >>>>       zeroes in any other I/O path, and raw images are fixed in size so
> >>>>       post-EOF writes won’t happen.
> >>>>
> >>>>       Advantages:
> >>>>       + Maybe simpler, depending on how difficult it is to handle the
> >>>>         layering violation
> >>>>       + Also fixes the performance problem of handle_alloc_space() being
> >>>>         slow on ppc64+XFS.
> >>>>
> >>>>       Disadvantages:
> >>>>       - Huge layering violation because qcow2 would need to know whether
> >>>>         the image is stored on XFS or not.
> >>>>       - We’d definitely want to skip this workaround when the XFS driver
> >>>>         has been fixed, so we need some method to find out whether it has
> >>>>
> >>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
> >>>>       To my knowledge I’m the only one who has provided any benchmarks for
> >>>>       this commit, and even then I was a bit skeptical because it performs
> >>>>       well in some cases and bad in others.  I concluded that it’s
> >>>>       probably worth it because the “some cases” are more likely to occur.
> >>>>
> >>>>       Now we have this problem of corruption here (granted due to a bug in
> >>>>       the XFS driver), and another report of massively degraded
> >>>>       performance on ppc64
> >>>>       (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
> >>>>       private BZ; I hate that :-/  The report is about 40 % worse
> >>>>       performance for an in-guest fio write benchmark.)
> >>>>
> >>>>       So I have to ask the question about what the justification for
> >>>>       keeping c8bb23cbdbe32f is.  How much does performance increase with
> >>>>       it actually?  (On non-(ppc64+XFS) machines, obviously)
> >>>>
> >>>>       Advantages:
> >>>>       + Trivial
> >>>>       + No layering violations
> >>>>       + We wouldn’t need to keep track of whether the kernel bug has been
> >>>>         fixed or not
> >>>>       + Fixes the ppc64+XFS performance problem
> >>>>
> >>>>       Disadvantages:
> >>>>       - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
> >>>>         levels, whatever that means
> >>>>
> >>>> So this is the main reason this is an RFC: What should we do?  Is (1)
> >>>> really the best choice?
> >>>>
> >>>>
> >>>> In any case, I’ve ran the test case I showed in
> >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
> >>>> more than ten times with this series applied and the installation
> >>>> succeeded every time.  (Without this series, it fails like every other
> >>>> time.)
> >>>>
> >>>>
> >>>
> >>> Hi!
> >>>
> >>> First, great thanks for your investigation!
> >>>
> >>> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
> >>> in time.
> >>>
> >>> I've tested a bit:
> >>>
> >>> test:
> >>> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done
> >>>
> >>> on master:
> >>>
> >>> /ssd/test.img  cl=64K step=4K    : 0.291
> >>> /ssd/test.img  cl=64K step=64K   : 0.813
> >>> /ssd/test.img  cl=64K step=1M    : 2.799
> >>> /ssd/test.img  cl=1M  step=4K    : 0.217
> >>> /ssd/test.img  cl=1M  step=64K   : 0.332
> >>> /ssd/test.img  cl=1M  step=1M    : 0.685
> >>> /test.img      cl=64K step=4K    : 1.751
> >>> /test.img      cl=64K step=64K   : 14.811
> >>> /test.img      cl=64K step=1M    : 18.321
> >>> /test.img      cl=1M  step=4K    : 0.759
> >>> /test.img      cl=1M  step=64K   : 13.574
> >>> /test.img      cl=1M  step=1M    : 28.970
> >>>
> >>> rerun on master:
> >>>
> >>> /ssd/test.img  cl=64K step=4K    : 0.295
> >>> /ssd/test.img  cl=64K step=64K   : 0.803
> >>> /ssd/test.img  cl=64K step=1M    : 2.921
> >>> /ssd/test.img  cl=1M  step=4K    : 0.233
> >>> /ssd/test.img  cl=1M  step=64K   : 0.321
> >>> /ssd/test.img  cl=1M  step=1M    : 0.762
> >>> /test.img      cl=64K step=4K    : 1.873
> >>> /test.img      cl=64K step=64K   : 15.621
> >>> /test.img      cl=64K step=1M    : 18.428
> >>> /test.img      cl=1M  step=4K    : 0.883
> >>> /test.img      cl=1M  step=64K   : 13.484
> >>> /test.img      cl=1M  step=1M    : 26.244
> >>>
> >>>
> >>> on master + revert c8bb23cbdbe32f5c326
> >>>
> >>> /ssd/test.img  cl=64K step=4K    : 0.395
> >>> /ssd/test.img  cl=64K step=64K   : 4.231
> >>> /ssd/test.img  cl=64K step=1M    : 5.598
> >>> /ssd/test.img  cl=1M  step=4K    : 0.352
> >>> /ssd/test.img  cl=1M  step=64K   : 2.519
> >>> /ssd/test.img  cl=1M  step=1M    : 38.919
> >>> /test.img      cl=64K step=4K    : 1.758
> >>> /test.img      cl=64K step=64K   : 9.838
> >>> /test.img      cl=64K step=1M    : 13.384
> >>> /test.img      cl=1M  step=4K    : 1.849
> >>> /test.img      cl=1M  step=64K   : 19.405
> >>> /test.img      cl=1M  step=1M    : 157.090
> >>>
> >>> rerun:
> >>>
> >>> /ssd/test.img  cl=64K step=4K    : 0.407
> >>> /ssd/test.img  cl=64K step=64K   : 3.325
> >>> /ssd/test.img  cl=64K step=1M    : 5.641
> >>> /ssd/test.img  cl=1M  step=4K    : 0.346
> >>> /ssd/test.img  cl=1M  step=64K   : 2.583
> >>> /ssd/test.img  cl=1M  step=1M    : 39.692
> >>> /test.img      cl=64K step=4K    : 1.727
> >>> /test.img      cl=64K step=64K   : 10.058
> >>> /test.img      cl=64K step=1M    : 13.441
> >>> /test.img      cl=1M  step=4K    : 1.926
> >>> /test.img      cl=1M  step=64K   : 19.738
> >>> /test.img      cl=1M  step=1M    : 158.268
> >>>
> >>>
> >>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
> >>> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
> >>> wrong, we need smarter heuristics..
> >>>
> >>> So, I'd prefer (1) or (2).
> > 
> > OK.  I wonder whether that problem would go away with Berto’s subcluster
> > series, though.
> 
> Very possible, I thought about it too.
> 
> > 
> >> About degradation in some cases: I think the problem is that one (a bit larger)
> >> write may be faster than fast-write-zeroes + small write, as the latter means
> >> additional write to metadata. And it's expected for small clusters in
> >> conjunction with rotational disk. But the actual limit is dependent on specific
> >> disk. So, I think possible solution is just sometimes try work with
> >> handle_alloc_space and sometimes without, remember time and length of request
> >> and make dynamic limit...
> > 
> > Maybe make a decision based both on the ratio of data size to COW area
> > length (only invoke handle_alloc_space() under a certain threshold), and
> > the absolute COW area length (always invoke it above a certain
> > threshold, unless the ratio doesn’t allow it)?
> > 
> 
> Yes, something like this..
> 
> without handle_alloc_space, time = time(write aligned up to cluster)
> with handle_alloc_space, time = time(fast zero write) + time(original write)
> 
> If we have some statistics on normal-write vs zero-write timing, we can just
> calculate both variants and choose faster.
> 
> if (predict_zero_write_time(aligned up request) + predict_write_time(request) < predict_write_time(aligned up request)) {
>     use handle_alloc_space()
> }

Self-tuning based on request latency works great on a quiet host.  If
there are other processes submitting I/O to the disk then measured
latencies may be bogus.

Stefan

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Alberto Garcia 4 years, 6 months ago

On Fri 25 Oct 2019 04:19:30 PM CEST, Max Reitz wrote:
>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M
>>> cluster-size, even on rotational disk, which means that previous
>>> assumption about calling handle_alloc_space() only for ssd is wrong,
>>> we need smarter heuristics..
>>>
>>> So, I'd prefer (1) or (2).
>
> OK.  I wonder whether that problem would go away with Berto’s subcluster
> series, though.

Catching up with this now. I was told about this last week at the KVM
Forum, but if the problems comes with the use of fallocate() and XFS,
the I don't think subclusters will solve it.

handle_alloc_space() is used to fill a cluster with zeroes when there's
COW, and that happens the same with subclusters, just at the subcluster
level instead of course.

What can happen, if the subcluster size matches the filesystem block
size, is that there's no need for any COW and therefore the bug is never
triggered. But that's not quite the same as a fix :-)

> Maybe make a decision based both on the ratio of data size to COW area
> length (only invoke handle_alloc_space() under a certain threshold),
> and the absolute COW area length (always invoke it above a certain
> threshold, unless the ratio doesn’t allow it)?

Maybe combining that with the smaller clusters/subclusters can work
around the problem. The maximum subcluster size is 64KB (for a 2MB
cluster).

Berto

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 04.11.19 15:03, Alberto Garcia wrote:
> On Fri 25 Oct 2019 04:19:30 PM CEST, Max Reitz wrote:
>>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M
>>>> cluster-size, even on rotational disk, which means that previous
>>>> assumption about calling handle_alloc_space() only for ssd is wrong,
>>>> we need smarter heuristics..
>>>>
>>>> So, I'd prefer (1) or (2).
>>
>> OK.  I wonder whether that problem would go away with Berto’s subcluster
>> series, though.
> 
> Catching up with this now. I was told about this last week at the KVM
> Forum, but if the problems comes with the use of fallocate() and XFS,
> the I don't think subclusters will solve it.
> 
> handle_alloc_space() is used to fill a cluster with zeroes when there's
> COW, and that happens the same with subclusters, just at the subcluster
> level instead of course.
> 
> What can happen, if the subcluster size matches the filesystem block
> size, is that there's no need for any COW and therefore the bug is never
> triggered. But that's not quite the same as a fix :-)

No, what I meant was that the original problem that led to c8bb23cbdbe
would go away.

c8bb23cbdbe was added because small writes to new clusters are slow when
the clusters are large (because you need to do a 2 MB write on the host
for a 4 kB write from the guest).  So handle_alloc_space() was added to
alleviate the problem with a zero-write instead of actually writing zeroes.

The question is whether there is no need for handle_alloc_space() with
subclusters because a normal write with explicit zeroes being written in
the COW areas would be sufficiently quick.  (Because the subclusters for
2 MB clusters are just 64 kB in size.)

If that were so (right now it doesn’t look like it), we could revert
c8bb23cbdbe and wouldn’t see the bug anymore.

Max

>> Maybe make a decision based both on the ratio of data size to COW area
>> length (only invoke handle_alloc_space() under a certain threshold),
>> and the absolute COW area length (always invoke it above a certain
>> threshold, unless the ratio doesn’t allow it)?
> 
> Maybe combining that with the smaller clusters/subclusters can work
> around the problem. The maximum subcluster size is 64KB (for a 2MB
> cluster).
> 
> Berto
>

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Alberto Garcia 4 years, 6 months ago

On Mon 04 Nov 2019 03:25:12 PM CET, Max Reitz wrote:
>>>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M
>>>>> cluster-size, even on rotational disk, which means that previous
>>>>> assumption about calling handle_alloc_space() only for ssd is wrong,
>>>>> we need smarter heuristics..
>>>>>
>>>>> So, I'd prefer (1) or (2).
>>>
>>> OK.  I wonder whether that problem would go away with Berto’s subcluster
>>> series, though.
>> 
>> Catching up with this now. I was told about this last week at the KVM
>> Forum, but if the problems comes with the use of fallocate() and XFS,
>> the I don't think subclusters will solve it.
>> 
>> handle_alloc_space() is used to fill a cluster with zeroes when there's
>> COW, and that happens the same with subclusters, just at the subcluster
>> level instead of course.
>> 
>> What can happen, if the subcluster size matches the filesystem block
>> size, is that there's no need for any COW and therefore the bug is never
>> triggered. But that's not quite the same as a fix :-)
>
> No, what I meant was that the original problem that led to c8bb23cbdbe
> would go away.

Ah, right. Not quite, according to my numbers:

|--------------+----------------+-----------------+-------------|
| Cluster size | subclusters=on | subclusters=off | fallocate() |
|--------------+----------------+-----------------+-------------|
|       256 KB |     10182 IOPS |        966 IOPS |  14007 IOPS |
|       512 KB |      7919 IOPS |        563 IOPS |  13442 IOPS |
|      1024 KB |      5050 IOPS |        465 IOPS |  13887 IOPS |
|      2048 KB |      2465 IOPS |        271 IOPS |  13885 IOPS |
|--------------+----------------+-----------------+-------------|

There's obviously no backing image, and only the last column uses
handle_alloc_space() / fallocate().

Berto

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 04.11.19 16:12, Alberto Garcia wrote:
> On Mon 04 Nov 2019 03:25:12 PM CET, Max Reitz wrote:
>>>>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M
>>>>>> cluster-size, even on rotational disk, which means that previous
>>>>>> assumption about calling handle_alloc_space() only for ssd is wrong,
>>>>>> we need smarter heuristics..
>>>>>>
>>>>>> So, I'd prefer (1) or (2).
>>>>
>>>> OK.  I wonder whether that problem would go away with Berto’s subcluster
>>>> series, though.
>>>
>>> Catching up with this now. I was told about this last week at the KVM
>>> Forum, but if the problems comes with the use of fallocate() and XFS,
>>> the I don't think subclusters will solve it.
>>>
>>> handle_alloc_space() is used to fill a cluster with zeroes when there's
>>> COW, and that happens the same with subclusters, just at the subcluster
>>> level instead of course.
>>>
>>> What can happen, if the subcluster size matches the filesystem block
>>> size, is that there's no need for any COW and therefore the bug is never
>>> triggered. But that's not quite the same as a fix :-)
>>
>> No, what I meant was that the original problem that led to c8bb23cbdbe
>> would go away.
> 
> Ah, right. Not quite, according to my numbers:
> 
> |--------------+----------------+-----------------+-------------|
> | Cluster size | subclusters=on | subclusters=off | fallocate() |
> |--------------+----------------+-----------------+-------------|
> |       256 KB |     10182 IOPS |        966 IOPS |  14007 IOPS |
> |       512 KB |      7919 IOPS |        563 IOPS |  13442 IOPS |
> |      1024 KB |      5050 IOPS |        465 IOPS |  13887 IOPS |
> |      2048 KB |      2465 IOPS |        271 IOPS |  13885 IOPS |
> |--------------+----------------+-----------------+-------------|
> 
> There's obviously no backing image, and only the last column uses
> handle_alloc_space() / fallocate().

Thanks for providing some numbers!

It was my impression, too, that subclusters wouldn’t solve it.  But it
didn’t seem like that big of a difference to me.  Did you run this with
aio=native?  (Because that’s where we have the XFS problem)

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Alberto Garcia 4 years, 6 months ago

On Mon 04 Nov 2019 04:14:56 PM CET, Max Reitz wrote:

>>> No, what I meant was that the original problem that led to
>>> c8bb23cbdbe would go away.
>> 
>> Ah, right. Not quite, according to my numbers:
>> 
>> |--------------+----------------+-----------------+-------------|
>> | Cluster size | subclusters=on | subclusters=off | fallocate() |
>> |--------------+----------------+-----------------+-------------|
>> |       256 KB |     10182 IOPS |        966 IOPS |  14007 IOPS |
>> |       512 KB |      7919 IOPS |        563 IOPS |  13442 IOPS |
>> |      1024 KB |      5050 IOPS |        465 IOPS |  13887 IOPS |
>> |      2048 KB |      2465 IOPS |        271 IOPS |  13885 IOPS |
>> |--------------+----------------+-----------------+-------------|
>> 
>> There's obviously no backing image, and only the last column uses
>> handle_alloc_space() / fallocate().
>
> Thanks for providing some numbers!
>
> It was my impression, too, that subclusters wouldn’t solve it.  But it
> didn’t seem like that big of a difference to me.  Did you run this
> with aio=native?  (Because that’s where we have the XFS problem)

Here's with aio=native

|--------------+----------------+-----------------+-------------|
| Cluster size | subclusters=on | subclusters=off | fallocate() |
|--------------+----------------+-----------------+-------------|
|       256 KB |     19897 IOPS |        891 IOPS |  32194 IOPS |
|       512 KB |     17881 IOPS |        436 IOPS |  33092 IOPS |
|      1024 KB |     17050 IOPS |        341 IOPS |  32768 IOPS |
|      2048 KB |      7854 IOPS |        207 IOPS |  30944 IOPS |
|--------------+----------------+-----------------+-------------|

Berto

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Max Reitz 4 years, 6 months ago

On 04.11.19 16:49, Alberto Garcia wrote:
> On Mon 04 Nov 2019 04:14:56 PM CET, Max Reitz wrote:
> 
>>>> No, what I meant was that the original problem that led to
>>>> c8bb23cbdbe would go away.
>>>
>>> Ah, right. Not quite, according to my numbers:
>>>
>>> |--------------+----------------+-----------------+-------------|
>>> | Cluster size | subclusters=on | subclusters=off | fallocate() |
>>> |--------------+----------------+-----------------+-------------|
>>> |       256 KB |     10182 IOPS |        966 IOPS |  14007 IOPS |
>>> |       512 KB |      7919 IOPS |        563 IOPS |  13442 IOPS |
>>> |      1024 KB |      5050 IOPS |        465 IOPS |  13887 IOPS |
>>> |      2048 KB |      2465 IOPS |        271 IOPS |  13885 IOPS |
>>> |--------------+----------------+-----------------+-------------|
>>>
>>> There's obviously no backing image, and only the last column uses
>>> handle_alloc_space() / fallocate().
>>
>> Thanks for providing some numbers!
>>
>> It was my impression, too, that subclusters wouldn’t solve it.  But it
>> didn’t seem like that big of a difference to me.  Did you run this
>> with aio=native?  (Because that’s where we have the XFS problem)
> 
> Here's with aio=native
> 
> |--------------+----------------+-----------------+-------------|
> | Cluster size | subclusters=on | subclusters=off | fallocate() |
> |--------------+----------------+-----------------+-------------|
> |       256 KB |     19897 IOPS |        891 IOPS |  32194 IOPS |
> |       512 KB |     17881 IOPS |        436 IOPS |  33092 IOPS |
> |      1024 KB |     17050 IOPS |        341 IOPS |  32768 IOPS |
> |      2048 KB |      7854 IOPS |        207 IOPS |  30944 IOPS |
> |--------------+----------------+-----------------+-------------|

OK, thanks again. :-)

So it seems right not to revert the commit and just work around the XFS
bug in file-posix.

Max

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Posted by Kevin Wolf 4 years, 6 months ago

Am 25.10.2019 um 16:19 hat Max Reitz geschrieben:
> On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote:
> > 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote:
> >> 25.10.2019 12:58, Max Reitz wrote:
> >>> Hi,
> >>>
> >>> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> >>> I’ve explained here:
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
> >>>
> >>> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> >>> data corruption when using qcow2 images on XFS with aio=native.
> >>>
> >>> We can’t wait until the XFS kernel driver is fixed, we should work
> >>> around the problem ourselves.
> >>>
> >>> This is an RFC for two reasons:
> >>> (1) I don’t know whether this is the right way to address the issue,
> >>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
> >>>      if so stop applying the workaround.
> >>>      I don’t know how we would go about this, so this series doesn’t do
> >>>      it.  (Hence it’s an RFC.)
> >>> (3) Perhaps it’s a bit of a layering violation to let the file-posix
> >>>      driver access and modify a BdrvTrackedRequest object.
> >>>
> >>> As for how we can address the issue, I see three ways:
> >>> (1) The one presented in this series: On XFS with aio=native, we extend
> >>>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
> >>>      operations) to reach until infinity (INT64_MAX in practice), mark
> >>>      them serializing and wait for other conflicting requests.
> >>>
> >>>      Advantages:
> >>>      + Limits the impact to very specific cases
> >>>        (And that means it wouldn’t hurt too much to keep this workaround
> >>>        even when the XFS driver has been fixed)
> >>>      + Works around the bug where it happens, namely in file-posix
> >>>
> >>>      Disadvantages:
> >>>      - A bit complex
> >>>      - A bit of a layering violation (should file-posix have access to
> >>>        tracked requests?)
> >>>
> >>> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
> >>>      becomes visible due to that function: I don’t think qcow2 writes
> >>>      zeroes in any other I/O path, and raw images are fixed in size so
> >>>      post-EOF writes won’t happen.
> >>>
> >>>      Advantages:
> >>>      + Maybe simpler, depending on how difficult it is to handle the
> >>>        layering violation
> >>>      + Also fixes the performance problem of handle_alloc_space() being
> >>>        slow on ppc64+XFS.
> >>>
> >>>      Disadvantages:
> >>>      - Huge layering violation because qcow2 would need to know whether
> >>>        the image is stored on XFS or not.
> >>>      - We’d definitely want to skip this workaround when the XFS driver
> >>>        has been fixed, so we need some method to find out whether it has
> >>>
> >>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
> >>>      To my knowledge I’m the only one who has provided any benchmarks for
> >>>      this commit, and even then I was a bit skeptical because it performs
> >>>      well in some cases and bad in others.  I concluded that it’s
> >>>      probably worth it because the “some cases” are more likely to occur.
> >>>
> >>>      Now we have this problem of corruption here (granted due to a bug in
> >>>      the XFS driver), and another report of massively degraded
> >>>      performance on ppc64
> >>>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
> >>>      private BZ; I hate that :-/  The report is about 40 % worse
> >>>      performance for an in-guest fio write benchmark.)
> >>>
> >>>      So I have to ask the question about what the justification for
> >>>      keeping c8bb23cbdbe32f is.  How much does performance increase with
> >>>      it actually?  (On non-(ppc64+XFS) machines, obviously)
> >>>
> >>>      Advantages:
> >>>      + Trivial
> >>>      + No layering violations
> >>>      + We wouldn’t need to keep track of whether the kernel bug has been
> >>>        fixed or not
> >>>      + Fixes the ppc64+XFS performance problem
> >>>
> >>>      Disadvantages:
> >>>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
> >>>        levels, whatever that means
> >>>
> >>> So this is the main reason this is an RFC: What should we do?  Is (1)
> >>> really the best choice?
> >>>
> >>>
> >>> In any case, I’ve ran the test case I showed in
> >>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
> >>> more than ten times with this series applied and the installation
> >>> succeeded every time.  (Without this series, it fails like every other
> >>> time.)
> >>>
> >>>
> >>
> >> Hi!
> >>
> >> First, great thanks for your investigation!
> >>
> >> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant
> >> in time.
> >>
> >> I've tested a bit:
> >>
> >> test:
> >> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done
> >>
> >> on master:
> >>
> >> /ssd/test.img  cl=64K step=4K    : 0.291
> >> /ssd/test.img  cl=64K step=64K   : 0.813
> >> /ssd/test.img  cl=64K step=1M    : 2.799
> >> /ssd/test.img  cl=1M  step=4K    : 0.217
> >> /ssd/test.img  cl=1M  step=64K   : 0.332
> >> /ssd/test.img  cl=1M  step=1M    : 0.685
> >> /test.img      cl=64K step=4K    : 1.751
> >> /test.img      cl=64K step=64K   : 14.811
> >> /test.img      cl=64K step=1M    : 18.321
> >> /test.img      cl=1M  step=4K    : 0.759
> >> /test.img      cl=1M  step=64K   : 13.574
> >> /test.img      cl=1M  step=1M    : 28.970
> >>
> >> rerun on master:
> >>
> >> /ssd/test.img  cl=64K step=4K    : 0.295
> >> /ssd/test.img  cl=64K step=64K   : 0.803
> >> /ssd/test.img  cl=64K step=1M    : 2.921
> >> /ssd/test.img  cl=1M  step=4K    : 0.233
> >> /ssd/test.img  cl=1M  step=64K   : 0.321
> >> /ssd/test.img  cl=1M  step=1M    : 0.762
> >> /test.img      cl=64K step=4K    : 1.873
> >> /test.img      cl=64K step=64K   : 15.621
> >> /test.img      cl=64K step=1M    : 18.428
> >> /test.img      cl=1M  step=4K    : 0.883
> >> /test.img      cl=1M  step=64K   : 13.484
> >> /test.img      cl=1M  step=1M    : 26.244
> >>
> >>
> >> on master + revert c8bb23cbdbe32f5c326
> >>
> >> /ssd/test.img  cl=64K step=4K    : 0.395
> >> /ssd/test.img  cl=64K step=64K   : 4.231
> >> /ssd/test.img  cl=64K step=1M    : 5.598
> >> /ssd/test.img  cl=1M  step=4K    : 0.352
> >> /ssd/test.img  cl=1M  step=64K   : 2.519
> >> /ssd/test.img  cl=1M  step=1M    : 38.919
> >> /test.img      cl=64K step=4K    : 1.758
> >> /test.img      cl=64K step=64K   : 9.838
> >> /test.img      cl=64K step=1M    : 13.384
> >> /test.img      cl=1M  step=4K    : 1.849
> >> /test.img      cl=1M  step=64K   : 19.405
> >> /test.img      cl=1M  step=1M    : 157.090
> >>
> >> rerun:
> >>
> >> /ssd/test.img  cl=64K step=4K    : 0.407
> >> /ssd/test.img  cl=64K step=64K   : 3.325
> >> /ssd/test.img  cl=64K step=1M    : 5.641
> >> /ssd/test.img  cl=1M  step=4K    : 0.346
> >> /ssd/test.img  cl=1M  step=64K   : 2.583
> >> /ssd/test.img  cl=1M  step=1M    : 39.692
> >> /test.img      cl=64K step=4K    : 1.727
> >> /test.img      cl=64K step=64K   : 10.058
> >> /test.img      cl=64K step=1M    : 13.441
> >> /test.img      cl=1M  step=4K    : 1.926
> >> /test.img      cl=1M  step=64K   : 19.738
> >> /test.img      cl=1M  step=1M    : 158.268
> >>
> >>
> >> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational
> >> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is
> >> wrong, we need smarter heuristics..
> >>
> >> So, I'd prefer (1) or (2).
> 
> OK.  I wonder whether that problem would go away with Berto’s subcluster
> series, though.
> 
> > About degradation in some cases: I think the problem is that one (a bit larger)
> > write may be faster than fast-write-zeroes + small write, as the latter means
> > additional write to metadata. And it's expected for small clusters in
> > conjunction with rotational disk. But the actual limit is dependent on specific
> > disk. So, I think possible solution is just sometimes try work with
> > handle_alloc_space and sometimes without, remember time and length of request
> > and make dynamic limit...
> 
> Maybe make a decision based both on the ratio of data size to COW area
> length (only invoke handle_alloc_space() under a certain threshold), and
> the absolute COW area length (always invoke it above a certain
> threshold, unless the ratio doesn’t allow it)?

I'm not sure that I would like this level of complexity in this code
path...

Kevin