include/block/block_int.h | 3 +++ block/file-posix.c | 46 +++++++++++++++++++++++++++++++++++++-- block/io.c | 24 ++++++++++---------- 3 files changed, 59 insertions(+), 14 deletions(-)
Hi, It seems to me that there is a bug in Linux’s XFS kernel driver, as I’ve explained here: https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html In combination with our commit c8bb23cbdbe32f, this may lead to guest data corruption when using qcow2 images on XFS with aio=native. We can’t wait until the XFS kernel driver is fixed, we should work around the problem ourselves. This is an RFC for two reasons: (1) I don’t know whether this is the right way to address the issue, (2) Ideally, we should detect whether the XFS kernel driver is fixed and if so stop applying the workaround. I don’t know how we would go about this, so this series doesn’t do it. (Hence it’s an RFC.) (3) Perhaps it’s a bit of a layering violation to let the file-posix driver access and modify a BdrvTrackedRequest object. As for how we can address the issue, I see three ways: (1) The one presented in this series: On XFS with aio=native, we extend tracked requests for post-EOF fallocate() calls (i.e., write-zero operations) to reach until infinity (INT64_MAX in practice), mark them serializing and wait for other conflicting requests. Advantages: + Limits the impact to very specific cases (And that means it wouldn’t hurt too much to keep this workaround even when the XFS driver has been fixed) + Works around the bug where it happens, namely in file-posix Disadvantages: - A bit complex - A bit of a layering violation (should file-posix have access to tracked requests?) (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only becomes visible due to that function: I don’t think qcow2 writes zeroes in any other I/O path, and raw images are fixed in size so post-EOF writes won’t happen. Advantages: + Maybe simpler, depending on how difficult it is to handle the layering violation + Also fixes the performance problem of handle_alloc_space() being slow on ppc64+XFS. Disadvantages: - Huge layering violation because qcow2 would need to know whether the image is stored on XFS or not. - We’d definitely want to skip this workaround when the XFS driver has been fixed, so we need some method to find out whether it has (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. To my knowledge I’m the only one who has provided any benchmarks for this commit, and even then I was a bit skeptical because it performs well in some cases and bad in others. I concluded that it’s probably worth it because the “some cases” are more likely to occur. Now we have this problem of corruption here (granted due to a bug in the XFS driver), and another report of massively degraded performance on ppc64 (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a private BZ; I hate that :-/ The report is about 40 % worse performance for an in-guest fio write benchmark.) So I have to ask the question about what the justification for keeping c8bb23cbdbe32f is. How much does performance increase with it actually? (On non-(ppc64+XFS) machines, obviously) Advantages: + Trivial + No layering violations + We wouldn’t need to keep track of whether the kernel bug has been fixed or not + Fixes the ppc64+XFS performance problem Disadvantages: - Reverts cluster allocation performance to pre-c8bb23cbdbe32f levels, whatever that means So this is the main reason this is an RFC: What should we do? Is (1) really the best choice? In any case, I’ve ran the test case I showed in https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html more than ten times with this series applied and the installation succeeded every time. (Without this series, it fails like every other time.) Max Reitz (3): block: Make wait/mark serialising requests public block/file-posix: Detect XFS with CONFIG_FALLOCATE block/file-posix: Let post-EOF fallocate serialize include/block/block_int.h | 3 +++ block/file-posix.c | 46 +++++++++++++++++++++++++++++++++++++-- block/io.c | 24 ++++++++++---------- 3 files changed, 59 insertions(+), 14 deletions(-) -- 2.21.0
Patchew URL: https://patchew.org/QEMU/20191025095849.25283-1-mreitz@redhat.com/ Hi, This series failed the docker-quick@centos7 build test. Please find the testing commands and their output below. If you have Docker installed, you can probably reproduce it locally. === TEST SCRIPT BEGIN === #!/bin/bash make docker-image-centos7 V=1 NETWORK=1 time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1 === TEST SCRIPT END === CC block/sheepdog.o CC block/accounting.o /tmp/qemu-test/src/block/file-posix.c: In function 'raw_open_common': /tmp/qemu-test/src/block/file-posix.c:671:5: error: implicit declaration of function 'platform_test_xfs_fd' [-Werror=implicit-function-declaration] if (platform_test_xfs_fd(s->fd)) { ^ /tmp/qemu-test/src/block/file-posix.c:671:5: error: nested extern declaration of 'platform_test_xfs_fd' [-Werror=nested-externs] cc1: all warnings being treated as errors make: *** [block/file-posix.o] Error 1 make: *** Waiting for unfinished jobs.... Traceback (most recent call last): File "./tests/docker/docker.py", line 662, in <module> --- raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=c74d47e9bfb24107b6e94885fa8a2151', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-ytazf4e4/src/docker-src.2019-10-25-20.11.53.7609:/var/tmp/qemu:z,ro', 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit status 2. filter=--filter=label=com.qemu.instance.uuid=c74d47e9bfb24107b6e94885fa8a2151 make[1]: *** [docker-run] Error 1 make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-ytazf4e4/src' make: *** [docker-run-test-quick@centos7] Error 2 real 2m32.235s user 0m8.092s The full log is available at http://patchew.org/logs/20191025095849.25283-1-mreitz@redhat.com/testing.docker-quick@centos7/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-devel@redhat.com
On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote: > > Hi, > > It seems to me that there is a bug in Linux’s XFS kernel driver, as > I’ve explained here: > > https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html > > In combination with our commit c8bb23cbdbe32f, this may lead to guest > data corruption when using qcow2 images on XFS with aio=native. Have we reported that upstream to the xfs folks? thanks -- PMM
On 25.10.19 15:46, Peter Maydell wrote: > On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote: >> >> Hi, >> >> It seems to me that there is a bug in Linux’s XFS kernel driver, as >> I’ve explained here: >> >> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html >> >> In combination with our commit c8bb23cbdbe32f, this may lead to guest >> data corruption when using qcow2 images on XFS with aio=native. > > Have we reported that upstream to the xfs folks? I’ve created an RH BZ here: https://bugzilla.redhat.com/show_bug.cgi?id=1765547 So at least some XFS folks are aware of it (and I trust them to inform the others as necessary :-)). (Eric Sandeen has been part of the discussion for the last couple of days already.) Max
On Fri, 25 Oct 2019 at 15:16, Max Reitz <mreitz@redhat.com> wrote: > > On 25.10.19 15:46, Peter Maydell wrote: > > On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote: > >> > >> Hi, > >> > >> It seems to me that there is a bug in Linux’s XFS kernel driver, as > >> I’ve explained here: > >> > >> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html > >> > >> In combination with our commit c8bb23cbdbe32f, this may lead to guest > >> data corruption when using qcow2 images on XFS with aio=native. > > > > Have we reported that upstream to the xfs folks? > > I’ve created an RH BZ here: > > https://bugzilla.redhat.com/show_bug.cgi?id=1765547 Currently "You are not authorized to access bug #1765547." for anonymous browsers, just fyi. thanks -- PMM
On 25.10.19 16:17, Peter Maydell wrote: > On Fri, 25 Oct 2019 at 15:16, Max Reitz <mreitz@redhat.com> wrote: >> >> On 25.10.19 15:46, Peter Maydell wrote: >>> On Fri, 25 Oct 2019 at 11:09, Max Reitz <mreitz@redhat.com> wrote: >>>> >>>> Hi, >>>> >>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as >>>> I’ve explained here: >>>> >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html >>>> >>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest >>>> data corruption when using qcow2 images on XFS with aio=native. >>> >>> Have we reported that upstream to the xfs folks? >> >> I’ve created an RH BZ here: >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1765547 > > Currently "You are not authorized to access bug #1765547." for > anonymous browsers, just fyi. Err. Oops. That wasn’t my intention. I hate that misfeature. Thanks for telling me, I fixed it. :-) Max
On Fri, 25 Oct 2019 at 15:21, Max Reitz <mreitz@redhat.com> wrote: > > On 25.10.19 16:17, Peter Maydell wrote: > > On Fri, 25 Oct 2019 at 15:16, Max Reitz <mreitz@redhat.com> wrote: > >> I’ve created an RH BZ here: > >> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1765547 > > > > Currently "You are not authorized to access bug #1765547." for > > anonymous browsers, just fyi. > > Err. Oops. That wasn’t my intention. I hate that misfeature. > > Thanks for telling me, I fixed it. :-) Yeah, that's a really well written bug report, you definitely don't want to hide it away :-) Thanks for making it public. -- PMM
On Fri, Oct 25, 2019 at 1:11 PM Max Reitz <mreitz@redhat.com> wrote: > > Hi, > > It seems to me that there is a bug in Linux’s XFS kernel driver, as > I’ve explained here: > > https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html > > In combination with our commit c8bb23cbdbe32f, this may lead to guest > data corruption when using qcow2 images on XFS with aio=native. > > We can’t wait until the XFS kernel driver is fixed, we should work > around the problem ourselves. > > This is an RFC for two reasons: > (1) I don’t know whether this is the right way to address the issue, > (2) Ideally, we should detect whether the XFS kernel driver is fixed and > if so stop applying the workaround. > I don’t know how we would go about this, so this series doesn’t do > it. (Hence it’s an RFC.) > (3) Perhaps it’s a bit of a layering violation to let the file-posix > driver access and modify a BdrvTrackedRequest object. > > As for how we can address the issue, I see three ways: > (1) The one presented in this series: On XFS with aio=native, we extend > tracked requests for post-EOF fallocate() calls (i.e., write-zero > operations) to reach until infinity (INT64_MAX in practice), mark > them serializing and wait for other conflicting requests. > > Advantages: > + Limits the impact to very specific cases > (And that means it wouldn’t hurt too much to keep this workaround > even when the XFS driver has been fixed) > + Works around the bug where it happens, namely in file-posix > > Disadvantages: > - A bit complex > - A bit of a layering violation (should file-posix have access to > tracked requests?) > > (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only > becomes visible due to that function: I don’t think qcow2 writes > zeroes in any other I/O path, and raw images are fixed in size so > post-EOF writes won’t happen. > > Advantages: > + Maybe simpler, depending on how difficult it is to handle the > layering violation > + Also fixes the performance problem of handle_alloc_space() being > slow on ppc64+XFS. > > Disadvantages: > - Huge layering violation because qcow2 would need to know whether > the image is stored on XFS or not. > - We’d definitely want to skip this workaround when the XFS driver > has been fixed, so we need some method to find out whether it has > > (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. > To my knowledge I’m the only one who has provided any benchmarks for > this commit, and even then I was a bit skeptical because it performs > well in some cases and bad in others. I concluded that it’s > probably worth it because the “some cases” are more likely to occur. > > Now we have this problem of corruption here (granted due to a bug in > the XFS driver), and another report of massively degraded > performance on ppc64 > (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a > private BZ; I hate that :-/ The report is about 40 % worse > performance for an in-guest fio write benchmark.) > > So I have to ask the question about what the justification for > keeping c8bb23cbdbe32f is. How much does performance increase with > it actually? (On non-(ppc64+XFS) machines, obviously) > > Advantages: > + Trivial > + No layering violations > + We wouldn’t need to keep track of whether the kernel bug has been > fixed or not > + Fixes the ppc64+XFS performance problem > > Disadvantages: > - Reverts cluster allocation performance to pre-c8bb23cbdbe32f > levels, whatever that means Correctness is more important than performance, so this is my preference as a user. Nir > So this is the main reason this is an RFC: What should we do? Is (1) > really the best choice? > > > In any case, I’ve ran the test case I showed in > https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html > more than ten times with this series applied and the installation > succeeded every time. (Without this series, it fails like every other > time.) > > > Max Reitz (3): > block: Make wait/mark serialising requests public > block/file-posix: Detect XFS with CONFIG_FALLOCATE > block/file-posix: Let post-EOF fallocate serialize > > include/block/block_int.h | 3 +++ > block/file-posix.c | 46 +++++++++++++++++++++++++++++++++++++-- > block/io.c | 24 ++++++++++---------- > 3 files changed, 59 insertions(+), 14 deletions(-) > > -- > 2.21.0 > >
26.10.2019 20:37, Nir Soffer wrote: > On Fri, Oct 25, 2019 at 1:11 PM Max Reitz <mreitz@redhat.com> wrote: >> >> Hi, >> >> It seems to me that there is a bug in Linux’s XFS kernel driver, as >> I’ve explained here: >> >> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html >> >> In combination with our commit c8bb23cbdbe32f, this may lead to guest >> data corruption when using qcow2 images on XFS with aio=native. >> >> We can’t wait until the XFS kernel driver is fixed, we should work >> around the problem ourselves. >> >> This is an RFC for two reasons: >> (1) I don’t know whether this is the right way to address the issue, >> (2) Ideally, we should detect whether the XFS kernel driver is fixed and >> if so stop applying the workaround. >> I don’t know how we would go about this, so this series doesn’t do >> it. (Hence it’s an RFC.) >> (3) Perhaps it’s a bit of a layering violation to let the file-posix >> driver access and modify a BdrvTrackedRequest object. >> >> As for how we can address the issue, I see three ways: >> (1) The one presented in this series: On XFS with aio=native, we extend >> tracked requests for post-EOF fallocate() calls (i.e., write-zero >> operations) to reach until infinity (INT64_MAX in practice), mark >> them serializing and wait for other conflicting requests. >> >> Advantages: >> + Limits the impact to very specific cases >> (And that means it wouldn’t hurt too much to keep this workaround >> even when the XFS driver has been fixed) >> + Works around the bug where it happens, namely in file-posix >> >> Disadvantages: >> - A bit complex >> - A bit of a layering violation (should file-posix have access to >> tracked requests?) >> >> (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only >> becomes visible due to that function: I don’t think qcow2 writes >> zeroes in any other I/O path, and raw images are fixed in size so >> post-EOF writes won’t happen. >> >> Advantages: >> + Maybe simpler, depending on how difficult it is to handle the >> layering violation >> + Also fixes the performance problem of handle_alloc_space() being >> slow on ppc64+XFS. >> >> Disadvantages: >> - Huge layering violation because qcow2 would need to know whether >> the image is stored on XFS or not. >> - We’d definitely want to skip this workaround when the XFS driver >> has been fixed, so we need some method to find out whether it has >> >> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >> To my knowledge I’m the only one who has provided any benchmarks for >> this commit, and even then I was a bit skeptical because it performs >> well in some cases and bad in others. I concluded that it’s >> probably worth it because the “some cases” are more likely to occur. >> >> Now we have this problem of corruption here (granted due to a bug in >> the XFS driver), and another report of massively degraded >> performance on ppc64 >> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >> private BZ; I hate that :-/ The report is about 40 % worse >> performance for an in-guest fio write benchmark.) >> >> So I have to ask the question about what the justification for >> keeping c8bb23cbdbe32f is. How much does performance increase with >> it actually? (On non-(ppc64+XFS) machines, obviously) >> >> Advantages: >> + Trivial >> + No layering violations >> + We wouldn’t need to keep track of whether the kernel bug has been >> fixed or not >> + Fixes the ppc64+XFS performance problem >> >> Disadvantages: >> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >> levels, whatever that means > > Correctness is more important than performance, so this is my > preference as a user. > Hmm, still, incorrect is XFS, not Qemu. This bug may be triggered by another software, or may be another scenario in Qemu (not sure). > >> So this is the main reason this is an RFC: What should we do? Is (1) >> really the best choice? >> >> >> In any case, I’ve ran the test case I showed in >> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html >> more than ten times with this series applied and the installation >> succeeded every time. (Without this series, it fails like every other >> time.) >> >> >> Max Reitz (3): >> block: Make wait/mark serialising requests public >> block/file-posix: Detect XFS with CONFIG_FALLOCATE >> block/file-posix: Let post-EOF fallocate serialize >> >> include/block/block_int.h | 3 +++ >> block/file-posix.c | 46 +++++++++++++++++++++++++++++++++++++-- >> block/io.c | 24 ++++++++++---------- >> 3 files changed, 59 insertions(+), 14 deletions(-) >> >> -- >> 2.21.0 >> >> -- Best regards, Vladimir
On 26.10.19 19:52, Vladimir Sementsov-Ogievskiy wrote: > 26.10.2019 20:37, Nir Soffer wrote: >> On Fri, Oct 25, 2019 at 1:11 PM Max Reitz <mreitz@redhat.com> wrote: >>> >>> Hi, >>> >>> It seems to me that there is a bug in Linux’s XFS kernel driver, as >>> I’ve explained here: >>> >>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html >>> >>> In combination with our commit c8bb23cbdbe32f, this may lead to guest >>> data corruption when using qcow2 images on XFS with aio=native. >>> >>> We can’t wait until the XFS kernel driver is fixed, we should work >>> around the problem ourselves. >>> >>> This is an RFC for two reasons: >>> (1) I don’t know whether this is the right way to address the issue, >>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and >>> if so stop applying the workaround. >>> I don’t know how we would go about this, so this series doesn’t do >>> it. (Hence it’s an RFC.) >>> (3) Perhaps it’s a bit of a layering violation to let the file-posix >>> driver access and modify a BdrvTrackedRequest object. >>> >>> As for how we can address the issue, I see three ways: >>> (1) The one presented in this series: On XFS with aio=native, we extend >>> tracked requests for post-EOF fallocate() calls (i.e., write-zero >>> operations) to reach until infinity (INT64_MAX in practice), mark >>> them serializing and wait for other conflicting requests. >>> >>> Advantages: >>> + Limits the impact to very specific cases >>> (And that means it wouldn’t hurt too much to keep this workaround >>> even when the XFS driver has been fixed) >>> + Works around the bug where it happens, namely in file-posix >>> >>> Disadvantages: >>> - A bit complex >>> - A bit of a layering violation (should file-posix have access to >>> tracked requests?) >>> >>> (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only >>> becomes visible due to that function: I don’t think qcow2 writes >>> zeroes in any other I/O path, and raw images are fixed in size so >>> post-EOF writes won’t happen. >>> >>> Advantages: >>> + Maybe simpler, depending on how difficult it is to handle the >>> layering violation >>> + Also fixes the performance problem of handle_alloc_space() being >>> slow on ppc64+XFS. >>> >>> Disadvantages: >>> - Huge layering violation because qcow2 would need to know whether >>> the image is stored on XFS or not. >>> - We’d definitely want to skip this workaround when the XFS driver >>> has been fixed, so we need some method to find out whether it has >>> >>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>> To my knowledge I’m the only one who has provided any benchmarks for >>> this commit, and even then I was a bit skeptical because it performs >>> well in some cases and bad in others. I concluded that it’s >>> probably worth it because the “some cases” are more likely to occur. >>> >>> Now we have this problem of corruption here (granted due to a bug in >>> the XFS driver), and another report of massively degraded >>> performance on ppc64 >>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>> private BZ; I hate that :-/ The report is about 40 % worse >>> performance for an in-guest fio write benchmark.) >>> >>> So I have to ask the question about what the justification for >>> keeping c8bb23cbdbe32f is. How much does performance increase with >>> it actually? (On non-(ppc64+XFS) machines, obviously) >>> >>> Advantages: >>> + Trivial >>> + No layering violations >>> + We wouldn’t need to keep track of whether the kernel bug has been >>> fixed or not >>> + Fixes the ppc64+XFS performance problem >>> >>> Disadvantages: >>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>> levels, whatever that means >> >> Correctness is more important than performance, so this is my >> preference as a user. >> > > Hmm, still, incorrect is XFS, not Qemu. This bug may be triggered by another > software, or may be another scenario in Qemu (not sure). I don’t see any other image format (but qcow2, raw, and filters) creating zero-writes apart from the parallels format which does so before writes can happen to that area. So with parallels, it’s impossible to get a data write behind an ongoing zero-write. Raw and filters do not initiate zero-writes on their own. So there must be something else that requests them. Block jobs and guests are limited to generally fixed-size disks, so I don’t see a way to generate zero-writes beyond the EOF there. Which leaves us with qcow2. To my knowledge the only place where qcow2 generates zero-writes in an I/O path (so they can occur concurrently to data writes) is in handle_alloc_space(). Therefore, dropping it should remove the only place where the XFS bug can be triggered. That the bug is in XFS and not in qemu is a good argument from a moral standpoint, but it isn’t very pragmatic or reasonable from a technical standpoint. There is no fix for XFS yet, so right now the situation is that there is a bug that causes guest data loss, that qemu can prevent it, and that there is no other workaround or fix. So we must work around it. Of course, that doesn’t mean that we have to do everything in our power to implement a work around no matter the cost. We do indeed have to balance between how far we’re willing to go to fix it and how much we impact non-XFS workloads. Honestly, I caught myself saying “Well, that’s too bad for XFS over gluster if XFS is broken”. But that just isn’t a reasonable thing to say. I suppose what we could do is drop the is_xfs check in patch 3. The only outcome is that you can no longer do concurrent data writes past zero-writes past the EOF. And as I’ve claimed above, this only affects handle_alloc_space() from qcow2. Would it be so bad if we simply did that on every filesystem? Max
On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: > As for how we can address the issue, I see three ways: > (1) The one presented in this series: On XFS with aio=native, we extend > tracked requests for post-EOF fallocate() calls (i.e., write-zero > operations) to reach until infinity (INT64_MAX in practice), mark > them serializing and wait for other conflicting requests. > > Advantages: > + Limits the impact to very specific cases > (And that means it wouldn’t hurt too much to keep this workaround > even when the XFS driver has been fixed) > + Works around the bug where it happens, namely in file-posix > > Disadvantages: > - A bit complex > - A bit of a layering violation (should file-posix have access to > tracked requests?) Your patch series is reasonable. I don't think it's too bad. The main question is how to detect the XFS fix once it ships. XFS already has a ton of ioctls, so maybe they don't mind adding a feature/quirk bit map ioctl for publishing information about bug fixes to userspace. I didn't see another obvious way of doing it, maybe a mount option that the kernel automatically sets and that gets reported to userspace? If we imagine that XFS will not provide a mechanism to detect the presence of the fix, then could we ask QEMU package maintainers to ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point in the future when their distro has been shipping a fixed kernel for a while? It's ugly because it doesn't work if the user installs an older custom-built kernel on the host. But at least it will cover 98% of users... > (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. > To my knowledge I’m the only one who has provided any benchmarks for > this commit, and even then I was a bit skeptical because it performs > well in some cases and bad in others. I concluded that it’s > probably worth it because the “some cases” are more likely to occur. > > Now we have this problem of corruption here (granted due to a bug in > the XFS driver), and another report of massively degraded > performance on ppc64 > (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a > private BZ; I hate that :-/ The report is about 40 % worse > performance for an in-guest fio write benchmark.) > > So I have to ask the question about what the justification for > keeping c8bb23cbdbe32f is. How much does performance increase with > it actually? (On non-(ppc64+XFS) machines, obviously) > > Advantages: > + Trivial > + No layering violations > + We wouldn’t need to keep track of whether the kernel bug has been > fixed or not > + Fixes the ppc64+XFS performance problem > > Disadvantages: > - Reverts cluster allocation performance to pre-c8bb23cbdbe32f > levels, whatever that means My favorite because it is clean and simple, but Vladimir has a valid use-case for requiring this performance optimization so reverting isn't an option. Stefan
On 27.10.19 13:35, Stefan Hajnoczi wrote: > On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >> As for how we can address the issue, I see three ways: >> (1) The one presented in this series: On XFS with aio=native, we extend >> tracked requests for post-EOF fallocate() calls (i.e., write-zero >> operations) to reach until infinity (INT64_MAX in practice), mark >> them serializing and wait for other conflicting requests. >> >> Advantages: >> + Limits the impact to very specific cases >> (And that means it wouldn’t hurt too much to keep this workaround >> even when the XFS driver has been fixed) >> + Works around the bug where it happens, namely in file-posix >> >> Disadvantages: >> - A bit complex >> - A bit of a layering violation (should file-posix have access to >> tracked requests?) > > Your patch series is reasonable. I don't think it's too bad. > > The main question is how to detect the XFS fix once it ships. XFS > already has a ton of ioctls, so maybe they don't mind adding a > feature/quirk bit map ioctl for publishing information about bug fixes > to userspace. I didn't see another obvious way of doing it, maybe a > mount option that the kernel automatically sets and that gets reported > to userspace? I’ll add a note to the RH BZ. > If we imagine that XFS will not provide a mechanism to detect the > presence of the fix, then could we ask QEMU package maintainers to > ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point > in the future when their distro has been shipping a fixed kernel for a > while? It's ugly because it doesn't work if the user installs an older > custom-built kernel on the host. But at least it will cover 98% of > users... :-/ I don’t like it, but I suppose it would work. We could also automatically enable this disabling option in configure when we detect uname to report a kernel version that must include the fix. (This wouldn’t work for kernel with backported fixes, but those disappear over time...) Max
On 28.10.19 10:24, Max Reitz wrote: > On 27.10.19 13:35, Stefan Hajnoczi wrote: >> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >>> As for how we can address the issue, I see three ways: >>> (1) The one presented in this series: On XFS with aio=native, we extend >>> tracked requests for post-EOF fallocate() calls (i.e., write-zero >>> operations) to reach until infinity (INT64_MAX in practice), mark >>> them serializing and wait for other conflicting requests. >>> >>> Advantages: >>> + Limits the impact to very specific cases >>> (And that means it wouldn’t hurt too much to keep this workaround >>> even when the XFS driver has been fixed) >>> + Works around the bug where it happens, namely in file-posix >>> >>> Disadvantages: >>> - A bit complex >>> - A bit of a layering violation (should file-posix have access to >>> tracked requests?) >> >> Your patch series is reasonable. I don't think it's too bad. >> >> The main question is how to detect the XFS fix once it ships. XFS >> already has a ton of ioctls, so maybe they don't mind adding a >> feature/quirk bit map ioctl for publishing information about bug fixes >> to userspace. I didn't see another obvious way of doing it, maybe a >> mount option that the kernel automatically sets and that gets reported >> to userspace? > > I’ll add a note to the RH BZ. > >> If we imagine that XFS will not provide a mechanism to detect the >> presence of the fix, then could we ask QEMU package maintainers to >> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point >> in the future when their distro has been shipping a fixed kernel for a >> while? It's ugly because it doesn't work if the user installs an older >> custom-built kernel on the host. But at least it will cover 98% of >> users... > > :-/ > > I don’t like it, but I suppose it would work. We could also > automatically enable this disabling option in configure when we detect > uname to report a kernel version that must include the fix. (This > wouldn’t work for kernel with backported fixes, but those disappear over > time...) I just realized that none of this is going to work for the gluster case brought up by Nir. The affected kernel is the remote one and we have no insight into that. I don’t think we can do ioctls to XFS over gluster, can we? I also realized that uname is a stupid idea because the user may be running a different kernel version than what runs on the build machine... :( Max
On 28.10.19 10:30, Max Reitz wrote: > On 28.10.19 10:24, Max Reitz wrote: >> On 27.10.19 13:35, Stefan Hajnoczi wrote: >>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >>>> As for how we can address the issue, I see three ways: >>>> (1) The one presented in this series: On XFS with aio=native, we extend >>>> tracked requests for post-EOF fallocate() calls (i.e., write-zero >>>> operations) to reach until infinity (INT64_MAX in practice), mark >>>> them serializing and wait for other conflicting requests. >>>> >>>> Advantages: >>>> + Limits the impact to very specific cases >>>> (And that means it wouldn’t hurt too much to keep this workaround >>>> even when the XFS driver has been fixed) >>>> + Works around the bug where it happens, namely in file-posix >>>> >>>> Disadvantages: >>>> - A bit complex >>>> - A bit of a layering violation (should file-posix have access to >>>> tracked requests?) >>> >>> Your patch series is reasonable. I don't think it's too bad. >>> >>> The main question is how to detect the XFS fix once it ships. XFS >>> already has a ton of ioctls, so maybe they don't mind adding a >>> feature/quirk bit map ioctl for publishing information about bug fixes >>> to userspace. I didn't see another obvious way of doing it, maybe a >>> mount option that the kernel automatically sets and that gets reported >>> to userspace? >> >> I’ll add a note to the RH BZ. >> >>> If we imagine that XFS will not provide a mechanism to detect the >>> presence of the fix, then could we ask QEMU package maintainers to >>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point >>> in the future when their distro has been shipping a fixed kernel for a >>> while? It's ugly because it doesn't work if the user installs an older >>> custom-built kernel on the host. But at least it will cover 98% of >>> users... >> >> :-/ >> >> I don’t like it, but I suppose it would work. We could also >> automatically enable this disabling option in configure when we detect >> uname to report a kernel version that must include the fix. (This >> wouldn’t work for kernel with backported fixes, but those disappear over >> time...) > I just realized that none of this is going to work for the gluster case > brought up by Nir. The affected kernel is the remote one and we have no > insight into that. I don’t think we can do ioctls to XFS over gluster, > can we? On third thought, we could try to detect whether the file is on a remote filesystem, and if so enable the workaround unconditionally. I suppose it wouldn’t hurt performance-wise, given that it’s a remote filesystem anyway. Max
28.10.2019 12:56, Max Reitz wrote: > On 28.10.19 10:30, Max Reitz wrote: >> On 28.10.19 10:24, Max Reitz wrote: >>> On 27.10.19 13:35, Stefan Hajnoczi wrote: >>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >>>>> As for how we can address the issue, I see three ways: >>>>> (1) The one presented in this series: On XFS with aio=native, we extend >>>>> tracked requests for post-EOF fallocate() calls (i.e., write-zero >>>>> operations) to reach until infinity (INT64_MAX in practice), mark >>>>> them serializing and wait for other conflicting requests. >>>>> >>>>> Advantages: >>>>> + Limits the impact to very specific cases >>>>> (And that means it wouldn’t hurt too much to keep this workaround >>>>> even when the XFS driver has been fixed) >>>>> + Works around the bug where it happens, namely in file-posix >>>>> >>>>> Disadvantages: >>>>> - A bit complex >>>>> - A bit of a layering violation (should file-posix have access to >>>>> tracked requests?) >>>> >>>> Your patch series is reasonable. I don't think it's too bad. >>>> >>>> The main question is how to detect the XFS fix once it ships. XFS >>>> already has a ton of ioctls, so maybe they don't mind adding a >>>> feature/quirk bit map ioctl for publishing information about bug fixes >>>> to userspace. I didn't see another obvious way of doing it, maybe a >>>> mount option that the kernel automatically sets and that gets reported >>>> to userspace? >>> >>> I’ll add a note to the RH BZ. >>> >>>> If we imagine that XFS will not provide a mechanism to detect the >>>> presence of the fix, then could we ask QEMU package maintainers to >>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point >>>> in the future when their distro has been shipping a fixed kernel for a >>>> while? It's ugly because it doesn't work if the user installs an older >>>> custom-built kernel on the host. But at least it will cover 98% of >>>> users... >>> >>> :-/ >>> >>> I don’t like it, but I suppose it would work. We could also >>> automatically enable this disabling option in configure when we detect >>> uname to report a kernel version that must include the fix. (This >>> wouldn’t work for kernel with backported fixes, but those disappear over >>> time...) >> I just realized that none of this is going to work for the gluster case >> brought up by Nir. The affected kernel is the remote one and we have no >> insight into that. I don’t think we can do ioctls to XFS over gluster, >> can we? > > On third thought, we could try to detect whether the file is on a remote > filesystem, and if so enable the workaround unconditionally. I suppose > it wouldn’t hurt performance-wise, given that it’s a remote filesystem > anyway. > I think, for remote, the difference may be even higher than for local, as cost of writing real zeroes through the wire vs fast zero command is high. Really, can we live with simple config option, is it so bad? -- Best regards, Vladimir
On 28.10.19 11:07, Vladimir Sementsov-Ogievskiy wrote: > 28.10.2019 12:56, Max Reitz wrote: >> On 28.10.19 10:30, Max Reitz wrote: >>> On 28.10.19 10:24, Max Reitz wrote: >>>> On 27.10.19 13:35, Stefan Hajnoczi wrote: >>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >>>>>> As for how we can address the issue, I see three ways: >>>>>> (1) The one presented in this series: On XFS with aio=native, we extend >>>>>> tracked requests for post-EOF fallocate() calls (i.e., write-zero >>>>>> operations) to reach until infinity (INT64_MAX in practice), mark >>>>>> them serializing and wait for other conflicting requests. >>>>>> >>>>>> Advantages: >>>>>> + Limits the impact to very specific cases >>>>>> (And that means it wouldn’t hurt too much to keep this workaround >>>>>> even when the XFS driver has been fixed) >>>>>> + Works around the bug where it happens, namely in file-posix >>>>>> >>>>>> Disadvantages: >>>>>> - A bit complex >>>>>> - A bit of a layering violation (should file-posix have access to >>>>>> tracked requests?) >>>>> >>>>> Your patch series is reasonable. I don't think it's too bad. >>>>> >>>>> The main question is how to detect the XFS fix once it ships. XFS >>>>> already has a ton of ioctls, so maybe they don't mind adding a >>>>> feature/quirk bit map ioctl for publishing information about bug fixes >>>>> to userspace. I didn't see another obvious way of doing it, maybe a >>>>> mount option that the kernel automatically sets and that gets reported >>>>> to userspace? >>>> >>>> I’ll add a note to the RH BZ. >>>> >>>>> If we imagine that XFS will not provide a mechanism to detect the >>>>> presence of the fix, then could we ask QEMU package maintainers to >>>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point >>>>> in the future when their distro has been shipping a fixed kernel for a >>>>> while? It's ugly because it doesn't work if the user installs an older >>>>> custom-built kernel on the host. But at least it will cover 98% of >>>>> users... >>>> >>>> :-/ >>>> >>>> I don’t like it, but I suppose it would work. We could also >>>> automatically enable this disabling option in configure when we detect >>>> uname to report a kernel version that must include the fix. (This >>>> wouldn’t work for kernel with backported fixes, but those disappear over >>>> time...) >>> I just realized that none of this is going to work for the gluster case >>> brought up by Nir. The affected kernel is the remote one and we have no >>> insight into that. I don’t think we can do ioctls to XFS over gluster, >>> can we? >> >> On third thought, we could try to detect whether the file is on a remote >> filesystem, and if so enable the workaround unconditionally. I suppose >> it wouldn’t hurt performance-wise, given that it’s a remote filesystem >> anyway. >> > > I think, for remote, the difference may be even higher than for local, as cost > of writing real zeroes through the wire vs fast zero command is high. I was speaking of a workaround in general, and that includes the workaround presented in this series. > Really, can we live with simple config option, is it so bad? The config option won’t work for remote hosts, though. That’s exactly the problem. Max
28.10.2019 13:10, Max Reitz wrote: > On 28.10.19 11:07, Vladimir Sementsov-Ogievskiy wrote: >> 28.10.2019 12:56, Max Reitz wrote: >>> On 28.10.19 10:30, Max Reitz wrote: >>>> On 28.10.19 10:24, Max Reitz wrote: >>>>> On 27.10.19 13:35, Stefan Hajnoczi wrote: >>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >>>>>>> As for how we can address the issue, I see three ways: >>>>>>> (1) The one presented in this series: On XFS with aio=native, we extend >>>>>>> tracked requests for post-EOF fallocate() calls (i.e., write-zero >>>>>>> operations) to reach until infinity (INT64_MAX in practice), mark >>>>>>> them serializing and wait for other conflicting requests. >>>>>>> >>>>>>> Advantages: >>>>>>> + Limits the impact to very specific cases >>>>>>> (And that means it wouldn’t hurt too much to keep this workaround >>>>>>> even when the XFS driver has been fixed) >>>>>>> + Works around the bug where it happens, namely in file-posix >>>>>>> >>>>>>> Disadvantages: >>>>>>> - A bit complex >>>>>>> - A bit of a layering violation (should file-posix have access to >>>>>>> tracked requests?) >>>>>> >>>>>> Your patch series is reasonable. I don't think it's too bad. >>>>>> >>>>>> The main question is how to detect the XFS fix once it ships. XFS >>>>>> already has a ton of ioctls, so maybe they don't mind adding a >>>>>> feature/quirk bit map ioctl for publishing information about bug fixes >>>>>> to userspace. I didn't see another obvious way of doing it, maybe a >>>>>> mount option that the kernel automatically sets and that gets reported >>>>>> to userspace? >>>>> >>>>> I’ll add a note to the RH BZ. >>>>> >>>>>> If we imagine that XFS will not provide a mechanism to detect the >>>>>> presence of the fix, then could we ask QEMU package maintainers to >>>>>> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point >>>>>> in the future when their distro has been shipping a fixed kernel for a >>>>>> while? It's ugly because it doesn't work if the user installs an older >>>>>> custom-built kernel on the host. But at least it will cover 98% of >>>>>> users... >>>>> >>>>> :-/ >>>>> >>>>> I don’t like it, but I suppose it would work. We could also >>>>> automatically enable this disabling option in configure when we detect >>>>> uname to report a kernel version that must include the fix. (This >>>>> wouldn’t work for kernel with backported fixes, but those disappear over >>>>> time...) >>>> I just realized that none of this is going to work for the gluster case >>>> brought up by Nir. The affected kernel is the remote one and we have no >>>> insight into that. I don’t think we can do ioctls to XFS over gluster, >>>> can we? >>> >>> On third thought, we could try to detect whether the file is on a remote >>> filesystem, and if so enable the workaround unconditionally. I suppose >>> it wouldn’t hurt performance-wise, given that it’s a remote filesystem >>> anyway. >>> >> >> I think, for remote, the difference may be even higher than for local, as cost >> of writing real zeroes through the wire vs fast zero command is high. > > I was speaking of a workaround in general, and that includes the > workaround presented in this series. Ah, yes, sorry. Then it should be OK. > >> Really, can we live with simple config option, is it so bad? > > The config option won’t work for remote hosts, though. That’s exactly > the problem. > Still config option will help people who have xfs setups and want to be absolutely safe. (Or, with another default, help people who doesn't have xfs setups (Virtuozzo) and want to be fast) -- Best regards, Vladimir
Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben: > On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: > > As for how we can address the issue, I see three ways: > > (1) The one presented in this series: On XFS with aio=native, we extend > > tracked requests for post-EOF fallocate() calls (i.e., write-zero > > operations) to reach until infinity (INT64_MAX in practice), mark > > them serializing and wait for other conflicting requests. > > > > Advantages: > > + Limits the impact to very specific cases > > (And that means it wouldn’t hurt too much to keep this workaround > > even when the XFS driver has been fixed) > > + Works around the bug where it happens, namely in file-posix > > > > Disadvantages: > > - A bit complex > > - A bit of a layering violation (should file-posix have access to > > tracked requests?) > > Your patch series is reasonable. I don't think it's too bad. > > The main question is how to detect the XFS fix once it ships. XFS > already has a ton of ioctls, so maybe they don't mind adding a > feature/quirk bit map ioctl for publishing information about bug fixes > to userspace. I didn't see another obvious way of doing it, maybe a > mount option that the kernel automatically sets and that gets reported > to userspace? I think the CC list is too short for this question. We should involve the XFS people here. > If we imagine that XFS will not provide a mechanism to detect the > presence of the fix, then could we ask QEMU package maintainers to > ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point > in the future when their distro has been shipping a fixed kernel for a > while? It's ugly because it doesn't work if the user installs an older > custom-built kernel on the host. But at least it will cover 98% of > users... > > > (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. > > To my knowledge I’m the only one who has provided any benchmarks for > > this commit, and even then I was a bit skeptical because it performs > > well in some cases and bad in others. I concluded that it’s > > probably worth it because the “some cases” are more likely to occur. > > > > Now we have this problem of corruption here (granted due to a bug in > > the XFS driver), and another report of massively degraded > > performance on ppc64 > > (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a > > private BZ; I hate that :-/ The report is about 40 % worse > > performance for an in-guest fio write benchmark.) > > > > So I have to ask the question about what the justification for > > keeping c8bb23cbdbe32f is. How much does performance increase with > > it actually? (On non-(ppc64+XFS) machines, obviously) > > > > Advantages: > > + Trivial > > + No layering violations > > + We wouldn’t need to keep track of whether the kernel bug has been > > fixed or not > > + Fixes the ppc64+XFS performance problem > > > > Disadvantages: > > - Reverts cluster allocation performance to pre-c8bb23cbdbe32f > > levels, whatever that means > > My favorite because it is clean and simple, but Vladimir has a valid > use-case for requiring this performance optimization so reverting isn't > an option. Vladimir also said that qcow2 subclusters would probably also solve his problem, so maybe reverting and applying the subcluster patches instead is a possible solution, too? We already have some cases where the existing handle_alloc_space() causes performance to actually become worse, and serialising requests as a workaround isn't going to make performance any better. So even on these grounds, keeping commit c8bb23cbdbe32f is questionable. Kevin
28.10.2019 14:04, Kevin Wolf wrote: > Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben: >> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >>> As for how we can address the issue, I see three ways: >>> (1) The one presented in this series: On XFS with aio=native, we extend >>> tracked requests for post-EOF fallocate() calls (i.e., write-zero >>> operations) to reach until infinity (INT64_MAX in practice), mark >>> them serializing and wait for other conflicting requests. >>> >>> Advantages: >>> + Limits the impact to very specific cases >>> (And that means it wouldn’t hurt too much to keep this workaround >>> even when the XFS driver has been fixed) >>> + Works around the bug where it happens, namely in file-posix >>> >>> Disadvantages: >>> - A bit complex >>> - A bit of a layering violation (should file-posix have access to >>> tracked requests?) >> >> Your patch series is reasonable. I don't think it's too bad. >> >> The main question is how to detect the XFS fix once it ships. XFS >> already has a ton of ioctls, so maybe they don't mind adding a >> feature/quirk bit map ioctl for publishing information about bug fixes >> to userspace. I didn't see another obvious way of doing it, maybe a >> mount option that the kernel automatically sets and that gets reported >> to userspace? > > I think the CC list is too short for this question. We should involve > the XFS people here. > >> If we imagine that XFS will not provide a mechanism to detect the >> presence of the fix, then could we ask QEMU package maintainers to >> ./configure --disable-xfs-fallocate-beyond-eof-workaround at some point >> in the future when their distro has been shipping a fixed kernel for a >> while? It's ugly because it doesn't work if the user installs an older >> custom-built kernel on the host. But at least it will cover 98% of >> users... >> >>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>> To my knowledge I’m the only one who has provided any benchmarks for >>> this commit, and even then I was a bit skeptical because it performs >>> well in some cases and bad in others. I concluded that it’s >>> probably worth it because the “some cases” are more likely to occur. >>> >>> Now we have this problem of corruption here (granted due to a bug in >>> the XFS driver), and another report of massively degraded >>> performance on ppc64 >>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>> private BZ; I hate that :-/ The report is about 40 % worse >>> performance for an in-guest fio write benchmark.) >>> >>> So I have to ask the question about what the justification for >>> keeping c8bb23cbdbe32f is. How much does performance increase with >>> it actually? (On non-(ppc64+XFS) machines, obviously) >>> >>> Advantages: >>> + Trivial >>> + No layering violations >>> + We wouldn’t need to keep track of whether the kernel bug has been >>> fixed or not >>> + Fixes the ppc64+XFS performance problem >>> >>> Disadvantages: >>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>> levels, whatever that means >> >> My favorite because it is clean and simple, but Vladimir has a valid >> use-case for requiring this performance optimization so reverting isn't >> an option. > > Vladimir also said that qcow2 subclusters would probably also solve his > problem, so maybe reverting and applying the subcluster patches instead > is a possible solution, too? I'm not sure about ssd case, it may need write-zero optimization anyway. > > We already have some cases where the existing handle_alloc_space() > causes performance to actually become worse, and serialising requests as > a workaround isn't going to make performance any better. So even on > these grounds, keeping commit c8bb23cbdbe32f is questionable. > Can keeping handle_alloc_space under some config option be an option? -- Best regards, Vladimir
On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote: > 28.10.2019 14:04, Kevin Wolf wrote: >> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben: >>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: [...] >>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>>> To my knowledge I’m the only one who has provided any benchmarks for >>>> this commit, and even then I was a bit skeptical because it performs >>>> well in some cases and bad in others. I concluded that it’s >>>> probably worth it because the “some cases” are more likely to occur. >>>> >>>> Now we have this problem of corruption here (granted due to a bug in >>>> the XFS driver), and another report of massively degraded >>>> performance on ppc64 >>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>>> private BZ; I hate that :-/ The report is about 40 % worse >>>> performance for an in-guest fio write benchmark.) >>>> >>>> So I have to ask the question about what the justification for >>>> keeping c8bb23cbdbe32f is. How much does performance increase with >>>> it actually? (On non-(ppc64+XFS) machines, obviously) >>>> >>>> Advantages: >>>> + Trivial >>>> + No layering violations >>>> + We wouldn’t need to keep track of whether the kernel bug has been >>>> fixed or not >>>> + Fixes the ppc64+XFS performance problem >>>> >>>> Disadvantages: >>>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>>> levels, whatever that means >>> >>> My favorite because it is clean and simple, but Vladimir has a valid >>> use-case for requiring this performance optimization so reverting isn't >>> an option. >> >> Vladimir also said that qcow2 subclusters would probably also solve his >> problem, so maybe reverting and applying the subcluster patches instead >> is a possible solution, too? > > I'm not sure about ssd case, it may need write-zero optimization anyway. What exactly do you need? Do you actually need to write zeroes (e.g. because you’re storing images on block devices) or would it be sufficient to just drop the COW areas when bdrv_has_zero_init() and bdrv_has_zero_init_truncate() are true? I’m asking because Dave Chinner said (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that fallocate() is always slow at least with aio=native because it needs to wait for all concurrent AIO writes to finish, and so it causes the AIO pipeline to stall. (He suggested using XFS extent size hints to get the same effect as write-zeroes for free, basically, but I don’t know whether that’s really useful to us; as unallocated areas on XFS read back as zero anyway.) >> We already have some cases where the existing handle_alloc_space() >> causes performance to actually become worse, and serialising requests as >> a workaround isn't going to make performance any better. So even on >> these grounds, keeping commit c8bb23cbdbe32f is questionable. >> > > Can keeping handle_alloc_space under some config option be an option? Hm. A config option is weird if you’re the only one who’s going to enable it. But other than that I don’t have anything against it. Max
29.10.2019 11:50, Max Reitz wrote: > On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote: >> 28.10.2019 14:04, Kevin Wolf wrote: >>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben: >>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: > > [...] > >>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>>>> To my knowledge I’m the only one who has provided any benchmarks for >>>>> this commit, and even then I was a bit skeptical because it performs >>>>> well in some cases and bad in others. I concluded that it’s >>>>> probably worth it because the “some cases” are more likely to occur. >>>>> >>>>> Now we have this problem of corruption here (granted due to a bug in >>>>> the XFS driver), and another report of massively degraded >>>>> performance on ppc64 >>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>>>> private BZ; I hate that :-/ The report is about 40 % worse >>>>> performance for an in-guest fio write benchmark.) >>>>> >>>>> So I have to ask the question about what the justification for >>>>> keeping c8bb23cbdbe32f is. How much does performance increase with >>>>> it actually? (On non-(ppc64+XFS) machines, obviously) >>>>> >>>>> Advantages: >>>>> + Trivial >>>>> + No layering violations >>>>> + We wouldn’t need to keep track of whether the kernel bug has been >>>>> fixed or not >>>>> + Fixes the ppc64+XFS performance problem >>>>> >>>>> Disadvantages: >>>>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>>>> levels, whatever that means >>>> >>>> My favorite because it is clean and simple, but Vladimir has a valid >>>> use-case for requiring this performance optimization so reverting isn't >>>> an option. >>> >>> Vladimir also said that qcow2 subclusters would probably also solve his >>> problem, so maybe reverting and applying the subcluster patches instead >>> is a possible solution, too? >> >> I'm not sure about ssd case, it may need write-zero optimization anyway. > > What exactly do you need? Do you actually need to write zeroes (e.g. > because you’re storing images on block devices) or would it be > sufficient to just drop the COW areas when bdrv_has_zero_init() and > bdrv_has_zero_init_truncate() are true? Hmm, what do you mean? We need to zero COW areas. So, original way is to write real zeroes, optimized way is fallocate.. What do you mean by drop? Mark sublusters as zeroes by metadata? But still we'll have COW areas in subcluster, and we'll need to directly zero them.. And fallocate will most probably be faster on ssd ext4 case.. > > I’m asking because Dave Chinner said > (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that > fallocate() is always slow at least with aio=native because it needs to > wait for all concurrent AIO writes to finish, and so it causes the AIO > pipeline to stall. > > (He suggested using XFS extent size hints to get the same effect as > write-zeroes for free, basically, but I don’t know whether that’s really > useful to us; as unallocated areas on XFS read back as zero anyway.) > >>> We already have some cases where the existing handle_alloc_space() >>> causes performance to actually become worse, and serialising requests as >>> a workaround isn't going to make performance any better. So even on >>> these grounds, keeping commit c8bb23cbdbe32f is questionable. >>> >> >> Can keeping handle_alloc_space under some config option be an option? > > Hm. A config option is weird if you’re the only one who’s going to > enable it. But other than that I don’t have anything against it. > It's just a bit easier for us to maintain config option, than out-of-tree patch. On the other hand, it's not a real problem to maintain this one patch in separate. It may return again to the tree, when XFS bug fixed. -- Best regards, Vladimir
On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote: > 29.10.2019 11:50, Max Reitz wrote: >> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote: >>> 28.10.2019 14:04, Kevin Wolf wrote: >>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben: >>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >> >> [...] >> >>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>>>>> To my knowledge I’m the only one who has provided any benchmarks for >>>>>> this commit, and even then I was a bit skeptical because it performs >>>>>> well in some cases and bad in others. I concluded that it’s >>>>>> probably worth it because the “some cases” are more likely to occur. >>>>>> >>>>>> Now we have this problem of corruption here (granted due to a bug in >>>>>> the XFS driver), and another report of massively degraded >>>>>> performance on ppc64 >>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>>>>> private BZ; I hate that :-/ The report is about 40 % worse >>>>>> performance for an in-guest fio write benchmark.) >>>>>> >>>>>> So I have to ask the question about what the justification for >>>>>> keeping c8bb23cbdbe32f is. How much does performance increase with >>>>>> it actually? (On non-(ppc64+XFS) machines, obviously) >>>>>> >>>>>> Advantages: >>>>>> + Trivial >>>>>> + No layering violations >>>>>> + We wouldn’t need to keep track of whether the kernel bug has been >>>>>> fixed or not >>>>>> + Fixes the ppc64+XFS performance problem >>>>>> >>>>>> Disadvantages: >>>>>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>>>>> levels, whatever that means >>>>> >>>>> My favorite because it is clean and simple, but Vladimir has a valid >>>>> use-case for requiring this performance optimization so reverting isn't >>>>> an option. >>>> >>>> Vladimir also said that qcow2 subclusters would probably also solve his >>>> problem, so maybe reverting and applying the subcluster patches instead >>>> is a possible solution, too? >>> >>> I'm not sure about ssd case, it may need write-zero optimization anyway. >> >> What exactly do you need? Do you actually need to write zeroes (e.g. >> because you’re storing images on block devices) or would it be >> sufficient to just drop the COW areas when bdrv_has_zero_init() and >> bdrv_has_zero_init_truncate() are true? > > Hmm, what do you mean? We need to zero COW areas. So, original way is to > write real zeroes, optimized way is fallocate.. What do you mean by drop? > Mark sublusters as zeroes by metadata? Why do you need to zero COW areas? For normal files, any data will read as zero if you didn’t write anything there. > But still we'll have COW areas in subcluster, and we'll need to directly zero > them.. And fallocate will most probably be faster on ssd ext4 case.. > >> >> I’m asking because Dave Chinner said >> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that >> fallocate() is always slow at least with aio=native because it needs to >> wait for all concurrent AIO writes to finish, and so it causes the AIO >> pipeline to stall. >> >> (He suggested using XFS extent size hints to get the same effect as >> write-zeroes for free, basically, but I don’t know whether that’s really >> useful to us; as unallocated areas on XFS read back as zero anyway.) >> >>>> We already have some cases where the existing handle_alloc_space() >>>> causes performance to actually become worse, and serialising requests as >>>> a workaround isn't going to make performance any better. So even on >>>> these grounds, keeping commit c8bb23cbdbe32f is questionable. >>>> >>> >>> Can keeping handle_alloc_space under some config option be an option? >> >> Hm. A config option is weird if you’re the only one who’s going to >> enable it. But other than that I don’t have anything against it. >> > > It's just a bit easier for us to maintain config option, than out-of-tree patch. > On the other hand, it's not a real problem to maintain this one patch in separate. > It may return again to the tree, when XFS bug fixed. We’ll still have the problem that fallocate() must stall aio=native requests. Max
29.10.2019 14:55, Max Reitz wrote: > On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote: >> 29.10.2019 11:50, Max Reitz wrote: >>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote: >>>> 28.10.2019 14:04, Kevin Wolf wrote: >>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben: >>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >>> >>> [...] >>> >>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>>>>>> To my knowledge I’m the only one who has provided any benchmarks for >>>>>>> this commit, and even then I was a bit skeptical because it performs >>>>>>> well in some cases and bad in others. I concluded that it’s >>>>>>> probably worth it because the “some cases” are more likely to occur. >>>>>>> >>>>>>> Now we have this problem of corruption here (granted due to a bug in >>>>>>> the XFS driver), and another report of massively degraded >>>>>>> performance on ppc64 >>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>>>>>> private BZ; I hate that :-/ The report is about 40 % worse >>>>>>> performance for an in-guest fio write benchmark.) >>>>>>> >>>>>>> So I have to ask the question about what the justification for >>>>>>> keeping c8bb23cbdbe32f is. How much does performance increase with >>>>>>> it actually? (On non-(ppc64+XFS) machines, obviously) >>>>>>> >>>>>>> Advantages: >>>>>>> + Trivial >>>>>>> + No layering violations >>>>>>> + We wouldn’t need to keep track of whether the kernel bug has been >>>>>>> fixed or not >>>>>>> + Fixes the ppc64+XFS performance problem >>>>>>> >>>>>>> Disadvantages: >>>>>>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>>>>>> levels, whatever that means >>>>>> >>>>>> My favorite because it is clean and simple, but Vladimir has a valid >>>>>> use-case for requiring this performance optimization so reverting isn't >>>>>> an option. >>>>> >>>>> Vladimir also said that qcow2 subclusters would probably also solve his >>>>> problem, so maybe reverting and applying the subcluster patches instead >>>>> is a possible solution, too? >>>> >>>> I'm not sure about ssd case, it may need write-zero optimization anyway. >>> >>> What exactly do you need? Do you actually need to write zeroes (e.g. >>> because you’re storing images on block devices) or would it be >>> sufficient to just drop the COW areas when bdrv_has_zero_init() and >>> bdrv_has_zero_init_truncate() are true? >> >> Hmm, what do you mean? We need to zero COW areas. So, original way is to >> write real zeroes, optimized way is fallocate.. What do you mean by drop? >> Mark sublusters as zeroes by metadata? > > Why do you need to zero COW areas? For normal files, any data will read > as zero if you didn’t write anything there. Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero, as it may be reused previously allocated cluster.. > >> But still we'll have COW areas in subcluster, and we'll need to directly zero >> them.. And fallocate will most probably be faster on ssd ext4 case.. >> >>> >>> I’m asking because Dave Chinner said >>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that >>> fallocate() is always slow at least with aio=native because it needs to >>> wait for all concurrent AIO writes to finish, and so it causes the AIO >>> pipeline to stall. >>> >>> (He suggested using XFS extent size hints to get the same effect as >>> write-zeroes for free, basically, but I don’t know whether that’s really >>> useful to us; as unallocated areas on XFS read back as zero anyway.) >>> >>>>> We already have some cases where the existing handle_alloc_space() >>>>> causes performance to actually become worse, and serialising requests as >>>>> a workaround isn't going to make performance any better. So even on >>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable. >>>>> >>>> >>>> Can keeping handle_alloc_space under some config option be an option? >>> >>> Hm. A config option is weird if you’re the only one who’s going to >>> enable it. But other than that I don’t have anything against it. >>> >> >> It's just a bit easier for us to maintain config option, than out-of-tree patch. >> On the other hand, it's not a real problem to maintain this one patch in separate. >> It may return again to the tree, when XFS bug fixed. > > We’ll still have the problem that fallocate() must stall aio=native > requests. > Does it mean that fallocate is bad in general? Practice shows the opposite.. At least I have my examples with qemu-img bench. Can that thing be shown with qemu-img bench or something? -- Best regards, Vladimir
On 29.10.19 13:05, Vladimir Sementsov-Ogievskiy wrote: > 29.10.2019 14:55, Max Reitz wrote: >> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote: >>> 29.10.2019 11:50, Max Reitz wrote: >>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote: >>>>> 28.10.2019 14:04, Kevin Wolf wrote: >>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben: >>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >>>> >>>> [...] >>>> >>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>>>>>>> To my knowledge I’m the only one who has provided any benchmarks for >>>>>>>> this commit, and even then I was a bit skeptical because it performs >>>>>>>> well in some cases and bad in others. I concluded that it’s >>>>>>>> probably worth it because the “some cases” are more likely to occur. >>>>>>>> >>>>>>>> Now we have this problem of corruption here (granted due to a bug in >>>>>>>> the XFS driver), and another report of massively degraded >>>>>>>> performance on ppc64 >>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>>>>>>> private BZ; I hate that :-/ The report is about 40 % worse >>>>>>>> performance for an in-guest fio write benchmark.) >>>>>>>> >>>>>>>> So I have to ask the question about what the justification for >>>>>>>> keeping c8bb23cbdbe32f is. How much does performance increase with >>>>>>>> it actually? (On non-(ppc64+XFS) machines, obviously) >>>>>>>> >>>>>>>> Advantages: >>>>>>>> + Trivial >>>>>>>> + No layering violations >>>>>>>> + We wouldn’t need to keep track of whether the kernel bug has been >>>>>>>> fixed or not >>>>>>>> + Fixes the ppc64+XFS performance problem >>>>>>>> >>>>>>>> Disadvantages: >>>>>>>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>>>>>>> levels, whatever that means >>>>>>> >>>>>>> My favorite because it is clean and simple, but Vladimir has a valid >>>>>>> use-case for requiring this performance optimization so reverting isn't >>>>>>> an option. >>>>>> >>>>>> Vladimir also said that qcow2 subclusters would probably also solve his >>>>>> problem, so maybe reverting and applying the subcluster patches instead >>>>>> is a possible solution, too? >>>>> >>>>> I'm not sure about ssd case, it may need write-zero optimization anyway. >>>> >>>> What exactly do you need? Do you actually need to write zeroes (e.g. >>>> because you’re storing images on block devices) or would it be >>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and >>>> bdrv_has_zero_init_truncate() are true? >>> >>> Hmm, what do you mean? We need to zero COW areas. So, original way is to >>> write real zeroes, optimized way is fallocate.. What do you mean by drop? >>> Mark sublusters as zeroes by metadata? >> >> Why do you need to zero COW areas? For normal files, any data will read >> as zero if you didn’t write anything there. > > Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero, > as it may be reused previously allocated cluster.. Hm, right. We could special-case something for beyond the EOF, but I don’t know whether that really makes it better. OTOH, maybe allocations at the EOF are the real bottleneck. Reusing existing clusters should be rare enough that maybe the existing code which explicitly writes zeroes is sufficient. >> >>> But still we'll have COW areas in subcluster, and we'll need to directly zero >>> them.. And fallocate will most probably be faster on ssd ext4 case.. >>> >>>> >>>> I’m asking because Dave Chinner said >>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that >>>> fallocate() is always slow at least with aio=native because it needs to >>>> wait for all concurrent AIO writes to finish, and so it causes the AIO >>>> pipeline to stall. >>>> >>>> (He suggested using XFS extent size hints to get the same effect as >>>> write-zeroes for free, basically, but I don’t know whether that’s really >>>> useful to us; as unallocated areas on XFS read back as zero anyway.) >>>> >>>>>> We already have some cases where the existing handle_alloc_space() >>>>>> causes performance to actually become worse, and serialising requests as >>>>>> a workaround isn't going to make performance any better. So even on >>>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable. >>>>>> >>>>> >>>>> Can keeping handle_alloc_space under some config option be an option? >>>> >>>> Hm. A config option is weird if you’re the only one who’s going to >>>> enable it. But other than that I don’t have anything against it. >>>> >>> >>> It's just a bit easier for us to maintain config option, than out-of-tree patch. >>> On the other hand, it's not a real problem to maintain this one patch in separate. >>> It may return again to the tree, when XFS bug fixed. >> >> We’ll still have the problem that fallocate() must stall aio=native >> requests. >> > > Does it mean that fallocate is bad in general? Practice shows the opposite.. > At least I have my examples with qemu-img bench. Can that thing be shown with > qemu-img bench or something? I haven’t done benchmarks yet, but I don’t think Laurent will mind if I paste his fio benchmark results from the unfortunately private BZ here: QCOW2 ON | EXT4 | XFS | | | | (rw, bs, iodepth) | 2.12.0 | 4.1.0 | 2.12.0 | 4.1.0 | ------------------+-----------+-----------+-----------+-----------+ (write, 16k, 1) | 1868KiB/s | 1865KiB/s | 18.6MiB/s | 13.7MiB/s | ------------------+-----------+-----------+-----------+-----------+ (write, 64k, 1) | 7036KiB/s | 7413KiB/s | 27.0MiB/s | 7465KiB/s | ------------------+-----------+-----------+-----------+-----------+ (write, 4k, 8) | 535KiB/s | 524KiB/s | 13.9MiB/s | 1662KiB/s | ------------------+-----------+-----------+-----------+-----------+ And that just looks like ext4 performs worse with aio=native in general. But I’ll have to do my own benchmarks first. Max
29.10.2019 15:11, Max Reitz wrote: > On 29.10.19 13:05, Vladimir Sementsov-Ogievskiy wrote: >> 29.10.2019 14:55, Max Reitz wrote: >>> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote: >>>> 29.10.2019 11:50, Max Reitz wrote: >>>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote: >>>>>> 28.10.2019 14:04, Kevin Wolf wrote: >>>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben: >>>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >>>>> >>>>> [...] >>>>> >>>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>>>>>>>> To my knowledge I’m the only one who has provided any benchmarks for >>>>>>>>> this commit, and even then I was a bit skeptical because it performs >>>>>>>>> well in some cases and bad in others. I concluded that it’s >>>>>>>>> probably worth it because the “some cases” are more likely to occur. >>>>>>>>> >>>>>>>>> Now we have this problem of corruption here (granted due to a bug in >>>>>>>>> the XFS driver), and another report of massively degraded >>>>>>>>> performance on ppc64 >>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>>>>>>>> private BZ; I hate that :-/ The report is about 40 % worse >>>>>>>>> performance for an in-guest fio write benchmark.) >>>>>>>>> >>>>>>>>> So I have to ask the question about what the justification for >>>>>>>>> keeping c8bb23cbdbe32f is. How much does performance increase with >>>>>>>>> it actually? (On non-(ppc64+XFS) machines, obviously) >>>>>>>>> >>>>>>>>> Advantages: >>>>>>>>> + Trivial >>>>>>>>> + No layering violations >>>>>>>>> + We wouldn’t need to keep track of whether the kernel bug has been >>>>>>>>> fixed or not >>>>>>>>> + Fixes the ppc64+XFS performance problem >>>>>>>>> >>>>>>>>> Disadvantages: >>>>>>>>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>>>>>>>> levels, whatever that means >>>>>>>> >>>>>>>> My favorite because it is clean and simple, but Vladimir has a valid >>>>>>>> use-case for requiring this performance optimization so reverting isn't >>>>>>>> an option. >>>>>>> >>>>>>> Vladimir also said that qcow2 subclusters would probably also solve his >>>>>>> problem, so maybe reverting and applying the subcluster patches instead >>>>>>> is a possible solution, too? >>>>>> >>>>>> I'm not sure about ssd case, it may need write-zero optimization anyway. >>>>> >>>>> What exactly do you need? Do you actually need to write zeroes (e.g. >>>>> because you’re storing images on block devices) or would it be >>>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and >>>>> bdrv_has_zero_init_truncate() are true? >>>> >>>> Hmm, what do you mean? We need to zero COW areas. So, original way is to >>>> write real zeroes, optimized way is fallocate.. What do you mean by drop? >>>> Mark sublusters as zeroes by metadata? >>> >>> Why do you need to zero COW areas? For normal files, any data will read >>> as zero if you didn’t write anything there. >> >> Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero, >> as it may be reused previously allocated cluster.. > > Hm, right. We could special-case something for beyond the EOF, but I > don’t know whether that really makes it better. > > OTOH, maybe allocations at the EOF are the real bottleneck. Reusing > existing clusters should be rare enough that maybe the existing code > which explicitly writes zeroes is sufficient. But, as I understand pre-EOF fallocates are safe in xfs? So, we may just drop calling fallocate past-EOF (it's zero anyway) and do fallocate pre-EOF (it's safe) ? (the only special case is intersecting EOF.. so better is keep file length at cluster-size boundary, truncating on exit) > >>> >>>> But still we'll have COW areas in subcluster, and we'll need to directly zero >>>> them.. And fallocate will most probably be faster on ssd ext4 case.. >>>> >>>>> >>>>> I’m asking because Dave Chinner said >>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) that >>>>> fallocate() is always slow at least with aio=native because it needs to >>>>> wait for all concurrent AIO writes to finish, and so it causes the AIO >>>>> pipeline to stall. >>>>> >>>>> (He suggested using XFS extent size hints to get the same effect as >>>>> write-zeroes for free, basically, but I don’t know whether that’s really >>>>> useful to us; as unallocated areas on XFS read back as zero anyway.) >>>>> >>>>>>> We already have some cases where the existing handle_alloc_space() >>>>>>> causes performance to actually become worse, and serialising requests as >>>>>>> a workaround isn't going to make performance any better. So even on >>>>>>> these grounds, keeping commit c8bb23cbdbe32f is questionable. >>>>>>> >>>>>> >>>>>> Can keeping handle_alloc_space under some config option be an option? >>>>> >>>>> Hm. A config option is weird if you’re the only one who’s going to >>>>> enable it. But other than that I don’t have anything against it. >>>>> >>>> >>>> It's just a bit easier for us to maintain config option, than out-of-tree patch. >>>> On the other hand, it's not a real problem to maintain this one patch in separate. >>>> It may return again to the tree, when XFS bug fixed. >>> >>> We’ll still have the problem that fallocate() must stall aio=native >>> requests. >>> >> >> Does it mean that fallocate is bad in general? Practice shows the opposite.. >> At least I have my examples with qemu-img bench. Can that thing be shown with >> qemu-img bench or something? > > I haven’t done benchmarks yet, but I don’t think Laurent will mind if I > paste his fio benchmark results from the unfortunately private BZ here: > > QCOW2 ON | EXT4 | XFS | > | | | > (rw, bs, iodepth) | 2.12.0 | 4.1.0 | 2.12.0 | 4.1.0 | > ------------------+-----------+-----------+-----------+-----------+ > (write, 16k, 1) | 1868KiB/s | 1865KiB/s | 18.6MiB/s | 13.7MiB/s | > ------------------+-----------+-----------+-----------+-----------+ > (write, 64k, 1) | 7036KiB/s | 7413KiB/s | 27.0MiB/s | 7465KiB/s | > ------------------+-----------+-----------+-----------+-----------+ > (write, 4k, 8) | 535KiB/s | 524KiB/s | 13.9MiB/s | 1662KiB/s | > ------------------+-----------+-----------+-----------+-----------+ > > And that just looks like ext4 performs worse with aio=native in general. > But I’ll have to do my own benchmarks first. > > Max > -- Best regards, Vladimir
On 29.10.19 13:19, Vladimir Sementsov-Ogievskiy wrote: > 29.10.2019 15:11, Max Reitz wrote: >> On 29.10.19 13:05, Vladimir Sementsov-Ogievskiy wrote: >>> 29.10.2019 14:55, Max Reitz wrote: >>>> On 29.10.19 12:48, Vladimir Sementsov-Ogievskiy wrote: >>>>> 29.10.2019 11:50, Max Reitz wrote: >>>>>> On 28.10.19 12:25, Vladimir Sementsov-Ogievskiy wrote: >>>>>>> 28.10.2019 14:04, Kevin Wolf wrote: >>>>>>>> Am 27.10.2019 um 13:35 hat Stefan Hajnoczi geschrieben: >>>>>>>>> On Fri, Oct 25, 2019 at 11:58:46AM +0200, Max Reitz wrote: >>>>>> >>>>>> [...] >>>>>> >>>>>>>>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>>>>>>>>> To my knowledge I’m the only one who has provided any benchmarks for >>>>>>>>>> this commit, and even then I was a bit skeptical because it performs >>>>>>>>>> well in some cases and bad in others. I concluded that it’s >>>>>>>>>> probably worth it because the “some cases” are more likely to occur. >>>>>>>>>> >>>>>>>>>> Now we have this problem of corruption here (granted due to a bug in >>>>>>>>>> the XFS driver), and another report of massively degraded >>>>>>>>>> performance on ppc64 >>>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>>>>>>>>> private BZ; I hate that :-/ The report is about 40 % worse >>>>>>>>>> performance for an in-guest fio write benchmark.) >>>>>>>>>> >>>>>>>>>> So I have to ask the question about what the justification for >>>>>>>>>> keeping c8bb23cbdbe32f is. How much does performance increase with >>>>>>>>>> it actually? (On non-(ppc64+XFS) machines, obviously) >>>>>>>>>> >>>>>>>>>> Advantages: >>>>>>>>>> + Trivial >>>>>>>>>> + No layering violations >>>>>>>>>> + We wouldn’t need to keep track of whether the kernel bug has been >>>>>>>>>> fixed or not >>>>>>>>>> + Fixes the ppc64+XFS performance problem >>>>>>>>>> >>>>>>>>>> Disadvantages: >>>>>>>>>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>>>>>>>>> levels, whatever that means >>>>>>>>> >>>>>>>>> My favorite because it is clean and simple, but Vladimir has a valid >>>>>>>>> use-case for requiring this performance optimization so reverting isn't >>>>>>>>> an option. >>>>>>>> >>>>>>>> Vladimir also said that qcow2 subclusters would probably also solve his >>>>>>>> problem, so maybe reverting and applying the subcluster patches instead >>>>>>>> is a possible solution, too? >>>>>>> >>>>>>> I'm not sure about ssd case, it may need write-zero optimization anyway. >>>>>> >>>>>> What exactly do you need? Do you actually need to write zeroes (e.g. >>>>>> because you’re storing images on block devices) or would it be >>>>>> sufficient to just drop the COW areas when bdrv_has_zero_init() and >>>>>> bdrv_has_zero_init_truncate() are true? >>>>> >>>>> Hmm, what do you mean? We need to zero COW areas. So, original way is to >>>>> write real zeroes, optimized way is fallocate.. What do you mean by drop? >>>>> Mark sublusters as zeroes by metadata? >>>> >>>> Why do you need to zero COW areas? For normal files, any data will read >>>> as zero if you didn’t write anything there. >>> >>> Hmm, but when allocating new cluster in qcow2, it's not guaranteed to be zero, >>> as it may be reused previously allocated cluster.. >> >> Hm, right. We could special-case something for beyond the EOF, but I >> don’t know whether that really makes it better. >> >> OTOH, maybe allocations at the EOF are the real bottleneck. Reusing >> existing clusters should be rare enough that maybe the existing code >> which explicitly writes zeroes is sufficient. > > But, as I understand pre-EOF fallocates are safe in xfs? So, we may just drop calling > fallocate past-EOF (it's zero anyway) and do fallocate pre-EOF (it's safe) ? But probably slow still. And there’s the question of how much complexity we want to heap onto this. Max
25.10.2019 12:58, Max Reitz wrote: > Hi, > > It seems to me that there is a bug in Linux’s XFS kernel driver, as > I’ve explained here: > > https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html > > In combination with our commit c8bb23cbdbe32f, this may lead to guest > data corruption when using qcow2 images on XFS with aio=native. > > We can’t wait until the XFS kernel driver is fixed, we should work > around the problem ourselves. > > This is an RFC for two reasons: > (1) I don’t know whether this is the right way to address the issue, > (2) Ideally, we should detect whether the XFS kernel driver is fixed and > if so stop applying the workaround. > I don’t know how we would go about this, so this series doesn’t do > it. (Hence it’s an RFC.) > (3) Perhaps it’s a bit of a layering violation to let the file-posix > driver access and modify a BdrvTrackedRequest object. > > As for how we can address the issue, I see three ways: > (1) The one presented in this series: On XFS with aio=native, we extend > tracked requests for post-EOF fallocate() calls (i.e., write-zero > operations) to reach until infinity (INT64_MAX in practice), mark > them serializing and wait for other conflicting requests. > > Advantages: > + Limits the impact to very specific cases > (And that means it wouldn’t hurt too much to keep this workaround > even when the XFS driver has been fixed) > + Works around the bug where it happens, namely in file-posix > > Disadvantages: > - A bit complex > - A bit of a layering violation (should file-posix have access to > tracked requests?) > > (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only > becomes visible due to that function: I don’t think qcow2 writes > zeroes in any other I/O path, and raw images are fixed in size so > post-EOF writes won’t happen. > > Advantages: > + Maybe simpler, depending on how difficult it is to handle the > layering violation > + Also fixes the performance problem of handle_alloc_space() being > slow on ppc64+XFS. > > Disadvantages: > - Huge layering violation because qcow2 would need to know whether > the image is stored on XFS or not. > - We’d definitely want to skip this workaround when the XFS driver > has been fixed, so we need some method to find out whether it has > > (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. > To my knowledge I’m the only one who has provided any benchmarks for > this commit, and even then I was a bit skeptical because it performs > well in some cases and bad in others. I concluded that it’s > probably worth it because the “some cases” are more likely to occur. > > Now we have this problem of corruption here (granted due to a bug in > the XFS driver), and another report of massively degraded > performance on ppc64 > (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a > private BZ; I hate that :-/ The report is about 40 % worse > performance for an in-guest fio write benchmark.) > > So I have to ask the question about what the justification for > keeping c8bb23cbdbe32f is. How much does performance increase with > it actually? (On non-(ppc64+XFS) machines, obviously) > > Advantages: > + Trivial > + No layering violations > + We wouldn’t need to keep track of whether the kernel bug has been > fixed or not > + Fixes the ppc64+XFS performance problem > > Disadvantages: > - Reverts cluster allocation performance to pre-c8bb23cbdbe32f > levels, whatever that means > > So this is the main reason this is an RFC: What should we do? Is (1) > really the best choice? > > > In any case, I’ve ran the test case I showed in > https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html > more than ten times with this series applied and the installation > succeeded every time. (Without this series, it fails like every other > time.) > > Hi! First, great thanks for your investigation! We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant in time. I've tested a bit: test: for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done on master: /ssd/test.img cl=64K step=4K : 0.291 /ssd/test.img cl=64K step=64K : 0.813 /ssd/test.img cl=64K step=1M : 2.799 /ssd/test.img cl=1M step=4K : 0.217 /ssd/test.img cl=1M step=64K : 0.332 /ssd/test.img cl=1M step=1M : 0.685 /test.img cl=64K step=4K : 1.751 /test.img cl=64K step=64K : 14.811 /test.img cl=64K step=1M : 18.321 /test.img cl=1M step=4K : 0.759 /test.img cl=1M step=64K : 13.574 /test.img cl=1M step=1M : 28.970 rerun on master: /ssd/test.img cl=64K step=4K : 0.295 /ssd/test.img cl=64K step=64K : 0.803 /ssd/test.img cl=64K step=1M : 2.921 /ssd/test.img cl=1M step=4K : 0.233 /ssd/test.img cl=1M step=64K : 0.321 /ssd/test.img cl=1M step=1M : 0.762 /test.img cl=64K step=4K : 1.873 /test.img cl=64K step=64K : 15.621 /test.img cl=64K step=1M : 18.428 /test.img cl=1M step=4K : 0.883 /test.img cl=1M step=64K : 13.484 /test.img cl=1M step=1M : 26.244 on master + revert c8bb23cbdbe32f5c326 /ssd/test.img cl=64K step=4K : 0.395 /ssd/test.img cl=64K step=64K : 4.231 /ssd/test.img cl=64K step=1M : 5.598 /ssd/test.img cl=1M step=4K : 0.352 /ssd/test.img cl=1M step=64K : 2.519 /ssd/test.img cl=1M step=1M : 38.919 /test.img cl=64K step=4K : 1.758 /test.img cl=64K step=64K : 9.838 /test.img cl=64K step=1M : 13.384 /test.img cl=1M step=4K : 1.849 /test.img cl=1M step=64K : 19.405 /test.img cl=1M step=1M : 157.090 rerun: /ssd/test.img cl=64K step=4K : 0.407 /ssd/test.img cl=64K step=64K : 3.325 /ssd/test.img cl=64K step=1M : 5.641 /ssd/test.img cl=1M step=4K : 0.346 /ssd/test.img cl=1M step=64K : 2.583 /ssd/test.img cl=1M step=1M : 39.692 /test.img cl=64K step=4K : 1.727 /test.img cl=64K step=64K : 10.058 /test.img cl=64K step=1M : 13.441 /test.img cl=1M step=4K : 1.926 /test.img cl=1M step=64K : 19.738 /test.img cl=1M step=1M : 158.268 So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational disk, which means that previous assumption about calling handle_alloc_space() only for ssd is wrong, we need smarter heuristics.. So, I'd prefer (1) or (2). -- Best regards, Vladimir
25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote: > 25.10.2019 12:58, Max Reitz wrote: >> Hi, >> >> It seems to me that there is a bug in Linux’s XFS kernel driver, as >> I’ve explained here: >> >> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html >> >> In combination with our commit c8bb23cbdbe32f, this may lead to guest >> data corruption when using qcow2 images on XFS with aio=native. >> >> We can’t wait until the XFS kernel driver is fixed, we should work >> around the problem ourselves. >> >> This is an RFC for two reasons: >> (1) I don’t know whether this is the right way to address the issue, >> (2) Ideally, we should detect whether the XFS kernel driver is fixed and >> if so stop applying the workaround. >> I don’t know how we would go about this, so this series doesn’t do >> it. (Hence it’s an RFC.) >> (3) Perhaps it’s a bit of a layering violation to let the file-posix >> driver access and modify a BdrvTrackedRequest object. >> >> As for how we can address the issue, I see three ways: >> (1) The one presented in this series: On XFS with aio=native, we extend >> tracked requests for post-EOF fallocate() calls (i.e., write-zero >> operations) to reach until infinity (INT64_MAX in practice), mark >> them serializing and wait for other conflicting requests. >> >> Advantages: >> + Limits the impact to very specific cases >> (And that means it wouldn’t hurt too much to keep this workaround >> even when the XFS driver has been fixed) >> + Works around the bug where it happens, namely in file-posix >> >> Disadvantages: >> - A bit complex >> - A bit of a layering violation (should file-posix have access to >> tracked requests?) >> >> (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only >> becomes visible due to that function: I don’t think qcow2 writes >> zeroes in any other I/O path, and raw images are fixed in size so >> post-EOF writes won’t happen. >> >> Advantages: >> + Maybe simpler, depending on how difficult it is to handle the >> layering violation >> + Also fixes the performance problem of handle_alloc_space() being >> slow on ppc64+XFS. >> >> Disadvantages: >> - Huge layering violation because qcow2 would need to know whether >> the image is stored on XFS or not. >> - We’d definitely want to skip this workaround when the XFS driver >> has been fixed, so we need some method to find out whether it has >> >> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >> To my knowledge I’m the only one who has provided any benchmarks for >> this commit, and even then I was a bit skeptical because it performs >> well in some cases and bad in others. I concluded that it’s >> probably worth it because the “some cases” are more likely to occur. >> >> Now we have this problem of corruption here (granted due to a bug in >> the XFS driver), and another report of massively degraded >> performance on ppc64 >> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >> private BZ; I hate that :-/ The report is about 40 % worse >> performance for an in-guest fio write benchmark.) >> >> So I have to ask the question about what the justification for >> keeping c8bb23cbdbe32f is. How much does performance increase with >> it actually? (On non-(ppc64+XFS) machines, obviously) >> >> Advantages: >> + Trivial >> + No layering violations >> + We wouldn’t need to keep track of whether the kernel bug has been >> fixed or not >> + Fixes the ppc64+XFS performance problem >> >> Disadvantages: >> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >> levels, whatever that means >> >> So this is the main reason this is an RFC: What should we do? Is (1) >> really the best choice? >> >> >> In any case, I’ve ran the test case I showed in >> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html >> more than ten times with this series applied and the installation >> succeeded every time. (Without this series, it fails like every other >> time.) >> >> > > Hi! > > First, great thanks for your investigation! > > We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant > in time. > > I've tested a bit: > > test: > for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done > > on master: > > /ssd/test.img cl=64K step=4K : 0.291 > /ssd/test.img cl=64K step=64K : 0.813 > /ssd/test.img cl=64K step=1M : 2.799 > /ssd/test.img cl=1M step=4K : 0.217 > /ssd/test.img cl=1M step=64K : 0.332 > /ssd/test.img cl=1M step=1M : 0.685 > /test.img cl=64K step=4K : 1.751 > /test.img cl=64K step=64K : 14.811 > /test.img cl=64K step=1M : 18.321 > /test.img cl=1M step=4K : 0.759 > /test.img cl=1M step=64K : 13.574 > /test.img cl=1M step=1M : 28.970 > > rerun on master: > > /ssd/test.img cl=64K step=4K : 0.295 > /ssd/test.img cl=64K step=64K : 0.803 > /ssd/test.img cl=64K step=1M : 2.921 > /ssd/test.img cl=1M step=4K : 0.233 > /ssd/test.img cl=1M step=64K : 0.321 > /ssd/test.img cl=1M step=1M : 0.762 > /test.img cl=64K step=4K : 1.873 > /test.img cl=64K step=64K : 15.621 > /test.img cl=64K step=1M : 18.428 > /test.img cl=1M step=4K : 0.883 > /test.img cl=1M step=64K : 13.484 > /test.img cl=1M step=1M : 26.244 > > > on master + revert c8bb23cbdbe32f5c326 > > /ssd/test.img cl=64K step=4K : 0.395 > /ssd/test.img cl=64K step=64K : 4.231 > /ssd/test.img cl=64K step=1M : 5.598 > /ssd/test.img cl=1M step=4K : 0.352 > /ssd/test.img cl=1M step=64K : 2.519 > /ssd/test.img cl=1M step=1M : 38.919 > /test.img cl=64K step=4K : 1.758 > /test.img cl=64K step=64K : 9.838 > /test.img cl=64K step=1M : 13.384 > /test.img cl=1M step=4K : 1.849 > /test.img cl=1M step=64K : 19.405 > /test.img cl=1M step=1M : 157.090 > > rerun: > > /ssd/test.img cl=64K step=4K : 0.407 > /ssd/test.img cl=64K step=64K : 3.325 > /ssd/test.img cl=64K step=1M : 5.641 > /ssd/test.img cl=1M step=4K : 0.346 > /ssd/test.img cl=1M step=64K : 2.583 > /ssd/test.img cl=1M step=1M : 39.692 > /test.img cl=64K step=4K : 1.727 > /test.img cl=64K step=64K : 10.058 > /test.img cl=64K step=1M : 13.441 > /test.img cl=1M step=4K : 1.926 > /test.img cl=1M step=64K : 19.738 > /test.img cl=1M step=1M : 158.268 > > > So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational > disk, which means that previous assumption about calling handle_alloc_space() only for ssd is > wrong, we need smarter heuristics.. > > So, I'd prefer (1) or (2). > About degradation in some cases: I think the problem is that one (a bit larger) write may be faster than fast-write-zeroes + small write, as the latter means additional write to metadata. And it's expected for small clusters in conjunction with rotational disk. But the actual limit is dependent on specific disk. So, I think possible solution is just sometimes try work with handle_alloc_space and sometimes without, remember time and length of request and make dynamic limit... -- Best regards, Vladimir
On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote: > 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote: >> 25.10.2019 12:58, Max Reitz wrote: >>> Hi, >>> >>> It seems to me that there is a bug in Linux’s XFS kernel driver, as >>> I’ve explained here: >>> >>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html >>> >>> In combination with our commit c8bb23cbdbe32f, this may lead to guest >>> data corruption when using qcow2 images on XFS with aio=native. >>> >>> We can’t wait until the XFS kernel driver is fixed, we should work >>> around the problem ourselves. >>> >>> This is an RFC for two reasons: >>> (1) I don’t know whether this is the right way to address the issue, >>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and >>> if so stop applying the workaround. >>> I don’t know how we would go about this, so this series doesn’t do >>> it. (Hence it’s an RFC.) >>> (3) Perhaps it’s a bit of a layering violation to let the file-posix >>> driver access and modify a BdrvTrackedRequest object. >>> >>> As for how we can address the issue, I see three ways: >>> (1) The one presented in this series: On XFS with aio=native, we extend >>> tracked requests for post-EOF fallocate() calls (i.e., write-zero >>> operations) to reach until infinity (INT64_MAX in practice), mark >>> them serializing and wait for other conflicting requests. >>> >>> Advantages: >>> + Limits the impact to very specific cases >>> (And that means it wouldn’t hurt too much to keep this workaround >>> even when the XFS driver has been fixed) >>> + Works around the bug where it happens, namely in file-posix >>> >>> Disadvantages: >>> - A bit complex >>> - A bit of a layering violation (should file-posix have access to >>> tracked requests?) >>> >>> (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only >>> becomes visible due to that function: I don’t think qcow2 writes >>> zeroes in any other I/O path, and raw images are fixed in size so >>> post-EOF writes won’t happen. >>> >>> Advantages: >>> + Maybe simpler, depending on how difficult it is to handle the >>> layering violation >>> + Also fixes the performance problem of handle_alloc_space() being >>> slow on ppc64+XFS. >>> >>> Disadvantages: >>> - Huge layering violation because qcow2 would need to know whether >>> the image is stored on XFS or not. >>> - We’d definitely want to skip this workaround when the XFS driver >>> has been fixed, so we need some method to find out whether it has >>> >>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>> To my knowledge I’m the only one who has provided any benchmarks for >>> this commit, and even then I was a bit skeptical because it performs >>> well in some cases and bad in others. I concluded that it’s >>> probably worth it because the “some cases” are more likely to occur. >>> >>> Now we have this problem of corruption here (granted due to a bug in >>> the XFS driver), and another report of massively degraded >>> performance on ppc64 >>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>> private BZ; I hate that :-/ The report is about 40 % worse >>> performance for an in-guest fio write benchmark.) >>> >>> So I have to ask the question about what the justification for >>> keeping c8bb23cbdbe32f is. How much does performance increase with >>> it actually? (On non-(ppc64+XFS) machines, obviously) >>> >>> Advantages: >>> + Trivial >>> + No layering violations >>> + We wouldn’t need to keep track of whether the kernel bug has been >>> fixed or not >>> + Fixes the ppc64+XFS performance problem >>> >>> Disadvantages: >>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>> levels, whatever that means >>> >>> So this is the main reason this is an RFC: What should we do? Is (1) >>> really the best choice? >>> >>> >>> In any case, I’ve ran the test case I showed in >>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html >>> more than ten times with this series applied and the installation >>> succeeded every time. (Without this series, it fails like every other >>> time.) >>> >>> >> >> Hi! >> >> First, great thanks for your investigation! >> >> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant >> in time. >> >> I've tested a bit: >> >> test: >> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done >> >> on master: >> >> /ssd/test.img cl=64K step=4K : 0.291 >> /ssd/test.img cl=64K step=64K : 0.813 >> /ssd/test.img cl=64K step=1M : 2.799 >> /ssd/test.img cl=1M step=4K : 0.217 >> /ssd/test.img cl=1M step=64K : 0.332 >> /ssd/test.img cl=1M step=1M : 0.685 >> /test.img cl=64K step=4K : 1.751 >> /test.img cl=64K step=64K : 14.811 >> /test.img cl=64K step=1M : 18.321 >> /test.img cl=1M step=4K : 0.759 >> /test.img cl=1M step=64K : 13.574 >> /test.img cl=1M step=1M : 28.970 >> >> rerun on master: >> >> /ssd/test.img cl=64K step=4K : 0.295 >> /ssd/test.img cl=64K step=64K : 0.803 >> /ssd/test.img cl=64K step=1M : 2.921 >> /ssd/test.img cl=1M step=4K : 0.233 >> /ssd/test.img cl=1M step=64K : 0.321 >> /ssd/test.img cl=1M step=1M : 0.762 >> /test.img cl=64K step=4K : 1.873 >> /test.img cl=64K step=64K : 15.621 >> /test.img cl=64K step=1M : 18.428 >> /test.img cl=1M step=4K : 0.883 >> /test.img cl=1M step=64K : 13.484 >> /test.img cl=1M step=1M : 26.244 >> >> >> on master + revert c8bb23cbdbe32f5c326 >> >> /ssd/test.img cl=64K step=4K : 0.395 >> /ssd/test.img cl=64K step=64K : 4.231 >> /ssd/test.img cl=64K step=1M : 5.598 >> /ssd/test.img cl=1M step=4K : 0.352 >> /ssd/test.img cl=1M step=64K : 2.519 >> /ssd/test.img cl=1M step=1M : 38.919 >> /test.img cl=64K step=4K : 1.758 >> /test.img cl=64K step=64K : 9.838 >> /test.img cl=64K step=1M : 13.384 >> /test.img cl=1M step=4K : 1.849 >> /test.img cl=1M step=64K : 19.405 >> /test.img cl=1M step=1M : 157.090 >> >> rerun: >> >> /ssd/test.img cl=64K step=4K : 0.407 >> /ssd/test.img cl=64K step=64K : 3.325 >> /ssd/test.img cl=64K step=1M : 5.641 >> /ssd/test.img cl=1M step=4K : 0.346 >> /ssd/test.img cl=1M step=64K : 2.583 >> /ssd/test.img cl=1M step=1M : 39.692 >> /test.img cl=64K step=4K : 1.727 >> /test.img cl=64K step=64K : 10.058 >> /test.img cl=64K step=1M : 13.441 >> /test.img cl=1M step=4K : 1.926 >> /test.img cl=1M step=64K : 19.738 >> /test.img cl=1M step=1M : 158.268 >> >> >> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational >> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is >> wrong, we need smarter heuristics.. >> >> So, I'd prefer (1) or (2). OK. I wonder whether that problem would go away with Berto’s subcluster series, though. > About degradation in some cases: I think the problem is that one (a bit larger) > write may be faster than fast-write-zeroes + small write, as the latter means > additional write to metadata. And it's expected for small clusters in > conjunction with rotational disk. But the actual limit is dependent on specific > disk. So, I think possible solution is just sometimes try work with > handle_alloc_space and sometimes without, remember time and length of request > and make dynamic limit... Maybe make a decision based both on the ratio of data size to COW area length (only invoke handle_alloc_space() under a certain threshold), and the absolute COW area length (always invoke it above a certain threshold, unless the ratio doesn’t allow it)? Max
25.10.2019 17:19, Max Reitz wrote: > On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote: >> 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote: >>> 25.10.2019 12:58, Max Reitz wrote: >>>> Hi, >>>> >>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as >>>> I’ve explained here: >>>> >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html >>>> >>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest >>>> data corruption when using qcow2 images on XFS with aio=native. >>>> >>>> We can’t wait until the XFS kernel driver is fixed, we should work >>>> around the problem ourselves. >>>> >>>> This is an RFC for two reasons: >>>> (1) I don’t know whether this is the right way to address the issue, >>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and >>>> if so stop applying the workaround. >>>> I don’t know how we would go about this, so this series doesn’t do >>>> it. (Hence it’s an RFC.) >>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix >>>> driver access and modify a BdrvTrackedRequest object. >>>> >>>> As for how we can address the issue, I see three ways: >>>> (1) The one presented in this series: On XFS with aio=native, we extend >>>> tracked requests for post-EOF fallocate() calls (i.e., write-zero >>>> operations) to reach until infinity (INT64_MAX in practice), mark >>>> them serializing and wait for other conflicting requests. >>>> >>>> Advantages: >>>> + Limits the impact to very specific cases >>>> (And that means it wouldn’t hurt too much to keep this workaround >>>> even when the XFS driver has been fixed) >>>> + Works around the bug where it happens, namely in file-posix >>>> >>>> Disadvantages: >>>> - A bit complex >>>> - A bit of a layering violation (should file-posix have access to >>>> tracked requests?) >>>> >>>> (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only >>>> becomes visible due to that function: I don’t think qcow2 writes >>>> zeroes in any other I/O path, and raw images are fixed in size so >>>> post-EOF writes won’t happen. >>>> >>>> Advantages: >>>> + Maybe simpler, depending on how difficult it is to handle the >>>> layering violation >>>> + Also fixes the performance problem of handle_alloc_space() being >>>> slow on ppc64+XFS. >>>> >>>> Disadvantages: >>>> - Huge layering violation because qcow2 would need to know whether >>>> the image is stored on XFS or not. >>>> - We’d definitely want to skip this workaround when the XFS driver >>>> has been fixed, so we need some method to find out whether it has >>>> >>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>>> To my knowledge I’m the only one who has provided any benchmarks for >>>> this commit, and even then I was a bit skeptical because it performs >>>> well in some cases and bad in others. I concluded that it’s >>>> probably worth it because the “some cases” are more likely to occur. >>>> >>>> Now we have this problem of corruption here (granted due to a bug in >>>> the XFS driver), and another report of massively degraded >>>> performance on ppc64 >>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>>> private BZ; I hate that :-/ The report is about 40 % worse >>>> performance for an in-guest fio write benchmark.) >>>> >>>> So I have to ask the question about what the justification for >>>> keeping c8bb23cbdbe32f is. How much does performance increase with >>>> it actually? (On non-(ppc64+XFS) machines, obviously) >>>> >>>> Advantages: >>>> + Trivial >>>> + No layering violations >>>> + We wouldn’t need to keep track of whether the kernel bug has been >>>> fixed or not >>>> + Fixes the ppc64+XFS performance problem >>>> >>>> Disadvantages: >>>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>>> levels, whatever that means >>>> >>>> So this is the main reason this is an RFC: What should we do? Is (1) >>>> really the best choice? >>>> >>>> >>>> In any case, I’ve ran the test case I showed in >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html >>>> more than ten times with this series applied and the installation >>>> succeeded every time. (Without this series, it fails like every other >>>> time.) >>>> >>>> >>> >>> Hi! >>> >>> First, great thanks for your investigation! >>> >>> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant >>> in time. >>> >>> I've tested a bit: >>> >>> test: >>> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done >>> >>> on master: >>> >>> /ssd/test.img cl=64K step=4K : 0.291 >>> /ssd/test.img cl=64K step=64K : 0.813 >>> /ssd/test.img cl=64K step=1M : 2.799 >>> /ssd/test.img cl=1M step=4K : 0.217 >>> /ssd/test.img cl=1M step=64K : 0.332 >>> /ssd/test.img cl=1M step=1M : 0.685 >>> /test.img cl=64K step=4K : 1.751 >>> /test.img cl=64K step=64K : 14.811 >>> /test.img cl=64K step=1M : 18.321 >>> /test.img cl=1M step=4K : 0.759 >>> /test.img cl=1M step=64K : 13.574 >>> /test.img cl=1M step=1M : 28.970 >>> >>> rerun on master: >>> >>> /ssd/test.img cl=64K step=4K : 0.295 >>> /ssd/test.img cl=64K step=64K : 0.803 >>> /ssd/test.img cl=64K step=1M : 2.921 >>> /ssd/test.img cl=1M step=4K : 0.233 >>> /ssd/test.img cl=1M step=64K : 0.321 >>> /ssd/test.img cl=1M step=1M : 0.762 >>> /test.img cl=64K step=4K : 1.873 >>> /test.img cl=64K step=64K : 15.621 >>> /test.img cl=64K step=1M : 18.428 >>> /test.img cl=1M step=4K : 0.883 >>> /test.img cl=1M step=64K : 13.484 >>> /test.img cl=1M step=1M : 26.244 >>> >>> >>> on master + revert c8bb23cbdbe32f5c326 >>> >>> /ssd/test.img cl=64K step=4K : 0.395 >>> /ssd/test.img cl=64K step=64K : 4.231 >>> /ssd/test.img cl=64K step=1M : 5.598 >>> /ssd/test.img cl=1M step=4K : 0.352 >>> /ssd/test.img cl=1M step=64K : 2.519 >>> /ssd/test.img cl=1M step=1M : 38.919 >>> /test.img cl=64K step=4K : 1.758 >>> /test.img cl=64K step=64K : 9.838 >>> /test.img cl=64K step=1M : 13.384 >>> /test.img cl=1M step=4K : 1.849 >>> /test.img cl=1M step=64K : 19.405 >>> /test.img cl=1M step=1M : 157.090 >>> >>> rerun: >>> >>> /ssd/test.img cl=64K step=4K : 0.407 >>> /ssd/test.img cl=64K step=64K : 3.325 >>> /ssd/test.img cl=64K step=1M : 5.641 >>> /ssd/test.img cl=1M step=4K : 0.346 >>> /ssd/test.img cl=1M step=64K : 2.583 >>> /ssd/test.img cl=1M step=1M : 39.692 >>> /test.img cl=64K step=4K : 1.727 >>> /test.img cl=64K step=64K : 10.058 >>> /test.img cl=64K step=1M : 13.441 >>> /test.img cl=1M step=4K : 1.926 >>> /test.img cl=1M step=64K : 19.738 >>> /test.img cl=1M step=1M : 158.268 >>> >>> >>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational >>> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is >>> wrong, we need smarter heuristics.. >>> >>> So, I'd prefer (1) or (2). > > OK. I wonder whether that problem would go away with Berto’s subcluster > series, though. Very possible, I thought about it too. > >> About degradation in some cases: I think the problem is that one (a bit larger) >> write may be faster than fast-write-zeroes + small write, as the latter means >> additional write to metadata. And it's expected for small clusters in >> conjunction with rotational disk. But the actual limit is dependent on specific >> disk. So, I think possible solution is just sometimes try work with >> handle_alloc_space and sometimes without, remember time and length of request >> and make dynamic limit... > > Maybe make a decision based both on the ratio of data size to COW area > length (only invoke handle_alloc_space() under a certain threshold), and > the absolute COW area length (always invoke it above a certain > threshold, unless the ratio doesn’t allow it)? > Yes, something like this.. without handle_alloc_space, time = time(write aligned up to cluster) with handle_alloc_space, time = time(fast zero write) + time(original write) If we have some statistics on normal-write vs zero-write timing, we can just calculate both variants and choose faster. if (predict_zero_write_time(aligned up request) + predict_write_time(request) < predict_write_time(aligned up request)) { use handle_alloc_space() } -- Best regards, Vladimir
On Fri, Oct 25, 2019 at 02:36:49PM +0000, Vladimir Sementsov-Ogievskiy wrote: > 25.10.2019 17:19, Max Reitz wrote: > > On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote: > >> 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote: > >>> 25.10.2019 12:58, Max Reitz wrote: > >>>> Hi, > >>>> > >>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as > >>>> I’ve explained here: > >>>> > >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html > >>>> > >>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest > >>>> data corruption when using qcow2 images on XFS with aio=native. > >>>> > >>>> We can’t wait until the XFS kernel driver is fixed, we should work > >>>> around the problem ourselves. > >>>> > >>>> This is an RFC for two reasons: > >>>> (1) I don’t know whether this is the right way to address the issue, > >>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and > >>>> if so stop applying the workaround. > >>>> I don’t know how we would go about this, so this series doesn’t do > >>>> it. (Hence it’s an RFC.) > >>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix > >>>> driver access and modify a BdrvTrackedRequest object. > >>>> > >>>> As for how we can address the issue, I see three ways: > >>>> (1) The one presented in this series: On XFS with aio=native, we extend > >>>> tracked requests for post-EOF fallocate() calls (i.e., write-zero > >>>> operations) to reach until infinity (INT64_MAX in practice), mark > >>>> them serializing and wait for other conflicting requests. > >>>> > >>>> Advantages: > >>>> + Limits the impact to very specific cases > >>>> (And that means it wouldn’t hurt too much to keep this workaround > >>>> even when the XFS driver has been fixed) > >>>> + Works around the bug where it happens, namely in file-posix > >>>> > >>>> Disadvantages: > >>>> - A bit complex > >>>> - A bit of a layering violation (should file-posix have access to > >>>> tracked requests?) > >>>> > >>>> (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only > >>>> becomes visible due to that function: I don’t think qcow2 writes > >>>> zeroes in any other I/O path, and raw images are fixed in size so > >>>> post-EOF writes won’t happen. > >>>> > >>>> Advantages: > >>>> + Maybe simpler, depending on how difficult it is to handle the > >>>> layering violation > >>>> + Also fixes the performance problem of handle_alloc_space() being > >>>> slow on ppc64+XFS. > >>>> > >>>> Disadvantages: > >>>> - Huge layering violation because qcow2 would need to know whether > >>>> the image is stored on XFS or not. > >>>> - We’d definitely want to skip this workaround when the XFS driver > >>>> has been fixed, so we need some method to find out whether it has > >>>> > >>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. > >>>> To my knowledge I’m the only one who has provided any benchmarks for > >>>> this commit, and even then I was a bit skeptical because it performs > >>>> well in some cases and bad in others. I concluded that it’s > >>>> probably worth it because the “some cases” are more likely to occur. > >>>> > >>>> Now we have this problem of corruption here (granted due to a bug in > >>>> the XFS driver), and another report of massively degraded > >>>> performance on ppc64 > >>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a > >>>> private BZ; I hate that :-/ The report is about 40 % worse > >>>> performance for an in-guest fio write benchmark.) > >>>> > >>>> So I have to ask the question about what the justification for > >>>> keeping c8bb23cbdbe32f is. How much does performance increase with > >>>> it actually? (On non-(ppc64+XFS) machines, obviously) > >>>> > >>>> Advantages: > >>>> + Trivial > >>>> + No layering violations > >>>> + We wouldn’t need to keep track of whether the kernel bug has been > >>>> fixed or not > >>>> + Fixes the ppc64+XFS performance problem > >>>> > >>>> Disadvantages: > >>>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f > >>>> levels, whatever that means > >>>> > >>>> So this is the main reason this is an RFC: What should we do? Is (1) > >>>> really the best choice? > >>>> > >>>> > >>>> In any case, I’ve ran the test case I showed in > >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html > >>>> more than ten times with this series applied and the installation > >>>> succeeded every time. (Without this series, it fails like every other > >>>> time.) > >>>> > >>>> > >>> > >>> Hi! > >>> > >>> First, great thanks for your investigation! > >>> > >>> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant > >>> in time. > >>> > >>> I've tested a bit: > >>> > >>> test: > >>> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done > >>> > >>> on master: > >>> > >>> /ssd/test.img cl=64K step=4K : 0.291 > >>> /ssd/test.img cl=64K step=64K : 0.813 > >>> /ssd/test.img cl=64K step=1M : 2.799 > >>> /ssd/test.img cl=1M step=4K : 0.217 > >>> /ssd/test.img cl=1M step=64K : 0.332 > >>> /ssd/test.img cl=1M step=1M : 0.685 > >>> /test.img cl=64K step=4K : 1.751 > >>> /test.img cl=64K step=64K : 14.811 > >>> /test.img cl=64K step=1M : 18.321 > >>> /test.img cl=1M step=4K : 0.759 > >>> /test.img cl=1M step=64K : 13.574 > >>> /test.img cl=1M step=1M : 28.970 > >>> > >>> rerun on master: > >>> > >>> /ssd/test.img cl=64K step=4K : 0.295 > >>> /ssd/test.img cl=64K step=64K : 0.803 > >>> /ssd/test.img cl=64K step=1M : 2.921 > >>> /ssd/test.img cl=1M step=4K : 0.233 > >>> /ssd/test.img cl=1M step=64K : 0.321 > >>> /ssd/test.img cl=1M step=1M : 0.762 > >>> /test.img cl=64K step=4K : 1.873 > >>> /test.img cl=64K step=64K : 15.621 > >>> /test.img cl=64K step=1M : 18.428 > >>> /test.img cl=1M step=4K : 0.883 > >>> /test.img cl=1M step=64K : 13.484 > >>> /test.img cl=1M step=1M : 26.244 > >>> > >>> > >>> on master + revert c8bb23cbdbe32f5c326 > >>> > >>> /ssd/test.img cl=64K step=4K : 0.395 > >>> /ssd/test.img cl=64K step=64K : 4.231 > >>> /ssd/test.img cl=64K step=1M : 5.598 > >>> /ssd/test.img cl=1M step=4K : 0.352 > >>> /ssd/test.img cl=1M step=64K : 2.519 > >>> /ssd/test.img cl=1M step=1M : 38.919 > >>> /test.img cl=64K step=4K : 1.758 > >>> /test.img cl=64K step=64K : 9.838 > >>> /test.img cl=64K step=1M : 13.384 > >>> /test.img cl=1M step=4K : 1.849 > >>> /test.img cl=1M step=64K : 19.405 > >>> /test.img cl=1M step=1M : 157.090 > >>> > >>> rerun: > >>> > >>> /ssd/test.img cl=64K step=4K : 0.407 > >>> /ssd/test.img cl=64K step=64K : 3.325 > >>> /ssd/test.img cl=64K step=1M : 5.641 > >>> /ssd/test.img cl=1M step=4K : 0.346 > >>> /ssd/test.img cl=1M step=64K : 2.583 > >>> /ssd/test.img cl=1M step=1M : 39.692 > >>> /test.img cl=64K step=4K : 1.727 > >>> /test.img cl=64K step=64K : 10.058 > >>> /test.img cl=64K step=1M : 13.441 > >>> /test.img cl=1M step=4K : 1.926 > >>> /test.img cl=1M step=64K : 19.738 > >>> /test.img cl=1M step=1M : 158.268 > >>> > >>> > >>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational > >>> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is > >>> wrong, we need smarter heuristics.. > >>> > >>> So, I'd prefer (1) or (2). > > > > OK. I wonder whether that problem would go away with Berto’s subcluster > > series, though. > > Very possible, I thought about it too. > > > > >> About degradation in some cases: I think the problem is that one (a bit larger) > >> write may be faster than fast-write-zeroes + small write, as the latter means > >> additional write to metadata. And it's expected for small clusters in > >> conjunction with rotational disk. But the actual limit is dependent on specific > >> disk. So, I think possible solution is just sometimes try work with > >> handle_alloc_space and sometimes without, remember time and length of request > >> and make dynamic limit... > > > > Maybe make a decision based both on the ratio of data size to COW area > > length (only invoke handle_alloc_space() under a certain threshold), and > > the absolute COW area length (always invoke it above a certain > > threshold, unless the ratio doesn’t allow it)? > > > > Yes, something like this.. > > without handle_alloc_space, time = time(write aligned up to cluster) > with handle_alloc_space, time = time(fast zero write) + time(original write) > > If we have some statistics on normal-write vs zero-write timing, we can just > calculate both variants and choose faster. > > if (predict_zero_write_time(aligned up request) + predict_write_time(request) < predict_write_time(aligned up request)) { > use handle_alloc_space() > } Self-tuning based on request latency works great on a quiet host. If there are other processes submitting I/O to the disk then measured latencies may be bogus. Stefan
On Fri 25 Oct 2019 04:19:30 PM CEST, Max Reitz wrote: >>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M >>> cluster-size, even on rotational disk, which means that previous >>> assumption about calling handle_alloc_space() only for ssd is wrong, >>> we need smarter heuristics.. >>> >>> So, I'd prefer (1) or (2). > > OK. I wonder whether that problem would go away with Berto’s subcluster > series, though. Catching up with this now. I was told about this last week at the KVM Forum, but if the problems comes with the use of fallocate() and XFS, the I don't think subclusters will solve it. handle_alloc_space() is used to fill a cluster with zeroes when there's COW, and that happens the same with subclusters, just at the subcluster level instead of course. What can happen, if the subcluster size matches the filesystem block size, is that there's no need for any COW and therefore the bug is never triggered. But that's not quite the same as a fix :-) > Maybe make a decision based both on the ratio of data size to COW area > length (only invoke handle_alloc_space() under a certain threshold), > and the absolute COW area length (always invoke it above a certain > threshold, unless the ratio doesn’t allow it)? Maybe combining that with the smaller clusters/subclusters can work around the problem. The maximum subcluster size is 64KB (for a 2MB cluster). Berto
On 04.11.19 15:03, Alberto Garcia wrote: > On Fri 25 Oct 2019 04:19:30 PM CEST, Max Reitz wrote: >>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M >>>> cluster-size, even on rotational disk, which means that previous >>>> assumption about calling handle_alloc_space() only for ssd is wrong, >>>> we need smarter heuristics.. >>>> >>>> So, I'd prefer (1) or (2). >> >> OK. I wonder whether that problem would go away with Berto’s subcluster >> series, though. > > Catching up with this now. I was told about this last week at the KVM > Forum, but if the problems comes with the use of fallocate() and XFS, > the I don't think subclusters will solve it. > > handle_alloc_space() is used to fill a cluster with zeroes when there's > COW, and that happens the same with subclusters, just at the subcluster > level instead of course. > > What can happen, if the subcluster size matches the filesystem block > size, is that there's no need for any COW and therefore the bug is never > triggered. But that's not quite the same as a fix :-) No, what I meant was that the original problem that led to c8bb23cbdbe would go away. c8bb23cbdbe was added because small writes to new clusters are slow when the clusters are large (because you need to do a 2 MB write on the host for a 4 kB write from the guest). So handle_alloc_space() was added to alleviate the problem with a zero-write instead of actually writing zeroes. The question is whether there is no need for handle_alloc_space() with subclusters because a normal write with explicit zeroes being written in the COW areas would be sufficiently quick. (Because the subclusters for 2 MB clusters are just 64 kB in size.) If that were so (right now it doesn’t look like it), we could revert c8bb23cbdbe and wouldn’t see the bug anymore. Max >> Maybe make a decision based both on the ratio of data size to COW area >> length (only invoke handle_alloc_space() under a certain threshold), >> and the absolute COW area length (always invoke it above a certain >> threshold, unless the ratio doesn’t allow it)? > > Maybe combining that with the smaller clusters/subclusters can work > around the problem. The maximum subcluster size is 64KB (for a 2MB > cluster). > > Berto >
On Mon 04 Nov 2019 03:25:12 PM CET, Max Reitz wrote: >>>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M >>>>> cluster-size, even on rotational disk, which means that previous >>>>> assumption about calling handle_alloc_space() only for ssd is wrong, >>>>> we need smarter heuristics.. >>>>> >>>>> So, I'd prefer (1) or (2). >>> >>> OK. I wonder whether that problem would go away with Berto’s subcluster >>> series, though. >> >> Catching up with this now. I was told about this last week at the KVM >> Forum, but if the problems comes with the use of fallocate() and XFS, >> the I don't think subclusters will solve it. >> >> handle_alloc_space() is used to fill a cluster with zeroes when there's >> COW, and that happens the same with subclusters, just at the subcluster >> level instead of course. >> >> What can happen, if the subcluster size matches the filesystem block >> size, is that there's no need for any COW and therefore the bug is never >> triggered. But that's not quite the same as a fix :-) > > No, what I meant was that the original problem that led to c8bb23cbdbe > would go away. Ah, right. Not quite, according to my numbers: |--------------+----------------+-----------------+-------------| | Cluster size | subclusters=on | subclusters=off | fallocate() | |--------------+----------------+-----------------+-------------| | 256 KB | 10182 IOPS | 966 IOPS | 14007 IOPS | | 512 KB | 7919 IOPS | 563 IOPS | 13442 IOPS | | 1024 KB | 5050 IOPS | 465 IOPS | 13887 IOPS | | 2048 KB | 2465 IOPS | 271 IOPS | 13885 IOPS | |--------------+----------------+-----------------+-------------| There's obviously no backing image, and only the last column uses handle_alloc_space() / fallocate(). Berto
On 04.11.19 16:12, Alberto Garcia wrote: > On Mon 04 Nov 2019 03:25:12 PM CET, Max Reitz wrote: >>>>>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M >>>>>> cluster-size, even on rotational disk, which means that previous >>>>>> assumption about calling handle_alloc_space() only for ssd is wrong, >>>>>> we need smarter heuristics.. >>>>>> >>>>>> So, I'd prefer (1) or (2). >>>> >>>> OK. I wonder whether that problem would go away with Berto’s subcluster >>>> series, though. >>> >>> Catching up with this now. I was told about this last week at the KVM >>> Forum, but if the problems comes with the use of fallocate() and XFS, >>> the I don't think subclusters will solve it. >>> >>> handle_alloc_space() is used to fill a cluster with zeroes when there's >>> COW, and that happens the same with subclusters, just at the subcluster >>> level instead of course. >>> >>> What can happen, if the subcluster size matches the filesystem block >>> size, is that there's no need for any COW and therefore the bug is never >>> triggered. But that's not quite the same as a fix :-) >> >> No, what I meant was that the original problem that led to c8bb23cbdbe >> would go away. > > Ah, right. Not quite, according to my numbers: > > |--------------+----------------+-----------------+-------------| > | Cluster size | subclusters=on | subclusters=off | fallocate() | > |--------------+----------------+-----------------+-------------| > | 256 KB | 10182 IOPS | 966 IOPS | 14007 IOPS | > | 512 KB | 7919 IOPS | 563 IOPS | 13442 IOPS | > | 1024 KB | 5050 IOPS | 465 IOPS | 13887 IOPS | > | 2048 KB | 2465 IOPS | 271 IOPS | 13885 IOPS | > |--------------+----------------+-----------------+-------------| > > There's obviously no backing image, and only the last column uses > handle_alloc_space() / fallocate(). Thanks for providing some numbers! It was my impression, too, that subclusters wouldn’t solve it. But it didn’t seem like that big of a difference to me. Did you run this with aio=native? (Because that’s where we have the XFS problem) Max
On Mon 04 Nov 2019 04:14:56 PM CET, Max Reitz wrote: >>> No, what I meant was that the original problem that led to >>> c8bb23cbdbe would go away. >> >> Ah, right. Not quite, according to my numbers: >> >> |--------------+----------------+-----------------+-------------| >> | Cluster size | subclusters=on | subclusters=off | fallocate() | >> |--------------+----------------+-----------------+-------------| >> | 256 KB | 10182 IOPS | 966 IOPS | 14007 IOPS | >> | 512 KB | 7919 IOPS | 563 IOPS | 13442 IOPS | >> | 1024 KB | 5050 IOPS | 465 IOPS | 13887 IOPS | >> | 2048 KB | 2465 IOPS | 271 IOPS | 13885 IOPS | >> |--------------+----------------+-----------------+-------------| >> >> There's obviously no backing image, and only the last column uses >> handle_alloc_space() / fallocate(). > > Thanks for providing some numbers! > > It was my impression, too, that subclusters wouldn’t solve it. But it > didn’t seem like that big of a difference to me. Did you run this > with aio=native? (Because that’s where we have the XFS problem) Here's with aio=native |--------------+----------------+-----------------+-------------| | Cluster size | subclusters=on | subclusters=off | fallocate() | |--------------+----------------+-----------------+-------------| | 256 KB | 19897 IOPS | 891 IOPS | 32194 IOPS | | 512 KB | 17881 IOPS | 436 IOPS | 33092 IOPS | | 1024 KB | 17050 IOPS | 341 IOPS | 32768 IOPS | | 2048 KB | 7854 IOPS | 207 IOPS | 30944 IOPS | |--------------+----------------+-----------------+-------------| Berto
On 04.11.19 16:49, Alberto Garcia wrote: > On Mon 04 Nov 2019 04:14:56 PM CET, Max Reitz wrote: > >>>> No, what I meant was that the original problem that led to >>>> c8bb23cbdbe would go away. >>> >>> Ah, right. Not quite, according to my numbers: >>> >>> |--------------+----------------+-----------------+-------------| >>> | Cluster size | subclusters=on | subclusters=off | fallocate() | >>> |--------------+----------------+-----------------+-------------| >>> | 256 KB | 10182 IOPS | 966 IOPS | 14007 IOPS | >>> | 512 KB | 7919 IOPS | 563 IOPS | 13442 IOPS | >>> | 1024 KB | 5050 IOPS | 465 IOPS | 13887 IOPS | >>> | 2048 KB | 2465 IOPS | 271 IOPS | 13885 IOPS | >>> |--------------+----------------+-----------------+-------------| >>> >>> There's obviously no backing image, and only the last column uses >>> handle_alloc_space() / fallocate(). >> >> Thanks for providing some numbers! >> >> It was my impression, too, that subclusters wouldn’t solve it. But it >> didn’t seem like that big of a difference to me. Did you run this >> with aio=native? (Because that’s where we have the XFS problem) > > Here's with aio=native > > |--------------+----------------+-----------------+-------------| > | Cluster size | subclusters=on | subclusters=off | fallocate() | > |--------------+----------------+-----------------+-------------| > | 256 KB | 19897 IOPS | 891 IOPS | 32194 IOPS | > | 512 KB | 17881 IOPS | 436 IOPS | 33092 IOPS | > | 1024 KB | 17050 IOPS | 341 IOPS | 32768 IOPS | > | 2048 KB | 7854 IOPS | 207 IOPS | 30944 IOPS | > |--------------+----------------+-----------------+-------------| OK, thanks again. :-) So it seems right not to revert the commit and just work around the XFS bug in file-posix. Max
Am 25.10.2019 um 16:19 hat Max Reitz geschrieben: > On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote: > > 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote: > >> 25.10.2019 12:58, Max Reitz wrote: > >>> Hi, > >>> > >>> It seems to me that there is a bug in Linux’s XFS kernel driver, as > >>> I’ve explained here: > >>> > >>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html > >>> > >>> In combination with our commit c8bb23cbdbe32f, this may lead to guest > >>> data corruption when using qcow2 images on XFS with aio=native. > >>> > >>> We can’t wait until the XFS kernel driver is fixed, we should work > >>> around the problem ourselves. > >>> > >>> This is an RFC for two reasons: > >>> (1) I don’t know whether this is the right way to address the issue, > >>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and > >>> if so stop applying the workaround. > >>> I don’t know how we would go about this, so this series doesn’t do > >>> it. (Hence it’s an RFC.) > >>> (3) Perhaps it’s a bit of a layering violation to let the file-posix > >>> driver access and modify a BdrvTrackedRequest object. > >>> > >>> As for how we can address the issue, I see three ways: > >>> (1) The one presented in this series: On XFS with aio=native, we extend > >>> tracked requests for post-EOF fallocate() calls (i.e., write-zero > >>> operations) to reach until infinity (INT64_MAX in practice), mark > >>> them serializing and wait for other conflicting requests. > >>> > >>> Advantages: > >>> + Limits the impact to very specific cases > >>> (And that means it wouldn’t hurt too much to keep this workaround > >>> even when the XFS driver has been fixed) > >>> + Works around the bug where it happens, namely in file-posix > >>> > >>> Disadvantages: > >>> - A bit complex > >>> - A bit of a layering violation (should file-posix have access to > >>> tracked requests?) > >>> > >>> (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only > >>> becomes visible due to that function: I don’t think qcow2 writes > >>> zeroes in any other I/O path, and raw images are fixed in size so > >>> post-EOF writes won’t happen. > >>> > >>> Advantages: > >>> + Maybe simpler, depending on how difficult it is to handle the > >>> layering violation > >>> + Also fixes the performance problem of handle_alloc_space() being > >>> slow on ppc64+XFS. > >>> > >>> Disadvantages: > >>> - Huge layering violation because qcow2 would need to know whether > >>> the image is stored on XFS or not. > >>> - We’d definitely want to skip this workaround when the XFS driver > >>> has been fixed, so we need some method to find out whether it has > >>> > >>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. > >>> To my knowledge I’m the only one who has provided any benchmarks for > >>> this commit, and even then I was a bit skeptical because it performs > >>> well in some cases and bad in others. I concluded that it’s > >>> probably worth it because the “some cases” are more likely to occur. > >>> > >>> Now we have this problem of corruption here (granted due to a bug in > >>> the XFS driver), and another report of massively degraded > >>> performance on ppc64 > >>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a > >>> private BZ; I hate that :-/ The report is about 40 % worse > >>> performance for an in-guest fio write benchmark.) > >>> > >>> So I have to ask the question about what the justification for > >>> keeping c8bb23cbdbe32f is. How much does performance increase with > >>> it actually? (On non-(ppc64+XFS) machines, obviously) > >>> > >>> Advantages: > >>> + Trivial > >>> + No layering violations > >>> + We wouldn’t need to keep track of whether the kernel bug has been > >>> fixed or not > >>> + Fixes the ppc64+XFS performance problem > >>> > >>> Disadvantages: > >>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f > >>> levels, whatever that means > >>> > >>> So this is the main reason this is an RFC: What should we do? Is (1) > >>> really the best choice? > >>> > >>> > >>> In any case, I’ve ran the test case I showed in > >>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html > >>> more than ten times with this series applied and the installation > >>> succeeded every time. (Without this series, it fails like every other > >>> time.) > >>> > >>> > >> > >> Hi! > >> > >> First, great thanks for your investigation! > >> > >> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is significant > >> in time. > >> > >> I've tested a bit: > >> > >> test: > >> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; done; done > >> > >> on master: > >> > >> /ssd/test.img cl=64K step=4K : 0.291 > >> /ssd/test.img cl=64K step=64K : 0.813 > >> /ssd/test.img cl=64K step=1M : 2.799 > >> /ssd/test.img cl=1M step=4K : 0.217 > >> /ssd/test.img cl=1M step=64K : 0.332 > >> /ssd/test.img cl=1M step=1M : 0.685 > >> /test.img cl=64K step=4K : 1.751 > >> /test.img cl=64K step=64K : 14.811 > >> /test.img cl=64K step=1M : 18.321 > >> /test.img cl=1M step=4K : 0.759 > >> /test.img cl=1M step=64K : 13.574 > >> /test.img cl=1M step=1M : 28.970 > >> > >> rerun on master: > >> > >> /ssd/test.img cl=64K step=4K : 0.295 > >> /ssd/test.img cl=64K step=64K : 0.803 > >> /ssd/test.img cl=64K step=1M : 2.921 > >> /ssd/test.img cl=1M step=4K : 0.233 > >> /ssd/test.img cl=1M step=64K : 0.321 > >> /ssd/test.img cl=1M step=1M : 0.762 > >> /test.img cl=64K step=4K : 1.873 > >> /test.img cl=64K step=64K : 15.621 > >> /test.img cl=64K step=1M : 18.428 > >> /test.img cl=1M step=4K : 0.883 > >> /test.img cl=1M step=64K : 13.484 > >> /test.img cl=1M step=1M : 26.244 > >> > >> > >> on master + revert c8bb23cbdbe32f5c326 > >> > >> /ssd/test.img cl=64K step=4K : 0.395 > >> /ssd/test.img cl=64K step=64K : 4.231 > >> /ssd/test.img cl=64K step=1M : 5.598 > >> /ssd/test.img cl=1M step=4K : 0.352 > >> /ssd/test.img cl=1M step=64K : 2.519 > >> /ssd/test.img cl=1M step=1M : 38.919 > >> /test.img cl=64K step=4K : 1.758 > >> /test.img cl=64K step=64K : 9.838 > >> /test.img cl=64K step=1M : 13.384 > >> /test.img cl=1M step=4K : 1.849 > >> /test.img cl=1M step=64K : 19.405 > >> /test.img cl=1M step=1M : 157.090 > >> > >> rerun: > >> > >> /ssd/test.img cl=64K step=4K : 0.407 > >> /ssd/test.img cl=64K step=64K : 3.325 > >> /ssd/test.img cl=64K step=1M : 5.641 > >> /ssd/test.img cl=1M step=4K : 0.346 > >> /ssd/test.img cl=1M step=64K : 2.583 > >> /ssd/test.img cl=1M step=1M : 39.692 > >> /test.img cl=64K step=4K : 1.727 > >> /test.img cl=64K step=64K : 10.058 > >> /test.img cl=64K step=1M : 13.441 > >> /test.img cl=1M step=4K : 1.926 > >> /test.img cl=1M step=64K : 19.738 > >> /test.img cl=1M step=1M : 158.268 > >> > >> > >> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, even on rotational > >> disk, which means that previous assumption about calling handle_alloc_space() only for ssd is > >> wrong, we need smarter heuristics.. > >> > >> So, I'd prefer (1) or (2). > > OK. I wonder whether that problem would go away with Berto’s subcluster > series, though. > > > About degradation in some cases: I think the problem is that one (a bit larger) > > write may be faster than fast-write-zeroes + small write, as the latter means > > additional write to metadata. And it's expected for small clusters in > > conjunction with rotational disk. But the actual limit is dependent on specific > > disk. So, I think possible solution is just sometimes try work with > > handle_alloc_space and sometimes without, remember time and length of request > > and make dynamic limit... > > Maybe make a decision based both on the ratio of data size to COW area > length (only invoke handle_alloc_space() under a certain threshold), and > the absolute COW area length (always invoke it above a certain > threshold, unless the ratio doesn’t allow it)? I'm not sure that I would like this level of complexity in this code path... Kevin
© 2016 - 2024 Red Hat, Inc.