[Qemu-devel] [PATCH for-2.9?] file-posix: Make bdrv_flush() failure permanent without O_DIRECT

Kevin Wolf posted 1 patch 7 years, 1 month ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20170322210005.16533-1-kwolf@redhat.com
Test checkpatch passed
Test docker passed
Test s390x passed
block/file-posix.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
[Qemu-devel] [PATCH for-2.9?] file-posix: Make bdrv_flush() failure permanent without O_DIRECT
Posted by Kevin Wolf 7 years, 1 month ago
Success for bdrv_flush() means that all previously written data is safe
on disk. For fdatasync(), the best semantics we can hope for on Linux
(without O_DIRECT) is that all data that was written since the last call
was successfully written back. Therefore, and because we can't redo all
writes after a flush failure, we have to give up after a single
fdatasync() failure. After this failure, we would never be able to make
the promise that a successful bdrv_flush() makes.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/file-posix.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 53febd3..beb7a4f 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -144,6 +144,7 @@ typedef struct BDRVRawState {
     bool has_write_zeroes:1;
     bool discard_zeroes:1;
     bool use_linux_aio:1;
+    bool page_cache_inconsistent:1;
     bool has_fallocate;
     bool needs_alignment;
 } BDRVRawState;
@@ -824,10 +825,31 @@ static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
 
 static ssize_t handle_aiocb_flush(RawPosixAIOData *aiocb)
 {
+    BDRVRawState *s = aiocb->bs->opaque;
     int ret;
 
+    if (s->page_cache_inconsistent) {
+        return -EIO;
+    }
+
     ret = qemu_fdatasync(aiocb->aio_fildes);
     if (ret == -1) {
+        /* There is no clear definition of the semantics of a failing fsync(),
+         * so we may have to assume the worst. The sad truth is that this
+         * assumption is correct for Linux. Some pages are now probably marked
+         * clean in the page cache even though they are inconsistent with the
+         * on-disk contents. The next fdatasync() call would succeed, but no
+         * further writeback attempt will be made. We can't get back to a state
+         * in which we know what is on disk (we would have to rewrite
+         * everything that was touched since the last fdatasync() at least), so
+         * make bdrv_flush() fail permanently. Given that the behaviour isn't
+         * really defined, I have little hope that other OSes are doing better.
+         *
+         * Obviously, this doesn't affect O_DIRECT, which bypasses the page
+         * cache. */
+        if ((s->open_flags & O_DIRECT) == 0) {
+            s->page_cache_inconsistent = true;
+        }
         return -errno;
     }
     return 0;
-- 
2.9.3


Re: [Qemu-devel] [PATCH for-2.9?] file-posix: Make bdrv_flush() failure permanent without O_DIRECT
Posted by Fam Zheng 7 years, 1 month ago
On Wed, 03/22 22:00, Kevin Wolf wrote:
> Success for bdrv_flush() means that all previously written data is safe
> on disk. For fdatasync(), the best semantics we can hope for on Linux
> (without O_DIRECT) is that all data that was written since the last call
> was successfully written back. Therefore, and because we can't redo all
> writes after a flush failure, we have to give up after a single
> fdatasync() failure. After this failure, we would never be able to make
> the promise that a successful bdrv_flush() makes.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/file-posix.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 53febd3..beb7a4f 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -144,6 +144,7 @@ typedef struct BDRVRawState {
>      bool has_write_zeroes:1;
>      bool discard_zeroes:1;
>      bool use_linux_aio:1;
> +    bool page_cache_inconsistent:1;
>      bool has_fallocate;
>      bool needs_alignment;
>  } BDRVRawState;
> @@ -824,10 +825,31 @@ static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
>  
>  static ssize_t handle_aiocb_flush(RawPosixAIOData *aiocb)
>  {
> +    BDRVRawState *s = aiocb->bs->opaque;
>      int ret;
>  
> +    if (s->page_cache_inconsistent) {
> +        return -EIO;
> +    }
> +
>      ret = qemu_fdatasync(aiocb->aio_fildes);
>      if (ret == -1) {
> +        /* There is no clear definition of the semantics of a failing fsync(),
> +         * so we may have to assume the worst. The sad truth is that this
> +         * assumption is correct for Linux. Some pages are now probably marked
> +         * clean in the page cache even though they are inconsistent with the
> +         * on-disk contents. The next fdatasync() call would succeed, but no
> +         * further writeback attempt will be made. We can't get back to a state
> +         * in which we know what is on disk (we would have to rewrite
> +         * everything that was touched since the last fdatasync() at least), so
> +         * make bdrv_flush() fail permanently. Given that the behaviour isn't
> +         * really defined, I have little hope that other OSes are doing better.
> +         *
> +         * Obviously, this doesn't affect O_DIRECT, which bypasses the page
> +         * cache. */
> +        if ((s->open_flags & O_DIRECT) == 0) {
> +            s->page_cache_inconsistent = true;
> +        }
>          return -errno;
>      }
>      return 0;
> -- 
> 2.9.3
> 
> 

Reviewed-by: Fam Zheng <famz@redhat.com>

Re: [Qemu-devel] [PATCH for-2.9?] file-posix: Make bdrv_flush() failure permanent without O_DIRECT
Posted by Eric Blake 7 years, 1 month ago
On 03/22/2017 04:00 PM, Kevin Wolf wrote:
> Success for bdrv_flush() means that all previously written data is safe
> on disk. For fdatasync(), the best semantics we can hope for on Linux
> (without O_DIRECT) is that all data that was written since the last call
> was successfully written back. Therefore, and because we can't redo all
> writes after a flush failure, we have to give up after a single
> fdatasync() failure. After this failure, we would never be able to make
> the promise that a successful bdrv_flush() makes.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/file-posix.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)

Makes sense for 2.9 (it doesn't change the data loss, but alerts to the
user to the knowledge of data loss a lot sooner, perhaps before things
get even worse).

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Re: [Qemu-devel] [Qemu-block] [PATCH for-2.9?] file-posix: Make bdrv_flush() failure permanent without O_DIRECT
Posted by Stefan Hajnoczi 7 years, 1 month ago
On Wed, Mar 22, 2017 at 10:00:05PM +0100, Kevin Wolf wrote:
> Success for bdrv_flush() means that all previously written data is safe
> on disk. For fdatasync(), the best semantics we can hope for on Linux
> (without O_DIRECT) is that all data that was written since the last call
> was successfully written back. Therefore, and because we can't redo all
> writes after a flush failure, we have to give up after a single
> fdatasync() failure. After this failure, we would never be able to make
> the promise that a successful bdrv_flush() makes.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/file-posix.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Re: [Qemu-devel] [Qemu-block] [PATCH for-2.9?] file-posix: Make bdrv_flush() failure permanent without O_DIRECT
Posted by Max Reitz 7 years ago
On 22.03.2017 22:00, Kevin Wolf wrote:
> Success for bdrv_flush() means that all previously written data is safe
> on disk. For fdatasync(), the best semantics we can hope for on Linux
> (without O_DIRECT) is that all data that was written since the last call
> was successfully written back. Therefore, and because we can't redo all
> writes after a flush failure, we have to give up after a single
> fdatasync() failure. After this failure, we would never be able to make
> the promise that a successful bdrv_flush() makes.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/file-posix.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)

Thanks, applied to my block branch for 2.9:

https://github.com/XanClic/qemu/commits/block

Max