[Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration

Stefan Hajnoczi posted 2 patches 6 years ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20180419075232.31407-1-stefanha@redhat.com
Test checkpatch passed
Test docker-build@min-glib passed
Test docker-mingw@fedora passed
Test s390x passed
There is a newer version of this series
block/file-posix.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 110 insertions(+)
[Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration
Posted by Stefan Hajnoczi 6 years ago
file-posix.c only supports shared storage live migration with -drive
cache.direct=off due to cache consistency issues.  There are two main shared
storage configurations: files on NFS and host block devices on SAN LUNs.

The problem is that QEMU starts on the destination host before the source host
has written everything out to the disk.  The page cache on the destination host
may contain stale data read when QEMU opened the image file (before migration
handover).  Using O_DIRECT avoids this problem but prevents users from taking
advantage of the host page cache.

Although cache=none is the recommended setting for virtualization use cases,
there are scenarios where cache=writeback makes sense.  If the guest has much
less RAM than the host or many guests share the same backing file, then the
host page cache can significantly improve disk I/O performance.

This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c
on Linux so that shared storage live migration works.  I have sent it as an RFC
because cache consistency is not binary, there are corner cases which I've
described in the actual patch, and this may require more discussion.

Regarding NFS, QEMU relies on O_DIRECT rather than the close-to-open
consistency model (see nfs(5)), which is the basic guarantee provided by NFS.
After this patch cache consistency is no longer provided by O_DIRECT.

This patch series relies on fdatasync(2) (source) +
posix_fadvise(POSIX_FADV_DONTNEED) (destination) instead.  I believe it is safe
for both NFS and SAN LUNs.  Maybe we should use fsync(2) instead of
fdatasync(2) so that NFS has up-to-date inode metadata?

Stefan Hajnoczi (2):
  block/file-posix: implement bdrv_co_invalidate_cache() on Linux
  block/file-posix: verify page cache is not used

 block/file-posix.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 110 insertions(+)

-- 
2.14.3


Re: [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration
Posted by Eric Blake 6 years ago
On 04/19/2018 02:52 AM, Stefan Hajnoczi wrote:
> file-posix.c only supports shared storage live migration with -drive
> cache.direct=off due to cache consistency issues.  There are two main shared
> storage configurations: files on NFS and host block devices on SAN LUNs.
> 
> The problem is that QEMU starts on the destination host before the source host
> has written everything out to the disk.  The page cache on the destination host
> may contain stale data read when QEMU opened the image file (before migration
> handover).  Using O_DIRECT avoids this problem but prevents users from taking
> advantage of the host page cache.
> 
> Although cache=none is the recommended setting for virtualization use cases,
> there are scenarios where cache=writeback makes sense.  If the guest has much
> less RAM than the host or many guests share the same backing file, then the
> host page cache can significantly improve disk I/O performance.
> 
> This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c
> on Linux so that shared storage live migration works.  I have sent it as an RFC
> because cache consistency is not binary, there are corner cases which I've
> described in the actual patch, and this may require more discussion.

Interesting, in that the NBD list is also discussing the possible
standardization of a NBD_CMD_CACHE command (based on existing practice
in the xNBD implementation), and covering whether that MIGHT be worth
doing as a thin wrapper that corresponds to posix_fadvise() semantics.
Thus, if NBD_CMD_CACHE learns flags, we could support
.bdrv_co_invalidate_cache() through the NBD protocol driver, in addition
to the POSIX file driver.  Obviously, your usage invalidates the cache
of the entire file; but does it also make sense to expose a start/length
subset invalidation, for better exposure to posix_fadvise() semantics?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration
Posted by Stefan Hajnoczi 6 years ago
On Thu, Apr 19, 2018 at 11:09:53AM -0500, Eric Blake wrote:
> On 04/19/2018 02:52 AM, Stefan Hajnoczi wrote:
> > file-posix.c only supports shared storage live migration with -drive
> > cache.direct=off due to cache consistency issues.  There are two main shared
> > storage configurations: files on NFS and host block devices on SAN LUNs.
> > 
> > The problem is that QEMU starts on the destination host before the source host
> > has written everything out to the disk.  The page cache on the destination host
> > may contain stale data read when QEMU opened the image file (before migration
> > handover).  Using O_DIRECT avoids this problem but prevents users from taking
> > advantage of the host page cache.
> > 
> > Although cache=none is the recommended setting for virtualization use cases,
> > there are scenarios where cache=writeback makes sense.  If the guest has much
> > less RAM than the host or many guests share the same backing file, then the
> > host page cache can significantly improve disk I/O performance.
> > 
> > This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c
> > on Linux so that shared storage live migration works.  I have sent it as an RFC
> > because cache consistency is not binary, there are corner cases which I've
> > described in the actual patch, and this may require more discussion.
> 
> Interesting, in that the NBD list is also discussing the possible
> standardization of a NBD_CMD_CACHE command (based on existing practice
> in the xNBD implementation), and covering whether that MIGHT be worth
> doing as a thin wrapper that corresponds to posix_fadvise() semantics.
> Thus, if NBD_CMD_CACHE learns flags, we could support
> .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition
> to the POSIX file driver.  Obviously, your usage invalidates the cache
> of the entire file; but does it also make sense to expose a start/length
> subset invalidation, for better exposure to posix_fadvise() semantics?

bdrv_co_invalidate_cache() is currently only used by migration before
using a file that may have been touched by the other host.  I don't
think start/length will be needed for that use case.

Can you describe how will NBD use cache invalidation?  Maybe this will
help me understand other use cases.

Stefan
Re: [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration
Posted by Eric Blake 6 years ago
On 04/19/2018 10:05 PM, Stefan Hajnoczi wrote:

>>> This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c
>>> on Linux so that shared storage live migration works.  I have sent it as an RFC
>>> because cache consistency is not binary, there are corner cases which I've
>>> described in the actual patch, and this may require more discussion.
>>
>> Interesting, in that the NBD list is also discussing the possible
>> standardization of a NBD_CMD_CACHE command (based on existing practice
>> in the xNBD implementation), and covering whether that MIGHT be worth
>> doing as a thin wrapper that corresponds to posix_fadvise() semantics.
>> Thus, if NBD_CMD_CACHE learns flags, we could support
>> .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition
>> to the POSIX file driver.  Obviously, your usage invalidates the cache
>> of the entire file; but does it also make sense to expose a start/length
>> subset invalidation, for better exposure to posix_fadvise() semantics?
> 
> bdrv_co_invalidate_cache() is currently only used by migration before
> using a file that may have been touched by the other host.  I don't
> think start/length will be needed for that use case.
> 
> Can you describe how will NBD use cache invalidation?  Maybe this will
> help me understand other use cases.

That's where things are still under discussion - no one has yet provided
a case that would benefit from a POSIX_FADV_DONTNEED over just a range
of the file [1]; on the other hand, it might make sense that if you know
an implementation has a limited cache, then having control over the
various posix_fadvise() flags over various ranges of the files may lead
to more optimum behavior.  And posix_fadvise() does have the ability to
work over the entire file (offset 0 length 0) or a subrange (any offset
and nonzero length).  So I'm also fine if .bdrv_co_invalidate_cache()
doesn't expose offset/length parameters, particularly if NBD can't come
up with an actual use case that would benefit.

[1] https://lists.debian.org/nbd/2018/04/msg00020.html

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [RFC 0/2] block/file-posix: allow -drive cache.direct=off live migration
Posted by Stefan Hajnoczi 6 years ago
On Fri, Apr 20, 2018 at 08:53:33AM -0500, Eric Blake wrote:
> On 04/19/2018 10:05 PM, Stefan Hajnoczi wrote:
> 
> >>> This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c
> >>> on Linux so that shared storage live migration works.  I have sent it as an RFC
> >>> because cache consistency is not binary, there are corner cases which I've
> >>> described in the actual patch, and this may require more discussion.
> >>
> >> Interesting, in that the NBD list is also discussing the possible
> >> standardization of a NBD_CMD_CACHE command (based on existing practice
> >> in the xNBD implementation), and covering whether that MIGHT be worth
> >> doing as a thin wrapper that corresponds to posix_fadvise() semantics.
> >> Thus, if NBD_CMD_CACHE learns flags, we could support
> >> .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition
> >> to the POSIX file driver.  Obviously, your usage invalidates the cache
> >> of the entire file; but does it also make sense to expose a start/length
> >> subset invalidation, for better exposure to posix_fadvise() semantics?
> > 
> > bdrv_co_invalidate_cache() is currently only used by migration before
> > using a file that may have been touched by the other host.  I don't
> > think start/length will be needed for that use case.
> > 
> > Can you describe how will NBD use cache invalidation?  Maybe this will
> > help me understand other use cases.
> 
> That's where things are still under discussion - no one has yet provided
> a case that would benefit from a POSIX_FADV_DONTNEED over just a range
> of the file [1]; on the other hand, it might make sense that if you know
> an implementation has a limited cache, then having control over the
> various posix_fadvise() flags over various ranges of the files may lead
> to more optimum behavior.  And posix_fadvise() does have the ability to
> work over the entire file (offset 0 length 0) or a subrange (any offset
> and nonzero length).  So I'm also fine if .bdrv_co_invalidate_cache()
> doesn't expose offset/length parameters, particularly if NBD can't come
> up with an actual use case that would benefit.
> 
> [1] https://lists.debian.org/nbd/2018/04/msg00020.html

Okay.  .bdrv_co_invalidate_cache() is an internal interface.  We can
change it later to support NBD functionality, if necessary.  Let's keep
it without start/length for now.

Stefan