block/file-posix.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+)
file-posix.c only supports shared storage live migration with -drive cache.direct=off due to cache consistency issues. There are two main shared storage configurations: files on NFS and host block devices on SAN LUNs. The problem is that QEMU starts on the destination host before the source host has written everything out to the disk. The page cache on the destination host may contain stale data read when QEMU opened the image file (before migration handover). Using O_DIRECT avoids this problem but prevents users from taking advantage of the host page cache. Although cache=none is the recommended setting for virtualization use cases, there are scenarios where cache=writeback makes sense. If the guest has much less RAM than the host or many guests share the same backing file, then the host page cache can significantly improve disk I/O performance. This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c on Linux so that shared storage live migration works. I have sent it as an RFC because cache consistency is not binary, there are corner cases which I've described in the actual patch, and this may require more discussion. Regarding NFS, QEMU relies on O_DIRECT rather than the close-to-open consistency model (see nfs(5)), which is the basic guarantee provided by NFS. After this patch cache consistency is no longer provided by O_DIRECT. This patch series relies on fdatasync(2) (source) + posix_fadvise(POSIX_FADV_DONTNEED) (destination) instead. I believe it is safe for both NFS and SAN LUNs. Maybe we should use fsync(2) instead of fdatasync(2) so that NFS has up-to-date inode metadata? Stefan Hajnoczi (2): block/file-posix: implement bdrv_co_invalidate_cache() on Linux block/file-posix: verify page cache is not used block/file-posix.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) -- 2.14.3
On 04/19/2018 02:52 AM, Stefan Hajnoczi wrote: > file-posix.c only supports shared storage live migration with -drive > cache.direct=off due to cache consistency issues. There are two main shared > storage configurations: files on NFS and host block devices on SAN LUNs. > > The problem is that QEMU starts on the destination host before the source host > has written everything out to the disk. The page cache on the destination host > may contain stale data read when QEMU opened the image file (before migration > handover). Using O_DIRECT avoids this problem but prevents users from taking > advantage of the host page cache. > > Although cache=none is the recommended setting for virtualization use cases, > there are scenarios where cache=writeback makes sense. If the guest has much > less RAM than the host or many guests share the same backing file, then the > host page cache can significantly improve disk I/O performance. > > This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c > on Linux so that shared storage live migration works. I have sent it as an RFC > because cache consistency is not binary, there are corner cases which I've > described in the actual patch, and this may require more discussion. Interesting, in that the NBD list is also discussing the possible standardization of a NBD_CMD_CACHE command (based on existing practice in the xNBD implementation), and covering whether that MIGHT be worth doing as a thin wrapper that corresponds to posix_fadvise() semantics. Thus, if NBD_CMD_CACHE learns flags, we could support .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition to the POSIX file driver. Obviously, your usage invalidates the cache of the entire file; but does it also make sense to expose a start/length subset invalidation, for better exposure to posix_fadvise() semantics? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
On Thu, Apr 19, 2018 at 11:09:53AM -0500, Eric Blake wrote: > On 04/19/2018 02:52 AM, Stefan Hajnoczi wrote: > > file-posix.c only supports shared storage live migration with -drive > > cache.direct=off due to cache consistency issues. There are two main shared > > storage configurations: files on NFS and host block devices on SAN LUNs. > > > > The problem is that QEMU starts on the destination host before the source host > > has written everything out to the disk. The page cache on the destination host > > may contain stale data read when QEMU opened the image file (before migration > > handover). Using O_DIRECT avoids this problem but prevents users from taking > > advantage of the host page cache. > > > > Although cache=none is the recommended setting for virtualization use cases, > > there are scenarios where cache=writeback makes sense. If the guest has much > > less RAM than the host or many guests share the same backing file, then the > > host page cache can significantly improve disk I/O performance. > > > > This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c > > on Linux so that shared storage live migration works. I have sent it as an RFC > > because cache consistency is not binary, there are corner cases which I've > > described in the actual patch, and this may require more discussion. > > Interesting, in that the NBD list is also discussing the possible > standardization of a NBD_CMD_CACHE command (based on existing practice > in the xNBD implementation), and covering whether that MIGHT be worth > doing as a thin wrapper that corresponds to posix_fadvise() semantics. > Thus, if NBD_CMD_CACHE learns flags, we could support > .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition > to the POSIX file driver. Obviously, your usage invalidates the cache > of the entire file; but does it also make sense to expose a start/length > subset invalidation, for better exposure to posix_fadvise() semantics? bdrv_co_invalidate_cache() is currently only used by migration before using a file that may have been touched by the other host. I don't think start/length will be needed for that use case. Can you describe how will NBD use cache invalidation? Maybe this will help me understand other use cases. Stefan
On 04/19/2018 10:05 PM, Stefan Hajnoczi wrote: >>> This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c >>> on Linux so that shared storage live migration works. I have sent it as an RFC >>> because cache consistency is not binary, there are corner cases which I've >>> described in the actual patch, and this may require more discussion. >> >> Interesting, in that the NBD list is also discussing the possible >> standardization of a NBD_CMD_CACHE command (based on existing practice >> in the xNBD implementation), and covering whether that MIGHT be worth >> doing as a thin wrapper that corresponds to posix_fadvise() semantics. >> Thus, if NBD_CMD_CACHE learns flags, we could support >> .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition >> to the POSIX file driver. Obviously, your usage invalidates the cache >> of the entire file; but does it also make sense to expose a start/length >> subset invalidation, for better exposure to posix_fadvise() semantics? > > bdrv_co_invalidate_cache() is currently only used by migration before > using a file that may have been touched by the other host. I don't > think start/length will be needed for that use case. > > Can you describe how will NBD use cache invalidation? Maybe this will > help me understand other use cases. That's where things are still under discussion - no one has yet provided a case that would benefit from a POSIX_FADV_DONTNEED over just a range of the file [1]; on the other hand, it might make sense that if you know an implementation has a limited cache, then having control over the various posix_fadvise() flags over various ranges of the files may lead to more optimum behavior. And posix_fadvise() does have the ability to work over the entire file (offset 0 length 0) or a subrange (any offset and nonzero length). So I'm also fine if .bdrv_co_invalidate_cache() doesn't expose offset/length parameters, particularly if NBD can't come up with an actual use case that would benefit. [1] https://lists.debian.org/nbd/2018/04/msg00020.html -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
On Fri, Apr 20, 2018 at 08:53:33AM -0500, Eric Blake wrote: > On 04/19/2018 10:05 PM, Stefan Hajnoczi wrote: > > >>> This patch series implements .bdrv_co_invalidate_cache() for block/file-posix.c > >>> on Linux so that shared storage live migration works. I have sent it as an RFC > >>> because cache consistency is not binary, there are corner cases which I've > >>> described in the actual patch, and this may require more discussion. > >> > >> Interesting, in that the NBD list is also discussing the possible > >> standardization of a NBD_CMD_CACHE command (based on existing practice > >> in the xNBD implementation), and covering whether that MIGHT be worth > >> doing as a thin wrapper that corresponds to posix_fadvise() semantics. > >> Thus, if NBD_CMD_CACHE learns flags, we could support > >> .bdrv_co_invalidate_cache() through the NBD protocol driver, in addition > >> to the POSIX file driver. Obviously, your usage invalidates the cache > >> of the entire file; but does it also make sense to expose a start/length > >> subset invalidation, for better exposure to posix_fadvise() semantics? > > > > bdrv_co_invalidate_cache() is currently only used by migration before > > using a file that may have been touched by the other host. I don't > > think start/length will be needed for that use case. > > > > Can you describe how will NBD use cache invalidation? Maybe this will > > help me understand other use cases. > > That's where things are still under discussion - no one has yet provided > a case that would benefit from a POSIX_FADV_DONTNEED over just a range > of the file [1]; on the other hand, it might make sense that if you know > an implementation has a limited cache, then having control over the > various posix_fadvise() flags over various ranges of the files may lead > to more optimum behavior. And posix_fadvise() does have the ability to > work over the entire file (offset 0 length 0) or a subrange (any offset > and nonzero length). So I'm also fine if .bdrv_co_invalidate_cache() > doesn't expose offset/length parameters, particularly if NBD can't come > up with an actual use case that would benefit. > > [1] https://lists.debian.org/nbd/2018/04/msg00020.html Okay. .bdrv_co_invalidate_cache() is an internal interface. We can change it later to support NBD functionality, if necessary. Let's keep it without start/length for now. Stefan
© 2016 - 2024 Red Hat, Inc.