backends/hostmem-file.c | 51 ++++++++++++++++++++++++++++++++--- backends/hostmem.c | 11 +++++--- docs/nvdimm.txt | 20 +++++++++++++- exec.c | 15 ++++++----- include/exec/memory.h | 36 +++++++++++++++++++++++-- include/exec/ram_addr.h | 29 ++++++++++++++++++-- include/qemu/mmap-alloc.h | 23 +++++++++++++++- include/standard-headers/linux/mman.h | 42 +++++++++++++++++++++++++++++ memory.c | 8 +++--- numa.c | 2 +- qemu-options.hx | 22 ++++++++++++++- util/mmap-alloc.c | 31 ++++++++++++++++++--- util/oslib-posix.c | 2 +- 13 files changed, 261 insertions(+), 31 deletions(-) create mode 100644 include/standard-headers/linux/mman.h
Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to guarantee the write persistence to mmap'ed files supporting DAX (e.g., files on ext4/xfs file system mounted with '-o dax'). A description of MAP_SYNC and MAP_SHARED_VALIDATE can be found at https://patchwork.kernel.org/patch/10028151/ This patchset enables QEMU to use MAP_SYNC flag for memory-backend-file, in order to guarantee the guest write persistence to backend files supporting DAX. A new auto on/off option 'sync' is added to memory-backend-file: - on: try to pass MAP_SYNC to mmap(2); if MAP_SYNC is not supported or 'share=off', QEMU will abort - off: never pass MAP_SYNC to mmap(2) - auto (default): if MAP_SYNC is supported and 'share=on', work as if 'sync=on'; otherwise, work as if 'sync=off' Changes in v4: * Add patch 1-3 to switch some functions to a single 'flags' parameters. (Michael S. Tsirkin) * v3 patch 1-3 become v4 patch 4-6. * Patch 4: move definitions of MAP_SYNC and MAP_SHARED_VALIDATE to a new header file under include/standard-headers/linux/. (Michael S. Tsirkin) * Patch 6: refine the description of the 'sync' option. (Michael S. Tsirkin) Changes in v3: * Patch 1: add MAP_SHARED_VALIDATE in both sync=on and sync=auto cases, and add back the retry mechanism. MAP_SYNC will be ignored by Linux kernel 4.15 if MAP_SHARED_VALIDATE is missed. * Patch 1: define MAP_SYNC and MAP_SHARED_VALIDATE as 0 on non-Linux platforms in order to make qemu_ram_mmap() compile on those platforms. * Patch 2&3: include more information in error messages of memory-backend in hope to help user to identify the error. (Dr. David Alan Gilbert) * Patch 3: fix typo in the commit message. (Dr. David Alan Gilbert) Changes in v2: * Add 'sync' option to control the use of MAP_SYNC. (Eduardo Habkost) * Remove the unnecessary set of MAP_SHARED_VALIDATE in some cases and the retry mechanism in qemu_ram_mmap(). (Michael S. Tsirkin) * Move OS dependent definitions of MAP_SYNC and MAP_SHARED_VALIDATE to osdep.h. (Michael S. Tsirkin) Haozhong Zhang (6): util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter exec: switch qemu_ram_alloc_from_{file,fd} to the 'flags' parameter memory: switch memory_region_init_ram_from_file() to 'flags' parameter util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap() hostmem: add more information in error messages hostmem-file: add 'sync' option backends/hostmem-file.c | 51 ++++++++++++++++++++++++++++++++--- backends/hostmem.c | 11 +++++--- docs/nvdimm.txt | 20 +++++++++++++- exec.c | 15 ++++++----- include/exec/memory.h | 36 +++++++++++++++++++++++-- include/exec/ram_addr.h | 29 ++++++++++++++++++-- include/qemu/mmap-alloc.h | 23 +++++++++++++++- include/standard-headers/linux/mman.h | 42 +++++++++++++++++++++++++++++ memory.c | 8 +++--- numa.c | 2 +- qemu-options.hx | 22 ++++++++++++++- util/mmap-alloc.c | 31 ++++++++++++++++++--- util/oslib-posix.c | 2 +- 13 files changed, 261 insertions(+), 31 deletions(-) create mode 100644 include/standard-headers/linux/mman.h -- 2.14.1
On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang <haozhong.zhang@intel.com> wrote: > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to > guarantee the write persistence to mmap'ed files supporting DAX (e.g., > files on ext4/xfs file system mounted with '-o dax'). Wait, MAP_SYNC does not guarantee persistence. It makes sure that the metadata is in sync after a fault. However, that does not make filesystem-DAX safe for use with QEMU, because we still need to coordinate DMA with fileystem operations. There is no way to do that coordination from within a guest. QEMU needs to use device-dax if the guest might ever perform DMA to a virtual-pmem range. See this patch set for more details on the DAX vs DMA problem [1]. I think we need to enforce this in the host kernel. I.e. do not allow file backed DAX pages to be mapped in EPT entries unless / until we have a solution to the DMA synchronization problem. Apologies for not noticing this earlier. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html
On 01/31/18 14:25 -0800, Dan Williams wrote: > On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: > > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to > > guarantee the write persistence to mmap'ed files supporting DAX (e.g., > > files on ext4/xfs file system mounted with '-o dax'). > > Wait, MAP_SYNC does not guarantee persistence. It makes sure that the > metadata is in sync after a fault. However, that does not make > filesystem-DAX safe for use with QEMU, because we still need to > coordinate DMA with fileystem operations. There is no way to do that > coordination from within a guest. QEMU needs to use device-dax if the > guest might ever perform DMA to a virtual-pmem range. See this patch > set for more details on the DAX vs DMA problem [1]. I think we need to > enforce this in the host kernel. I.e. do not allow file backed DAX > pages to be mapped in EPT entries unless / until we have a solution to > the DMA synchronization problem. Apologies for not noticing this > earlier. QEMU does not truncate or punch holes of the file once it has been mmap()'ed. Does the problem [1] still exist in such case? Thanks, Haozhong > > [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html
On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang <haozhong.zhang@intel.com> wrote: > On 01/31/18 14:25 -0800, Dan Williams wrote: >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang >> <haozhong.zhang@intel.com> wrote: >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., >> > files on ext4/xfs file system mounted with '-o dax'). >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the >> metadata is in sync after a fault. However, that does not make >> filesystem-DAX safe for use with QEMU, because we still need to >> coordinate DMA with fileystem operations. There is no way to do that >> coordination from within a guest. QEMU needs to use device-dax if the >> guest might ever perform DMA to a virtual-pmem range. See this patch >> set for more details on the DAX vs DMA problem [1]. I think we need to >> enforce this in the host kernel. I.e. do not allow file backed DAX >> pages to be mapped in EPT entries unless / until we have a solution to >> the DMA synchronization problem. Apologies for not noticing this >> earlier. > > QEMU does not truncate or punch holes of the file once it has been > mmap()'ed. Does the problem [1] still exist in such case? Something else on the system might. The only agent that could enforce protection is the kernel, and the kernel will likely just disallow passing addresses from filesystem-dax vmas through to a guest altogether. I think there's even a problem in the non-DAX case unless KVM is pinning pages while they are handed out to a guest. The problem is that we don't have a page cache page to pin in the DAX case.
On 01/31/18 16:08 -0800, Dan Williams wrote: > On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: > > On 01/31/18 14:25 -0800, Dan Williams wrote: > >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang > >> <haozhong.zhang@intel.com> wrote: > >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to > >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., > >> > files on ext4/xfs file system mounted with '-o dax'). > >> > >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the > >> metadata is in sync after a fault. However, that does not make > >> filesystem-DAX safe for use with QEMU, because we still need to > >> coordinate DMA with fileystem operations. There is no way to do that > >> coordination from within a guest. QEMU needs to use device-dax if the > >> guest might ever perform DMA to a virtual-pmem range. See this patch > >> set for more details on the DAX vs DMA problem [1]. I think we need to > >> enforce this in the host kernel. I.e. do not allow file backed DAX > >> pages to be mapped in EPT entries unless / until we have a solution to > >> the DMA synchronization problem. Apologies for not noticing this > >> earlier. > > > > QEMU does not truncate or punch holes of the file once it has been > > mmap()'ed. Does the problem [1] still exist in such case? > > Something else on the system might. The only agent that could enforce > protection is the kernel, and the kernel will likely just disallow > passing addresses from filesystem-dax vmas through to a guest > altogether. I think there's even a problem in the non-DAX case unless > KVM is pinning pages while they are handed out to a guest. The problem > is that we don't have a page cache page to pin in the DAX case. > Does it mean any user-space code like ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem // make DMA to ptr is unsafe?
On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang <haozhong.zhang@intel.com> wrote: > On 01/31/18 16:08 -0800, Dan Williams wrote: >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang >> <haozhong.zhang@intel.com> wrote: >> > On 01/31/18 14:25 -0800, Dan Williams wrote: >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang >> >> <haozhong.zhang@intel.com> wrote: >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., >> >> > files on ext4/xfs file system mounted with '-o dax'). >> >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the >> >> metadata is in sync after a fault. However, that does not make >> >> filesystem-DAX safe for use with QEMU, because we still need to >> >> coordinate DMA with fileystem operations. There is no way to do that >> >> coordination from within a guest. QEMU needs to use device-dax if the >> >> guest might ever perform DMA to a virtual-pmem range. See this patch >> >> set for more details on the DAX vs DMA problem [1]. I think we need to >> >> enforce this in the host kernel. I.e. do not allow file backed DAX >> >> pages to be mapped in EPT entries unless / until we have a solution to >> >> the DMA synchronization problem. Apologies for not noticing this >> >> earlier. >> > >> > QEMU does not truncate or punch holes of the file once it has been >> > mmap()'ed. Does the problem [1] still exist in such case? >> >> Something else on the system might. The only agent that could enforce >> protection is the kernel, and the kernel will likely just disallow >> passing addresses from filesystem-dax vmas through to a guest >> altogether. I think there's even a problem in the non-DAX case unless >> KVM is pinning pages while they are handed out to a guest. The problem >> is that we don't have a page cache page to pin in the DAX case. >> > > Does it mean any user-space code like > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem > // make DMA to ptr > is unsafe? Yes, it is currently unsafe because there is no coordination with the filesytem if it decides to make block layout changes. We can fix that in the non-virtualization case by having the filesystem wait for DMA completion callbacks (i.e. what for all pages to be idle), but as far as I can see we can't do the same coordination for DMA initiated by a guest device driver.
+ vfio maintainer Alex Williamson in case my understanding of vfio is incorrect. On 01/31/18 16:32 -0800, Dan Williams wrote: > On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: > > On 01/31/18 16:08 -0800, Dan Williams wrote: > >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang > >> <haozhong.zhang@intel.com> wrote: > >> > On 01/31/18 14:25 -0800, Dan Williams wrote: > >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang > >> >> <haozhong.zhang@intel.com> wrote: > >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to > >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., > >> >> > files on ext4/xfs file system mounted with '-o dax'). > >> >> > >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the > >> >> metadata is in sync after a fault. However, that does not make > >> >> filesystem-DAX safe for use with QEMU, because we still need to > >> >> coordinate DMA with fileystem operations. There is no way to do that > >> >> coordination from within a guest. QEMU needs to use device-dax if the > >> >> guest might ever perform DMA to a virtual-pmem range. See this patch > >> >> set for more details on the DAX vs DMA problem [1]. I think we need to > >> >> enforce this in the host kernel. I.e. do not allow file backed DAX > >> >> pages to be mapped in EPT entries unless / until we have a solution to > >> >> the DMA synchronization problem. Apologies for not noticing this > >> >> earlier. > >> > > >> > QEMU does not truncate or punch holes of the file once it has been > >> > mmap()'ed. Does the problem [1] still exist in such case? > >> > >> Something else on the system might. The only agent that could enforce > >> protection is the kernel, and the kernel will likely just disallow > >> passing addresses from filesystem-dax vmas through to a guest > >> altogether. I think there's even a problem in the non-DAX case unless > >> KVM is pinning pages while they are handed out to a guest. The problem > >> is that we don't have a page cache page to pin in the DAX case. > >> > > > > Does it mean any user-space code like > > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem > > // make DMA to ptr > > is unsafe? > > Yes, it is currently unsafe because there is no coordination with the > filesytem if it decides to make block layout changes. We can fix that > in the non-virtualization case by having the filesystem wait for DMA > completion callbacks (i.e. what for all pages to be idle), but as far > as I can see we can't do the same coordination for DMA initiated by a > guest device driver. > I think that fix [1] also works for KVM/QEMU. The guest DMA are performed on two types of devices: 1. For emulated devices, the guest DMA requests are trapped and actually performed by QEMU on the host side. The host side fix [1] can cover this case. 2. For passthrough devices, vfio pins all pages, including those backed by dax mode files, used by the guest if any device is passthroughed to it. If I read the commit message in [2] correctly, operations that change the page-to-file offset association of pages from dax mode files will be deferred until the reference count of the affected pages becomes 1. That is, if any passthrough device is used with a VM, the changes of page-to-file offset will not be able to happen until the VM is shutdown, so the fix [1] still takes effect here. Another question is how a user-space application (e.g., QEMU) knows whether it's safe to mmap a file on the DAX file system? [1] https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html [2] https://lists.01.org/pipermail/linux-nvdimm/2017-December/013713.html Thanks, Haozhong
On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang <haozhong.zhang@intel.com> wrote: > + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect. > > On 01/31/18 16:32 -0800, Dan Williams wrote: >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang >> <haozhong.zhang@intel.com> wrote: >> > On 01/31/18 16:08 -0800, Dan Williams wrote: >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang >> >> <haozhong.zhang@intel.com> wrote: >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote: >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang >> >> >> <haozhong.zhang@intel.com> wrote: >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., >> >> >> > files on ext4/xfs file system mounted with '-o dax'). >> >> >> >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the >> >> >> metadata is in sync after a fault. However, that does not make >> >> >> filesystem-DAX safe for use with QEMU, because we still need to >> >> >> coordinate DMA with fileystem operations. There is no way to do that >> >> >> coordination from within a guest. QEMU needs to use device-dax if the >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX >> >> >> pages to be mapped in EPT entries unless / until we have a solution to >> >> >> the DMA synchronization problem. Apologies for not noticing this >> >> >> earlier. >> >> > >> >> > QEMU does not truncate or punch holes of the file once it has been >> >> > mmap()'ed. Does the problem [1] still exist in such case? >> >> >> >> Something else on the system might. The only agent that could enforce >> >> protection is the kernel, and the kernel will likely just disallow >> >> passing addresses from filesystem-dax vmas through to a guest >> >> altogether. I think there's even a problem in the non-DAX case unless >> >> KVM is pinning pages while they are handed out to a guest. The problem >> >> is that we don't have a page cache page to pin in the DAX case. >> >> >> > >> > Does it mean any user-space code like >> > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem >> > // make DMA to ptr >> > is unsafe? >> >> Yes, it is currently unsafe because there is no coordination with the >> filesytem if it decides to make block layout changes. We can fix that >> in the non-virtualization case by having the filesystem wait for DMA >> completion callbacks (i.e. what for all pages to be idle), but as far >> as I can see we can't do the same coordination for DMA initiated by a >> guest device driver. >> > > I think that fix [1] also works for KVM/QEMU. The guest DMA are > performed on two types of devices: > > 1. For emulated devices, the guest DMA requests are trapped and > actually performed by QEMU on the host side. The host side fix [1] > can cover this case. > > 2. For passthrough devices, vfio pins all pages, including those > backed by dax mode files, used by the guest if any device is > passthroughed to it. If I read the commit message in [2] correctly, > operations that change the page-to-file offset association of pages > from dax mode files will be deferred until the reference count of > the affected pages becomes 1. That is, if any passthrough device > is used with a VM, the changes of page-to-file offset will not be > able to happen until the VM is shutdown, so the fix [1] still takes > effect here. This sounds like a longterm mapping under control of vfio and not the filesystem. See get_user_pages_longterm(), it is a problem if pages are pinned indefinitely especially DAX. It sounds like vfio is in the same boat as RDMA and cannot support long lived pins of DAX pages. As of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual fix will be to create a "memory-registration with lease" semantic available for RDMA so that the kernel can forcibly revoke page pinning to perform physical layout changes. In the near it seems vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so that filesystem-dax mappings are explicitly disallowed. > Another question is how a user-space application (e.g., QEMU) knows > whether it's safe to mmap a file on the DAX file system? I think we fix vaddr_get_pfn() to start failing for DAX mappings unless/until we can add a "with lease" mechanism. Userspace will know when it is safe again when vfio stops failing.
[ adding Michal and lsf-pci ] On Wed, Jan 31, 2018 at 7:02 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: >> + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect. >> >> On 01/31/18 16:32 -0800, Dan Williams wrote: >>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang >>> <haozhong.zhang@intel.com> wrote: >>> > On 01/31/18 16:08 -0800, Dan Williams wrote: >>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang >>> >> <haozhong.zhang@intel.com> wrote: >>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote: >>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang >>> >> >> <haozhong.zhang@intel.com> wrote: >>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to >>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., >>> >> >> > files on ext4/xfs file system mounted with '-o dax'). >>> >> >> >>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the >>> >> >> metadata is in sync after a fault. However, that does not make >>> >> >> filesystem-DAX safe for use with QEMU, because we still need to >>> >> >> coordinate DMA with fileystem operations. There is no way to do that >>> >> >> coordination from within a guest. QEMU needs to use device-dax if the >>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch >>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to >>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX >>> >> >> pages to be mapped in EPT entries unless / until we have a solution to >>> >> >> the DMA synchronization problem. Apologies for not noticing this >>> >> >> earlier. >>> >> > >>> >> > QEMU does not truncate or punch holes of the file once it has been >>> >> > mmap()'ed. Does the problem [1] still exist in such case? >>> >> >>> >> Something else on the system might. The only agent that could enforce >>> >> protection is the kernel, and the kernel will likely just disallow >>> >> passing addresses from filesystem-dax vmas through to a guest >>> >> altogether. I think there's even a problem in the non-DAX case unless >>> >> KVM is pinning pages while they are handed out to a guest. The problem >>> >> is that we don't have a page cache page to pin in the DAX case. >>> >> >>> > >>> > Does it mean any user-space code like >>> > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem >>> > // make DMA to ptr >>> > is unsafe? >>> >>> Yes, it is currently unsafe because there is no coordination with the >>> filesytem if it decides to make block layout changes. We can fix that >>> in the non-virtualization case by having the filesystem wait for DMA >>> completion callbacks (i.e. what for all pages to be idle), but as far >>> as I can see we can't do the same coordination for DMA initiated by a >>> guest device driver. >>> >> >> I think that fix [1] also works for KVM/QEMU. The guest DMA are >> performed on two types of devices: >> >> 1. For emulated devices, the guest DMA requests are trapped and >> actually performed by QEMU on the host side. The host side fix [1] >> can cover this case. >> >> 2. For passthrough devices, vfio pins all pages, including those >> backed by dax mode files, used by the guest if any device is >> passthroughed to it. If I read the commit message in [2] correctly, >> operations that change the page-to-file offset association of pages >> from dax mode files will be deferred until the reference count of >> the affected pages becomes 1. That is, if any passthrough device >> is used with a VM, the changes of page-to-file offset will not be >> able to happen until the VM is shutdown, so the fix [1] still takes >> effect here. > > This sounds like a longterm mapping under control of vfio and not the > filesystem. See get_user_pages_longterm(), it is a problem if pages > are pinned indefinitely especially DAX. It sounds like vfio is in the > same boat as RDMA and cannot support long lived pins of DAX pages. As > of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual > fix will be to create a "memory-registration with lease" semantic > available for RDMA so that the kernel can forcibly revoke page pinning > to perform physical layout changes. In the near it seems > vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so > that filesystem-dax mappings are explicitly disallowed. > >> Another question is how a user-space application (e.g., QEMU) knows >> whether it's safe to mmap a file on the DAX file system? > > I think we fix vaddr_get_pfn() to start failing for DAX mappings > unless/until we can add a "with lease" mechanism. Userspace will know > when it is safe again when vfio stops failing. Btw, there is an LSF/MM topic proposal on this subject [1]. [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-January/013935.html
On 01/31/18 19:02 -0800, Dan Williams wrote: > On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: > > + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect. > > > > On 01/31/18 16:32 -0800, Dan Williams wrote: > >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang > >> <haozhong.zhang@intel.com> wrote: > >> > On 01/31/18 16:08 -0800, Dan Williams wrote: > >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang > >> >> <haozhong.zhang@intel.com> wrote: > >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote: > >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang > >> >> >> <haozhong.zhang@intel.com> wrote: > >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to > >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., > >> >> >> > files on ext4/xfs file system mounted with '-o dax'). > >> >> >> > >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the > >> >> >> metadata is in sync after a fault. However, that does not make > >> >> >> filesystem-DAX safe for use with QEMU, because we still need to > >> >> >> coordinate DMA with fileystem operations. There is no way to do that > >> >> >> coordination from within a guest. QEMU needs to use device-dax if the > >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch > >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to > >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX > >> >> >> pages to be mapped in EPT entries unless / until we have a solution to > >> >> >> the DMA synchronization problem. Apologies for not noticing this > >> >> >> earlier. > >> >> > > >> >> > QEMU does not truncate or punch holes of the file once it has been > >> >> > mmap()'ed. Does the problem [1] still exist in such case? > >> >> > >> >> Something else on the system might. The only agent that could enforce > >> >> protection is the kernel, and the kernel will likely just disallow > >> >> passing addresses from filesystem-dax vmas through to a guest > >> >> altogether. I think there's even a problem in the non-DAX case unless > >> >> KVM is pinning pages while they are handed out to a guest. The problem > >> >> is that we don't have a page cache page to pin in the DAX case. > >> >> > >> > > >> > Does it mean any user-space code like > >> > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem > >> > // make DMA to ptr > >> > is unsafe? > >> > >> Yes, it is currently unsafe because there is no coordination with the > >> filesytem if it decides to make block layout changes. We can fix that > >> in the non-virtualization case by having the filesystem wait for DMA > >> completion callbacks (i.e. what for all pages to be idle), but as far > >> as I can see we can't do the same coordination for DMA initiated by a > >> guest device driver. > >> > > > > I think that fix [1] also works for KVM/QEMU. The guest DMA are > > performed on two types of devices: > > > > 1. For emulated devices, the guest DMA requests are trapped and > > actually performed by QEMU on the host side. The host side fix [1] > > can cover this case. > > > > 2. For passthrough devices, vfio pins all pages, including those > > backed by dax mode files, used by the guest if any device is > > passthroughed to it. If I read the commit message in [2] correctly, > > operations that change the page-to-file offset association of pages > > from dax mode files will be deferred until the reference count of > > the affected pages becomes 1. That is, if any passthrough device > > is used with a VM, the changes of page-to-file offset will not be > > able to happen until the VM is shutdown, so the fix [1] still takes > > effect here. > > This sounds like a longterm mapping under control of vfio and not the > filesystem. See get_user_pages_longterm(), it is a problem if pages > are pinned indefinitely especially DAX. It sounds like vfio is in the > same boat as RDMA and cannot support long lived pins of DAX pages. As > of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual > fix will be to create a "memory-registration with lease" semantic > available for RDMA so that the kernel can forcibly revoke page pinning > to perform physical layout changes. In the near it seems > vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so > that filesystem-dax mappings are explicitly disallowed. It seems that KVM and VFIO need to switch to get_user_pages_longterm() which fails getting pages backed by dax mode files. However, as get_user_pages() and its variants in the current KVM and VFIO may be called after a VM starts running, e.g., handling EPT violation on demand, and hotplugging a passthrough device to VM, simply switching to the longterm version would cause VM crash in those cases. Therefore, it also needs to patch or document in QEMU to not use dax files with memory-backend-file. Paolo, Radim and Alex, what do you think? Thanks, Haozhong > > > Another question is how a user-space application (e.g., QEMU) knows > > whether it's safe to mmap a file on the DAX file system? > > I think we fix vaddr_get_pfn() to start failing for DAX mappings > unless/until we can add a "with lease" mechanism. Userspace will know > when it is safe again when vfio stops failing. >
On Thu, 1 Feb 2018 18:17:44 +0800 Haozhong Zhang <haozhong.zhang@intel.com> wrote: > On 01/31/18 19:02 -0800, Dan Williams wrote: > > On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang > > <haozhong.zhang@intel.com> wrote: > > > + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect. > > > > > > On 01/31/18 16:32 -0800, Dan Williams wrote: > > >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang > > >> <haozhong.zhang@intel.com> wrote: > > >> > On 01/31/18 16:08 -0800, Dan Williams wrote: > > >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang > > >> >> <haozhong.zhang@intel.com> wrote: > > >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote: > > >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang > > >> >> >> <haozhong.zhang@intel.com> wrote: > > >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to > > >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g., > > >> >> >> > files on ext4/xfs file system mounted with '-o dax'). > > >> >> >> > > >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the > > >> >> >> metadata is in sync after a fault. However, that does not make > > >> >> >> filesystem-DAX safe for use with QEMU, because we still need to > > >> >> >> coordinate DMA with fileystem operations. There is no way to do that > > >> >> >> coordination from within a guest. QEMU needs to use device-dax if the > > >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch > > >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to > > >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX > > >> >> >> pages to be mapped in EPT entries unless / until we have a solution to > > >> >> >> the DMA synchronization problem. Apologies for not noticing this > > >> >> >> earlier. > > >> >> > > > >> >> > QEMU does not truncate or punch holes of the file once it has been > > >> >> > mmap()'ed. Does the problem [1] still exist in such case? > > >> >> > > >> >> Something else on the system might. The only agent that could enforce > > >> >> protection is the kernel, and the kernel will likely just disallow > > >> >> passing addresses from filesystem-dax vmas through to a guest > > >> >> altogether. I think there's even a problem in the non-DAX case unless > > >> >> KVM is pinning pages while they are handed out to a guest. The problem > > >> >> is that we don't have a page cache page to pin in the DAX case. > > >> >> > > >> > > > >> > Does it mean any user-space code like > > >> > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem > > >> > // make DMA to ptr > > >> > is unsafe? > > >> > > >> Yes, it is currently unsafe because there is no coordination with the > > >> filesytem if it decides to make block layout changes. We can fix that > > >> in the non-virtualization case by having the filesystem wait for DMA > > >> completion callbacks (i.e. what for all pages to be idle), but as far > > >> as I can see we can't do the same coordination for DMA initiated by a > > >> guest device driver. > > >> > > > > > > I think that fix [1] also works for KVM/QEMU. The guest DMA are > > > performed on two types of devices: > > > > > > 1. For emulated devices, the guest DMA requests are trapped and > > > actually performed by QEMU on the host side. The host side fix [1] > > > can cover this case. > > > > > > 2. For passthrough devices, vfio pins all pages, including those > > > backed by dax mode files, used by the guest if any device is > > > passthroughed to it. If I read the commit message in [2] correctly, > > > operations that change the page-to-file offset association of pages > > > from dax mode files will be deferred until the reference count of > > > the affected pages becomes 1. That is, if any passthrough device > > > is used with a VM, the changes of page-to-file offset will not be > > > able to happen until the VM is shutdown, so the fix [1] still takes > > > effect here. > > > > This sounds like a longterm mapping under control of vfio and not the > > filesystem. See get_user_pages_longterm(), it is a problem if pages > > are pinned indefinitely especially DAX. It sounds like vfio is in the > > same boat as RDMA and cannot support long lived pins of DAX pages. As > > of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual > > fix will be to create a "memory-registration with lease" semantic > > available for RDMA so that the kernel can forcibly revoke page pinning > > to perform physical layout changes. In the near it seems > > vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so > > that filesystem-dax mappings are explicitly disallowed. > > It seems that KVM and VFIO need to switch to get_user_pages_longterm() > which fails getting pages backed by dax mode files. > > However, as get_user_pages() and its variants in the current KVM and > VFIO may be called after a VM starts running, e.g., handling EPT > violation on demand, and hotplugging a passthrough device to VM, > simply switching to the longterm version would cause VM crash in those > cases. Therefore, it also needs to patch or document in QEMU to not > use dax files with memory-backend-file. Paolo, Radim and Alex, what do > you think? Yeah, it looks like vaddr_get_pfn() needs to do its own vma_is_fsdax() check or convert it to the _longterm gup variant. On hot-adding an assigned device to a VM, QEMU should just fail the initfn of the device, which would be non-fatal to the VM. OTOH, if one of these problem mappings can be hot added to the VM, such as via memory hotplug, I think the mapping failure would be fatal to the VM. Thanks, Alex
On Wed, Jan 31, 2018 at 07:02:27PM -0800, Dan Williams wrote: > > Another question is how a user-space application (e.g., QEMU) knows > > whether it's safe to mmap a file on the DAX file system? > > I think we fix vaddr_get_pfn() to start failing for DAX mappings > unless/until we can add a "with lease" mechanism. Userspace will know > when it is safe again when vfio stops failing. I read some of the discussion around that but could not figure out what exactly does happen if a file is truncated by a malicious userspace. Could you let me know pls? Thanks! -- MST
© 2016 - 2024 Red Hat, Inc.