[v4] nvdimm: support MAP_SYNC for memory-backend-file

[Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Haozhong Zhang 6 years, 2 months ago

Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
files on ext4/xfs file system mounted with '-o dax').

A description of MAP_SYNC and MAP_SHARED_VALIDATE can be found at
    https://patchwork.kernel.org/patch/10028151/

This patchset enables QEMU to use MAP_SYNC flag for memory-backend-file,
in order to guarantee the guest write persistence to backend files
supporting DAX.

A new auto on/off option 'sync' is added to memory-backend-file:
 - on:  try to pass MAP_SYNC to mmap(2); if MAP_SYNC is not supported or
        'share=off', QEMU will abort
 - off: never pass MAP_SYNC to mmap(2)
 - auto (default): if MAP_SYNC is supported and 'share=on', work as if
        'sync=on'; otherwise, work as if 'sync=off'

Changes in v4:
 * Add patch 1-3 to switch some functions to a single 'flags'
   parameters. (Michael S. Tsirkin)
 * v3 patch 1-3 become v4 patch 4-6.
 * Patch 4: move definitions of MAP_SYNC and MAP_SHARED_VALIDATE to a
   new header file under include/standard-headers/linux/. (Michael S. Tsirkin)
 * Patch 6: refine the description of the 'sync' option. (Michael S. Tsirkin)

Changes in v3:
 * Patch 1: add MAP_SHARED_VALIDATE in both sync=on and sync=auto
   cases, and add back the retry mechanism. MAP_SYNC will be ignored
   by Linux kernel 4.15 if MAP_SHARED_VALIDATE is missed.
 * Patch 1: define MAP_SYNC and MAP_SHARED_VALIDATE as 0 on non-Linux
   platforms in order to make qemu_ram_mmap() compile on those platforms.
 * Patch 2&3: include more information in error messages of
   memory-backend in hope to help user to identify the error.
   (Dr. David Alan Gilbert)
 * Patch 3: fix typo in the commit message. (Dr. David Alan Gilbert)

Changes in v2:
 * Add 'sync' option to control the use of MAP_SYNC. (Eduardo Habkost)
 * Remove the unnecessary set of MAP_SHARED_VALIDATE in some cases and
   the retry mechanism in qemu_ram_mmap(). (Michael S. Tsirkin)
 * Move OS dependent definitions of MAP_SYNC and MAP_SHARED_VALIDATE
   to osdep.h. (Michael S. Tsirkin)


Haozhong Zhang (6):
  util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter
  exec: switch qemu_ram_alloc_from_{file,fd} to the 'flags' parameter
  memory: switch memory_region_init_ram_from_file() to 'flags' parameter
  util/mmap-alloc: support MAP_SYNC in qemu_ram_mmap()
  hostmem: add more information in error messages
  hostmem-file: add 'sync' option

 backends/hostmem-file.c               | 51 ++++++++++++++++++++++++++++++++---
 backends/hostmem.c                    | 11 +++++---
 docs/nvdimm.txt                       | 20 +++++++++++++-
 exec.c                                | 15 ++++++-----
 include/exec/memory.h                 | 36 +++++++++++++++++++++++--
 include/exec/ram_addr.h               | 29 ++++++++++++++++++--
 include/qemu/mmap-alloc.h             | 23 +++++++++++++++-
 include/standard-headers/linux/mman.h | 42 +++++++++++++++++++++++++++++
 memory.c                              |  8 +++---
 numa.c                                |  2 +-
 qemu-options.hx                       | 22 ++++++++++++++-
 util/mmap-alloc.c                     | 31 ++++++++++++++++++---
 util/oslib-posix.c                    |  2 +-
 13 files changed, 261 insertions(+), 31 deletions(-)
 create mode 100644 include/standard-headers/linux/mman.h

-- 
2.14.1

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Dan Williams 6 years, 2 months ago

On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> files on ext4/xfs file system mounted with '-o dax').

Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
metadata is in sync after a fault. However, that does not make
filesystem-DAX safe for use with QEMU, because we still need to
coordinate DMA with fileystem operations. There is no way to do that
coordination from within a guest. QEMU needs to use device-dax if the
guest might ever perform DMA to a virtual-pmem range. See this patch
set for more details on the DAX vs DMA problem [1]. I think we need to
enforce this in the host kernel. I.e. do not allow file backed DAX
pages to be mapped in EPT entries unless / until we have a solution to
the DMA synchronization problem. Apologies for not noticing this
earlier.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Haozhong Zhang 6 years, 2 months ago

On 01/31/18 14:25 -0800, Dan Williams wrote:
> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> > files on ext4/xfs file system mounted with '-o dax').
> 
> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> metadata is in sync after a fault. However, that does not make
> filesystem-DAX safe for use with QEMU, because we still need to
> coordinate DMA with fileystem operations. There is no way to do that
> coordination from within a guest. QEMU needs to use device-dax if the
> guest might ever perform DMA to a virtual-pmem range. See this patch
> set for more details on the DAX vs DMA problem [1]. I think we need to
> enforce this in the host kernel. I.e. do not allow file backed DAX
> pages to be mapped in EPT entries unless / until we have a solution to
> the DMA synchronization problem. Apologies for not noticing this
> earlier.

QEMU does not truncate or punch holes of the file once it has been
mmap()'ed. Does the problem [1] still exist in such case?

Thanks,
Haozhong

> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Dan Williams 6 years, 2 months ago

On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> On 01/31/18 14:25 -0800, Dan Williams wrote:
>> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
>> > files on ext4/xfs file system mounted with '-o dax').
>>
>> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>> metadata is in sync after a fault. However, that does not make
>> filesystem-DAX safe for use with QEMU, because we still need to
>> coordinate DMA with fileystem operations. There is no way to do that
>> coordination from within a guest. QEMU needs to use device-dax if the
>> guest might ever perform DMA to a virtual-pmem range. See this patch
>> set for more details on the DAX vs DMA problem [1]. I think we need to
>> enforce this in the host kernel. I.e. do not allow file backed DAX
>> pages to be mapped in EPT entries unless / until we have a solution to
>> the DMA synchronization problem. Apologies for not noticing this
>> earlier.
>
> QEMU does not truncate or punch holes of the file once it has been
> mmap()'ed. Does the problem [1] still exist in such case?

Something else on the system might. The only agent that could enforce
protection is the kernel, and the kernel will likely just disallow
passing addresses from filesystem-dax vmas through to a guest
altogether. I think there's even a problem in the non-DAX case unless
KVM is pinning pages while they are handed out to a guest. The problem
is that we don't have a page cache page to pin in the DAX case.

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Haozhong Zhang 6 years, 2 months ago

On 01/31/18 16:08 -0800, Dan Williams wrote:
> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > On 01/31/18 14:25 -0800, Dan Williams wrote:
> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> >> <haozhong.zhang@intel.com> wrote:
> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> >> > files on ext4/xfs file system mounted with '-o dax').
> >>
> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> >> metadata is in sync after a fault. However, that does not make
> >> filesystem-DAX safe for use with QEMU, because we still need to
> >> coordinate DMA with fileystem operations. There is no way to do that
> >> coordination from within a guest. QEMU needs to use device-dax if the
> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> >> pages to be mapped in EPT entries unless / until we have a solution to
> >> the DMA synchronization problem. Apologies for not noticing this
> >> earlier.
> >
> > QEMU does not truncate or punch holes of the file once it has been
> > mmap()'ed. Does the problem [1] still exist in such case?
> 
> Something else on the system might. The only agent that could enforce
> protection is the kernel, and the kernel will likely just disallow
> passing addresses from filesystem-dax vmas through to a guest
> altogether. I think there's even a problem in the non-DAX case unless
> KVM is pinning pages while they are handed out to a guest. The problem
> is that we don't have a page cache page to pin in the DAX case.
> 

Does it mean any user-space code like
  ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
  // make DMA to ptr
is unsafe?

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Dan Williams 6 years, 2 months ago

On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> On 01/31/18 16:08 -0800, Dan Williams wrote:
>> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>> >> <haozhong.zhang@intel.com> wrote:
>> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
>> >> > files on ext4/xfs file system mounted with '-o dax').
>> >>
>> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>> >> metadata is in sync after a fault. However, that does not make
>> >> filesystem-DAX safe for use with QEMU, because we still need to
>> >> coordinate DMA with fileystem operations. There is no way to do that
>> >> coordination from within a guest. QEMU needs to use device-dax if the
>> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>> >> pages to be mapped in EPT entries unless / until we have a solution to
>> >> the DMA synchronization problem. Apologies for not noticing this
>> >> earlier.
>> >
>> > QEMU does not truncate or punch holes of the file once it has been
>> > mmap()'ed. Does the problem [1] still exist in such case?
>>
>> Something else on the system might. The only agent that could enforce
>> protection is the kernel, and the kernel will likely just disallow
>> passing addresses from filesystem-dax vmas through to a guest
>> altogether. I think there's even a problem in the non-DAX case unless
>> KVM is pinning pages while they are handed out to a guest. The problem
>> is that we don't have a page cache page to pin in the DAX case.
>>
>
> Does it mean any user-space code like
>   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>   // make DMA to ptr
> is unsafe?

Yes, it is currently unsafe because there is no coordination with the
filesytem if it decides to make block layout changes. We can fix that
in the non-virtualization case by having the filesystem wait for DMA
completion callbacks (i.e. what for all pages to be idle), but as far
as I can see we can't do the same coordination for DMA initiated by a
guest device driver.

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Haozhong Zhang 6 years, 2 months ago

+ vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.

On 01/31/18 16:32 -0800, Dan Williams wrote:
> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > On 01/31/18 16:08 -0800, Dan Williams wrote:
> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> >> <haozhong.zhang@intel.com> wrote:
> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> >> >> <haozhong.zhang@intel.com> wrote:
> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> >> >> > files on ext4/xfs file system mounted with '-o dax').
> >> >>
> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> >> >> metadata is in sync after a fault. However, that does not make
> >> >> filesystem-DAX safe for use with QEMU, because we still need to
> >> >> coordinate DMA with fileystem operations. There is no way to do that
> >> >> coordination from within a guest. QEMU needs to use device-dax if the
> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> >> >> pages to be mapped in EPT entries unless / until we have a solution to
> >> >> the DMA synchronization problem. Apologies for not noticing this
> >> >> earlier.
> >> >
> >> > QEMU does not truncate or punch holes of the file once it has been
> >> > mmap()'ed. Does the problem [1] still exist in such case?
> >>
> >> Something else on the system might. The only agent that could enforce
> >> protection is the kernel, and the kernel will likely just disallow
> >> passing addresses from filesystem-dax vmas through to a guest
> >> altogether. I think there's even a problem in the non-DAX case unless
> >> KVM is pinning pages while they are handed out to a guest. The problem
> >> is that we don't have a page cache page to pin in the DAX case.
> >>
> >
> > Does it mean any user-space code like
> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
> >   // make DMA to ptr
> > is unsafe?
> 
> Yes, it is currently unsafe because there is no coordination with the
> filesytem if it decides to make block layout changes. We can fix that
> in the non-virtualization case by having the filesystem wait for DMA
> completion callbacks (i.e. what for all pages to be idle), but as far
> as I can see we can't do the same coordination for DMA initiated by a
> guest device driver.
> 

I think that fix [1] also works for KVM/QEMU. The guest DMA are
performed on two types of devices:

1. For emulated devices, the guest DMA requests are trapped and
   actually performed by QEMU on the host side. The host side fix [1]
   can cover this case.

2. For passthrough devices, vfio pins all pages, including those
   backed by dax mode files, used by the guest if any device is
   passthroughed to it. If I read the commit message in [2] correctly,
   operations that change the page-to-file offset association of pages
   from dax mode files will be deferred until the reference count of
   the affected pages becomes 1.  That is, if any passthrough device
   is used with a VM, the changes of page-to-file offset will not be
   able to happen until the VM is shutdown, so the fix [1] still takes
   effect here.

Another question is how a user-space application (e.g., QEMU) knows
whether it's safe to mmap a file on the DAX file system?

[1] https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2017-December/013713.html


Thanks,
Haozhong

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Dan Williams 6 years, 2 months ago

On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.
>
> On 01/31/18 16:32 -0800, Dan Williams wrote:
>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
>> <haozhong.zhang@intel.com> wrote:
>> > On 01/31/18 16:08 -0800, Dan Williams wrote:
>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>> >> <haozhong.zhang@intel.com> wrote:
>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>> >> >> <haozhong.zhang@intel.com> wrote:
>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
>> >> >> > files on ext4/xfs file system mounted with '-o dax').
>> >> >>
>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>> >> >> metadata is in sync after a fault. However, that does not make
>> >> >> filesystem-DAX safe for use with QEMU, because we still need to
>> >> >> coordinate DMA with fileystem operations. There is no way to do that
>> >> >> coordination from within a guest. QEMU needs to use device-dax if the
>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>> >> >> pages to be mapped in EPT entries unless / until we have a solution to
>> >> >> the DMA synchronization problem. Apologies for not noticing this
>> >> >> earlier.
>> >> >
>> >> > QEMU does not truncate or punch holes of the file once it has been
>> >> > mmap()'ed. Does the problem [1] still exist in such case?
>> >>
>> >> Something else on the system might. The only agent that could enforce
>> >> protection is the kernel, and the kernel will likely just disallow
>> >> passing addresses from filesystem-dax vmas through to a guest
>> >> altogether. I think there's even a problem in the non-DAX case unless
>> >> KVM is pinning pages while they are handed out to a guest. The problem
>> >> is that we don't have a page cache page to pin in the DAX case.
>> >>
>> >
>> > Does it mean any user-space code like
>> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>> >   // make DMA to ptr
>> > is unsafe?
>>
>> Yes, it is currently unsafe because there is no coordination with the
>> filesytem if it decides to make block layout changes. We can fix that
>> in the non-virtualization case by having the filesystem wait for DMA
>> completion callbacks (i.e. what for all pages to be idle), but as far
>> as I can see we can't do the same coordination for DMA initiated by a
>> guest device driver.
>>
>
> I think that fix [1] also works for KVM/QEMU. The guest DMA are
> performed on two types of devices:
>
> 1. For emulated devices, the guest DMA requests are trapped and
>    actually performed by QEMU on the host side. The host side fix [1]
>    can cover this case.
>
> 2. For passthrough devices, vfio pins all pages, including those
>    backed by dax mode files, used by the guest if any device is
>    passthroughed to it. If I read the commit message in [2] correctly,
>    operations that change the page-to-file offset association of pages
>    from dax mode files will be deferred until the reference count of
>    the affected pages becomes 1.  That is, if any passthrough device
>    is used with a VM, the changes of page-to-file offset will not be
>    able to happen until the VM is shutdown, so the fix [1] still takes
>    effect here.

This sounds like a longterm mapping under control of vfio and not the
filesystem. See get_user_pages_longterm(), it is a problem if pages
are pinned indefinitely especially DAX. It sounds like vfio is in the
same boat as RDMA and cannot support long lived pins of DAX pages. As
of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
fix will be to create a "memory-registration with lease" semantic
available for RDMA so that the kernel can forcibly revoke page pinning
to perform physical layout changes. In the near it seems
vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
that filesystem-dax mappings are explicitly disallowed.

> Another question is how a user-space application (e.g., QEMU) knows
> whether it's safe to mmap a file on the DAX file system?

I think we fix vaddr_get_pfn() to start failing for DAX mappings
unless/until we can add a "with lease" mechanism. Userspace will know
when it is safe again when vfio stops failing.

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Dan Williams 6 years, 2 months ago

[ adding Michal and lsf-pci ]

On Wed, Jan 31, 2018 at 7:02 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
>> + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.
>>
>> On 01/31/18 16:32 -0800, Dan Williams wrote:
>>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
>>> <haozhong.zhang@intel.com> wrote:
>>> > On 01/31/18 16:08 -0800, Dan Williams wrote:
>>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>>> >> <haozhong.zhang@intel.com> wrote:
>>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>>> >> >> <haozhong.zhang@intel.com> wrote:
>>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
>>> >> >> > files on ext4/xfs file system mounted with '-o dax').
>>> >> >>
>>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>>> >> >> metadata is in sync after a fault. However, that does not make
>>> >> >> filesystem-DAX safe for use with QEMU, because we still need to
>>> >> >> coordinate DMA with fileystem operations. There is no way to do that
>>> >> >> coordination from within a guest. QEMU needs to use device-dax if the
>>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>>> >> >> pages to be mapped in EPT entries unless / until we have a solution to
>>> >> >> the DMA synchronization problem. Apologies for not noticing this
>>> >> >> earlier.
>>> >> >
>>> >> > QEMU does not truncate or punch holes of the file once it has been
>>> >> > mmap()'ed. Does the problem [1] still exist in such case?
>>> >>
>>> >> Something else on the system might. The only agent that could enforce
>>> >> protection is the kernel, and the kernel will likely just disallow
>>> >> passing addresses from filesystem-dax vmas through to a guest
>>> >> altogether. I think there's even a problem in the non-DAX case unless
>>> >> KVM is pinning pages while they are handed out to a guest. The problem
>>> >> is that we don't have a page cache page to pin in the DAX case.
>>> >>
>>> >
>>> > Does it mean any user-space code like
>>> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>>> >   // make DMA to ptr
>>> > is unsafe?
>>>
>>> Yes, it is currently unsafe because there is no coordination with the
>>> filesytem if it decides to make block layout changes. We can fix that
>>> in the non-virtualization case by having the filesystem wait for DMA
>>> completion callbacks (i.e. what for all pages to be idle), but as far
>>> as I can see we can't do the same coordination for DMA initiated by a
>>> guest device driver.
>>>
>>
>> I think that fix [1] also works for KVM/QEMU. The guest DMA are
>> performed on two types of devices:
>>
>> 1. For emulated devices, the guest DMA requests are trapped and
>>    actually performed by QEMU on the host side. The host side fix [1]
>>    can cover this case.
>>
>> 2. For passthrough devices, vfio pins all pages, including those
>>    backed by dax mode files, used by the guest if any device is
>>    passthroughed to it. If I read the commit message in [2] correctly,
>>    operations that change the page-to-file offset association of pages
>>    from dax mode files will be deferred until the reference count of
>>    the affected pages becomes 1.  That is, if any passthrough device
>>    is used with a VM, the changes of page-to-file offset will not be
>>    able to happen until the VM is shutdown, so the fix [1] still takes
>>    effect here.
>
> This sounds like a longterm mapping under control of vfio and not the
> filesystem. See get_user_pages_longterm(), it is a problem if pages
> are pinned indefinitely especially DAX. It sounds like vfio is in the
> same boat as RDMA and cannot support long lived pins of DAX pages. As
> of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> fix will be to create a "memory-registration with lease" semantic
> available for RDMA so that the kernel can forcibly revoke page pinning
> to perform physical layout changes. In the near it seems
> vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> that filesystem-dax mappings are explicitly disallowed.
>
>> Another question is how a user-space application (e.g., QEMU) knows
>> whether it's safe to mmap a file on the DAX file system?
>
> I think we fix vaddr_get_pfn() to start failing for DAX mappings
> unless/until we can add a "with lease" mechanism. Userspace will know
> when it is safe again when vfio stops failing.

Btw, there is an LSF/MM topic proposal on this subject [1].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-January/013935.html

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Haozhong Zhang 6 years, 2 months ago

On 01/31/18 19:02 -0800, Dan Williams wrote:
> On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
> > + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.
> >
> > On 01/31/18 16:32 -0800, Dan Williams wrote:
> >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
> >> <haozhong.zhang@intel.com> wrote:
> >> > On 01/31/18 16:08 -0800, Dan Williams wrote:
> >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> >> >> <haozhong.zhang@intel.com> wrote:
> >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
> >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> >> >> >> <haozhong.zhang@intel.com> wrote:
> >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> >> >> >> > files on ext4/xfs file system mounted with '-o dax').
> >> >> >>
> >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> >> >> >> metadata is in sync after a fault. However, that does not make
> >> >> >> filesystem-DAX safe for use with QEMU, because we still need to
> >> >> >> coordinate DMA with fileystem operations. There is no way to do that
> >> >> >> coordination from within a guest. QEMU needs to use device-dax if the
> >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> >> >> >> pages to be mapped in EPT entries unless / until we have a solution to
> >> >> >> the DMA synchronization problem. Apologies for not noticing this
> >> >> >> earlier.
> >> >> >
> >> >> > QEMU does not truncate or punch holes of the file once it has been
> >> >> > mmap()'ed. Does the problem [1] still exist in such case?
> >> >>
> >> >> Something else on the system might. The only agent that could enforce
> >> >> protection is the kernel, and the kernel will likely just disallow
> >> >> passing addresses from filesystem-dax vmas through to a guest
> >> >> altogether. I think there's even a problem in the non-DAX case unless
> >> >> KVM is pinning pages while they are handed out to a guest. The problem
> >> >> is that we don't have a page cache page to pin in the DAX case.
> >> >>
> >> >
> >> > Does it mean any user-space code like
> >> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
> >> >   // make DMA to ptr
> >> > is unsafe?
> >>
> >> Yes, it is currently unsafe because there is no coordination with the
> >> filesytem if it decides to make block layout changes. We can fix that
> >> in the non-virtualization case by having the filesystem wait for DMA
> >> completion callbacks (i.e. what for all pages to be idle), but as far
> >> as I can see we can't do the same coordination for DMA initiated by a
> >> guest device driver.
> >>
> >
> > I think that fix [1] also works for KVM/QEMU. The guest DMA are
> > performed on two types of devices:
> >
> > 1. For emulated devices, the guest DMA requests are trapped and
> >    actually performed by QEMU on the host side. The host side fix [1]
> >    can cover this case.
> >
> > 2. For passthrough devices, vfio pins all pages, including those
> >    backed by dax mode files, used by the guest if any device is
> >    passthroughed to it. If I read the commit message in [2] correctly,
> >    operations that change the page-to-file offset association of pages
> >    from dax mode files will be deferred until the reference count of
> >    the affected pages becomes 1.  That is, if any passthrough device
> >    is used with a VM, the changes of page-to-file offset will not be
> >    able to happen until the VM is shutdown, so the fix [1] still takes
> >    effect here.
> 
> This sounds like a longterm mapping under control of vfio and not the
> filesystem. See get_user_pages_longterm(), it is a problem if pages
> are pinned indefinitely especially DAX. It sounds like vfio is in the
> same boat as RDMA and cannot support long lived pins of DAX pages. As
> of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> fix will be to create a "memory-registration with lease" semantic
> available for RDMA so that the kernel can forcibly revoke page pinning
> to perform physical layout changes. In the near it seems
> vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> that filesystem-dax mappings are explicitly disallowed.

It seems that KVM and VFIO need to switch to get_user_pages_longterm()
which fails getting pages backed by dax mode files.

However, as get_user_pages() and its variants in the current KVM and
VFIO may be called after a VM starts running, e.g., handling EPT
violation on demand, and hotplugging a passthrough device to VM,
simply switching to the longterm version would cause VM crash in those
cases. Therefore, it also needs to patch or document in QEMU to not
use dax files with memory-backend-file. Paolo, Radim and Alex, what do
you think?

Thanks,
Haozhong

> 
> > Another question is how a user-space application (e.g., QEMU) knows
> > whether it's safe to mmap a file on the DAX file system?
> 
> I think we fix vaddr_get_pfn() to start failing for DAX mappings
> unless/until we can add a "with lease" mechanism. Userspace will know
> when it is safe again when vfio stops failing.
>

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Alex Williamson 6 years, 2 months ago

On Thu, 1 Feb 2018 18:17:44 +0800
Haozhong Zhang <haozhong.zhang@intel.com> wrote:

> On 01/31/18 19:02 -0800, Dan Williams wrote:
> > On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> > <haozhong.zhang@intel.com> wrote:  
> > > + vfio maintainer Alex Williamson in case my understanding of vfio is incorrect.
> > >
> > > On 01/31/18 16:32 -0800, Dan Williams wrote:  
> > >> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
> > >> <haozhong.zhang@intel.com> wrote:  
> > >> > On 01/31/18 16:08 -0800, Dan Williams wrote:  
> > >> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
> > >> >> <haozhong.zhang@intel.com> wrote:  
> > >> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:  
> > >> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
> > >> >> >> <haozhong.zhang@intel.com> wrote:  
> > >> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
> > >> >> >> > guarantee the write persistence to mmap'ed files supporting DAX (e.g.,
> > >> >> >> > files on ext4/xfs file system mounted with '-o dax').  
> > >> >> >>
> > >> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
> > >> >> >> metadata is in sync after a fault. However, that does not make
> > >> >> >> filesystem-DAX safe for use with QEMU, because we still need to
> > >> >> >> coordinate DMA with fileystem operations. There is no way to do that
> > >> >> >> coordination from within a guest. QEMU needs to use device-dax if the
> > >> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
> > >> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
> > >> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
> > >> >> >> pages to be mapped in EPT entries unless / until we have a solution to
> > >> >> >> the DMA synchronization problem. Apologies for not noticing this
> > >> >> >> earlier.  
> > >> >> >
> > >> >> > QEMU does not truncate or punch holes of the file once it has been
> > >> >> > mmap()'ed. Does the problem [1] still exist in such case?  
> > >> >>
> > >> >> Something else on the system might. The only agent that could enforce
> > >> >> protection is the kernel, and the kernel will likely just disallow
> > >> >> passing addresses from filesystem-dax vmas through to a guest
> > >> >> altogether. I think there's even a problem in the non-DAX case unless
> > >> >> KVM is pinning pages while they are handed out to a guest. The problem
> > >> >> is that we don't have a page cache page to pin in the DAX case.
> > >> >>  
> > >> >
> > >> > Does it mean any user-space code like
> > >> >   ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
> > >> >   // make DMA to ptr
> > >> > is unsafe?  
> > >>
> > >> Yes, it is currently unsafe because there is no coordination with the
> > >> filesytem if it decides to make block layout changes. We can fix that
> > >> in the non-virtualization case by having the filesystem wait for DMA
> > >> completion callbacks (i.e. what for all pages to be idle), but as far
> > >> as I can see we can't do the same coordination for DMA initiated by a
> > >> guest device driver.
> > >>  
> > >
> > > I think that fix [1] also works for KVM/QEMU. The guest DMA are
> > > performed on two types of devices:
> > >
> > > 1. For emulated devices, the guest DMA requests are trapped and
> > >    actually performed by QEMU on the host side. The host side fix [1]
> > >    can cover this case.
> > >
> > > 2. For passthrough devices, vfio pins all pages, including those
> > >    backed by dax mode files, used by the guest if any device is
> > >    passthroughed to it. If I read the commit message in [2] correctly,
> > >    operations that change the page-to-file offset association of pages
> > >    from dax mode files will be deferred until the reference count of
> > >    the affected pages becomes 1.  That is, if any passthrough device
> > >    is used with a VM, the changes of page-to-file offset will not be
> > >    able to happen until the VM is shutdown, so the fix [1] still takes
> > >    effect here.  
> > 
> > This sounds like a longterm mapping under control of vfio and not the
> > filesystem. See get_user_pages_longterm(), it is a problem if pages
> > are pinned indefinitely especially DAX. It sounds like vfio is in the
> > same boat as RDMA and cannot support long lived pins of DAX pages. As
> > of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> > fix will be to create a "memory-registration with lease" semantic
> > available for RDMA so that the kernel can forcibly revoke page pinning
> > to perform physical layout changes. In the near it seems
> > vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> > that filesystem-dax mappings are explicitly disallowed.  
> 
> It seems that KVM and VFIO need to switch to get_user_pages_longterm()
> which fails getting pages backed by dax mode files.
> 
> However, as get_user_pages() and its variants in the current KVM and
> VFIO may be called after a VM starts running, e.g., handling EPT
> violation on demand, and hotplugging a passthrough device to VM,
> simply switching to the longterm version would cause VM crash in those
> cases. Therefore, it also needs to patch or document in QEMU to not
> use dax files with memory-backend-file. Paolo, Radim and Alex, what do
> you think?

Yeah, it looks like vaddr_get_pfn() needs to do its own vma_is_fsdax()
check or convert it to the _longterm gup variant.  On hot-adding an
assigned device to a VM, QEMU should just fail the initfn of the
device, which would be non-fatal to the VM.  OTOH, if one of these
problem mappings can be hot added to the VM, such as via memory
hotplug, I think the mapping failure would be fatal to the VM.  Thanks,

Alex

Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file

Posted by Michael S. Tsirkin 6 years, 2 months ago

On Wed, Jan 31, 2018 at 07:02:27PM -0800, Dan Williams wrote:
> > Another question is how a user-space application (e.g., QEMU) knows
> > whether it's safe to mmap a file on the DAX file system?
> 
> I think we fix vaddr_get_pfn() to start failing for DAX mappings
> unless/until we can add a "with lease" mechanism. Userspace will know
> when it is safe again when vfio stops failing.

I read some of the discussion around that but could not figure out what
exactly does happen if a file is truncated by a malicious userspace.
Could you let me know pls? Thanks!

-- 
MST