[Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure

Haozhong Zhang posted 4 patches 6 years, 12 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20170331084147.32716-1-haozhong.zhang@intel.com
Test checkpatch passed
Test docker passed
Test s390x failed
hw/acpi/nvdimm.c         | 111 ++++++++++++++++++++++++++++++++++++++++++++---
hw/i386/pc.c             |   5 ++-
hw/i386/pc_piix.c        |   2 +-
hw/i386/pc_q35.c         |   2 +-
hw/mem/nvdimm.c          |  48 ++++++++++++++++++++
hw/mem/pc-dimm.c         |  48 ++++++++++++++++++--
include/hw/mem/nvdimm.h  |  20 ++++++++-
include/hw/mem/pc-dimm.h |   2 +
8 files changed, 224 insertions(+), 14 deletions(-)
[Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Haozhong Zhang 6 years, 12 months ago
This patch series constructs the flush hint address structures for
nvdimm devices in QEMU.

It's of course not for 2.9. I send it out early in order to get
comments on one point I'm uncertain (see the detailed explanation
below). Thanks for any comments in advance!


Background
---------------
Flush hint address structure is a substructure of NFIT and specifies
one or more addresses, namely Flush Hint Addresses. Software can write
to any one of these flush hint addresses to cause any preceding writes
to the NVDIMM region to be flushed out of the intervening platform
buffers to the targeted NVDIMM. More details can be found in ACPI Spec
6.1, Section 5.2.25.8 "Flush Hint Address Structure".


Why is it RFC?
---------------
RFC is added because I'm not sure whether the way in this patch series
that allocates the guest flush hint addresses is right.

QEMU needs to trap guest accesses (at least for writes) to the flush
hint addresses in order to perform the necessary flush on the host
back store. Therefore, QEMU needs to create IO memory regions that
cover those flush hint addresses. In order to create those IO memory
regions, QEMU needs to know the flush hint addresses or their offsets
to other known memory regions in advance. So far looks good.

Flush hint addresses are in the guest address space. Looking at how
the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural
to take the same way for flush hint addresses, i.e. let the guest
firmware allocate from free addresses and patch them in the flush hint
address structure. (*Please correct me If my following understand is wrong*)
However, the current allocation and pointer patching are transparent
to QEMU, so QEMU will be unaware of the flush hint addresses, and
consequently have no way to create corresponding IO memory regions in
order to trap guest accesses.

Alternatively, this patch series moves the allocation of flush hint
addresses to QEMU:

1. (Patch 1) We reserve an address range after the end address of each
   nvdimm device. Its size is specified by the user via a new pc-dimm
   option 'reserved-size'.

   For the following example,
        -object memory-backend-file,id=mem0,size=4G,...
        -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
        -device pc-dimm,id=dimm1,...
   if dimm0 is allocated to address N ~ N+4G, the address of dimm1
   will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for
   dimm0.

2. (Patch 4) When NVDIMM ACPI code builds the flush hint address
   structure for each nvdimm device, it will allocate them from the
   above reserved area, e.g. the flush hint addresses of above dimm0
   are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in
   this way, so QEMU can easily create IO memory regions for them.

   If the reserved area is not present or too small, QEMU will report
   errors.


How to test?
---------------
Add options 'flush-hint' and 'reserved-size' when creating a nvdimm
device, e.g.
    qemu-system-x86_64 -machine pc,nvdimm \
                       -m 4G,slots=4,maxmem=128G \
                       -object memory-backend-file,id=mem1,share,mem-path=/dev/pmem0 \
                       -device nvdimm,id=nv1,memdev=mem1,reserved-size=4K,flush-hint \
                       ...

The guest OS should be able to find a flush hint address structure in
NFIT. For guest Linux kernel v4.8 or later which supports flush hint,
if QEMU is built with NVDIMM_DEBUG = 1 in include/hw/mem/nvdimm.h, it
will print debug messages like
    nvdimm: Write Flush Hint: offset 0x0, data 0x1
    nvdimm: Write Flush Hint: offset 0x4, data 0x0
when linux performs flush via flush hint address.



Haozhong Zhang (4):
  pc-dimm: add 'reserved-size' to reserve address range after the ending address
  nvdimm: add functions to initialize and perform flush on back store
  nvdimm acpi: record the cache line size in AcpiNVDIMMState
  nvdimm acpi: build flush hint address structure if required

 hw/acpi/nvdimm.c         | 111 ++++++++++++++++++++++++++++++++++++++++++++---
 hw/i386/pc.c             |   5 ++-
 hw/i386/pc_piix.c        |   2 +-
 hw/i386/pc_q35.c         |   2 +-
 hw/mem/nvdimm.c          |  48 ++++++++++++++++++++
 hw/mem/pc-dimm.c         |  48 ++++++++++++++++++--
 include/hw/mem/nvdimm.h  |  20 ++++++++-
 include/hw/mem/pc-dimm.h |   2 +
 8 files changed, 224 insertions(+), 14 deletions(-)

-- 
2.10.1


Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Xiao Guangrong 6 years, 11 months ago

On 31/03/2017 4:41 PM, Haozhong Zhang wrote:
> This patch series constructs the flush hint address structures for
> nvdimm devices in QEMU.
>
> It's of course not for 2.9. I send it out early in order to get
> comments on one point I'm uncertain (see the detailed explanation
> below). Thanks for any comments in advance!
>
>
> Background
> ---------------
> Flush hint address structure is a substructure of NFIT and specifies
> one or more addresses, namely Flush Hint Addresses. Software can write
> to any one of these flush hint addresses to cause any preceding writes
> to the NVDIMM region to be flushed out of the intervening platform
> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>
>
> Why is it RFC?
> ---------------
> RFC is added because I'm not sure whether the way in this patch series
> that allocates the guest flush hint addresses is right.
>
> QEMU needs to trap guest accesses (at least for writes) to the flush
> hint addresses in order to perform the necessary flush on the host
> back store. Therefore, QEMU needs to create IO memory regions that
> cover those flush hint addresses. In order to create those IO memory
> regions, QEMU needs to know the flush hint addresses or their offsets
> to other known memory regions in advance. So far looks good.
>
> Flush hint addresses are in the guest address space. Looking at how
> the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural
> to take the same way for flush hint addresses, i.e. let the guest
> firmware allocate from free addresses and patch them in the flush hint
> address structure. (*Please correct me If my following understand is wrong*)
> However, the current allocation and pointer patching are transparent
> to QEMU, so QEMU will be unaware of the flush hint addresses, and
> consequently have no way to create corresponding IO memory regions in
> order to trap guest accesses.

Er, it is awkward and flush-hint-table is static which may not be
easily patched.

>
> Alternatively, this patch series moves the allocation of flush hint
> addresses to QEMU:
>
> 1. (Patch 1) We reserve an address range after the end address of each
>    nvdimm device. Its size is specified by the user via a new pc-dimm
>    option 'reserved-size'.
>

We should make it only work for nvdimm?

>    For the following example,
>         -object memory-backend-file,id=mem0,size=4G,...
>         -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
>         -device pc-dimm,id=dimm1,...
>    if dimm0 is allocated to address N ~ N+4G, the address of dimm1
>    will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for
>    dimm0.
>
> 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address
>    structure for each nvdimm device, it will allocate them from the
>    above reserved area, e.g. the flush hint addresses of above dimm0
>    are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in
>    this way, so QEMU can easily create IO memory regions for them.
>
>    If the reserved area is not present or too small, QEMU will report
>    errors.
>

We should make 'reserved-size' always be page-aligned and should be
transparent to the user, i.e, automatically reserve 4k if 'flush-hint'
is specified?


Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Haozhong Zhang 6 years, 11 months ago
On 04/06/17 17:39 +0800, Xiao Guangrong wrote:
> 
> 
> On 31/03/2017 4:41 PM, Haozhong Zhang wrote:
> > This patch series constructs the flush hint address structures for
> > nvdimm devices in QEMU.
> > 
> > It's of course not for 2.9. I send it out early in order to get
> > comments on one point I'm uncertain (see the detailed explanation
> > below). Thanks for any comments in advance!
> > 
> > 
> > Background
> > ---------------
> > Flush hint address structure is a substructure of NFIT and specifies
> > one or more addresses, namely Flush Hint Addresses. Software can write
> > to any one of these flush hint addresses to cause any preceding writes
> > to the NVDIMM region to be flushed out of the intervening platform
> > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> > 
> > 
> > Why is it RFC?
> > ---------------
> > RFC is added because I'm not sure whether the way in this patch series
> > that allocates the guest flush hint addresses is right.
> > 
> > QEMU needs to trap guest accesses (at least for writes) to the flush
> > hint addresses in order to perform the necessary flush on the host
> > back store. Therefore, QEMU needs to create IO memory regions that
> > cover those flush hint addresses. In order to create those IO memory
> > regions, QEMU needs to know the flush hint addresses or their offsets
> > to other known memory regions in advance. So far looks good.
> > 
> > Flush hint addresses are in the guest address space. Looking at how
> > the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural
> > to take the same way for flush hint addresses, i.e. let the guest
> > firmware allocate from free addresses and patch them in the flush hint
> > address structure. (*Please correct me If my following understand is wrong*)
> > However, the current allocation and pointer patching are transparent
> > to QEMU, so QEMU will be unaware of the flush hint addresses, and
> > consequently have no way to create corresponding IO memory regions in
> > order to trap guest accesses.
> 
> Er, it is awkward and flush-hint-table is static which may not be
> easily patched.
> 
> > 
> > Alternatively, this patch series moves the allocation of flush hint
> > addresses to QEMU:
> > 
> > 1. (Patch 1) We reserve an address range after the end address of each
> >    nvdimm device. Its size is specified by the user via a new pc-dimm
> >    option 'reserved-size'.
> > 
> 
> We should make it only work for nvdimm?
>

Yes, we can check whether the machine option 'nvdimm' is present when
plugging the nvdimm.

> >    For the following example,
> >         -object memory-backend-file,id=mem0,size=4G,...
> >         -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
> >         -device pc-dimm,id=dimm1,...
> >    if dimm0 is allocated to address N ~ N+4G, the address of dimm1
> >    will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for
> >    dimm0.
> > 
> > 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address
> >    structure for each nvdimm device, it will allocate them from the
> >    above reserved area, e.g. the flush hint addresses of above dimm0
> >    are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in
> >    this way, so QEMU can easily create IO memory regions for them.
> > 
> >    If the reserved area is not present or too small, QEMU will report
> >    errors.
> > 
> 
> We should make 'reserved-size' always be page-aligned and should be
> transparent to the user, i.e, automatically reserve 4k if 'flush-hint'
> is specified?
>

4K alignment is already enforced by current memory plug code.

About the automatic reservation, is a non-zero default value
acceptable by qemu design/convention in general?

Thanks,
Haozhong

Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Xiao Guangrong 6 years, 11 months ago

On 04/06/2017 05:58 PM, Haozhong Zhang wrote:
> On 04/06/17 17:39 +0800, Xiao Guangrong wrote:
>>
>>
>> On 31/03/2017 4:41 PM, Haozhong Zhang wrote:
>>> This patch series constructs the flush hint address structures for
>>> nvdimm devices in QEMU.
>>>
>>> It's of course not for 2.9. I send it out early in order to get
>>> comments on one point I'm uncertain (see the detailed explanation
>>> below). Thanks for any comments in advance!
>>>
>>>
>>> Background
>>> ---------------
>>> Flush hint address structure is a substructure of NFIT and specifies
>>> one or more addresses, namely Flush Hint Addresses. Software can write
>>> to any one of these flush hint addresses to cause any preceding writes
>>> to the NVDIMM region to be flushed out of the intervening platform
>>> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>>> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>>>
>>>
>>> Why is it RFC?
>>> ---------------
>>> RFC is added because I'm not sure whether the way in this patch series
>>> that allocates the guest flush hint addresses is right.
>>>
>>> QEMU needs to trap guest accesses (at least for writes) to the flush
>>> hint addresses in order to perform the necessary flush on the host
>>> back store. Therefore, QEMU needs to create IO memory regions that
>>> cover those flush hint addresses. In order to create those IO memory
>>> regions, QEMU needs to know the flush hint addresses or their offsets
>>> to other known memory regions in advance. So far looks good.
>>>
>>> Flush hint addresses are in the guest address space. Looking at how
>>> the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural
>>> to take the same way for flush hint addresses, i.e. let the guest
>>> firmware allocate from free addresses and patch them in the flush hint
>>> address structure. (*Please correct me If my following understand is wrong*)
>>> However, the current allocation and pointer patching are transparent
>>> to QEMU, so QEMU will be unaware of the flush hint addresses, and
>>> consequently have no way to create corresponding IO memory regions in
>>> order to trap guest accesses.
>>
>> Er, it is awkward and flush-hint-table is static which may not be
>> easily patched.
>>
>>>
>>> Alternatively, this patch series moves the allocation of flush hint
>>> addresses to QEMU:
>>>
>>> 1. (Patch 1) We reserve an address range after the end address of each
>>>    nvdimm device. Its size is specified by the user via a new pc-dimm
>>>    option 'reserved-size'.
>>>
>>
>> We should make it only work for nvdimm?
>>
>
> Yes, we can check whether the machine option 'nvdimm' is present when
> plugging the nvdimm.
>
>>>    For the following example,
>>>         -object memory-backend-file,id=mem0,size=4G,...
>>>         -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
>>>         -device pc-dimm,id=dimm1,...
>>>    if dimm0 is allocated to address N ~ N+4G, the address of dimm1
>>>    will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for
>>>    dimm0.
>>>
>>> 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address
>>>    structure for each nvdimm device, it will allocate them from the
>>>    above reserved area, e.g. the flush hint addresses of above dimm0
>>>    are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in
>>>    this way, so QEMU can easily create IO memory regions for them.
>>>
>>>    If the reserved area is not present or too small, QEMU will report
>>>    errors.
>>>
>>
>> We should make 'reserved-size' always be page-aligned and should be
>> transparent to the user, i.e, automatically reserve 4k if 'flush-hint'
>> is specified?
>>
>
> 4K alignment is already enforced by current memory plug code.
>
> About the automatic reservation, is a non-zero default value
> acceptable by qemu design/convention in general?

Needn't make it as a user-visible parameter, just a field contained in
dimm-dev struct or nvdimm-dev struct indicates the reserved size is
okay.



Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Stefan Hajnoczi 6 years, 11 months ago
On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> This patch series constructs the flush hint address structures for
> nvdimm devices in QEMU.
> 
> It's of course not for 2.9. I send it out early in order to get
> comments on one point I'm uncertain (see the detailed explanation
> below). Thanks for any comments in advance!
> Background
> ---------------

Extra background:

Flush Hint Addresses are necessary because:

1. Some hardware configurations may require them.  In other words, a
   cache flush instruction is not enough to persist data.

2. The host file system may need fsync(2) calls (e.g. to persist
   metadata changes).

Without Flush Hint Addresses only some NVDIMM configurations actually
guarantee data persistence.

> Flush hint address structure is a substructure of NFIT and specifies
> one or more addresses, namely Flush Hint Addresses. Software can write
> to any one of these flush hint addresses to cause any preceding writes
> to the NVDIMM region to be flushed out of the intervening platform
> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".

Do you have performance data?  I'm concerned that Flush Hint Address
hardware interface is not virtualization-friendly.

In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:

  wmb();
  for (i = 0; i < nd_region->ndr_mappings; i++)
      if (ndrd_get_flush_wpq(ndrd, i, 0))
          writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
  wmb();

That looks pretty lightweight - it's an MMIO write between write
barriers.

This patch implements the MMIO write like this:

  void nvdimm_flush(NVDIMMDevice *nvdimm)
  {
      if (nvdimm->backend_fd != -1) {
          /*
           * If the backend store is a physical NVDIMM device, fsync()
           * will trigger the flush via the flush hint on the host device.
           */
          fsync(nvdimm->backend_fd);
      }
  }

The MMIO store instruction turned into a synchronous fsync(2) system
call plus vmexit/vmenter and QEMU userspace context switch:

1. The vcpu blocks during the fsync(2) system call.  The MMIO write
   instruction has an unexpected and huge latency.

2. The vcpu thread holds the QEMU global mutex so all other threads
   (including the monitor) are blocked during fsync(2).  Other vcpu
   threads may block if they vmexit.

It is hard to implement this efficiently in QEMU.  This is why I said
the hardware interface is not virtualization-friendly.  It's cheap on
real hardware but expensive under virtualization.

We should think about the optimal way of implementing Flush Hint
Addresses in QEMU.  But if there is no reasonable way to implement them
then I think it's better *not* to implement them, just like the Block
Window feature which is also not virtualization-friendly.  Users who
want a block device can use virtio-blk.  I don't think NVDIMM Block
Window can achieve better performance than virtio-blk under
virtualization (although I'm happy to be proven wrong).

Some ideas for a faster implementation:

1. Use memory_region_clear_global_locking() to avoid taking the QEMU
   global mutex.  Little synchronization is necessary as long as the
   NVDIMM device isn't hot unplugged (not yet supported anyway).

2. Can the host kernel provide a way to mmap Address Flush Hints from
   the physical NVDIMM in cases where the configuration does not require
   host kernel interception?  That way QEMU can map the physical
   NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
   is bypassed and performance would be good.

I'm not sure there is anything we can do to make the case where the host
kernel wants an fsync(2) fast :(.

Benchmark results would be important for deciding how big the problem
is.
Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Haozhong Zhang 6 years, 11 months ago
On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote:
> On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > This patch series constructs the flush hint address structures for
> > nvdimm devices in QEMU.
> > 
> > It's of course not for 2.9. I send it out early in order to get
> > comments on one point I'm uncertain (see the detailed explanation
> > below). Thanks for any comments in advance!
> > Background
> > ---------------
> 
> Extra background:
> 
> Flush Hint Addresses are necessary because:
> 
> 1. Some hardware configurations may require them.  In other words, a
>    cache flush instruction is not enough to persist data.
> 
> 2. The host file system may need fsync(2) calls (e.g. to persist
>    metadata changes).
> 
> Without Flush Hint Addresses only some NVDIMM configurations actually
> guarantee data persistence.
> 
> > Flush hint address structure is a substructure of NFIT and specifies
> > one or more addresses, namely Flush Hint Addresses. Software can write
> > to any one of these flush hint addresses to cause any preceding writes
> > to the NVDIMM region to be flushed out of the intervening platform
> > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> 
> Do you have performance data?  I'm concerned that Flush Hint Address
> hardware interface is not virtualization-friendly.
>

I haven't tested how much vNVDIMM performance drops with this patch
series.

I tested the fsycn latency of a regular file on the bare metal by
writing 1 GB random data to a file (on ext4 fs on SSD) and then
performing fsync. The average latency of fsync in that case is 3 ms.
I currently don't have NVDIMM hardware, so I cannot get its latency
data. Anyway, as your comment below, the latency should be larger for
VM.

> In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> 
>   wmb();
>   for (i = 0; i < nd_region->ndr_mappings; i++)
>       if (ndrd_get_flush_wpq(ndrd, i, 0))
>           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>   wmb();
> 
> That looks pretty lightweight - it's an MMIO write between write
> barriers.
> 
> This patch implements the MMIO write like this:
> 
>   void nvdimm_flush(NVDIMMDevice *nvdimm)
>   {
>       if (nvdimm->backend_fd != -1) {
>           /*
>            * If the backend store is a physical NVDIMM device, fsync()
>            * will trigger the flush via the flush hint on the host device.
>            */
>           fsync(nvdimm->backend_fd);
>       }
>   }
> 
> The MMIO store instruction turned into a synchronous fsync(2) system
> call plus vmexit/vmenter and QEMU userspace context switch:
> 
> 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>    instruction has an unexpected and huge latency.
> 
> 2. The vcpu thread holds the QEMU global mutex so all other threads
>    (including the monitor) are blocked during fsync(2).  Other vcpu
>    threads may block if they vmexit.
> 
> It is hard to implement this efficiently in QEMU.  This is why I said
> the hardware interface is not virtualization-friendly.  It's cheap on
> real hardware but expensive under virtualization.
>

I don't have the NVDIMM hardware, so I don't know the latency of
writing to host flush hint address. Dan, do you have any latency data
on the bare metal?

> We should think about the optimal way of implementing Flush Hint
> Addresses in QEMU.  But if there is no reasonable way to implement them
> then I think it's better *not* to implement them, just like the Block
> Window feature which is also not virtualization-friendly.  Users who
> want a block device can use virtio-blk.  I don't think NVDIMM Block
> Window can achieve better performance than virtio-blk under
> virtualization (although I'm happy to be proven wrong).
> 
> Some ideas for a faster implementation:
> 
> 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>    global mutex.  Little synchronization is necessary as long as the
>    NVDIMM device isn't hot unplugged (not yet supported anyway).
>

ACPI spec does not say it allows or disallows multiple writes to the
same flush hint address in parallel. If it can, I think we can remove
the global locking requirement for the MMIO memory region of the flush
hint address of vNVDIMM.

> 2. Can the host kernel provide a way to mmap Address Flush Hints from
>    the physical NVDIMM in cases where the configuration does not require
>    host kernel interception?  That way QEMU can map the physical
>    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>    is bypassed and performance would be good.
>

It may work if the backend store of vNVDIMM is the physical NVDIMM
region and the latency of writing to host flush hint address is much
cheaper then performing fsync.

However, if the backend store is a regular file, then we still need to
use fsync.

> I'm not sure there is anything we can do to make the case where the host
> kernel wants an fsync(2) fast :(.
> 
> Benchmark results would be important for deciding how big the problem
> is.

Let me collect performance data w/ and w/o this patch series.

Thanks,
Haozhong

Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Stefan Hajnoczi 6 years, 11 months ago
On Thu, Apr 06, 2017 at 06:31:17PM +0800, Haozhong Zhang wrote:
> On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > We should think about the optimal way of implementing Flush Hint
> > Addresses in QEMU.  But if there is no reasonable way to implement them
> > then I think it's better *not* to implement them, just like the Block
> > Window feature which is also not virtualization-friendly.  Users who
> > want a block device can use virtio-blk.  I don't think NVDIMM Block
> > Window can achieve better performance than virtio-blk under
> > virtualization (although I'm happy to be proven wrong).
> > 
> > Some ideas for a faster implementation:
> > 
> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
> >    global mutex.  Little synchronization is necessary as long as the
> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
> >
> 
> ACPI spec does not say it allows or disallows multiple writes to the
> same flush hint address in parallel. If it can, I think we can remove
> the global locking requirement for the MMIO memory region of the flush
> hint address of vNVDIMM.

The Linux code tries to spread the writes but two CPUs can write to the
same address sometimes.

It doesn't matter if two vcpus access the same address because QEMU just
invokes fsync(2).  The host kernel has synchronization to make the fsync
safe.

Stefan
Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Xiao Guangrong 6 years, 11 months ago

On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
> On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
>> This patch series constructs the flush hint address structures for
>> nvdimm devices in QEMU.
>>
>> It's of course not for 2.9. I send it out early in order to get
>> comments on one point I'm uncertain (see the detailed explanation
>> below). Thanks for any comments in advance!
>> Background
>> ---------------
>
> Extra background:
>
> Flush Hint Addresses are necessary because:
>
> 1. Some hardware configurations may require them.  In other words, a
>    cache flush instruction is not enough to persist data.
>
> 2. The host file system may need fsync(2) calls (e.g. to persist
>    metadata changes).
>
> Without Flush Hint Addresses only some NVDIMM configurations actually
> guarantee data persistence.
>
>> Flush hint address structure is a substructure of NFIT and specifies
>> one or more addresses, namely Flush Hint Addresses. Software can write
>> to any one of these flush hint addresses to cause any preceding writes
>> to the NVDIMM region to be flushed out of the intervening platform
>> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>
> Do you have performance data?  I'm concerned that Flush Hint Address
> hardware interface is not virtualization-friendly.
>
> In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
>
>   wmb();
>   for (i = 0; i < nd_region->ndr_mappings; i++)
>       if (ndrd_get_flush_wpq(ndrd, i, 0))
>           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>   wmb();
>
> That looks pretty lightweight - it's an MMIO write between write
> barriers.
>
> This patch implements the MMIO write like this:
>
>   void nvdimm_flush(NVDIMMDevice *nvdimm)
>   {
>       if (nvdimm->backend_fd != -1) {
>           /*
>            * If the backend store is a physical NVDIMM device, fsync()
>            * will trigger the flush via the flush hint on the host device.
>            */
>           fsync(nvdimm->backend_fd);
>       }
>   }
>
> The MMIO store instruction turned into a synchronous fsync(2) system
> call plus vmexit/vmenter and QEMU userspace context switch:
>
> 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>    instruction has an unexpected and huge latency.
>
> 2. The vcpu thread holds the QEMU global mutex so all other threads
>    (including the monitor) are blocked during fsync(2).  Other vcpu
>    threads may block if they vmexit.
>
> It is hard to implement this efficiently in QEMU.  This is why I said
> the hardware interface is not virtualization-friendly.  It's cheap on
> real hardware but expensive under virtualization.
>
> We should think about the optimal way of implementing Flush Hint
> Addresses in QEMU.  But if there is no reasonable way to implement them
> then I think it's better *not* to implement them, just like the Block
> Window feature which is also not virtualization-friendly.  Users who
> want a block device can use virtio-blk.  I don't think NVDIMM Block
> Window can achieve better performance than virtio-blk under
> virtualization (although I'm happy to be proven wrong).
>
> Some ideas for a faster implementation:
>
> 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>    global mutex.  Little synchronization is necessary as long as the
>    NVDIMM device isn't hot unplugged (not yet supported anyway).
>
> 2. Can the host kernel provide a way to mmap Address Flush Hints from
>    the physical NVDIMM in cases where the configuration does not require
>    host kernel interception?  That way QEMU can map the physical
>    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>    is bypassed and performance would be good.
>
> I'm not sure there is anything we can do to make the case where the host
> kernel wants an fsync(2) fast :(.

Good point.

We can assume flush-CPU-cache-to-make-persistence is always
available on Intel's hardware so that flush-hint-table is not
needed if the vNVDIMM is based on a real Intel's NVDIMM device.

If the vNVDIMM device is based on the regular file, i think
fsync is the bottleneck rather than this mmio-virtualization. :(




Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Haozhong Zhang 6 years, 11 months ago
On 04/06/17 20:02 +0800, Xiao Guangrong wrote:
> 
> 
> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > > This patch series constructs the flush hint address structures for
> > > nvdimm devices in QEMU.
> > > 
> > > It's of course not for 2.9. I send it out early in order to get
> > > comments on one point I'm uncertain (see the detailed explanation
> > > below). Thanks for any comments in advance!
> > > Background
> > > ---------------
> > 
> > Extra background:
> > 
> > Flush Hint Addresses are necessary because:
> > 
> > 1. Some hardware configurations may require them.  In other words, a
> >    cache flush instruction is not enough to persist data.
> > 
> > 2. The host file system may need fsync(2) calls (e.g. to persist
> >    metadata changes).
> > 
> > Without Flush Hint Addresses only some NVDIMM configurations actually
> > guarantee data persistence.
> > 
> > > Flush hint address structure is a substructure of NFIT and specifies
> > > one or more addresses, namely Flush Hint Addresses. Software can write
> > > to any one of these flush hint addresses to cause any preceding writes
> > > to the NVDIMM region to be flushed out of the intervening platform
> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> > 
> > Do you have performance data?  I'm concerned that Flush Hint Address
> > hardware interface is not virtualization-friendly.
> > 
> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> > 
> >   wmb();
> >   for (i = 0; i < nd_region->ndr_mappings; i++)
> >       if (ndrd_get_flush_wpq(ndrd, i, 0))
> >           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> >   wmb();
> > 
> > That looks pretty lightweight - it's an MMIO write between write
> > barriers.
> > 
> > This patch implements the MMIO write like this:
> > 
> >   void nvdimm_flush(NVDIMMDevice *nvdimm)
> >   {
> >       if (nvdimm->backend_fd != -1) {
> >           /*
> >            * If the backend store is a physical NVDIMM device, fsync()
> >            * will trigger the flush via the flush hint on the host device.
> >            */
> >           fsync(nvdimm->backend_fd);
> >       }
> >   }
> > 
> > The MMIO store instruction turned into a synchronous fsync(2) system
> > call plus vmexit/vmenter and QEMU userspace context switch:
> > 
> > 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
> >    instruction has an unexpected and huge latency.
> > 
> > 2. The vcpu thread holds the QEMU global mutex so all other threads
> >    (including the monitor) are blocked during fsync(2).  Other vcpu
> >    threads may block if they vmexit.
> > 
> > It is hard to implement this efficiently in QEMU.  This is why I said
> > the hardware interface is not virtualization-friendly.  It's cheap on
> > real hardware but expensive under virtualization.
> > 
> > We should think about the optimal way of implementing Flush Hint
> > Addresses in QEMU.  But if there is no reasonable way to implement them
> > then I think it's better *not* to implement them, just like the Block
> > Window feature which is also not virtualization-friendly.  Users who
> > want a block device can use virtio-blk.  I don't think NVDIMM Block
> > Window can achieve better performance than virtio-blk under
> > virtualization (although I'm happy to be proven wrong).
> > 
> > Some ideas for a faster implementation:
> > 
> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
> >    global mutex.  Little synchronization is necessary as long as the
> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
> > 
> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
> >    the physical NVDIMM in cases where the configuration does not require
> >    host kernel interception?  That way QEMU can map the physical
> >    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
> >    is bypassed and performance would be good.
> > 
> > I'm not sure there is anything we can do to make the case where the host
> > kernel wants an fsync(2) fast :(.
> 
> Good point.
> 
> We can assume flush-CPU-cache-to-make-persistence is always
> available on Intel's hardware so that flush-hint-table is not
> needed if the vNVDIMM is based on a real Intel's NVDIMM device.
>

We can let users of qemu (e.g. libvirt) detect whether the backend
device supports ADR, and pass 'flush-hint' option to qemu only if ADR
is not supported.

> If the vNVDIMM device is based on the regular file, i think
> fsync is the bottleneck rather than this mmio-virtualization. :(
> 

Yes, fsync() on the regular file is the bottleneck. We may either

1/ perform the host-side flush in an asynchronous way which will not
   block vcpu too long,

or

2/ not provide strong durability guarantee for non-NVDIMM backend and
   not emulate flush-hint for guest at all. (I know 1/ does not
   provide strong durability guarantee either).

Haozhong

Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Dan Williams 6 years, 11 months ago
[ adding Christoph ]

On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang
<haozhong.zhang@intel.com> wrote:
> On 04/06/17 20:02 +0800, Xiao Guangrong wrote:
>>
>>
>> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
>> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
>> > > This patch series constructs the flush hint address structures for
>> > > nvdimm devices in QEMU.
>> > >
>> > > It's of course not for 2.9. I send it out early in order to get
>> > > comments on one point I'm uncertain (see the detailed explanation
>> > > below). Thanks for any comments in advance!
>> > > Background
>> > > ---------------
>> >
>> > Extra background:
>> >
>> > Flush Hint Addresses are necessary because:
>> >
>> > 1. Some hardware configurations may require them.  In other words, a
>> >    cache flush instruction is not enough to persist data.
>> >
>> > 2. The host file system may need fsync(2) calls (e.g. to persist
>> >    metadata changes).
>> >
>> > Without Flush Hint Addresses only some NVDIMM configurations actually
>> > guarantee data persistence.
>> >
>> > > Flush hint address structure is a substructure of NFIT and specifies
>> > > one or more addresses, namely Flush Hint Addresses. Software can write
>> > > to any one of these flush hint addresses to cause any preceding writes
>> > > to the NVDIMM region to be flushed out of the intervening platform
>> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>> >
>> > Do you have performance data?  I'm concerned that Flush Hint Address
>> > hardware interface is not virtualization-friendly.
>> >
>> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
>> >
>> >   wmb();
>> >   for (i = 0; i < nd_region->ndr_mappings; i++)
>> >       if (ndrd_get_flush_wpq(ndrd, i, 0))
>> >           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>> >   wmb();
>> >
>> > That looks pretty lightweight - it's an MMIO write between write
>> > barriers.
>> >
>> > This patch implements the MMIO write like this:
>> >
>> >   void nvdimm_flush(NVDIMMDevice *nvdimm)
>> >   {
>> >       if (nvdimm->backend_fd != -1) {
>> >           /*
>> >            * If the backend store is a physical NVDIMM device, fsync()
>> >            * will trigger the flush via the flush hint on the host device.
>> >            */
>> >           fsync(nvdimm->backend_fd);
>> >       }
>> >   }
>> >
>> > The MMIO store instruction turned into a synchronous fsync(2) system
>> > call plus vmexit/vmenter and QEMU userspace context switch:
>> >
>> > 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>> >    instruction has an unexpected and huge latency.
>> >
>> > 2. The vcpu thread holds the QEMU global mutex so all other threads
>> >    (including the monitor) are blocked during fsync(2).  Other vcpu
>> >    threads may block if they vmexit.
>> >
>> > It is hard to implement this efficiently in QEMU.  This is why I said
>> > the hardware interface is not virtualization-friendly.  It's cheap on
>> > real hardware but expensive under virtualization.
>> >
>> > We should think about the optimal way of implementing Flush Hint
>> > Addresses in QEMU.  But if there is no reasonable way to implement them
>> > then I think it's better *not* to implement them, just like the Block
>> > Window feature which is also not virtualization-friendly.  Users who
>> > want a block device can use virtio-blk.  I don't think NVDIMM Block
>> > Window can achieve better performance than virtio-blk under
>> > virtualization (although I'm happy to be proven wrong).
>> >
>> > Some ideas for a faster implementation:
>> >
>> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>> >    global mutex.  Little synchronization is necessary as long as the
>> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
>> >
>> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
>> >    the physical NVDIMM in cases where the configuration does not require
>> >    host kernel interception?  That way QEMU can map the physical
>> >    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>> >    is bypassed and performance would be good.
>> >
>> > I'm not sure there is anything we can do to make the case where the host
>> > kernel wants an fsync(2) fast :(.
>>
>> Good point.
>>
>> We can assume flush-CPU-cache-to-make-persistence is always
>> available on Intel's hardware so that flush-hint-table is not
>> needed if the vNVDIMM is based on a real Intel's NVDIMM device.
>>
>
> We can let users of qemu (e.g. libvirt) detect whether the backend
> device supports ADR, and pass 'flush-hint' option to qemu only if ADR
> is not supported.
>

There currently is no ACPI mechanism to detect the presence of ADR.
Also, you still need the flush for fs metadata management.

>> If the vNVDIMM device is based on the regular file, i think
>> fsync is the bottleneck rather than this mmio-virtualization. :(
>>
>
> Yes, fsync() on the regular file is the bottleneck. We may either
>
> 1/ perform the host-side flush in an asynchronous way which will not
>    block vcpu too long,
>
> or
>
> 2/ not provide strong durability guarantee for non-NVDIMM backend and
>    not emulate flush-hint for guest at all. (I know 1/ does not
>    provide strong durability guarantee either).

or

3/ Use device-dax as a stop-gap until we can get an efficient fsync()
overhead reduction (or bypass) mechanism built and accepted for
filesystem-dax.

Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Dan Williams 6 years, 11 months ago
On Tue, Apr 11, 2017 at 7:56 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> [ adding Christoph ]
>
> On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang
> <haozhong.zhang@intel.com> wrote:
>> On 04/06/17 20:02 +0800, Xiao Guangrong wrote:
>>>
>>>
>>> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
>>> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
>>> > > This patch series constructs the flush hint address structures for
>>> > > nvdimm devices in QEMU.
>>> > >
>>> > > It's of course not for 2.9. I send it out early in order to get
>>> > > comments on one point I'm uncertain (see the detailed explanation
>>> > > below). Thanks for any comments in advance!
>>> > > Background
>>> > > ---------------
>>> >
>>> > Extra background:
>>> >
>>> > Flush Hint Addresses are necessary because:
>>> >
>>> > 1. Some hardware configurations may require them.  In other words, a
>>> >    cache flush instruction is not enough to persist data.
>>> >
>>> > 2. The host file system may need fsync(2) calls (e.g. to persist
>>> >    metadata changes).
>>> >
>>> > Without Flush Hint Addresses only some NVDIMM configurations actually
>>> > guarantee data persistence.
>>> >
>>> > > Flush hint address structure is a substructure of NFIT and specifies
>>> > > one or more addresses, namely Flush Hint Addresses. Software can write
>>> > > to any one of these flush hint addresses to cause any preceding writes
>>> > > to the NVDIMM region to be flushed out of the intervening platform
>>> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>>> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>>> >
>>> > Do you have performance data?  I'm concerned that Flush Hint Address
>>> > hardware interface is not virtualization-friendly.
>>> >
>>> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
>>> >
>>> >   wmb();
>>> >   for (i = 0; i < nd_region->ndr_mappings; i++)
>>> >       if (ndrd_get_flush_wpq(ndrd, i, 0))
>>> >           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>>> >   wmb();
>>> >
>>> > That looks pretty lightweight - it's an MMIO write between write
>>> > barriers.
>>> >
>>> > This patch implements the MMIO write like this:
>>> >
>>> >   void nvdimm_flush(NVDIMMDevice *nvdimm)
>>> >   {
>>> >       if (nvdimm->backend_fd != -1) {
>>> >           /*
>>> >            * If the backend store is a physical NVDIMM device, fsync()
>>> >            * will trigger the flush via the flush hint on the host device.
>>> >            */
>>> >           fsync(nvdimm->backend_fd);
>>> >       }
>>> >   }
>>> >
>>> > The MMIO store instruction turned into a synchronous fsync(2) system
>>> > call plus vmexit/vmenter and QEMU userspace context switch:
>>> >
>>> > 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>>> >    instruction has an unexpected and huge latency.
>>> >
>>> > 2. The vcpu thread holds the QEMU global mutex so all other threads
>>> >    (including the monitor) are blocked during fsync(2).  Other vcpu
>>> >    threads may block if they vmexit.
>>> >
>>> > It is hard to implement this efficiently in QEMU.  This is why I said
>>> > the hardware interface is not virtualization-friendly.  It's cheap on
>>> > real hardware but expensive under virtualization.
>>> >
>>> > We should think about the optimal way of implementing Flush Hint
>>> > Addresses in QEMU.  But if there is no reasonable way to implement them
>>> > then I think it's better *not* to implement them, just like the Block
>>> > Window feature which is also not virtualization-friendly.  Users who
>>> > want a block device can use virtio-blk.  I don't think NVDIMM Block
>>> > Window can achieve better performance than virtio-blk under
>>> > virtualization (although I'm happy to be proven wrong).
>>> >
>>> > Some ideas for a faster implementation:
>>> >
>>> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>>> >    global mutex.  Little synchronization is necessary as long as the
>>> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
>>> >
>>> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
>>> >    the physical NVDIMM in cases where the configuration does not require
>>> >    host kernel interception?  That way QEMU can map the physical
>>> >    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>>> >    is bypassed and performance would be good.
>>> >
>>> > I'm not sure there is anything we can do to make the case where the host
>>> > kernel wants an fsync(2) fast :(.
>>>
>>> Good point.
>>>
>>> We can assume flush-CPU-cache-to-make-persistence is always
>>> available on Intel's hardware so that flush-hint-table is not
>>> needed if the vNVDIMM is based on a real Intel's NVDIMM device.
>>>
>>
>> We can let users of qemu (e.g. libvirt) detect whether the backend
>> device supports ADR, and pass 'flush-hint' option to qemu only if ADR
>> is not supported.
>>
>
> There currently is no ACPI mechanism to detect the presence of ADR.
> Also, you still need the flush for fs metadata management.
>
>>> If the vNVDIMM device is based on the regular file, i think
>>> fsync is the bottleneck rather than this mmio-virtualization. :(
>>>
>>
>> Yes, fsync() on the regular file is the bottleneck. We may either
>>
>> 1/ perform the host-side flush in an asynchronous way which will not
>>    block vcpu too long,
>>
>> or
>>
>> 2/ not provide strong durability guarantee for non-NVDIMM backend and
>>    not emulate flush-hint for guest at all. (I know 1/ does not
>>    provide strong durability guarantee either).
>
> or
>
> 3/ Use device-dax as a stop-gap until we can get an efficient fsync()
> overhead reduction (or bypass) mechanism built and accepted for
> filesystem-dax.

I didn't realize we have a bigger problem with host filesystem-fsync
and that WPQ exits will not save us. Applications that use device-dax
in the guest may never trigger a WPQ flush, because userspace flushing
with device-dax is expected to be safe. WPQ flush was never meant to
be a persistency mechanism the way it is proposed here, it's only
meant to minimize the fallout from potential ADR failure. My apologies
for insinuating that it was viable.

So, until we solve this userspace flushing problem virtualization must
not pass through any file except a device-dax instance for any
production workload.

Also these performance overheads seem prohibitive. We really want to
take whatever fsync minimization / bypass mechanism we come up with on
the host into a fast para-virtualized interface for the guest. Guests
need to be able to avoid hypervisor and host syscall overhead in the
fast path.

Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Stefan Hajnoczi 6 years, 11 months ago
On Thu, Apr 20, 2017 at 12:49:21PM -0700, Dan Williams wrote:
> On Tue, Apr 11, 2017 at 7:56 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> > [ adding Christoph ]
> >
> > On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang
> > <haozhong.zhang@intel.com> wrote:
> >> On 04/06/17 20:02 +0800, Xiao Guangrong wrote:
> >>>
> >>>
> >>> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:
> >>> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> >>> > > This patch series constructs the flush hint address structures for
> >>> > > nvdimm devices in QEMU.
> >>> > >
> >>> > > It's of course not for 2.9. I send it out early in order to get
> >>> > > comments on one point I'm uncertain (see the detailed explanation
> >>> > > below). Thanks for any comments in advance!
> >>> > > Background
> >>> > > ---------------
> >>> >
> >>> > Extra background:
> >>> >
> >>> > Flush Hint Addresses are necessary because:
> >>> >
> >>> > 1. Some hardware configurations may require them.  In other words, a
> >>> >    cache flush instruction is not enough to persist data.
> >>> >
> >>> > 2. The host file system may need fsync(2) calls (e.g. to persist
> >>> >    metadata changes).
> >>> >
> >>> > Without Flush Hint Addresses only some NVDIMM configurations actually
> >>> > guarantee data persistence.
> >>> >
> >>> > > Flush hint address structure is a substructure of NFIT and specifies
> >>> > > one or more addresses, namely Flush Hint Addresses. Software can write
> >>> > > to any one of these flush hint addresses to cause any preceding writes
> >>> > > to the NVDIMM region to be flushed out of the intervening platform
> >>> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> >>> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> >>> >
> >>> > Do you have performance data?  I'm concerned that Flush Hint Address
> >>> > hardware interface is not virtualization-friendly.
> >>> >
> >>> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> >>> >
> >>> >   wmb();
> >>> >   for (i = 0; i < nd_region->ndr_mappings; i++)
> >>> >       if (ndrd_get_flush_wpq(ndrd, i, 0))
> >>> >           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> >>> >   wmb();
> >>> >
> >>> > That looks pretty lightweight - it's an MMIO write between write
> >>> > barriers.
> >>> >
> >>> > This patch implements the MMIO write like this:
> >>> >
> >>> >   void nvdimm_flush(NVDIMMDevice *nvdimm)
> >>> >   {
> >>> >       if (nvdimm->backend_fd != -1) {
> >>> >           /*
> >>> >            * If the backend store is a physical NVDIMM device, fsync()
> >>> >            * will trigger the flush via the flush hint on the host device.
> >>> >            */
> >>> >           fsync(nvdimm->backend_fd);
> >>> >       }
> >>> >   }
> >>> >
> >>> > The MMIO store instruction turned into a synchronous fsync(2) system
> >>> > call plus vmexit/vmenter and QEMU userspace context switch:
> >>> >
> >>> > 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
> >>> >    instruction has an unexpected and huge latency.
> >>> >
> >>> > 2. The vcpu thread holds the QEMU global mutex so all other threads
> >>> >    (including the monitor) are blocked during fsync(2).  Other vcpu
> >>> >    threads may block if they vmexit.
> >>> >
> >>> > It is hard to implement this efficiently in QEMU.  This is why I said
> >>> > the hardware interface is not virtualization-friendly.  It's cheap on
> >>> > real hardware but expensive under virtualization.
> >>> >
> >>> > We should think about the optimal way of implementing Flush Hint
> >>> > Addresses in QEMU.  But if there is no reasonable way to implement them
> >>> > then I think it's better *not* to implement them, just like the Block
> >>> > Window feature which is also not virtualization-friendly.  Users who
> >>> > want a block device can use virtio-blk.  I don't think NVDIMM Block
> >>> > Window can achieve better performance than virtio-blk under
> >>> > virtualization (although I'm happy to be proven wrong).
> >>> >
> >>> > Some ideas for a faster implementation:
> >>> >
> >>> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
> >>> >    global mutex.  Little synchronization is necessary as long as the
> >>> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
> >>> >
> >>> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
> >>> >    the physical NVDIMM in cases where the configuration does not require
> >>> >    host kernel interception?  That way QEMU can map the physical
> >>> >    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
> >>> >    is bypassed and performance would be good.
> >>> >
> >>> > I'm not sure there is anything we can do to make the case where the host
> >>> > kernel wants an fsync(2) fast :(.
> >>>
> >>> Good point.
> >>>
> >>> We can assume flush-CPU-cache-to-make-persistence is always
> >>> available on Intel's hardware so that flush-hint-table is not
> >>> needed if the vNVDIMM is based on a real Intel's NVDIMM device.
> >>>
> >>
> >> We can let users of qemu (e.g. libvirt) detect whether the backend
> >> device supports ADR, and pass 'flush-hint' option to qemu only if ADR
> >> is not supported.
> >>
> >
> > There currently is no ACPI mechanism to detect the presence of ADR.
> > Also, you still need the flush for fs metadata management.
> >
> >>> If the vNVDIMM device is based on the regular file, i think
> >>> fsync is the bottleneck rather than this mmio-virtualization. :(
> >>>
> >>
> >> Yes, fsync() on the regular file is the bottleneck. We may either
> >>
> >> 1/ perform the host-side flush in an asynchronous way which will not
> >>    block vcpu too long,
> >>
> >> or
> >>
> >> 2/ not provide strong durability guarantee for non-NVDIMM backend and
> >>    not emulate flush-hint for guest at all. (I know 1/ does not
> >>    provide strong durability guarantee either).
> >
> > or
> >
> > 3/ Use device-dax as a stop-gap until we can get an efficient fsync()
> > overhead reduction (or bypass) mechanism built and accepted for
> > filesystem-dax.
> 
> I didn't realize we have a bigger problem with host filesystem-fsync
> and that WPQ exits will not save us. Applications that use device-dax
> in the guest may never trigger a WPQ flush, because userspace flushing
> with device-dax is expected to be safe. WPQ flush was never meant to
> be a persistency mechanism the way it is proposed here, it's only
> meant to minimize the fallout from potential ADR failure. My apologies
> for insinuating that it was viable.
> 
> So, until we solve this userspace flushing problem virtualization must
> not pass through any file except a device-dax instance for any
> production workload.

Okay.  That's what I've assumed up until now and I think distros will
document this limitation.

> Also these performance overheads seem prohibitive. We really want to
> take whatever fsync minimization / bypass mechanism we come up with on
> the host into a fast para-virtualized interface for the guest. Guests
> need to be able to avoid hypervisor and host syscall overhead in the
> fast path.

It's hard to avoid the hypervisor if the host kernel file system needs
an fsync() to persist everything.  There should be a fast path where the
host file is preallocated and no fancy file system features are in use
(e.g. deduplication, copy-on-write snapshots) where host file systems
don't need fsync().

Stefan
Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Dan Williams 6 years, 11 months ago
[ adding xfs and fsdevel ]

On Fri, Apr 21, 2017 at 6:56 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
[..]
>> >>> If the vNVDIMM device is based on the regular file, i think
>> >>> fsync is the bottleneck rather than this mmio-virtualization. :(
>> >>>
>> >>
>> >> Yes, fsync() on the regular file is the bottleneck. We may either
>> >>
>> >> 1/ perform the host-side flush in an asynchronous way which will not
>> >>    block vcpu too long,
>> >>
>> >> or
>> >>
>> >> 2/ not provide strong durability guarantee for non-NVDIMM backend and
>> >>    not emulate flush-hint for guest at all. (I know 1/ does not
>> >>    provide strong durability guarantee either).
>> >
>> > or
>> >
>> > 3/ Use device-dax as a stop-gap until we can get an efficient fsync()
>> > overhead reduction (or bypass) mechanism built and accepted for
>> > filesystem-dax.
>>
>> I didn't realize we have a bigger problem with host filesystem-fsync
>> and that WPQ exits will not save us. Applications that use device-dax
>> in the guest may never trigger a WPQ flush, because userspace flushing
>> with device-dax is expected to be safe. WPQ flush was never meant to
>> be a persistency mechanism the way it is proposed here, it's only
>> meant to minimize the fallout from potential ADR failure. My apologies
>> for insinuating that it was viable.
>>
>> So, until we solve this userspace flushing problem virtualization must
>> not pass through any file except a device-dax instance for any
>> production workload.
>
> Okay.  That's what I've assumed up until now and I think distros will
> document this limitation.
>
>> Also these performance overheads seem prohibitive. We really want to
>> take whatever fsync minimization / bypass mechanism we come up with on
>> the host into a fast para-virtualized interface for the guest. Guests
>> need to be able to avoid hypervisor and host syscall overhead in the
>> fast path.
>
> It's hard to avoid the hypervisor if the host kernel file system needs
> an fsync() to persist everything.  There should be a fast path where the
> host file is preallocated and no fancy file system features are in use
> (e.g. deduplication, copy-on-write snapshots) where host file systems
> don't need fsync().
>

So we've gone around and around on this with XFS folks with various
levels of disagreement about how to achieve synchronous faulting or
disabling metadata updates for a file. I think at some point someone
is going to want some fancy filesystem feature *and* DAX *and* still
want it to be fast.

The current problem is that if you want to checkpoint persistence at a
high rate, think committing updates to a tree data structure, an
fsync() call is going to burn a lot of cycles just to find out that
there is nothing to do in most cases. Especially when you've only
touched a couple pointers in a cache line, calling into the kernel to
sync those writes is a ridiculous proposition.

One of the current ideas to resolve this is instead of trying to
implement synchronous faulting and wrestle with the constraints that
puts on fs-fault paths, is to instead have synchronous notification of
metadata dirtying events to userspace. That notification mechanism
would be associated with something like an fsync2() library call that
knows how to bypass sys_fsync in the common case. Of course, this is
still in the idea phase, so until we can get a proof-of-concept on its
feet this is all subject to further debate.

Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Dan Williams 6 years, 11 months ago
On Thu, Apr 6, 2017 at 2:43 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
>> This patch series constructs the flush hint address structures for
>> nvdimm devices in QEMU.
>>
>> It's of course not for 2.9. I send it out early in order to get
>> comments on one point I'm uncertain (see the detailed explanation
>> below). Thanks for any comments in advance!
>> Background
>> ---------------
>
> Extra background:
>
> Flush Hint Addresses are necessary because:
>
> 1. Some hardware configurations may require them.  In other words, a
>    cache flush instruction is not enough to persist data.
>
> 2. The host file system may need fsync(2) calls (e.g. to persist
>    metadata changes).
>
> Without Flush Hint Addresses only some NVDIMM configurations actually
> guarantee data persistence.
>
>> Flush hint address structure is a substructure of NFIT and specifies
>> one or more addresses, namely Flush Hint Addresses. Software can write
>> to any one of these flush hint addresses to cause any preceding writes
>> to the NVDIMM region to be flushed out of the intervening platform
>> buffers to the targeted NVDIMM. More details can be found in ACPI Spec
>> 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
>
> Do you have performance data?  I'm concerned that Flush Hint Address
> hardware interface is not virtualization-friendly.
>
> In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
>
>   wmb();
>   for (i = 0; i < nd_region->ndr_mappings; i++)
>       if (ndrd_get_flush_wpq(ndrd, i, 0))
>           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>   wmb();
>
> That looks pretty lightweight - it's an MMIO write between write
> barriers.
>
> This patch implements the MMIO write like this:
>
>   void nvdimm_flush(NVDIMMDevice *nvdimm)
>   {
>       if (nvdimm->backend_fd != -1) {
>           /*
>            * If the backend store is a physical NVDIMM device, fsync()
>            * will trigger the flush via the flush hint on the host device.
>            */
>           fsync(nvdimm->backend_fd);
>       }
>   }
>
> The MMIO store instruction turned into a synchronous fsync(2) system
> call plus vmexit/vmenter and QEMU userspace context switch:
>
> 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>    instruction has an unexpected and huge latency.
>
> 2. The vcpu thread holds the QEMU global mutex so all other threads
>    (including the monitor) are blocked during fsync(2).  Other vcpu
>    threads may block if they vmexit.
>
> It is hard to implement this efficiently in QEMU.  This is why I said
> the hardware interface is not virtualization-friendly.  It's cheap on
> real hardware but expensive under virtualization.
>
> We should think about the optimal way of implementing Flush Hint
> Addresses in QEMU.  But if there is no reasonable way to implement them
> then I think it's better *not* to implement them, just like the Block
> Window feature which is also not virtualization-friendly.  Users who
> want a block device can use virtio-blk.  I don't think NVDIMM Block
> Window can achieve better performance than virtio-blk under
> virtualization (although I'm happy to be proven wrong).
>
> Some ideas for a faster implementation:
>
> 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>    global mutex.  Little synchronization is necessary as long as the
>    NVDIMM device isn't hot unplugged (not yet supported anyway).
>
> 2. Can the host kernel provide a way to mmap Address Flush Hints from
>    the physical NVDIMM in cases where the configuration does not require
>    host kernel interception?  That way QEMU can map the physical
>    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>    is bypassed and performance would be good.
>
> I'm not sure there is anything we can do to make the case where the host
> kernel wants an fsync(2) fast :(.

A few thoughts:

* You don't need to call fsync in the case that the host file is a
/dev/daxX.Y device. Since there is no intervening filesystem.

* While the host implementation is fast it still is not meant to be
called very frequently. We could also look to make a paravirtualized
way for it complete asynchronously if the guest just wants to post a
flush request and get notified of a completion at a later time.

* We're still actively looking for ways to make it safe and efficient
to minimize sync overhead if an mmap write did not dirty filesystem
data, so hopefully we can make that fsync call overhead smaller over
time.

* I don't think we can just skip implementing support for this flush.
Yes it's virtualization unfriendly, but it completely defeats the
persistence of persistent memory if the host filesystem is free to
throw away writes that the guest expects to find after a surprise
power loss.

Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Stefan Hajnoczi 6 years, 11 months ago
On Thu, Apr 06, 2017 at 07:32:01AM -0700, Dan Williams wrote:
> On Thu, Apr 6, 2017 at 2:43 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> * I don't think we can just skip implementing support for this flush.
> Yes it's virtualization unfriendly, but it completely defeats the
> persistence of persistent memory if the host filesystem is free to
> throw away writes that the guest expects to find after a surprise
> power loss.

Right, if we don't implement Address Flush Hints then the only safe
option for configurations that require them would be to print an error
and refuse to start.

This synchronous fsync(2) business is like taking a host page fault or
being scheduled out by the host scheduler.  It means the vcpu is stuck
for a little bit.  Hopefully not long enough to upset the guest kernel
or the application.

One of the things that guest software can do to help minimize the
performance impact is to avoid holding locks across the address flush
mmio write instruction.

Stefan
Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Haozhong Zhang 6 years, 11 months ago
On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote:
> On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > This patch series constructs the flush hint address structures for
> > nvdimm devices in QEMU.
> > 
> > It's of course not for 2.9. I send it out early in order to get
> > comments on one point I'm uncertain (see the detailed explanation
> > below). Thanks for any comments in advance!
> > Background
> > ---------------
> 
> Extra background:
> 
> Flush Hint Addresses are necessary because:
> 
> 1. Some hardware configurations may require them.  In other words, a
>    cache flush instruction is not enough to persist data.
> 
> 2. The host file system may need fsync(2) calls (e.g. to persist
>    metadata changes).
> 
> Without Flush Hint Addresses only some NVDIMM configurations actually
> guarantee data persistence.
> 
> > Flush hint address structure is a substructure of NFIT and specifies
> > one or more addresses, namely Flush Hint Addresses. Software can write
> > to any one of these flush hint addresses to cause any preceding writes
> > to the NVDIMM region to be flushed out of the intervening platform
> > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> 
> Do you have performance data?  I'm concerned that Flush Hint Address
> hardware interface is not virtualization-friendly.

Some performance data below.

Host HW config:
  CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz x 2 sockets w/ HT enabled
  MEM: 64 GB

  As I don't have NVDIMM hardware, so I use files in ext4 fs on a
  normal SATA SSD as the back storage of vNVDIMM.


Host SW config:
  Kernel: 4.10.1
  QEMU: commit ea2afcf with this patch series applied.


Guest config:
  For flush hint enabled case, the following QEMU options are used
    -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \
    -m 4G,slots=4,maxmem=128G \
    -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \
    -device nvdimm,id=nv1,memdev=mem1,reserved-size=4K,flush-hint \
    -hda GUEST_DISK_IMG -serial pty

  For flush hint disabled case, the following QEMU options are used
    -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \
    -m 4G,slots=4,maxmem=128G \
    -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \
    -device nvdimm,id=nv1,memdev=mem1 \
    -hda GUEST_DISK_IMG -serial pty

  nvm-img used above is created in ext4 fs on the host SSD by
    dd if=/dev/zero of=nvm-img bs=1G count=8

  Guest kernel: 4.11.0-rc4


Benchmark in guest:
  mkfs.ext4 /dev/pmem0
  mount -o dax /dev/pmem0 /mnt
  dd if=/dev/zero of=/mnt/data bs=1G count=7 # warm up EPT mapping
  rm /mnt/data                               #
  dd if=/dev/zero of=/mnt/data bs=1G count=7

  and record the write speed reported by the last 'dd' command.


Result:
  - Flush hint disabled
    Vary from 161 MB/s to 708 MB/s, depending on how many fs/device
    flush operations are performed on the host side during the guest
    'dd'.

  - Flush hint enabled
  
    Vary from 164 MB/s to 546 MB/s, depending on how long fsync() in
    QEMU takes. Usually, there is at least one fsync() during one 'dd'
    command that takes several seconds (the worst one takes 39 s).

    To be worse, during those long host-side fsync() operations, guest
    kernel complained stalls.


Some thoughts:

- If the non-NVDIMM hardware is used as the back store of vNVDIMM,
  QEMU may perform the host-side flush operations asynchronously with
  VM, which will not block VM too long but sacrifice the durability
  guarantee.

- If physical NVDIMM is used as the back store and ADR is supported on
  the host, QEMU can rely on ADR to guarantee the data durability and
  will not need to emulate flush hint for guest.

- If physical NVDIMM is used as the back store and ADR is not
  supported on the host, QEMU will still need to emulate flush hint
  for guest and need to use a fast approach other than fsync() to
  trigger writes to host flush hint.

  Could kernel expose an interface to allow the userland (i.e. QEMU in
  this case) to directly write to the flush hint of a NVDIMM region?


Haozhong

> 
> In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> 
>   wmb();
>   for (i = 0; i < nd_region->ndr_mappings; i++)
>       if (ndrd_get_flush_wpq(ndrd, i, 0))
>           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
>   wmb();
> 
> That looks pretty lightweight - it's an MMIO write between write
> barriers.
> 
> This patch implements the MMIO write like this:
> 
>   void nvdimm_flush(NVDIMMDevice *nvdimm)
>   {
>       if (nvdimm->backend_fd != -1) {
>           /*
>            * If the backend store is a physical NVDIMM device, fsync()
>            * will trigger the flush via the flush hint on the host device.
>            */
>           fsync(nvdimm->backend_fd);
>       }
>   }
> 
> The MMIO store instruction turned into a synchronous fsync(2) system
> call plus vmexit/vmenter and QEMU userspace context switch:
> 
> 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
>    instruction has an unexpected and huge latency.
> 
> 2. The vcpu thread holds the QEMU global mutex so all other threads
>    (including the monitor) are blocked during fsync(2).  Other vcpu
>    threads may block if they vmexit.
> 
> It is hard to implement this efficiently in QEMU.  This is why I said
> the hardware interface is not virtualization-friendly.  It's cheap on
> real hardware but expensive under virtualization.
> 
> We should think about the optimal way of implementing Flush Hint
> Addresses in QEMU.  But if there is no reasonable way to implement them
> then I think it's better *not* to implement them, just like the Block
> Window feature which is also not virtualization-friendly.  Users who
> want a block device can use virtio-blk.  I don't think NVDIMM Block
> Window can achieve better performance than virtio-blk under
> virtualization (although I'm happy to be proven wrong).
> 
> Some ideas for a faster implementation:
> 
> 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
>    global mutex.  Little synchronization is necessary as long as the
>    NVDIMM device isn't hot unplugged (not yet supported anyway).
> 
> 2. Can the host kernel provide a way to mmap Address Flush Hints from
>    the physical NVDIMM in cases where the configuration does not require
>    host kernel interception?  That way QEMU can map the physical
>    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
>    is bypassed and performance would be good.
> 
> I'm not sure there is anything we can do to make the case where the host
> kernel wants an fsync(2) fast :(.
> 
> Benchmark results would be important for deciding how big the problem
> is.



Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
Posted by Stefan Hajnoczi 6 years, 11 months ago
On Tue, Apr 11, 2017 at 02:34:26PM +0800, Haozhong Zhang wrote:
> On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote:
> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:
> > > This patch series constructs the flush hint address structures for
> > > nvdimm devices in QEMU.
> > > 
> > > It's of course not for 2.9. I send it out early in order to get
> > > comments on one point I'm uncertain (see the detailed explanation
> > > below). Thanks for any comments in advance!
> > > Background
> > > ---------------
> > 
> > Extra background:
> > 
> > Flush Hint Addresses are necessary because:
> > 
> > 1. Some hardware configurations may require them.  In other words, a
> >    cache flush instruction is not enough to persist data.
> > 
> > 2. The host file system may need fsync(2) calls (e.g. to persist
> >    metadata changes).
> > 
> > Without Flush Hint Addresses only some NVDIMM configurations actually
> > guarantee data persistence.
> > 
> > > Flush hint address structure is a substructure of NFIT and specifies
> > > one or more addresses, namely Flush Hint Addresses. Software can write
> > > to any one of these flush hint addresses to cause any preceding writes
> > > to the NVDIMM region to be flushed out of the intervening platform
> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec
> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure".
> > 
> > Do you have performance data?  I'm concerned that Flush Hint Address
> > hardware interface is not virtualization-friendly.
> 
> Some performance data below.
> 
> Host HW config:
>   CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz x 2 sockets w/ HT enabled
>   MEM: 64 GB
> 
>   As I don't have NVDIMM hardware, so I use files in ext4 fs on a
>   normal SATA SSD as the back storage of vNVDIMM.
> 
> 
> Host SW config:
>   Kernel: 4.10.1
>   QEMU: commit ea2afcf with this patch series applied.
> 
> 
> Guest config:
>   For flush hint enabled case, the following QEMU options are used
>     -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \
>     -m 4G,slots=4,maxmem=128G \
>     -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \
>     -device nvdimm,id=nv1,memdev=mem1,reserved-size=4K,flush-hint \
>     -hda GUEST_DISK_IMG -serial pty
> 
>   For flush hint disabled case, the following QEMU options are used
>     -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \
>     -m 4G,slots=4,maxmem=128G \
>     -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \
>     -device nvdimm,id=nv1,memdev=mem1 \
>     -hda GUEST_DISK_IMG -serial pty
> 
>   nvm-img used above is created in ext4 fs on the host SSD by
>     dd if=/dev/zero of=nvm-img bs=1G count=8
> 
>   Guest kernel: 4.11.0-rc4
> 
> 
> Benchmark in guest:
>   mkfs.ext4 /dev/pmem0
>   mount -o dax /dev/pmem0 /mnt
>   dd if=/dev/zero of=/mnt/data bs=1G count=7 # warm up EPT mapping
>   rm /mnt/data                               #
>   dd if=/dev/zero of=/mnt/data bs=1G count=7
> 
>   and record the write speed reported by the last 'dd' command.
> 
> 
> Result:
>   - Flush hint disabled
>     Vary from 161 MB/s to 708 MB/s, depending on how many fs/device
>     flush operations are performed on the host side during the guest
>     'dd'.
> 
>   - Flush hint enabled
>   
>     Vary from 164 MB/s to 546 MB/s, depending on how long fsync() in
>     QEMU takes. Usually, there is at least one fsync() during one 'dd'
>     command that takes several seconds (the worst one takes 39 s).
> 
>     To be worse, during those long host-side fsync() operations, guest
>     kernel complained stalls.

I'm surprised that maximum throughput was 708 MB/s.  The guest is
DAX-aware and the write(2) syscall is a memcpy.  I expected higher
numbers without flush hints.

Also strange that throughput varied so greatly.  A benchmark that varies
4x is not useful since it's hard to tell if anything <4x indicates a
significant performance difference.  In other words, the noise is huge!

What results do you get on the host?

Dan: Any comments on this benchmark and is there a recommended way to
benchmark NVDIMM?

> Some thoughts:
> 
> - If the non-NVDIMM hardware is used as the back store of vNVDIMM,
>   QEMU may perform the host-side flush operations asynchronously with
>   VM, which will not block VM too long but sacrifice the durability
>   guarantee.
> 
> - If physical NVDIMM is used as the back store and ADR is supported on
>   the host, QEMU can rely on ADR to guarantee the data durability and
>   will not need to emulate flush hint for guest.
> 
> - If physical NVDIMM is used as the back store and ADR is not
>   supported on the host, QEMU will still need to emulate flush hint
>   for guest and need to use a fast approach other than fsync() to
>   trigger writes to host flush hint.
> 
>   Could kernel expose an interface to allow the userland (i.e. QEMU in
>   this case) to directly write to the flush hint of a NVDIMM region?
> 
> 
> Haozhong
> 
> > 
> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:
> > 
> >   wmb();
> >   for (i = 0; i < nd_region->ndr_mappings; i++)
> >       if (ndrd_get_flush_wpq(ndrd, i, 0))
> >           writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
> >   wmb();
> > 
> > That looks pretty lightweight - it's an MMIO write between write
> > barriers.
> > 
> > This patch implements the MMIO write like this:
> > 
> >   void nvdimm_flush(NVDIMMDevice *nvdimm)
> >   {
> >       if (nvdimm->backend_fd != -1) {
> >           /*
> >            * If the backend store is a physical NVDIMM device, fsync()
> >            * will trigger the flush via the flush hint on the host device.
> >            */
> >           fsync(nvdimm->backend_fd);
> >       }
> >   }
> > 
> > The MMIO store instruction turned into a synchronous fsync(2) system
> > call plus vmexit/vmenter and QEMU userspace context switch:
> > 
> > 1. The vcpu blocks during the fsync(2) system call.  The MMIO write
> >    instruction has an unexpected and huge latency.
> > 
> > 2. The vcpu thread holds the QEMU global mutex so all other threads
> >    (including the monitor) are blocked during fsync(2).  Other vcpu
> >    threads may block if they vmexit.
> > 
> > It is hard to implement this efficiently in QEMU.  This is why I said
> > the hardware interface is not virtualization-friendly.  It's cheap on
> > real hardware but expensive under virtualization.
> > 
> > We should think about the optimal way of implementing Flush Hint
> > Addresses in QEMU.  But if there is no reasonable way to implement them
> > then I think it's better *not* to implement them, just like the Block
> > Window feature which is also not virtualization-friendly.  Users who
> > want a block device can use virtio-blk.  I don't think NVDIMM Block
> > Window can achieve better performance than virtio-blk under
> > virtualization (although I'm happy to be proven wrong).
> > 
> > Some ideas for a faster implementation:
> > 
> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU
> >    global mutex.  Little synchronization is necessary as long as the
> >    NVDIMM device isn't hot unplugged (not yet supported anyway).
> > 
> > 2. Can the host kernel provide a way to mmap Address Flush Hints from
> >    the physical NVDIMM in cases where the configuration does not require
> >    host kernel interception?  That way QEMU can map the physical
> >    NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
> >    is bypassed and performance would be good.
> > 
> > I'm not sure there is anything we can do to make the case where the host
> > kernel wants an fsync(2) fast :(.
> > 
> > Benchmark results would be important for deciding how big the problem
> > is.
> 
>