hw/acpi/nvdimm.c | 111 ++++++++++++++++++++++++++++++++++++++++++++--- hw/i386/pc.c | 5 ++- hw/i386/pc_piix.c | 2 +- hw/i386/pc_q35.c | 2 +- hw/mem/nvdimm.c | 48 ++++++++++++++++++++ hw/mem/pc-dimm.c | 48 ++++++++++++++++++-- include/hw/mem/nvdimm.h | 20 ++++++++- include/hw/mem/pc-dimm.h | 2 + 8 files changed, 224 insertions(+), 14 deletions(-)
This patch series constructs the flush hint address structures for nvdimm devices in QEMU. It's of course not for 2.9. I send it out early in order to get comments on one point I'm uncertain (see the detailed explanation below). Thanks for any comments in advance! Background --------------- Flush hint address structure is a substructure of NFIT and specifies one or more addresses, namely Flush Hint Addresses. Software can write to any one of these flush hint addresses to cause any preceding writes to the NVDIMM region to be flushed out of the intervening platform buffers to the targeted NVDIMM. More details can be found in ACPI Spec 6.1, Section 5.2.25.8 "Flush Hint Address Structure". Why is it RFC? --------------- RFC is added because I'm not sure whether the way in this patch series that allocates the guest flush hint addresses is right. QEMU needs to trap guest accesses (at least for writes) to the flush hint addresses in order to perform the necessary flush on the host back store. Therefore, QEMU needs to create IO memory regions that cover those flush hint addresses. In order to create those IO memory regions, QEMU needs to know the flush hint addresses or their offsets to other known memory regions in advance. So far looks good. Flush hint addresses are in the guest address space. Looking at how the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural to take the same way for flush hint addresses, i.e. let the guest firmware allocate from free addresses and patch them in the flush hint address structure. (*Please correct me If my following understand is wrong*) However, the current allocation and pointer patching are transparent to QEMU, so QEMU will be unaware of the flush hint addresses, and consequently have no way to create corresponding IO memory regions in order to trap guest accesses. Alternatively, this patch series moves the allocation of flush hint addresses to QEMU: 1. (Patch 1) We reserve an address range after the end address of each nvdimm device. Its size is specified by the user via a new pc-dimm option 'reserved-size'. For the following example, -object memory-backend-file,id=mem0,size=4G,... -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,... -device pc-dimm,id=dimm1,... if dimm0 is allocated to address N ~ N+4G, the address of dimm1 will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for dimm0. 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address structure for each nvdimm device, it will allocate them from the above reserved area, e.g. the flush hint addresses of above dimm0 are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in this way, so QEMU can easily create IO memory regions for them. If the reserved area is not present or too small, QEMU will report errors. How to test? --------------- Add options 'flush-hint' and 'reserved-size' when creating a nvdimm device, e.g. qemu-system-x86_64 -machine pc,nvdimm \ -m 4G,slots=4,maxmem=128G \ -object memory-backend-file,id=mem1,share,mem-path=/dev/pmem0 \ -device nvdimm,id=nv1,memdev=mem1,reserved-size=4K,flush-hint \ ... The guest OS should be able to find a flush hint address structure in NFIT. For guest Linux kernel v4.8 or later which supports flush hint, if QEMU is built with NVDIMM_DEBUG = 1 in include/hw/mem/nvdimm.h, it will print debug messages like nvdimm: Write Flush Hint: offset 0x0, data 0x1 nvdimm: Write Flush Hint: offset 0x4, data 0x0 when linux performs flush via flush hint address. Haozhong Zhang (4): pc-dimm: add 'reserved-size' to reserve address range after the ending address nvdimm: add functions to initialize and perform flush on back store nvdimm acpi: record the cache line size in AcpiNVDIMMState nvdimm acpi: build flush hint address structure if required hw/acpi/nvdimm.c | 111 ++++++++++++++++++++++++++++++++++++++++++++--- hw/i386/pc.c | 5 ++- hw/i386/pc_piix.c | 2 +- hw/i386/pc_q35.c | 2 +- hw/mem/nvdimm.c | 48 ++++++++++++++++++++ hw/mem/pc-dimm.c | 48 ++++++++++++++++++-- include/hw/mem/nvdimm.h | 20 ++++++++- include/hw/mem/pc-dimm.h | 2 + 8 files changed, 224 insertions(+), 14 deletions(-) -- 2.10.1
On 31/03/2017 4:41 PM, Haozhong Zhang wrote: > This patch series constructs the flush hint address structures for > nvdimm devices in QEMU. > > It's of course not for 2.9. I send it out early in order to get > comments on one point I'm uncertain (see the detailed explanation > below). Thanks for any comments in advance! > > > Background > --------------- > Flush hint address structure is a substructure of NFIT and specifies > one or more addresses, namely Flush Hint Addresses. Software can write > to any one of these flush hint addresses to cause any preceding writes > to the NVDIMM region to be flushed out of the intervening platform > buffers to the targeted NVDIMM. More details can be found in ACPI Spec > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > > > Why is it RFC? > --------------- > RFC is added because I'm not sure whether the way in this patch series > that allocates the guest flush hint addresses is right. > > QEMU needs to trap guest accesses (at least for writes) to the flush > hint addresses in order to perform the necessary flush on the host > back store. Therefore, QEMU needs to create IO memory regions that > cover those flush hint addresses. In order to create those IO memory > regions, QEMU needs to know the flush hint addresses or their offsets > to other known memory regions in advance. So far looks good. > > Flush hint addresses are in the guest address space. Looking at how > the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural > to take the same way for flush hint addresses, i.e. let the guest > firmware allocate from free addresses and patch them in the flush hint > address structure. (*Please correct me If my following understand is wrong*) > However, the current allocation and pointer patching are transparent > to QEMU, so QEMU will be unaware of the flush hint addresses, and > consequently have no way to create corresponding IO memory regions in > order to trap guest accesses. Er, it is awkward and flush-hint-table is static which may not be easily patched. > > Alternatively, this patch series moves the allocation of flush hint > addresses to QEMU: > > 1. (Patch 1) We reserve an address range after the end address of each > nvdimm device. Its size is specified by the user via a new pc-dimm > option 'reserved-size'. > We should make it only work for nvdimm? > For the following example, > -object memory-backend-file,id=mem0,size=4G,... > -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,... > -device pc-dimm,id=dimm1,... > if dimm0 is allocated to address N ~ N+4G, the address of dimm1 > will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for > dimm0. > > 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address > structure for each nvdimm device, it will allocate them from the > above reserved area, e.g. the flush hint addresses of above dimm0 > are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in > this way, so QEMU can easily create IO memory regions for them. > > If the reserved area is not present or too small, QEMU will report > errors. > We should make 'reserved-size' always be page-aligned and should be transparent to the user, i.e, automatically reserve 4k if 'flush-hint' is specified?
On 04/06/17 17:39 +0800, Xiao Guangrong wrote: > > > On 31/03/2017 4:41 PM, Haozhong Zhang wrote: > > This patch series constructs the flush hint address structures for > > nvdimm devices in QEMU. > > > > It's of course not for 2.9. I send it out early in order to get > > comments on one point I'm uncertain (see the detailed explanation > > below). Thanks for any comments in advance! > > > > > > Background > > --------------- > > Flush hint address structure is a substructure of NFIT and specifies > > one or more addresses, namely Flush Hint Addresses. Software can write > > to any one of these flush hint addresses to cause any preceding writes > > to the NVDIMM region to be flushed out of the intervening platform > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > > > > > > Why is it RFC? > > --------------- > > RFC is added because I'm not sure whether the way in this patch series > > that allocates the guest flush hint addresses is right. > > > > QEMU needs to trap guest accesses (at least for writes) to the flush > > hint addresses in order to perform the necessary flush on the host > > back store. Therefore, QEMU needs to create IO memory regions that > > cover those flush hint addresses. In order to create those IO memory > > regions, QEMU needs to know the flush hint addresses or their offsets > > to other known memory regions in advance. So far looks good. > > > > Flush hint addresses are in the guest address space. Looking at how > > the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural > > to take the same way for flush hint addresses, i.e. let the guest > > firmware allocate from free addresses and patch them in the flush hint > > address structure. (*Please correct me If my following understand is wrong*) > > However, the current allocation and pointer patching are transparent > > to QEMU, so QEMU will be unaware of the flush hint addresses, and > > consequently have no way to create corresponding IO memory regions in > > order to trap guest accesses. > > Er, it is awkward and flush-hint-table is static which may not be > easily patched. > > > > > Alternatively, this patch series moves the allocation of flush hint > > addresses to QEMU: > > > > 1. (Patch 1) We reserve an address range after the end address of each > > nvdimm device. Its size is specified by the user via a new pc-dimm > > option 'reserved-size'. > > > > We should make it only work for nvdimm? > Yes, we can check whether the machine option 'nvdimm' is present when plugging the nvdimm. > > For the following example, > > -object memory-backend-file,id=mem0,size=4G,... > > -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,... > > -device pc-dimm,id=dimm1,... > > if dimm0 is allocated to address N ~ N+4G, the address of dimm1 > > will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for > > dimm0. > > > > 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address > > structure for each nvdimm device, it will allocate them from the > > above reserved area, e.g. the flush hint addresses of above dimm0 > > are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in > > this way, so QEMU can easily create IO memory regions for them. > > > > If the reserved area is not present or too small, QEMU will report > > errors. > > > > We should make 'reserved-size' always be page-aligned and should be > transparent to the user, i.e, automatically reserve 4k if 'flush-hint' > is specified? > 4K alignment is already enforced by current memory plug code. About the automatic reservation, is a non-zero default value acceptable by qemu design/convention in general? Thanks, Haozhong
On 04/06/2017 05:58 PM, Haozhong Zhang wrote: > On 04/06/17 17:39 +0800, Xiao Guangrong wrote: >> >> >> On 31/03/2017 4:41 PM, Haozhong Zhang wrote: >>> This patch series constructs the flush hint address structures for >>> nvdimm devices in QEMU. >>> >>> It's of course not for 2.9. I send it out early in order to get >>> comments on one point I'm uncertain (see the detailed explanation >>> below). Thanks for any comments in advance! >>> >>> >>> Background >>> --------------- >>> Flush hint address structure is a substructure of NFIT and specifies >>> one or more addresses, namely Flush Hint Addresses. Software can write >>> to any one of these flush hint addresses to cause any preceding writes >>> to the NVDIMM region to be flushed out of the intervening platform >>> buffers to the targeted NVDIMM. More details can be found in ACPI Spec >>> 6.1, Section 5.2.25.8 "Flush Hint Address Structure". >>> >>> >>> Why is it RFC? >>> --------------- >>> RFC is added because I'm not sure whether the way in this patch series >>> that allocates the guest flush hint addresses is right. >>> >>> QEMU needs to trap guest accesses (at least for writes) to the flush >>> hint addresses in order to perform the necessary flush on the host >>> back store. Therefore, QEMU needs to create IO memory regions that >>> cover those flush hint addresses. In order to create those IO memory >>> regions, QEMU needs to know the flush hint addresses or their offsets >>> to other known memory regions in advance. So far looks good. >>> >>> Flush hint addresses are in the guest address space. Looking at how >>> the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural >>> to take the same way for flush hint addresses, i.e. let the guest >>> firmware allocate from free addresses and patch them in the flush hint >>> address structure. (*Please correct me If my following understand is wrong*) >>> However, the current allocation and pointer patching are transparent >>> to QEMU, so QEMU will be unaware of the flush hint addresses, and >>> consequently have no way to create corresponding IO memory regions in >>> order to trap guest accesses. >> >> Er, it is awkward and flush-hint-table is static which may not be >> easily patched. >> >>> >>> Alternatively, this patch series moves the allocation of flush hint >>> addresses to QEMU: >>> >>> 1. (Patch 1) We reserve an address range after the end address of each >>> nvdimm device. Its size is specified by the user via a new pc-dimm >>> option 'reserved-size'. >>> >> >> We should make it only work for nvdimm? >> > > Yes, we can check whether the machine option 'nvdimm' is present when > plugging the nvdimm. > >>> For the following example, >>> -object memory-backend-file,id=mem0,size=4G,... >>> -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,... >>> -device pc-dimm,id=dimm1,... >>> if dimm0 is allocated to address N ~ N+4G, the address of dimm1 >>> will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for >>> dimm0. >>> >>> 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address >>> structure for each nvdimm device, it will allocate them from the >>> above reserved area, e.g. the flush hint addresses of above dimm0 >>> are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in >>> this way, so QEMU can easily create IO memory regions for them. >>> >>> If the reserved area is not present or too small, QEMU will report >>> errors. >>> >> >> We should make 'reserved-size' always be page-aligned and should be >> transparent to the user, i.e, automatically reserve 4k if 'flush-hint' >> is specified? >> > > 4K alignment is already enforced by current memory plug code. > > About the automatic reservation, is a non-zero default value > acceptable by qemu design/convention in general? Needn't make it as a user-visible parameter, just a field contained in dimm-dev struct or nvdimm-dev struct indicates the reserved size is okay.
On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: > This patch series constructs the flush hint address structures for > nvdimm devices in QEMU. > > It's of course not for 2.9. I send it out early in order to get > comments on one point I'm uncertain (see the detailed explanation > below). Thanks for any comments in advance! > Background > --------------- Extra background: Flush Hint Addresses are necessary because: 1. Some hardware configurations may require them. In other words, a cache flush instruction is not enough to persist data. 2. The host file system may need fsync(2) calls (e.g. to persist metadata changes). Without Flush Hint Addresses only some NVDIMM configurations actually guarantee data persistence. > Flush hint address structure is a substructure of NFIT and specifies > one or more addresses, namely Flush Hint Addresses. Software can write > to any one of these flush hint addresses to cause any preceding writes > to the NVDIMM region to be flushed out of the intervening platform > buffers to the targeted NVDIMM. More details can be found in ACPI Spec > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". Do you have performance data? I'm concerned that Flush Hint Address hardware interface is not virtualization-friendly. In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: wmb(); for (i = 0; i < nd_region->ndr_mappings; i++) if (ndrd_get_flush_wpq(ndrd, i, 0)) writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); wmb(); That looks pretty lightweight - it's an MMIO write between write barriers. This patch implements the MMIO write like this: void nvdimm_flush(NVDIMMDevice *nvdimm) { if (nvdimm->backend_fd != -1) { /* * If the backend store is a physical NVDIMM device, fsync() * will trigger the flush via the flush hint on the host device. */ fsync(nvdimm->backend_fd); } } The MMIO store instruction turned into a synchronous fsync(2) system call plus vmexit/vmenter and QEMU userspace context switch: 1. The vcpu blocks during the fsync(2) system call. The MMIO write instruction has an unexpected and huge latency. 2. The vcpu thread holds the QEMU global mutex so all other threads (including the monitor) are blocked during fsync(2). Other vcpu threads may block if they vmexit. It is hard to implement this efficiently in QEMU. This is why I said the hardware interface is not virtualization-friendly. It's cheap on real hardware but expensive under virtualization. We should think about the optimal way of implementing Flush Hint Addresses in QEMU. But if there is no reasonable way to implement them then I think it's better *not* to implement them, just like the Block Window feature which is also not virtualization-friendly. Users who want a block device can use virtio-blk. I don't think NVDIMM Block Window can achieve better performance than virtio-blk under virtualization (although I'm happy to be proven wrong). Some ideas for a faster implementation: 1. Use memory_region_clear_global_locking() to avoid taking the QEMU global mutex. Little synchronization is necessary as long as the NVDIMM device isn't hot unplugged (not yet supported anyway). 2. Can the host kernel provide a way to mmap Address Flush Hints from the physical NVDIMM in cases where the configuration does not require host kernel interception? That way QEMU can map the physical NVDIMM's Address Flush Hints directly into the guest. The hypervisor is bypassed and performance would be good. I'm not sure there is anything we can do to make the case where the host kernel wants an fsync(2) fast :(. Benchmark results would be important for deciding how big the problem is.
On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote: > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: > > This patch series constructs the flush hint address structures for > > nvdimm devices in QEMU. > > > > It's of course not for 2.9. I send it out early in order to get > > comments on one point I'm uncertain (see the detailed explanation > > below). Thanks for any comments in advance! > > Background > > --------------- > > Extra background: > > Flush Hint Addresses are necessary because: > > 1. Some hardware configurations may require them. In other words, a > cache flush instruction is not enough to persist data. > > 2. The host file system may need fsync(2) calls (e.g. to persist > metadata changes). > > Without Flush Hint Addresses only some NVDIMM configurations actually > guarantee data persistence. > > > Flush hint address structure is a substructure of NFIT and specifies > > one or more addresses, namely Flush Hint Addresses. Software can write > > to any one of these flush hint addresses to cause any preceding writes > > to the NVDIMM region to be flushed out of the intervening platform > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > > Do you have performance data? I'm concerned that Flush Hint Address > hardware interface is not virtualization-friendly. > I haven't tested how much vNVDIMM performance drops with this patch series. I tested the fsycn latency of a regular file on the bare metal by writing 1 GB random data to a file (on ext4 fs on SSD) and then performing fsync. The average latency of fsync in that case is 3 ms. I currently don't have NVDIMM hardware, so I cannot get its latency data. Anyway, as your comment below, the latency should be larger for VM. > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: > > wmb(); > for (i = 0; i < nd_region->ndr_mappings; i++) > if (ndrd_get_flush_wpq(ndrd, i, 0)) > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); > wmb(); > > That looks pretty lightweight - it's an MMIO write between write > barriers. > > This patch implements the MMIO write like this: > > void nvdimm_flush(NVDIMMDevice *nvdimm) > { > if (nvdimm->backend_fd != -1) { > /* > * If the backend store is a physical NVDIMM device, fsync() > * will trigger the flush via the flush hint on the host device. > */ > fsync(nvdimm->backend_fd); > } > } > > The MMIO store instruction turned into a synchronous fsync(2) system > call plus vmexit/vmenter and QEMU userspace context switch: > > 1. The vcpu blocks during the fsync(2) system call. The MMIO write > instruction has an unexpected and huge latency. > > 2. The vcpu thread holds the QEMU global mutex so all other threads > (including the monitor) are blocked during fsync(2). Other vcpu > threads may block if they vmexit. > > It is hard to implement this efficiently in QEMU. This is why I said > the hardware interface is not virtualization-friendly. It's cheap on > real hardware but expensive under virtualization. > I don't have the NVDIMM hardware, so I don't know the latency of writing to host flush hint address. Dan, do you have any latency data on the bare metal? > We should think about the optimal way of implementing Flush Hint > Addresses in QEMU. But if there is no reasonable way to implement them > then I think it's better *not* to implement them, just like the Block > Window feature which is also not virtualization-friendly. Users who > want a block device can use virtio-blk. I don't think NVDIMM Block > Window can achieve better performance than virtio-blk under > virtualization (although I'm happy to be proven wrong). > > Some ideas for a faster implementation: > > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU > global mutex. Little synchronization is necessary as long as the > NVDIMM device isn't hot unplugged (not yet supported anyway). > ACPI spec does not say it allows or disallows multiple writes to the same flush hint address in parallel. If it can, I think we can remove the global locking requirement for the MMIO memory region of the flush hint address of vNVDIMM. > 2. Can the host kernel provide a way to mmap Address Flush Hints from > the physical NVDIMM in cases where the configuration does not require > host kernel interception? That way QEMU can map the physical > NVDIMM's Address Flush Hints directly into the guest. The hypervisor > is bypassed and performance would be good. > It may work if the backend store of vNVDIMM is the physical NVDIMM region and the latency of writing to host flush hint address is much cheaper then performing fsync. However, if the backend store is a regular file, then we still need to use fsync. > I'm not sure there is anything we can do to make the case where the host > kernel wants an fsync(2) fast :(. > > Benchmark results would be important for deciding how big the problem > is. Let me collect performance data w/ and w/o this patch series. Thanks, Haozhong
On Thu, Apr 06, 2017 at 06:31:17PM +0800, Haozhong Zhang wrote: > On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote: > > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: > > We should think about the optimal way of implementing Flush Hint > > Addresses in QEMU. But if there is no reasonable way to implement them > > then I think it's better *not* to implement them, just like the Block > > Window feature which is also not virtualization-friendly. Users who > > want a block device can use virtio-blk. I don't think NVDIMM Block > > Window can achieve better performance than virtio-blk under > > virtualization (although I'm happy to be proven wrong). > > > > Some ideas for a faster implementation: > > > > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU > > global mutex. Little synchronization is necessary as long as the > > NVDIMM device isn't hot unplugged (not yet supported anyway). > > > > ACPI spec does not say it allows or disallows multiple writes to the > same flush hint address in parallel. If it can, I think we can remove > the global locking requirement for the MMIO memory region of the flush > hint address of vNVDIMM. The Linux code tries to spread the writes but two CPUs can write to the same address sometimes. It doesn't matter if two vcpus access the same address because QEMU just invokes fsync(2). The host kernel has synchronization to make the fsync safe. Stefan
On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote: > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: >> This patch series constructs the flush hint address structures for >> nvdimm devices in QEMU. >> >> It's of course not for 2.9. I send it out early in order to get >> comments on one point I'm uncertain (see the detailed explanation >> below). Thanks for any comments in advance! >> Background >> --------------- > > Extra background: > > Flush Hint Addresses are necessary because: > > 1. Some hardware configurations may require them. In other words, a > cache flush instruction is not enough to persist data. > > 2. The host file system may need fsync(2) calls (e.g. to persist > metadata changes). > > Without Flush Hint Addresses only some NVDIMM configurations actually > guarantee data persistence. > >> Flush hint address structure is a substructure of NFIT and specifies >> one or more addresses, namely Flush Hint Addresses. Software can write >> to any one of these flush hint addresses to cause any preceding writes >> to the NVDIMM region to be flushed out of the intervening platform >> buffers to the targeted NVDIMM. More details can be found in ACPI Spec >> 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > > Do you have performance data? I'm concerned that Flush Hint Address > hardware interface is not virtualization-friendly. > > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: > > wmb(); > for (i = 0; i < nd_region->ndr_mappings; i++) > if (ndrd_get_flush_wpq(ndrd, i, 0)) > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); > wmb(); > > That looks pretty lightweight - it's an MMIO write between write > barriers. > > This patch implements the MMIO write like this: > > void nvdimm_flush(NVDIMMDevice *nvdimm) > { > if (nvdimm->backend_fd != -1) { > /* > * If the backend store is a physical NVDIMM device, fsync() > * will trigger the flush via the flush hint on the host device. > */ > fsync(nvdimm->backend_fd); > } > } > > The MMIO store instruction turned into a synchronous fsync(2) system > call plus vmexit/vmenter and QEMU userspace context switch: > > 1. The vcpu blocks during the fsync(2) system call. The MMIO write > instruction has an unexpected and huge latency. > > 2. The vcpu thread holds the QEMU global mutex so all other threads > (including the monitor) are blocked during fsync(2). Other vcpu > threads may block if they vmexit. > > It is hard to implement this efficiently in QEMU. This is why I said > the hardware interface is not virtualization-friendly. It's cheap on > real hardware but expensive under virtualization. > > We should think about the optimal way of implementing Flush Hint > Addresses in QEMU. But if there is no reasonable way to implement them > then I think it's better *not* to implement them, just like the Block > Window feature which is also not virtualization-friendly. Users who > want a block device can use virtio-blk. I don't think NVDIMM Block > Window can achieve better performance than virtio-blk under > virtualization (although I'm happy to be proven wrong). > > Some ideas for a faster implementation: > > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU > global mutex. Little synchronization is necessary as long as the > NVDIMM device isn't hot unplugged (not yet supported anyway). > > 2. Can the host kernel provide a way to mmap Address Flush Hints from > the physical NVDIMM in cases where the configuration does not require > host kernel interception? That way QEMU can map the physical > NVDIMM's Address Flush Hints directly into the guest. The hypervisor > is bypassed and performance would be good. > > I'm not sure there is anything we can do to make the case where the host > kernel wants an fsync(2) fast :(. Good point. We can assume flush-CPU-cache-to-make-persistence is always available on Intel's hardware so that flush-hint-table is not needed if the vNVDIMM is based on a real Intel's NVDIMM device. If the vNVDIMM device is based on the regular file, i think fsync is the bottleneck rather than this mmio-virtualization. :(
On 04/06/17 20:02 +0800, Xiao Guangrong wrote: > > > On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote: > > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: > > > This patch series constructs the flush hint address structures for > > > nvdimm devices in QEMU. > > > > > > It's of course not for 2.9. I send it out early in order to get > > > comments on one point I'm uncertain (see the detailed explanation > > > below). Thanks for any comments in advance! > > > Background > > > --------------- > > > > Extra background: > > > > Flush Hint Addresses are necessary because: > > > > 1. Some hardware configurations may require them. In other words, a > > cache flush instruction is not enough to persist data. > > > > 2. The host file system may need fsync(2) calls (e.g. to persist > > metadata changes). > > > > Without Flush Hint Addresses only some NVDIMM configurations actually > > guarantee data persistence. > > > > > Flush hint address structure is a substructure of NFIT and specifies > > > one or more addresses, namely Flush Hint Addresses. Software can write > > > to any one of these flush hint addresses to cause any preceding writes > > > to the NVDIMM region to be flushed out of the intervening platform > > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec > > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > > > > Do you have performance data? I'm concerned that Flush Hint Address > > hardware interface is not virtualization-friendly. > > > > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: > > > > wmb(); > > for (i = 0; i < nd_region->ndr_mappings; i++) > > if (ndrd_get_flush_wpq(ndrd, i, 0)) > > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); > > wmb(); > > > > That looks pretty lightweight - it's an MMIO write between write > > barriers. > > > > This patch implements the MMIO write like this: > > > > void nvdimm_flush(NVDIMMDevice *nvdimm) > > { > > if (nvdimm->backend_fd != -1) { > > /* > > * If the backend store is a physical NVDIMM device, fsync() > > * will trigger the flush via the flush hint on the host device. > > */ > > fsync(nvdimm->backend_fd); > > } > > } > > > > The MMIO store instruction turned into a synchronous fsync(2) system > > call plus vmexit/vmenter and QEMU userspace context switch: > > > > 1. The vcpu blocks during the fsync(2) system call. The MMIO write > > instruction has an unexpected and huge latency. > > > > 2. The vcpu thread holds the QEMU global mutex so all other threads > > (including the monitor) are blocked during fsync(2). Other vcpu > > threads may block if they vmexit. > > > > It is hard to implement this efficiently in QEMU. This is why I said > > the hardware interface is not virtualization-friendly. It's cheap on > > real hardware but expensive under virtualization. > > > > We should think about the optimal way of implementing Flush Hint > > Addresses in QEMU. But if there is no reasonable way to implement them > > then I think it's better *not* to implement them, just like the Block > > Window feature which is also not virtualization-friendly. Users who > > want a block device can use virtio-blk. I don't think NVDIMM Block > > Window can achieve better performance than virtio-blk under > > virtualization (although I'm happy to be proven wrong). > > > > Some ideas for a faster implementation: > > > > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU > > global mutex. Little synchronization is necessary as long as the > > NVDIMM device isn't hot unplugged (not yet supported anyway). > > > > 2. Can the host kernel provide a way to mmap Address Flush Hints from > > the physical NVDIMM in cases where the configuration does not require > > host kernel interception? That way QEMU can map the physical > > NVDIMM's Address Flush Hints directly into the guest. The hypervisor > > is bypassed and performance would be good. > > > > I'm not sure there is anything we can do to make the case where the host > > kernel wants an fsync(2) fast :(. > > Good point. > > We can assume flush-CPU-cache-to-make-persistence is always > available on Intel's hardware so that flush-hint-table is not > needed if the vNVDIMM is based on a real Intel's NVDIMM device. > We can let users of qemu (e.g. libvirt) detect whether the backend device supports ADR, and pass 'flush-hint' option to qemu only if ADR is not supported. > If the vNVDIMM device is based on the regular file, i think > fsync is the bottleneck rather than this mmio-virtualization. :( > Yes, fsync() on the regular file is the bottleneck. We may either 1/ perform the host-side flush in an asynchronous way which will not block vcpu too long, or 2/ not provide strong durability guarantee for non-NVDIMM backend and not emulate flush-hint for guest at all. (I know 1/ does not provide strong durability guarantee either). Haozhong
[ adding Christoph ] On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang <haozhong.zhang@intel.com> wrote: > On 04/06/17 20:02 +0800, Xiao Guangrong wrote: >> >> >> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote: >> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: >> > > This patch series constructs the flush hint address structures for >> > > nvdimm devices in QEMU. >> > > >> > > It's of course not for 2.9. I send it out early in order to get >> > > comments on one point I'm uncertain (see the detailed explanation >> > > below). Thanks for any comments in advance! >> > > Background >> > > --------------- >> > >> > Extra background: >> > >> > Flush Hint Addresses are necessary because: >> > >> > 1. Some hardware configurations may require them. In other words, a >> > cache flush instruction is not enough to persist data. >> > >> > 2. The host file system may need fsync(2) calls (e.g. to persist >> > metadata changes). >> > >> > Without Flush Hint Addresses only some NVDIMM configurations actually >> > guarantee data persistence. >> > >> > > Flush hint address structure is a substructure of NFIT and specifies >> > > one or more addresses, namely Flush Hint Addresses. Software can write >> > > to any one of these flush hint addresses to cause any preceding writes >> > > to the NVDIMM region to be flushed out of the intervening platform >> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec >> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". >> > >> > Do you have performance data? I'm concerned that Flush Hint Address >> > hardware interface is not virtualization-friendly. >> > >> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: >> > >> > wmb(); >> > for (i = 0; i < nd_region->ndr_mappings; i++) >> > if (ndrd_get_flush_wpq(ndrd, i, 0)) >> > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); >> > wmb(); >> > >> > That looks pretty lightweight - it's an MMIO write between write >> > barriers. >> > >> > This patch implements the MMIO write like this: >> > >> > void nvdimm_flush(NVDIMMDevice *nvdimm) >> > { >> > if (nvdimm->backend_fd != -1) { >> > /* >> > * If the backend store is a physical NVDIMM device, fsync() >> > * will trigger the flush via the flush hint on the host device. >> > */ >> > fsync(nvdimm->backend_fd); >> > } >> > } >> > >> > The MMIO store instruction turned into a synchronous fsync(2) system >> > call plus vmexit/vmenter and QEMU userspace context switch: >> > >> > 1. The vcpu blocks during the fsync(2) system call. The MMIO write >> > instruction has an unexpected and huge latency. >> > >> > 2. The vcpu thread holds the QEMU global mutex so all other threads >> > (including the monitor) are blocked during fsync(2). Other vcpu >> > threads may block if they vmexit. >> > >> > It is hard to implement this efficiently in QEMU. This is why I said >> > the hardware interface is not virtualization-friendly. It's cheap on >> > real hardware but expensive under virtualization. >> > >> > We should think about the optimal way of implementing Flush Hint >> > Addresses in QEMU. But if there is no reasonable way to implement them >> > then I think it's better *not* to implement them, just like the Block >> > Window feature which is also not virtualization-friendly. Users who >> > want a block device can use virtio-blk. I don't think NVDIMM Block >> > Window can achieve better performance than virtio-blk under >> > virtualization (although I'm happy to be proven wrong). >> > >> > Some ideas for a faster implementation: >> > >> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU >> > global mutex. Little synchronization is necessary as long as the >> > NVDIMM device isn't hot unplugged (not yet supported anyway). >> > >> > 2. Can the host kernel provide a way to mmap Address Flush Hints from >> > the physical NVDIMM in cases where the configuration does not require >> > host kernel interception? That way QEMU can map the physical >> > NVDIMM's Address Flush Hints directly into the guest. The hypervisor >> > is bypassed and performance would be good. >> > >> > I'm not sure there is anything we can do to make the case where the host >> > kernel wants an fsync(2) fast :(. >> >> Good point. >> >> We can assume flush-CPU-cache-to-make-persistence is always >> available on Intel's hardware so that flush-hint-table is not >> needed if the vNVDIMM is based on a real Intel's NVDIMM device. >> > > We can let users of qemu (e.g. libvirt) detect whether the backend > device supports ADR, and pass 'flush-hint' option to qemu only if ADR > is not supported. > There currently is no ACPI mechanism to detect the presence of ADR. Also, you still need the flush for fs metadata management. >> If the vNVDIMM device is based on the regular file, i think >> fsync is the bottleneck rather than this mmio-virtualization. :( >> > > Yes, fsync() on the regular file is the bottleneck. We may either > > 1/ perform the host-side flush in an asynchronous way which will not > block vcpu too long, > > or > > 2/ not provide strong durability guarantee for non-NVDIMM backend and > not emulate flush-hint for guest at all. (I know 1/ does not > provide strong durability guarantee either). or 3/ Use device-dax as a stop-gap until we can get an efficient fsync() overhead reduction (or bypass) mechanism built and accepted for filesystem-dax.
On Tue, Apr 11, 2017 at 7:56 AM, Dan Williams <dan.j.williams@intel.com> wrote: > [ adding Christoph ] > > On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang > <haozhong.zhang@intel.com> wrote: >> On 04/06/17 20:02 +0800, Xiao Guangrong wrote: >>> >>> >>> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote: >>> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: >>> > > This patch series constructs the flush hint address structures for >>> > > nvdimm devices in QEMU. >>> > > >>> > > It's of course not for 2.9. I send it out early in order to get >>> > > comments on one point I'm uncertain (see the detailed explanation >>> > > below). Thanks for any comments in advance! >>> > > Background >>> > > --------------- >>> > >>> > Extra background: >>> > >>> > Flush Hint Addresses are necessary because: >>> > >>> > 1. Some hardware configurations may require them. In other words, a >>> > cache flush instruction is not enough to persist data. >>> > >>> > 2. The host file system may need fsync(2) calls (e.g. to persist >>> > metadata changes). >>> > >>> > Without Flush Hint Addresses only some NVDIMM configurations actually >>> > guarantee data persistence. >>> > >>> > > Flush hint address structure is a substructure of NFIT and specifies >>> > > one or more addresses, namely Flush Hint Addresses. Software can write >>> > > to any one of these flush hint addresses to cause any preceding writes >>> > > to the NVDIMM region to be flushed out of the intervening platform >>> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec >>> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". >>> > >>> > Do you have performance data? I'm concerned that Flush Hint Address >>> > hardware interface is not virtualization-friendly. >>> > >>> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: >>> > >>> > wmb(); >>> > for (i = 0; i < nd_region->ndr_mappings; i++) >>> > if (ndrd_get_flush_wpq(ndrd, i, 0)) >>> > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); >>> > wmb(); >>> > >>> > That looks pretty lightweight - it's an MMIO write between write >>> > barriers. >>> > >>> > This patch implements the MMIO write like this: >>> > >>> > void nvdimm_flush(NVDIMMDevice *nvdimm) >>> > { >>> > if (nvdimm->backend_fd != -1) { >>> > /* >>> > * If the backend store is a physical NVDIMM device, fsync() >>> > * will trigger the flush via the flush hint on the host device. >>> > */ >>> > fsync(nvdimm->backend_fd); >>> > } >>> > } >>> > >>> > The MMIO store instruction turned into a synchronous fsync(2) system >>> > call plus vmexit/vmenter and QEMU userspace context switch: >>> > >>> > 1. The vcpu blocks during the fsync(2) system call. The MMIO write >>> > instruction has an unexpected and huge latency. >>> > >>> > 2. The vcpu thread holds the QEMU global mutex so all other threads >>> > (including the monitor) are blocked during fsync(2). Other vcpu >>> > threads may block if they vmexit. >>> > >>> > It is hard to implement this efficiently in QEMU. This is why I said >>> > the hardware interface is not virtualization-friendly. It's cheap on >>> > real hardware but expensive under virtualization. >>> > >>> > We should think about the optimal way of implementing Flush Hint >>> > Addresses in QEMU. But if there is no reasonable way to implement them >>> > then I think it's better *not* to implement them, just like the Block >>> > Window feature which is also not virtualization-friendly. Users who >>> > want a block device can use virtio-blk. I don't think NVDIMM Block >>> > Window can achieve better performance than virtio-blk under >>> > virtualization (although I'm happy to be proven wrong). >>> > >>> > Some ideas for a faster implementation: >>> > >>> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU >>> > global mutex. Little synchronization is necessary as long as the >>> > NVDIMM device isn't hot unplugged (not yet supported anyway). >>> > >>> > 2. Can the host kernel provide a way to mmap Address Flush Hints from >>> > the physical NVDIMM in cases where the configuration does not require >>> > host kernel interception? That way QEMU can map the physical >>> > NVDIMM's Address Flush Hints directly into the guest. The hypervisor >>> > is bypassed and performance would be good. >>> > >>> > I'm not sure there is anything we can do to make the case where the host >>> > kernel wants an fsync(2) fast :(. >>> >>> Good point. >>> >>> We can assume flush-CPU-cache-to-make-persistence is always >>> available on Intel's hardware so that flush-hint-table is not >>> needed if the vNVDIMM is based on a real Intel's NVDIMM device. >>> >> >> We can let users of qemu (e.g. libvirt) detect whether the backend >> device supports ADR, and pass 'flush-hint' option to qemu only if ADR >> is not supported. >> > > There currently is no ACPI mechanism to detect the presence of ADR. > Also, you still need the flush for fs metadata management. > >>> If the vNVDIMM device is based on the regular file, i think >>> fsync is the bottleneck rather than this mmio-virtualization. :( >>> >> >> Yes, fsync() on the regular file is the bottleneck. We may either >> >> 1/ perform the host-side flush in an asynchronous way which will not >> block vcpu too long, >> >> or >> >> 2/ not provide strong durability guarantee for non-NVDIMM backend and >> not emulate flush-hint for guest at all. (I know 1/ does not >> provide strong durability guarantee either). > > or > > 3/ Use device-dax as a stop-gap until we can get an efficient fsync() > overhead reduction (or bypass) mechanism built and accepted for > filesystem-dax. I didn't realize we have a bigger problem with host filesystem-fsync and that WPQ exits will not save us. Applications that use device-dax in the guest may never trigger a WPQ flush, because userspace flushing with device-dax is expected to be safe. WPQ flush was never meant to be a persistency mechanism the way it is proposed here, it's only meant to minimize the fallout from potential ADR failure. My apologies for insinuating that it was viable. So, until we solve this userspace flushing problem virtualization must not pass through any file except a device-dax instance for any production workload. Also these performance overheads seem prohibitive. We really want to take whatever fsync minimization / bypass mechanism we come up with on the host into a fast para-virtualized interface for the guest. Guests need to be able to avoid hypervisor and host syscall overhead in the fast path.
On Thu, Apr 20, 2017 at 12:49:21PM -0700, Dan Williams wrote: > On Tue, Apr 11, 2017 at 7:56 AM, Dan Williams <dan.j.williams@intel.com> wrote: > > [ adding Christoph ] > > > > On Tue, Apr 11, 2017 at 1:41 AM, Haozhong Zhang > > <haozhong.zhang@intel.com> wrote: > >> On 04/06/17 20:02 +0800, Xiao Guangrong wrote: > >>> > >>> > >>> On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote: > >>> > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: > >>> > > This patch series constructs the flush hint address structures for > >>> > > nvdimm devices in QEMU. > >>> > > > >>> > > It's of course not for 2.9. I send it out early in order to get > >>> > > comments on one point I'm uncertain (see the detailed explanation > >>> > > below). Thanks for any comments in advance! > >>> > > Background > >>> > > --------------- > >>> > > >>> > Extra background: > >>> > > >>> > Flush Hint Addresses are necessary because: > >>> > > >>> > 1. Some hardware configurations may require them. In other words, a > >>> > cache flush instruction is not enough to persist data. > >>> > > >>> > 2. The host file system may need fsync(2) calls (e.g. to persist > >>> > metadata changes). > >>> > > >>> > Without Flush Hint Addresses only some NVDIMM configurations actually > >>> > guarantee data persistence. > >>> > > >>> > > Flush hint address structure is a substructure of NFIT and specifies > >>> > > one or more addresses, namely Flush Hint Addresses. Software can write > >>> > > to any one of these flush hint addresses to cause any preceding writes > >>> > > to the NVDIMM region to be flushed out of the intervening platform > >>> > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec > >>> > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > >>> > > >>> > Do you have performance data? I'm concerned that Flush Hint Address > >>> > hardware interface is not virtualization-friendly. > >>> > > >>> > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: > >>> > > >>> > wmb(); > >>> > for (i = 0; i < nd_region->ndr_mappings; i++) > >>> > if (ndrd_get_flush_wpq(ndrd, i, 0)) > >>> > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); > >>> > wmb(); > >>> > > >>> > That looks pretty lightweight - it's an MMIO write between write > >>> > barriers. > >>> > > >>> > This patch implements the MMIO write like this: > >>> > > >>> > void nvdimm_flush(NVDIMMDevice *nvdimm) > >>> > { > >>> > if (nvdimm->backend_fd != -1) { > >>> > /* > >>> > * If the backend store is a physical NVDIMM device, fsync() > >>> > * will trigger the flush via the flush hint on the host device. > >>> > */ > >>> > fsync(nvdimm->backend_fd); > >>> > } > >>> > } > >>> > > >>> > The MMIO store instruction turned into a synchronous fsync(2) system > >>> > call plus vmexit/vmenter and QEMU userspace context switch: > >>> > > >>> > 1. The vcpu blocks during the fsync(2) system call. The MMIO write > >>> > instruction has an unexpected and huge latency. > >>> > > >>> > 2. The vcpu thread holds the QEMU global mutex so all other threads > >>> > (including the monitor) are blocked during fsync(2). Other vcpu > >>> > threads may block if they vmexit. > >>> > > >>> > It is hard to implement this efficiently in QEMU. This is why I said > >>> > the hardware interface is not virtualization-friendly. It's cheap on > >>> > real hardware but expensive under virtualization. > >>> > > >>> > We should think about the optimal way of implementing Flush Hint > >>> > Addresses in QEMU. But if there is no reasonable way to implement them > >>> > then I think it's better *not* to implement them, just like the Block > >>> > Window feature which is also not virtualization-friendly. Users who > >>> > want a block device can use virtio-blk. I don't think NVDIMM Block > >>> > Window can achieve better performance than virtio-blk under > >>> > virtualization (although I'm happy to be proven wrong). > >>> > > >>> > Some ideas for a faster implementation: > >>> > > >>> > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU > >>> > global mutex. Little synchronization is necessary as long as the > >>> > NVDIMM device isn't hot unplugged (not yet supported anyway). > >>> > > >>> > 2. Can the host kernel provide a way to mmap Address Flush Hints from > >>> > the physical NVDIMM in cases where the configuration does not require > >>> > host kernel interception? That way QEMU can map the physical > >>> > NVDIMM's Address Flush Hints directly into the guest. The hypervisor > >>> > is bypassed and performance would be good. > >>> > > >>> > I'm not sure there is anything we can do to make the case where the host > >>> > kernel wants an fsync(2) fast :(. > >>> > >>> Good point. > >>> > >>> We can assume flush-CPU-cache-to-make-persistence is always > >>> available on Intel's hardware so that flush-hint-table is not > >>> needed if the vNVDIMM is based on a real Intel's NVDIMM device. > >>> > >> > >> We can let users of qemu (e.g. libvirt) detect whether the backend > >> device supports ADR, and pass 'flush-hint' option to qemu only if ADR > >> is not supported. > >> > > > > There currently is no ACPI mechanism to detect the presence of ADR. > > Also, you still need the flush for fs metadata management. > > > >>> If the vNVDIMM device is based on the regular file, i think > >>> fsync is the bottleneck rather than this mmio-virtualization. :( > >>> > >> > >> Yes, fsync() on the regular file is the bottleneck. We may either > >> > >> 1/ perform the host-side flush in an asynchronous way which will not > >> block vcpu too long, > >> > >> or > >> > >> 2/ not provide strong durability guarantee for non-NVDIMM backend and > >> not emulate flush-hint for guest at all. (I know 1/ does not > >> provide strong durability guarantee either). > > > > or > > > > 3/ Use device-dax as a stop-gap until we can get an efficient fsync() > > overhead reduction (or bypass) mechanism built and accepted for > > filesystem-dax. > > I didn't realize we have a bigger problem with host filesystem-fsync > and that WPQ exits will not save us. Applications that use device-dax > in the guest may never trigger a WPQ flush, because userspace flushing > with device-dax is expected to be safe. WPQ flush was never meant to > be a persistency mechanism the way it is proposed here, it's only > meant to minimize the fallout from potential ADR failure. My apologies > for insinuating that it was viable. > > So, until we solve this userspace flushing problem virtualization must > not pass through any file except a device-dax instance for any > production workload. Okay. That's what I've assumed up until now and I think distros will document this limitation. > Also these performance overheads seem prohibitive. We really want to > take whatever fsync minimization / bypass mechanism we come up with on > the host into a fast para-virtualized interface for the guest. Guests > need to be able to avoid hypervisor and host syscall overhead in the > fast path. It's hard to avoid the hypervisor if the host kernel file system needs an fsync() to persist everything. There should be a fast path where the host file is preallocated and no fancy file system features are in use (e.g. deduplication, copy-on-write snapshots) where host file systems don't need fsync(). Stefan
[ adding xfs and fsdevel ] On Fri, Apr 21, 2017 at 6:56 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: [..] >> >>> If the vNVDIMM device is based on the regular file, i think >> >>> fsync is the bottleneck rather than this mmio-virtualization. :( >> >>> >> >> >> >> Yes, fsync() on the regular file is the bottleneck. We may either >> >> >> >> 1/ perform the host-side flush in an asynchronous way which will not >> >> block vcpu too long, >> >> >> >> or >> >> >> >> 2/ not provide strong durability guarantee for non-NVDIMM backend and >> >> not emulate flush-hint for guest at all. (I know 1/ does not >> >> provide strong durability guarantee either). >> > >> > or >> > >> > 3/ Use device-dax as a stop-gap until we can get an efficient fsync() >> > overhead reduction (or bypass) mechanism built and accepted for >> > filesystem-dax. >> >> I didn't realize we have a bigger problem with host filesystem-fsync >> and that WPQ exits will not save us. Applications that use device-dax >> in the guest may never trigger a WPQ flush, because userspace flushing >> with device-dax is expected to be safe. WPQ flush was never meant to >> be a persistency mechanism the way it is proposed here, it's only >> meant to minimize the fallout from potential ADR failure. My apologies >> for insinuating that it was viable. >> >> So, until we solve this userspace flushing problem virtualization must >> not pass through any file except a device-dax instance for any >> production workload. > > Okay. That's what I've assumed up until now and I think distros will > document this limitation. > >> Also these performance overheads seem prohibitive. We really want to >> take whatever fsync minimization / bypass mechanism we come up with on >> the host into a fast para-virtualized interface for the guest. Guests >> need to be able to avoid hypervisor and host syscall overhead in the >> fast path. > > It's hard to avoid the hypervisor if the host kernel file system needs > an fsync() to persist everything. There should be a fast path where the > host file is preallocated and no fancy file system features are in use > (e.g. deduplication, copy-on-write snapshots) where host file systems > don't need fsync(). > So we've gone around and around on this with XFS folks with various levels of disagreement about how to achieve synchronous faulting or disabling metadata updates for a file. I think at some point someone is going to want some fancy filesystem feature *and* DAX *and* still want it to be fast. The current problem is that if you want to checkpoint persistence at a high rate, think committing updates to a tree data structure, an fsync() call is going to burn a lot of cycles just to find out that there is nothing to do in most cases. Especially when you've only touched a couple pointers in a cache line, calling into the kernel to sync those writes is a ridiculous proposition. One of the current ideas to resolve this is instead of trying to implement synchronous faulting and wrestle with the constraints that puts on fs-fault paths, is to instead have synchronous notification of metadata dirtying events to userspace. That notification mechanism would be associated with something like an fsync2() library call that knows how to bypass sys_fsync in the common case. Of course, this is still in the idea phase, so until we can get a proof-of-concept on its feet this is all subject to further debate.
On Thu, Apr 6, 2017 at 2:43 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: >> This patch series constructs the flush hint address structures for >> nvdimm devices in QEMU. >> >> It's of course not for 2.9. I send it out early in order to get >> comments on one point I'm uncertain (see the detailed explanation >> below). Thanks for any comments in advance! >> Background >> --------------- > > Extra background: > > Flush Hint Addresses are necessary because: > > 1. Some hardware configurations may require them. In other words, a > cache flush instruction is not enough to persist data. > > 2. The host file system may need fsync(2) calls (e.g. to persist > metadata changes). > > Without Flush Hint Addresses only some NVDIMM configurations actually > guarantee data persistence. > >> Flush hint address structure is a substructure of NFIT and specifies >> one or more addresses, namely Flush Hint Addresses. Software can write >> to any one of these flush hint addresses to cause any preceding writes >> to the NVDIMM region to be flushed out of the intervening platform >> buffers to the targeted NVDIMM. More details can be found in ACPI Spec >> 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > > Do you have performance data? I'm concerned that Flush Hint Address > hardware interface is not virtualization-friendly. > > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: > > wmb(); > for (i = 0; i < nd_region->ndr_mappings; i++) > if (ndrd_get_flush_wpq(ndrd, i, 0)) > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); > wmb(); > > That looks pretty lightweight - it's an MMIO write between write > barriers. > > This patch implements the MMIO write like this: > > void nvdimm_flush(NVDIMMDevice *nvdimm) > { > if (nvdimm->backend_fd != -1) { > /* > * If the backend store is a physical NVDIMM device, fsync() > * will trigger the flush via the flush hint on the host device. > */ > fsync(nvdimm->backend_fd); > } > } > > The MMIO store instruction turned into a synchronous fsync(2) system > call plus vmexit/vmenter and QEMU userspace context switch: > > 1. The vcpu blocks during the fsync(2) system call. The MMIO write > instruction has an unexpected and huge latency. > > 2. The vcpu thread holds the QEMU global mutex so all other threads > (including the monitor) are blocked during fsync(2). Other vcpu > threads may block if they vmexit. > > It is hard to implement this efficiently in QEMU. This is why I said > the hardware interface is not virtualization-friendly. It's cheap on > real hardware but expensive under virtualization. > > We should think about the optimal way of implementing Flush Hint > Addresses in QEMU. But if there is no reasonable way to implement them > then I think it's better *not* to implement them, just like the Block > Window feature which is also not virtualization-friendly. Users who > want a block device can use virtio-blk. I don't think NVDIMM Block > Window can achieve better performance than virtio-blk under > virtualization (although I'm happy to be proven wrong). > > Some ideas for a faster implementation: > > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU > global mutex. Little synchronization is necessary as long as the > NVDIMM device isn't hot unplugged (not yet supported anyway). > > 2. Can the host kernel provide a way to mmap Address Flush Hints from > the physical NVDIMM in cases where the configuration does not require > host kernel interception? That way QEMU can map the physical > NVDIMM's Address Flush Hints directly into the guest. The hypervisor > is bypassed and performance would be good. > > I'm not sure there is anything we can do to make the case where the host > kernel wants an fsync(2) fast :(. A few thoughts: * You don't need to call fsync in the case that the host file is a /dev/daxX.Y device. Since there is no intervening filesystem. * While the host implementation is fast it still is not meant to be called very frequently. We could also look to make a paravirtualized way for it complete asynchronously if the guest just wants to post a flush request and get notified of a completion at a later time. * We're still actively looking for ways to make it safe and efficient to minimize sync overhead if an mmap write did not dirty filesystem data, so hopefully we can make that fsync call overhead smaller over time. * I don't think we can just skip implementing support for this flush. Yes it's virtualization unfriendly, but it completely defeats the persistence of persistent memory if the host filesystem is free to throw away writes that the guest expects to find after a surprise power loss.
On Thu, Apr 06, 2017 at 07:32:01AM -0700, Dan Williams wrote: > On Thu, Apr 6, 2017 at 2:43 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: > * I don't think we can just skip implementing support for this flush. > Yes it's virtualization unfriendly, but it completely defeats the > persistence of persistent memory if the host filesystem is free to > throw away writes that the guest expects to find after a surprise > power loss. Right, if we don't implement Address Flush Hints then the only safe option for configurations that require them would be to print an error and refuse to start. This synchronous fsync(2) business is like taking a host page fault or being scheduled out by the host scheduler. It means the vcpu is stuck for a little bit. Hopefully not long enough to upset the guest kernel or the application. One of the things that guest software can do to help minimize the performance impact is to avoid holding locks across the address flush mmio write instruction. Stefan
On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote: > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: > > This patch series constructs the flush hint address structures for > > nvdimm devices in QEMU. > > > > It's of course not for 2.9. I send it out early in order to get > > comments on one point I'm uncertain (see the detailed explanation > > below). Thanks for any comments in advance! > > Background > > --------------- > > Extra background: > > Flush Hint Addresses are necessary because: > > 1. Some hardware configurations may require them. In other words, a > cache flush instruction is not enough to persist data. > > 2. The host file system may need fsync(2) calls (e.g. to persist > metadata changes). > > Without Flush Hint Addresses only some NVDIMM configurations actually > guarantee data persistence. > > > Flush hint address structure is a substructure of NFIT and specifies > > one or more addresses, namely Flush Hint Addresses. Software can write > > to any one of these flush hint addresses to cause any preceding writes > > to the NVDIMM region to be flushed out of the intervening platform > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > > Do you have performance data? I'm concerned that Flush Hint Address > hardware interface is not virtualization-friendly. Some performance data below. Host HW config: CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz x 2 sockets w/ HT enabled MEM: 64 GB As I don't have NVDIMM hardware, so I use files in ext4 fs on a normal SATA SSD as the back storage of vNVDIMM. Host SW config: Kernel: 4.10.1 QEMU: commit ea2afcf with this patch series applied. Guest config: For flush hint enabled case, the following QEMU options are used -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \ -m 4G,slots=4,maxmem=128G \ -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \ -device nvdimm,id=nv1,memdev=mem1,reserved-size=4K,flush-hint \ -hda GUEST_DISK_IMG -serial pty For flush hint disabled case, the following QEMU options are used -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \ -m 4G,slots=4,maxmem=128G \ -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \ -device nvdimm,id=nv1,memdev=mem1 \ -hda GUEST_DISK_IMG -serial pty nvm-img used above is created in ext4 fs on the host SSD by dd if=/dev/zero of=nvm-img bs=1G count=8 Guest kernel: 4.11.0-rc4 Benchmark in guest: mkfs.ext4 /dev/pmem0 mount -o dax /dev/pmem0 /mnt dd if=/dev/zero of=/mnt/data bs=1G count=7 # warm up EPT mapping rm /mnt/data # dd if=/dev/zero of=/mnt/data bs=1G count=7 and record the write speed reported by the last 'dd' command. Result: - Flush hint disabled Vary from 161 MB/s to 708 MB/s, depending on how many fs/device flush operations are performed on the host side during the guest 'dd'. - Flush hint enabled Vary from 164 MB/s to 546 MB/s, depending on how long fsync() in QEMU takes. Usually, there is at least one fsync() during one 'dd' command that takes several seconds (the worst one takes 39 s). To be worse, during those long host-side fsync() operations, guest kernel complained stalls. Some thoughts: - If the non-NVDIMM hardware is used as the back store of vNVDIMM, QEMU may perform the host-side flush operations asynchronously with VM, which will not block VM too long but sacrifice the durability guarantee. - If physical NVDIMM is used as the back store and ADR is supported on the host, QEMU can rely on ADR to guarantee the data durability and will not need to emulate flush hint for guest. - If physical NVDIMM is used as the back store and ADR is not supported on the host, QEMU will still need to emulate flush hint for guest and need to use a fast approach other than fsync() to trigger writes to host flush hint. Could kernel expose an interface to allow the userland (i.e. QEMU in this case) to directly write to the flush hint of a NVDIMM region? Haozhong > > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: > > wmb(); > for (i = 0; i < nd_region->ndr_mappings; i++) > if (ndrd_get_flush_wpq(ndrd, i, 0)) > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); > wmb(); > > That looks pretty lightweight - it's an MMIO write between write > barriers. > > This patch implements the MMIO write like this: > > void nvdimm_flush(NVDIMMDevice *nvdimm) > { > if (nvdimm->backend_fd != -1) { > /* > * If the backend store is a physical NVDIMM device, fsync() > * will trigger the flush via the flush hint on the host device. > */ > fsync(nvdimm->backend_fd); > } > } > > The MMIO store instruction turned into a synchronous fsync(2) system > call plus vmexit/vmenter and QEMU userspace context switch: > > 1. The vcpu blocks during the fsync(2) system call. The MMIO write > instruction has an unexpected and huge latency. > > 2. The vcpu thread holds the QEMU global mutex so all other threads > (including the monitor) are blocked during fsync(2). Other vcpu > threads may block if they vmexit. > > It is hard to implement this efficiently in QEMU. This is why I said > the hardware interface is not virtualization-friendly. It's cheap on > real hardware but expensive under virtualization. > > We should think about the optimal way of implementing Flush Hint > Addresses in QEMU. But if there is no reasonable way to implement them > then I think it's better *not* to implement them, just like the Block > Window feature which is also not virtualization-friendly. Users who > want a block device can use virtio-blk. I don't think NVDIMM Block > Window can achieve better performance than virtio-blk under > virtualization (although I'm happy to be proven wrong). > > Some ideas for a faster implementation: > > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU > global mutex. Little synchronization is necessary as long as the > NVDIMM device isn't hot unplugged (not yet supported anyway). > > 2. Can the host kernel provide a way to mmap Address Flush Hints from > the physical NVDIMM in cases where the configuration does not require > host kernel interception? That way QEMU can map the physical > NVDIMM's Address Flush Hints directly into the guest. The hypervisor > is bypassed and performance would be good. > > I'm not sure there is anything we can do to make the case where the host > kernel wants an fsync(2) fast :(. > > Benchmark results would be important for deciding how big the problem > is.
On Tue, Apr 11, 2017 at 02:34:26PM +0800, Haozhong Zhang wrote: > On 04/06/17 10:43 +0100, Stefan Hajnoczi wrote: > > On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: > > > This patch series constructs the flush hint address structures for > > > nvdimm devices in QEMU. > > > > > > It's of course not for 2.9. I send it out early in order to get > > > comments on one point I'm uncertain (see the detailed explanation > > > below). Thanks for any comments in advance! > > > Background > > > --------------- > > > > Extra background: > > > > Flush Hint Addresses are necessary because: > > > > 1. Some hardware configurations may require them. In other words, a > > cache flush instruction is not enough to persist data. > > > > 2. The host file system may need fsync(2) calls (e.g. to persist > > metadata changes). > > > > Without Flush Hint Addresses only some NVDIMM configurations actually > > guarantee data persistence. > > > > > Flush hint address structure is a substructure of NFIT and specifies > > > one or more addresses, namely Flush Hint Addresses. Software can write > > > to any one of these flush hint addresses to cause any preceding writes > > > to the NVDIMM region to be flushed out of the intervening platform > > > buffers to the targeted NVDIMM. More details can be found in ACPI Spec > > > 6.1, Section 5.2.25.8 "Flush Hint Address Structure". > > > > Do you have performance data? I'm concerned that Flush Hint Address > > hardware interface is not virtualization-friendly. > > Some performance data below. > > Host HW config: > CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz x 2 sockets w/ HT enabled > MEM: 64 GB > > As I don't have NVDIMM hardware, so I use files in ext4 fs on a > normal SATA SSD as the back storage of vNVDIMM. > > > Host SW config: > Kernel: 4.10.1 > QEMU: commit ea2afcf with this patch series applied. > > > Guest config: > For flush hint enabled case, the following QEMU options are used > -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \ > -m 4G,slots=4,maxmem=128G \ > -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \ > -device nvdimm,id=nv1,memdev=mem1,reserved-size=4K,flush-hint \ > -hda GUEST_DISK_IMG -serial pty > > For flush hint disabled case, the following QEMU options are used > -enable-kvm -smp 4 -cpu host -machine pc,nvdimm \ > -m 4G,slots=4,maxmem=128G \ > -object memory-backend-file,id=mem1,share,mem-path=nvm-img,size=8G \ > -device nvdimm,id=nv1,memdev=mem1 \ > -hda GUEST_DISK_IMG -serial pty > > nvm-img used above is created in ext4 fs on the host SSD by > dd if=/dev/zero of=nvm-img bs=1G count=8 > > Guest kernel: 4.11.0-rc4 > > > Benchmark in guest: > mkfs.ext4 /dev/pmem0 > mount -o dax /dev/pmem0 /mnt > dd if=/dev/zero of=/mnt/data bs=1G count=7 # warm up EPT mapping > rm /mnt/data # > dd if=/dev/zero of=/mnt/data bs=1G count=7 > > and record the write speed reported by the last 'dd' command. > > > Result: > - Flush hint disabled > Vary from 161 MB/s to 708 MB/s, depending on how many fs/device > flush operations are performed on the host side during the guest > 'dd'. > > - Flush hint enabled > > Vary from 164 MB/s to 546 MB/s, depending on how long fsync() in > QEMU takes. Usually, there is at least one fsync() during one 'dd' > command that takes several seconds (the worst one takes 39 s). > > To be worse, during those long host-side fsync() operations, guest > kernel complained stalls. I'm surprised that maximum throughput was 708 MB/s. The guest is DAX-aware and the write(2) syscall is a memcpy. I expected higher numbers without flush hints. Also strange that throughput varied so greatly. A benchmark that varies 4x is not useful since it's hard to tell if anything <4x indicates a significant performance difference. In other words, the noise is huge! What results do you get on the host? Dan: Any comments on this benchmark and is there a recommended way to benchmark NVDIMM? > Some thoughts: > > - If the non-NVDIMM hardware is used as the back store of vNVDIMM, > QEMU may perform the host-side flush operations asynchronously with > VM, which will not block VM too long but sacrifice the durability > guarantee. > > - If physical NVDIMM is used as the back store and ADR is supported on > the host, QEMU can rely on ADR to guarantee the data durability and > will not need to emulate flush hint for guest. > > - If physical NVDIMM is used as the back store and ADR is not > supported on the host, QEMU will still need to emulate flush hint > for guest and need to use a fast approach other than fsync() to > trigger writes to host flush hint. > > Could kernel expose an interface to allow the userland (i.e. QEMU in > this case) to directly write to the flush hint of a NVDIMM region? > > > Haozhong > > > > > In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: > > > > wmb(); > > for (i = 0; i < nd_region->ndr_mappings; i++) > > if (ndrd_get_flush_wpq(ndrd, i, 0)) > > writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); > > wmb(); > > > > That looks pretty lightweight - it's an MMIO write between write > > barriers. > > > > This patch implements the MMIO write like this: > > > > void nvdimm_flush(NVDIMMDevice *nvdimm) > > { > > if (nvdimm->backend_fd != -1) { > > /* > > * If the backend store is a physical NVDIMM device, fsync() > > * will trigger the flush via the flush hint on the host device. > > */ > > fsync(nvdimm->backend_fd); > > } > > } > > > > The MMIO store instruction turned into a synchronous fsync(2) system > > call plus vmexit/vmenter and QEMU userspace context switch: > > > > 1. The vcpu blocks during the fsync(2) system call. The MMIO write > > instruction has an unexpected and huge latency. > > > > 2. The vcpu thread holds the QEMU global mutex so all other threads > > (including the monitor) are blocked during fsync(2). Other vcpu > > threads may block if they vmexit. > > > > It is hard to implement this efficiently in QEMU. This is why I said > > the hardware interface is not virtualization-friendly. It's cheap on > > real hardware but expensive under virtualization. > > > > We should think about the optimal way of implementing Flush Hint > > Addresses in QEMU. But if there is no reasonable way to implement them > > then I think it's better *not* to implement them, just like the Block > > Window feature which is also not virtualization-friendly. Users who > > want a block device can use virtio-blk. I don't think NVDIMM Block > > Window can achieve better performance than virtio-blk under > > virtualization (although I'm happy to be proven wrong). > > > > Some ideas for a faster implementation: > > > > 1. Use memory_region_clear_global_locking() to avoid taking the QEMU > > global mutex. Little synchronization is necessary as long as the > > NVDIMM device isn't hot unplugged (not yet supported anyway). > > > > 2. Can the host kernel provide a way to mmap Address Flush Hints from > > the physical NVDIMM in cases where the configuration does not require > > host kernel interception? That way QEMU can map the physical > > NVDIMM's Address Flush Hints directly into the guest. The hypervisor > > is bypassed and performance would be good. > > > > I'm not sure there is anything we can do to make the case where the host > > kernel wants an fsync(2) fast :(. > > > > Benchmark results would be important for deciding how big the problem > > is. > >
© 2016 - 2024 Red Hat, Inc.