[Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect

guangrong.xiao@gmail.com posted 7 patches 6 years, 10 months ago
Failed in applying to current master (apply log)
There is a newer version of this series
arch/x86/include/asm/kvm_host.h |  25 +++-
arch/x86/kvm/mmu.c              | 267 ++++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/mmu.h              |   1 +
arch/x86/kvm/paging_tmpl.h      |  13 +-
arch/x86/kvm/x86.c              |   7 ++
include/uapi/linux/kvm.h        |   8 +-
virt/kvm/kvm_main.c             |  15 ++-
7 files changed, 317 insertions(+), 19 deletions(-)
[Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by guangrong.xiao@gmail.com 6 years, 10 months ago
From: Xiao Guangrong <xiaoguangrong@tencent.com>

Background
==========
The original idea of this patchset is from Avi who raised it in
the mailing list during my vMMU development some years ago

This patchset introduces a extremely fast way to write protect
all the guest memory. Comparing with the ordinary algorithm which
write protects last level sptes based on the rmap one by one,
it just simply updates the generation number to ask all vCPUs
to reload its root page table, particularly, it can be done out
of mmu-lock, so that it does not hurt vMMU's parallel. It is
the O(1) algorithm which does not depends on the capacity of
guest's memory and the number of guest's vCPUs

Implementation
==============
When write protect for all guest memory is required, we update
the global generation number and ask vCPUs to reload its root
page table by calling kvm_reload_remote_mmus(), the global number
is protected by slots_lock

During reloading its root page table, the vCPU checks root page
table's generation number with current global number, if it is not
matched, it makes all the entries in the shadow page readonly and
directly go to VM. So the read access is still going on smoothly
without KVM's involvement and write access triggers page fault

If the page fault is triggered by write operation, KVM moves the
write protection from the upper level to the lower level page - by
making all the entries in the lower page readonly first then make
the upper level writable, this operation is repeated until we meet
the last spte

In order to speed up the process of making all entries readonly, we
introduce possible_writable_spte_bitmap which indicates the writable
sptes and possiable_writable_sptes which is a counter indicating the
number of writable sptes in the shadow page, they work very efficiently
as usually only one entry in PML4 ( < 512 G),few entries in PDPT (one
entry indicates 1G memory), PDEs and PTEs need to be write protected for
the worst case. Note, the number of page fault and TLB flush are the same
as the ordinary algorithm

Performance Data
================
Case 1) For a VM which has 3G memory and 12 vCPUs, we noticed that:
   a: the time required for dirty log (ns)
       before           after
       64289121         137654      +46603%

   b: the performance of memory write after dirty log, i.e, the dirty
      log path is not parallel with page fault, the time required to
      write all 3G memory for all vCPUs in the VM (ns):
       before           after
       281735017        291150923   -3%
      We think the impact, 3%, is acceptable, particularly, mmu-lock
      contention is not take into account in this case

Case 2) For a VM which has 30G memory and 8 vCPUs, we do the live
   migration, at the some time, a test case which greedily and repeatedly
   writes 3000M memory in the VM.

   2.1) for the new booted VM, i.e, page fault is required to map guest
        memory in, we noticed that:
        a: the dirty page rate (pages):
            before       after
            333092       497266     +49%
	that means, the performance for the being migrated VM is hugely
	improved as the contention on mmu-lock is reduced

	b: the time to complete live migration (ms):
	    before       after
	    12532        18467     -47%
	not surprise, the time required to complete live migration is
	increased as the VM is able to generate more dirty pages

   2.2) pre-write the VM first, then run the test case and do live
        migration, i.e, no much page faults are needed to map guest
        memory in, we noticed that:
	a: the dirty page rate (pages):
	    before       after
	    447435       449284  +0%
	
	b: time time to complete live migration (ms)
	    before       after
	    31068        28310  +10%
	under this case, we also noticed that the time of dirty log for
	the first time, before the patchset is 156 ms, after that, only
	6 ms is needed
   
The patch applied to QEMU
=========================
The draft patch is attached to enable this functionality in QEMU:

diff --git a/kvm-all.c b/kvm-all.c
index 90b8573..9ebe1ac 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -122,6 +122,7 @@ bool kvm_direct_msi_allowed;
 bool kvm_ioeventfd_any_length_allowed;
 bool kvm_msi_use_devid;
 static bool kvm_immediate_exit;
+static bool kvm_write_protect_all;
 
 static const KVMCapabilityInfo kvm_required_capabilites[] = {
     KVM_CAP_INFO(USER_MEMORY),
@@ -440,6 +441,26 @@ static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
 
 #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
 
+static bool kvm_write_protect_all_is_supported(KVMState *s)
+{
+	return kvm_check_extension(s, KVM_CAP_X86_WRITE_PROTECT_ALL_MEM) &&
+		kvm_check_extension(s, KVM_CAP_X86_DIRTY_LOG_WITHOUT_WRITE_PROTECT);
+}
+
+static void kvm_write_protect_all_mem(bool write)
+{
+	int ret;
+
+	if (!kvm_write_protect_all)
+		return;
+
+	ret = kvm_vm_ioctl(kvm_state, KVM_WRITE_PROTECT_ALL_MEM, !!write);
+	if (ret < 0) {
+	  printf("ioctl failed %d\n", errno);
+	  abort();
+	}
+}
+
 /**
  * kvm_physical_sync_dirty_bitmap - Grab dirty bitmap from kernel space
  * This function updates qemu's dirty bitmap using
@@ -490,6 +511,7 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
         memset(d.dirty_bitmap, 0, allocated_size);
 
         d.slot = mem->slot | (kml->as_id << 16);
+        d.flags = kvm_write_protect_all ? KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT : 0;
         if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
             DPRINTF("ioctl failed %d\n", errno);
             ret = -1;
@@ -1622,6 +1644,9 @@ static int kvm_init(MachineState *ms)
     }
 
     kvm_immediate_exit = kvm_check_extension(s, KVM_CAP_IMMEDIATE_EXIT);
+    kvm_write_protect_all = kvm_write_protect_all_is_supported(s);
+    printf("Write protect all is %s.\n", kvm_write_protect_all ? "supported" : "unsupported");
+    memory_register_write_protect_all(kvm_write_protect_all_mem);
     s->nr_slots = kvm_check_extension(s, KVM_CAP_NR_MEMSLOTS);
 
     /* If unspecified, use the default value */
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 4e082a8..7c056ef 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -443,9 +443,12 @@ struct kvm_interrupt {
 };
 
 /* for KVM_GET_DIRTY_LOG */
+
+#define KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT	0x1
+
 struct kvm_dirty_log {
 	__u32 slot;
-	__u32 padding1;
+	__u32 flags;
 	union {
 		void *dirty_bitmap; /* one bit per page */
 		__u64 padding2;
@@ -884,6 +887,9 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_MMU_HASH_V3 135
 #define KVM_CAP_IMMEDIATE_EXIT 136
 
+#define KVM_CAP_X86_WRITE_PROTECT_ALL_MEM 144
+#define KVM_CAP_X86_DIRTY_LOG_WITHOUT_WRITE_PROTECT 145
+
 #ifdef KVM_CAP_IRQ_ROUTING
 
 struct kvm_irq_routing_irqchip {
@@ -1126,6 +1132,7 @@ enum kvm_device_type {
 					struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_WRITE_PROTECT_ALL_MEM _IO(KVMIO,  0x49)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
diff --git a/memory.c b/memory.c
index 4c95aaf..b836675 100644
--- a/memory.c
+++ b/memory.c
@@ -809,6 +809,13 @@ static void address_space_update_ioeventfds(AddressSpace *as)
     flatview_unref(view);
 }
 
+static write_protect_all_fn write_func;
+void memory_register_write_protect_all(write_protect_all_fn func)
+{
+	printf("Write function is being registering...\n");
+	write_func = func;
+}
+
 static void address_space_update_topology_pass(AddressSpace *as,
                                                const FlatView *old_view,
                                                const FlatView *new_view,
@@ -859,6 +866,8 @@ static void address_space_update_topology_pass(AddressSpace *as,
                     MEMORY_LISTENER_UPDATE_REGION(frnew, as, Reverse, log_stop,
                                                   frold->dirty_log_mask,
                                                   frnew->dirty_log_mask);
+			if (write_func)
+				write_func(false);
                 }
             }
 
@@ -2267,6 +2276,9 @@ void memory_global_dirty_log_sync(void)
         }
         flatview_unref(view);
     }
+
+    if (write_func)
+        write_func(true);
 }

Xiao Guangrong (7):
  KVM: MMU: correct the behavior of mmu_spte_update_no_track
  KVM: MMU: introduce possible_writable_spte_bitmap
  KVM: MMU: introduce kvm_mmu_write_protect_all_pages
  KVM: MMU: enable KVM_WRITE_PROTECT_ALL_MEM
  KVM: MMU: allow dirty log without write protect
  KVM: MMU: clarify fast_pf_fix_direct_spte
  KVM: MMU: stop using mmu_spte_get_lockless under mmu-lock

 arch/x86/include/asm/kvm_host.h |  25 +++-
 arch/x86/kvm/mmu.c              | 267 ++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu.h              |   1 +
 arch/x86/kvm/paging_tmpl.h      |  13 +-
 arch/x86/kvm/x86.c              |   7 ++
 include/uapi/linux/kvm.h        |   8 +-
 virt/kvm/kvm_main.c             |  15 ++-
 7 files changed, 317 insertions(+), 19 deletions(-)

-- 
2.9.3


Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by Paolo Bonzini 6 years, 10 months ago
So if I understand correctly this relies on userspace doing:

	1) KVM_GET_DIRTY_LOG without write protect
	2) KVM_WRITE_PROTECT_ALL_MEM
	<only look now at the dirty log snapshot>

Writes may happen between 1 and 2; they are not represented in the live
dirty bitmap but it's okay because they are in the snapshot and will
only be used after 2.  This is similar to what the dirty page ring
buffer patches do; in fact, the KVM_WRITE_PROTECT_ALL_MEM ioctl is very
similar to KVM_RESET_DIRTY_PAGES in those patches.

On 03/05/2017 12:52, guangrong.xiao@gmail.com wrote:
> Comparing with the ordinary algorithm which
> write protects last level sptes based on the rmap one by one,
> it just simply updates the generation number to ask all vCPUs
> to reload its root page table, particularly, it can be done out
> of mmu-lock, so that it does not hurt vMMU's parallel.

This is clever.

For processors that have PML, write protecting is only done on large
pages and only for splitting purposes; not for dirty page tracking
process at 4k granularity.  In this case, I think that you should do
nothing in the new write-protect-all ioctl?

Also, I wonder how the alternative write protection mechanism would
affect performance of the dirty page ring buffer patches.  You would do
the write protection of all memory at the end of
kvm_vm_ioctl_reset_dirty_pages.  You wouldn't even need a separate
ioctl, which is nice.  On the other hand, checkpoints would be more
frequent and most pages would be write-protected, so it would be more
expensive to rebuild the shadow page tables...

Thanks,

Paolo

> @@ -490,6 +511,7 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
>          memset(d.dirty_bitmap, 0, allocated_size);
>  
>          d.slot = mem->slot | (kml->as_id << 16);
> +        d.flags = kvm_write_protect_all ? KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT : 0;
>          if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
>              DPRINTF("ioctl failed %d\n", errno);
>              ret = -1;

How would this work when kvm_physical_sync_dirty_bitmap is called from
memory_region_sync_dirty_bitmap rather than
memory_region_global_dirty_log_sync?

Thanks,

Paolo

Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by Xiao Guangrong 6 years, 10 months ago

On 05/03/2017 08:28 PM, Paolo Bonzini wrote:
> So if I understand correctly this relies on userspace doing:
> 
> 	1) KVM_GET_DIRTY_LOG without write protect
> 	2) KVM_WRITE_PROTECT_ALL_MEM
> 	<only look now at the dirty log snapshot>
> 
> Writes may happen between 1 and 2; they are not represented in the live
> dirty bitmap but it's okay because they are in the snapshot and will
> only be used after 2.  This is similar to what the dirty page ring
> buffer patches do; in fact, the KVM_WRITE_PROTECT_ALL_MEM ioctl is very
> similar to KVM_RESET_DIRTY_PAGES in those patches.
> 

You are right. After 1) and 2), the page which has been modified either
in the bitmap returned to userspace or in the bitmap of memslot, i.e,
there is no dirty page lost.

> On 03/05/2017 12:52, guangrong.xiao@gmail.com wrote:
>> Comparing with the ordinary algorithm which
>> write protects last level sptes based on the rmap one by one,
>> it just simply updates the generation number to ask all vCPUs
>> to reload its root page table, particularly, it can be done out
>> of mmu-lock, so that it does not hurt vMMU's parallel.
> 
> This is clever.
> 
> For processors that have PML, write protecting is only done on large
> pages and only for splitting purposes; not for dirty page tracking
> process at 4k granularity.  In this case, I think that you should do
> nothing in the new write-protect-all ioctl?

Good point, thanks for you pointing it out.
Doing nothing in write-protect-all() is not acceptable as it breaks
its semantic. :(

Furthermore, userspace has no knowledge about if PML is enable (it
can be required from sysfs, but it is a good way in QEMU), so it is
difficult for the usespace to know when to use write-protect-all.
Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if
PML is enabled?

> 
> Also, I wonder how the alternative write protection mechanism would
> affect performance of the dirty page ring buffer patches.  You would do
> the write protection of all memory at the end of
> kvm_vm_ioctl_reset_dirty_pages.  You wouldn't even need a separate
> ioctl, which is nice.  On the other hand, checkpoints would be more
> frequent and most pages would be write-protected, so it would be more
> expensive to rebuild the shadow page tables...

Yup, write-protect-all can improve reset_dirty_pages indeed, i will
apply your idea after reset_dirty_pages is merged.

However, we still prefer to have a separate ioctl for write-protect-all
which cooperates with KVM_GET_DIRTY_LOG to improve live migration that
should not always depend on checkpoint.

> 
> Thanks,
> 
> Paolo
> 
>> @@ -490,6 +511,7 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
>>           memset(d.dirty_bitmap, 0, allocated_size);
>>   
>>           d.slot = mem->slot | (kml->as_id << 16);
>> +        d.flags = kvm_write_protect_all ? KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT : 0;
>>           if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
>>               DPRINTF("ioctl failed %d\n", errno);
>>               ret = -1;
> 
> How would this work when kvm_physical_sync_dirty_bitmap is called from
> memory_region_sync_dirty_bitmap rather than
> memory_region_global_dirty_log_sync?

You are right, we did not consider the full cases carefully, will fix it
when push it to QEMU formally.

Thank you, Paolo!


Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by Paolo Bonzini 6 years, 10 months ago

On 03/05/2017 16:50, Xiao Guangrong wrote:
> Furthermore, userspace has no knowledge about if PML is enable (it
> can be required from sysfs, but it is a good way in QEMU), so it is
> difficult for the usespace to know when to use write-protect-all.
> Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if
> PML is enabled?

Yes, that's a good idea.  Though it's a pity that, with PML, setting the
dirty bit will still do the massive walk of the rmap.  At least with
reset_dirty_pages it's done a little bit at a time.

>> Also, I wonder how the alternative write protection mechanism would
>> affect performance of the dirty page ring buffer patches.  You would do
>> the write protection of all memory at the end of
>> kvm_vm_ioctl_reset_dirty_pages.  You wouldn't even need a separate
>> ioctl, which is nice.  On the other hand, checkpoints would be more
>> frequent and most pages would be write-protected, so it would be more
>> expensive to rebuild the shadow page tables...
> 
> Yup, write-protect-all can improve reset_dirty_pages indeed, i will
> apply your idea after reset_dirty_pages is merged.
> 
> However, we still prefer to have a separate ioctl for write-protect-all
> which cooperates with KVM_GET_DIRTY_LOG to improve live migration that
> should not always depend on checkpoint. 

Ok, I plan to merge the dirty ring pages early in 4.13 development.

Paolo

Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by Xiao Guangrong 6 years, 10 months ago

On 05/03/2017 10:57 PM, Paolo Bonzini wrote:
> 
> 
> On 03/05/2017 16:50, Xiao Guangrong wrote:
>> Furthermore, userspace has no knowledge about if PML is enable (it
>> can be required from sysfs, but it is a good way in QEMU), so it is
>> difficult for the usespace to know when to use write-protect-all.
>> Maybe we can make KVM_CAP_X86_WRITE_PROTECT_ALL_MEM return false if
>> PML is enabled?
> 
> Yes, that's a good idea.  Though it's a pity that, with PML, setting the
> dirty bit will still do the massive walk of the rmap.  At least with
> reset_dirty_pages it's done a little bit at a time.
> 
>>> Also, I wonder how the alternative write protection mechanism would
>>> affect performance of the dirty page ring buffer patches.  You would do
>>> the write protection of all memory at the end of
>>> kvm_vm_ioctl_reset_dirty_pages.  You wouldn't even need a separate
>>> ioctl, which is nice.  On the other hand, checkpoints would be more
>>> frequent and most pages would be write-protected, so it would be more
>>> expensive to rebuild the shadow page tables...
>>
>> Yup, write-protect-all can improve reset_dirty_pages indeed, i will
>> apply your idea after reset_dirty_pages is merged.
>>
>> However, we still prefer to have a separate ioctl for write-protect-all
>> which cooperates with KVM_GET_DIRTY_LOG to improve live migration that
>> should not always depend on checkpoint.
> 
> Ok, I plan to merge the dirty ring pages early in 4.13 development.

Great.

As there is no conflict between these two patchsets except dirty
ring pages takes benefit from write-protect-all, i think they
can be developed and iterated independently, right?

Or you prefer to merge dirty ring pages first then review the
new version of this patchset later?

Thanks!


Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by Paolo Bonzini 6 years, 10 months ago

On 04/05/2017 05:36, Xiao Guangrong wrote:
> Great.
> 
> As there is no conflict between these two patchsets except dirty
> ring pages takes benefit from write-protect-all, i think they
> can be developed and iterated independently, right?

I can certainly start reviewing this one.

Paolo

> Or you prefer to merge dirty ring pages first then review the
> new version of this patchset later?

Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by Xiao Guangrong 6 years, 10 months ago
Ping...

Sorry to disturb, just make this patchset not be missed. :)

On 05/04/2017 03:06 PM, Paolo Bonzini wrote:
> 
> 
> On 04/05/2017 05:36, Xiao Guangrong wrote:
>> Great.
>>
>> As there is no conflict between these two patchsets except dirty
>> ring pages takes benefit from write-protect-all, i think they
>> can be developed and iterated independently, right?
> 
> I can certainly start reviewing this one.
> 
> Paolo
> 
>> Or you prefer to merge dirty ring pages first then review the
>> new version of this patchset later?

Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by Paolo Bonzini 6 years, 10 months ago

On 23/05/2017 04:23, Xiao Guangrong wrote:
> 
> Ping...
> 
> Sorry to disturb, just make this patchset not be missed. :)

It won't. :)  I'm going to look at it and the dirty page ring buffer
this week.

Paolo

Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by Xiao Guangrong 6 years, 9 months ago

On 05/30/2017 12:48 AM, Paolo Bonzini wrote:
> 
> 
> On 23/05/2017 04:23, Xiao Guangrong wrote:
>>
>> Ping...
>>
>> Sorry to disturb, just make this patchset not be missed. :)
> 
> It won't. :)  I'm going to look at it and the dirty page ring buffer
> this week.

Ping.. :)

Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by Jay Zhou 6 years, 9 months ago
On 2017/5/3 18:52, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>
> Background
> ==========
> The original idea of this patchset is from Avi who raised it in
> the mailing list during my vMMU development some years ago
>
> This patchset introduces a extremely fast way to write protect
> all the guest memory. Comparing with the ordinary algorithm which
> write protects last level sptes based on the rmap one by one,
> it just simply updates the generation number to ask all vCPUs
> to reload its root page table, particularly, it can be done out
> of mmu-lock, so that it does not hurt vMMU's parallel. It is
> the O(1) algorithm which does not depends on the capacity of
> guest's memory and the number of guest's vCPUs
>
> Implementation
> ==============
> When write protect for all guest memory is required, we update
> the global generation number and ask vCPUs to reload its root
> page table by calling kvm_reload_remote_mmus(), the global number
> is protected by slots_lock
>
> During reloading its root page table, the vCPU checks root page
> table's generation number with current global number, if it is not
> matched, it makes all the entries in the shadow page readonly and
> directly go to VM. So the read access is still going on smoothly
> without KVM's involvement and write access triggers page fault
>
> If the page fault is triggered by write operation, KVM moves the
> write protection from the upper level to the lower level page - by
> making all the entries in the lower page readonly first then make
> the upper level writable, this operation is repeated until we meet
> the last spte
>
> In order to speed up the process of making all entries readonly, we
> introduce possible_writable_spte_bitmap which indicates the writable
> sptes and possiable_writable_sptes which is a counter indicating the
> number of writable sptes in the shadow page, they work very efficiently
> as usually only one entry in PML4 ( < 512 G),few entries in PDPT (one
> entry indicates 1G memory), PDEs and PTEs need to be write protected for
> the worst case. Note, the number of page fault and TLB flush are the same
> as the ordinary algorithm
>
> Performance Data
> ================
> Case 1) For a VM which has 3G memory and 12 vCPUs, we noticed that:
>     a: the time required for dirty log (ns)
>         before           after
>         64289121         137654      +46603%
>
>     b: the performance of memory write after dirty log, i.e, the dirty
>        log path is not parallel with page fault, the time required to
>        write all 3G memory for all vCPUs in the VM (ns):
>         before           after
>         281735017        291150923   -3%
>        We think the impact, 3%, is acceptable, particularly, mmu-lock
>        contention is not take into account in this case
>
> Case 2) For a VM which has 30G memory and 8 vCPUs, we do the live
>     migration, at the some time, a test case which greedily and repeatedly
>     writes 3000M memory in the VM.
>
>     2.1) for the new booted VM, i.e, page fault is required to map guest
>          memory in, we noticed that:
>          a: the dirty page rate (pages):
>              before       after
>              333092       497266     +49%
> 	that means, the performance for the being migrated VM is hugely
> 	improved as the contention on mmu-lock is reduced
>
> 	b: the time to complete live migration (ms):
> 	    before       after
> 	    12532        18467     -47%
> 	not surprise, the time required to complete live migration is
> 	increased as the VM is able to generate more dirty pages
>
>     2.2) pre-write the VM first, then run the test case and do live
>          migration, i.e, no much page faults are needed to map guest
>          memory in, we noticed that:
> 	a: the dirty page rate (pages):
> 	    before       after
> 	    447435       449284  +0%
> 	
> 	b: time time to complete live migration (ms)
> 	    before       after
> 	    31068        28310  +10%
> 	under this case, we also noticed that the time of dirty log for
> 	the first time, before the patchset is 156 ms, after that, only
> 	6 ms is needed
>
> The patch applied to QEMU
> =========================
> The draft patch is attached to enable this functionality in QEMU:
>
> diff --git a/kvm-all.c b/kvm-all.c
> index 90b8573..9ebe1ac 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -122,6 +122,7 @@ bool kvm_direct_msi_allowed;
>   bool kvm_ioeventfd_any_length_allowed;
>   bool kvm_msi_use_devid;
>   static bool kvm_immediate_exit;
> +static bool kvm_write_protect_all;
>
>   static const KVMCapabilityInfo kvm_required_capabilites[] = {
>       KVM_CAP_INFO(USER_MEMORY),
> @@ -440,6 +441,26 @@ static int kvm_get_dirty_pages_log_range(MemoryRegionSection *section,
>
>   #define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
>
> +static bool kvm_write_protect_all_is_supported(KVMState *s)
> +{
> +	return kvm_check_extension(s, KVM_CAP_X86_WRITE_PROTECT_ALL_MEM) &&
> +		kvm_check_extension(s, KVM_CAP_X86_DIRTY_LOG_WITHOUT_WRITE_PROTECT);
> +}
> +
> +static void kvm_write_protect_all_mem(bool write)
> +{
> +	int ret;
> +
> +	if (!kvm_write_protect_all)
> +		return;
> +
> +	ret = kvm_vm_ioctl(kvm_state, KVM_WRITE_PROTECT_ALL_MEM, !!write);
> +	if (ret < 0) {
> +	  printf("ioctl failed %d\n", errno);
> +	  abort();
> +	}
> +}
> +
>   /**
>    * kvm_physical_sync_dirty_bitmap - Grab dirty bitmap from kernel space
>    * This function updates qemu's dirty bitmap using
> @@ -490,6 +511,7 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
>           memset(d.dirty_bitmap, 0, allocated_size);
>
>           d.slot = mem->slot | (kml->as_id << 16);
> +        d.flags = kvm_write_protect_all ? KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT : 0;
>           if (kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d) == -1) {
>               DPRINTF("ioctl failed %d\n", errno);
>               ret = -1;
> @@ -1622,6 +1644,9 @@ static int kvm_init(MachineState *ms)
>       }
>
>       kvm_immediate_exit = kvm_check_extension(s, KVM_CAP_IMMEDIATE_EXIT);
> +    kvm_write_protect_all = kvm_write_protect_all_is_supported(s);
> +    printf("Write protect all is %s.\n", kvm_write_protect_all ? "supported" : "unsupported");
> +    memory_register_write_protect_all(kvm_write_protect_all_mem);
>       s->nr_slots = kvm_check_extension(s, KVM_CAP_NR_MEMSLOTS);
>
>       /* If unspecified, use the default value */
> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> index 4e082a8..7c056ef 100644
> --- a/linux-headers/linux/kvm.h
> +++ b/linux-headers/linux/kvm.h
> @@ -443,9 +443,12 @@ struct kvm_interrupt {
>   };
>
>   /* for KVM_GET_DIRTY_LOG */
> +
> +#define KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT	0x1
> +
>   struct kvm_dirty_log {
>   	__u32 slot;
> -	__u32 padding1;
> +	__u32 flags;
>   	union {
>   		void *dirty_bitmap; /* one bit per page */
>   		__u64 padding2;
> @@ -884,6 +887,9 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_PPC_MMU_HASH_V3 135
>   #define KVM_CAP_IMMEDIATE_EXIT 136
>
> +#define KVM_CAP_X86_WRITE_PROTECT_ALL_MEM 144
> +#define KVM_CAP_X86_DIRTY_LOG_WITHOUT_WRITE_PROTECT 145
> +
>   #ifdef KVM_CAP_IRQ_ROUTING
>
>   struct kvm_irq_routing_irqchip {
> @@ -1126,6 +1132,7 @@ enum kvm_device_type {
>   					struct kvm_userspace_memory_region)
>   #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
>   #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> +#define KVM_WRITE_PROTECT_ALL_MEM _IO(KVMIO,  0x49)
>
>   /* enable ucontrol for s390 */
>   struct kvm_s390_ucas_mapping {
> diff --git a/memory.c b/memory.c
> index 4c95aaf..b836675 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -809,6 +809,13 @@ static void address_space_update_ioeventfds(AddressSpace *as)
>       flatview_unref(view);
>   }
>
> +static write_protect_all_fn write_func;

I think there should be a declaration in memory.h,

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7fc3f48..31f3098 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1152,6 +1152,9 @@ void memory_global_dirty_log_start(void);
   */
  void memory_global_dirty_log_stop(void);

+typedef void (*write_protect_all_fn)(bool write);
+void memory_register_write_protect_all(write_protect_all_fn func);
+
  void mtree_info(fprintf_function mon_printf, void *f);

-- 
Best Regards,
Jay Zhou


Re: [Qemu-devel] [PATCH 0/7] KVM: MMU: fast write protect
Posted by Xiao Guangrong 6 years, 9 months ago

On 06/05/2017 03:36 PM, Jay Zhou wrote:

>>   /* enable ucontrol for s390 */
>>   struct kvm_s390_ucas_mapping {
>> diff --git a/memory.c b/memory.c
>> index 4c95aaf..b836675 100644
>> --- a/memory.c
>> +++ b/memory.c
>> @@ -809,6 +809,13 @@ static void address_space_update_ioeventfds(AddressSpace *as)
>>       flatview_unref(view);
>>   }
>>
>> +static write_protect_all_fn write_func;
> 
> I think there should be a declaration in memory.h,
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 7fc3f48..31f3098 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -1152,6 +1152,9 @@ void memory_global_dirty_log_start(void);
>    */
>   void memory_global_dirty_log_stop(void);
> 
> +typedef void (*write_protect_all_fn)(bool write);
> +void memory_register_write_protect_all(write_protect_all_fn func);
> +
>   void mtree_info(fprintf_function mon_printf, void *f);
> 

Thanks for your suggestion, Jay!

This code just demonstrates how to enable this feature in QEMU, i will
carefully consider it and merger your suggestion when the formal patch
is posted out.


[Qemu-devel] [PATCH 1/7] KVM: MMU: correct the behavior of mmu_spte_update_no_track
Posted by guangrong.xiao@gmail.com 6 years, 10 months ago
From: Xiao Guangrong <xiaoguangrong@tencent.com>

Current behavior of mmu_spte_update_no_track() does not match
the name of _no_track() as actually the A/D bits are tracked
and returned to the caller

This patch introduces the real _no_track() function to update
the spte regardless of A/D bits and rename the original function
to _track()

The _no_track() function will be used by later patches to update
upper spte which need not care of A/D bits indeed

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/kvm/mmu.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5586765..ba8e7af 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -583,10 +583,29 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
 }
 
 /*
- * Update the SPTE (excluding the PFN), but do not track changes in its
+ * Update the SPTE (excluding the PFN) regardless of accessed/dirty
+ * status which is used to update the upper level spte.
+ */
+static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
+{
+	u64 old_spte = *sptep;
+
+	WARN_ON(!is_shadow_present_pte(new_spte));
+
+	if (!is_shadow_present_pte(old_spte)) {
+		mmu_spte_set(sptep, new_spte);
+		return;
+	}
+
+	__update_clear_spte_fast(sptep, new_spte);
+}
+
+/*
+ * Update the SPTE (excluding the PFN), the original value is
+ * returned, based on it, the caller can track changes of its
  * accessed/dirty status.
  */
-static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
+static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
 {
 	u64 old_spte = *sptep;
 
@@ -621,7 +640,7 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
 static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 {
 	bool flush = false;
-	u64 old_spte = mmu_spte_update_no_track(sptep, new_spte);
+	u64 old_spte = mmu_spte_update_track(sptep, new_spte);
 
 	if (!is_shadow_present_pte(old_spte))
 		return false;
-- 
2.9.3


[Qemu-devel] [PATCH 2/7] KVM: MMU: introduce possible_writable_spte_bitmap
Posted by guangrong.xiao@gmail.com 6 years, 10 months ago
From: Xiao Guangrong <xiaoguangrong@tencent.com>

It is used to track possible writable sptes on the shadow page on
which the bit is set to 1 for the sptes that are already writable
or can be locklessly updated to writable on the fast_page_fault
path, also a counter for the number of possible writable sptes is
introduced to speed up bitmap walking

Later patch will benefit good performance by using this bitmap and
counter to fast figure out writable sptes and write protect them

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  6 ++++-
 arch/x86/kvm/mmu.c              | 53 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 84c8489..4872ae7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -114,6 +114,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
 #define KVM_MIN_ALLOC_MMU_PAGES 64
 #define KVM_MMU_HASH_SHIFT 12
 #define KVM_NUM_MMU_PAGES (1 << KVM_MMU_HASH_SHIFT)
+#define KVM_MMU_SP_ENTRY_NR 512
 #define KVM_MIN_FREE_MMU_PAGES 5
 #define KVM_REFILL_PAGES 25
 #define KVM_MAX_CPUID_ENTRIES 80
@@ -287,12 +288,15 @@ struct kvm_mmu_page {
 	bool unsync;
 	int root_count;          /* Currently serving as active root */
 	unsigned int unsync_children;
+	unsigned int possiable_writable_sptes;
 	struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 
 	/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
 	unsigned long mmu_valid_gen;
 
-	DECLARE_BITMAP(unsync_child_bitmap, 512);
+	DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
+
+	DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
 
 #ifdef CONFIG_X86_32
 	/*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ba8e7af..8a20e4f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -570,6 +570,49 @@ static bool is_dirty_spte(u64 spte)
 				 : spte & PT_WRITABLE_MASK;
 }
 
+static bool is_possible_writable_spte(u64 spte)
+{
+	if (!is_shadow_present_pte(spte))
+		return false;
+
+	if (is_writable_pte(spte))
+		return true;
+
+	if (spte_can_locklessly_be_made_writable(spte))
+		return true;
+
+	/*
+	 * although is_access_track_spte() sptes can be updated out of
+	 * mmu-lock, we need not take them into account as access_track
+	 * drops writable bit for them
+	 */
+	return false;
+}
+
+static void
+mmu_log_possible_writable_spte(u64 *sptep, u64 old_spte, u64 new_spte)
+{
+	struct kvm_mmu_page *sp = page_header(__pa(sptep));
+	bool old_state, new_state;
+
+	old_state = is_possible_writable_spte(old_spte);
+	new_state = is_possible_writable_spte(new_spte);
+
+	if (old_state == new_state)
+		return;
+
+	/* a possible writable spte is dropped */
+	if (old_state) {
+		sp->possiable_writable_sptes--;
+		__clear_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
+		return;
+	}
+
+	/* a new possible writable spte is set */
+	sp->possiable_writable_sptes++;
+	__set_bit(sptep - sp->spt, sp->possible_writable_spte_bitmap);
+}
+
 /* Rules for using mmu_spte_set:
  * Set the sptep from nonpresent to present.
  * Note: the sptep being assigned *must* be either not present
@@ -580,6 +623,7 @@ static void mmu_spte_set(u64 *sptep, u64 new_spte)
 {
 	WARN_ON(is_shadow_present_pte(*sptep));
 	__set_spte(sptep, new_spte);
+	mmu_log_possible_writable_spte(sptep, 0ull, new_spte);
 }
 
 /*
@@ -598,6 +642,7 @@ static void mmu_spte_update_no_track(u64 *sptep, u64 new_spte)
 	}
 
 	__update_clear_spte_fast(sptep, new_spte);
+	mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
 }
 
 /*
@@ -623,6 +668,7 @@ static u64 mmu_spte_update_track(u64 *sptep, u64 new_spte)
 
 	WARN_ON(spte_to_pfn(old_spte) != spte_to_pfn(new_spte));
 
+	mmu_log_possible_writable_spte(sptep, old_spte, new_spte);
 	return old_spte;
 }
 
@@ -688,6 +734,8 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 	else
 		old_spte = __update_clear_spte_slow(sptep, 0ull);
 
+	mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
+
 	if (!is_shadow_present_pte(old_spte))
 		return 0;
 
@@ -716,7 +764,10 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
  */
 static void mmu_spte_clear_no_track(u64 *sptep)
 {
+	u64 old_spte = *sptep;
+
 	__update_clear_spte_fast(sptep, 0ull);
+	mmu_log_possible_writable_spte(sptep, old_spte, 0ull);
 }
 
 static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -1988,7 +2039,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
 {
 	int i, ret, nr_unsync_leaf = 0;
 
-	for_each_set_bit(i, sp->unsync_child_bitmap, 512) {
+	for_each_set_bit(i, sp->unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR) {
 		struct kvm_mmu_page *child;
 		u64 ent = sp->spt[i];
 
-- 
2.9.3


[Qemu-devel] [PATCH 3/7] KVM: MMU: introduce kvm_mmu_write_protect_all_pages
Posted by guangrong.xiao@gmail.com 6 years, 10 months ago
From: Xiao Guangrong <xiaoguangrong@tencent.com>

The original idea is from Avi. kvm_mmu_write_protect_all_pages() is
extremely fast to write protect all the guest memory. Comparing with
the ordinary algorithm which write protects last level sptes based on
the rmap one by one, it just simply updates the generation number to
ask all vCPUs to reload its root page table, particularly, it can be
done out of mmu-lock, so that it does not hurt vMMU's parallel. It is
the O(1) algorithm which does not depends on the capacity of guest's
memory and the number of guest's vCPUs

When reloading its root page table, the vCPU checks root page table's
generation number with current global number, if it is not matched, it
makes all the entries in page readonly and directly go to VM. So the
read access is still going on smoothly without KVM's involvement and
write access triggers page fault, then KVM moves the write protection
from the upper level to the lower level page - by making all the entries
in the lower page readonly first then make the upper level writable,
this operation is repeated until we meet the last spte

In order to speed up the process of making all entries readonly, we
introduce possible_writable_spte_bitmap which indicates the writable
sptes and possiable_writable_sptes which is a counter indicating the
number of writable sptes, this works very efficiently as usually only
one entry in PML4 ( < 512 G),few entries in PDPT (only entry indicates
1G memory), PDEs and PTEs need to be write protected for the worst case.
Note, the number of page fault and TLB flush are the same as the ordinary
algorithm. During our test, for a VM which has 3G memory and 12 vCPUs,
we benchmarked the performance of pure memory write after write protection,
noticed only 3% is dropped, however, we also benchmarked the case that
run the test case of pure memory-write in the new booted VM (i.e, it will
trigger #PF to map memory), at the same time, live migration is going on,
we noticed the diry page ratio is increased ~50%, that means, the memory's
performance is hugely improved during live migration

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/include/asm/kvm_host.h |  19 +++++
 arch/x86/kvm/mmu.c              | 176 ++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu.h              |   1 +
 arch/x86/kvm/paging_tmpl.h      |  13 ++-
 4 files changed, 201 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4872ae7..663d88e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -294,6 +294,13 @@ struct kvm_mmu_page {
 	/* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
 	unsigned long mmu_valid_gen;
 
+	/*
+	 * The generation number of write protection for all guest memory
+	 * which is synced with kvm_arch.mmu_write_protect_all_indicator
+	 * whenever it is linked into upper entry.
+	 */
+	u64 mmu_write_protect_all_gen;
+
 	DECLARE_BITMAP(unsync_child_bitmap, KVM_MMU_SP_ENTRY_NR);
 
 	DECLARE_BITMAP(possible_writable_spte_bitmap, KVM_MMU_SP_ENTRY_NR);
@@ -743,6 +750,18 @@ struct kvm_arch {
 	unsigned int n_max_mmu_pages;
 	unsigned int indirect_shadow_pages;
 	unsigned long mmu_valid_gen;
+
+	/*
+	 * The indicator of write protection for all guest memory.
+	 *
+	 * The top bit indicates if the write-protect is enabled,
+	 * remaining bits are used as a generation number which is
+	 * increased whenever write-protect is enabled.
+	 *
+	 * The enable bit and generation number are squeezed into
+	 * a single u64 so that it can be accessed atomically.
+	 */
+	atomic64_t mmu_write_protect_all_indicator;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	/*
 	 * Hash table of struct kvm_mmu_page.
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8a20e4f..ad6ee46 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -344,6 +344,34 @@ void kvm_mmu_clear_all_pte_masks(void)
 	shadow_present_mask = 0;
 	shadow_acc_track_mask = 0;
 }
+/* see the comments in struct kvm_arch. */
+#define WP_ALL_ENABLE_BIT	(63)
+#define WP_ALL_ENABLE_MASK	(1ull << WP_ALL_ENABLE_BIT)
+#define WP_ALL_GEN_MASK		(~0ull & ~WP_ALL_ENABLE_MASK)
+
+static bool is_write_protect_all_enabled(u64 indicator)
+{
+	return !!(indicator & WP_ALL_ENABLE_MASK);
+}
+
+static u64 get_write_protect_all_gen(u64 indicator)
+{
+	return indicator & WP_ALL_GEN_MASK;
+}
+
+static u64 get_write_protect_all_indicator(struct kvm *kvm)
+{
+	return atomic64_read(&kvm->arch.mmu_write_protect_all_indicator);
+}
+
+static void
+set_write_protect_all_indicator(struct kvm *kvm, bool enable, u64 generation)
+{
+	u64 value = (u64)(!!enable) << WP_ALL_ENABLE_BIT;
+
+	value |= generation & WP_ALL_GEN_MASK;
+	atomic64_set(&kvm->arch.mmu_write_protect_all_indicator, value);
+}
 
 static int is_cpuid_PSE36(void)
 {
@@ -2312,6 +2340,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 					     int direct,
 					     unsigned access)
 {
+	u64 write_protect_indicator;
 	union kvm_mmu_page_role role;
 	unsigned quadrant;
 	struct kvm_mmu_page *sp;
@@ -2386,6 +2415,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 			flush |= kvm_sync_pages(vcpu, gfn, &invalid_list);
 	}
 	sp->mmu_valid_gen = vcpu->kvm->arch.mmu_valid_gen;
+	write_protect_indicator = get_write_protect_all_indicator(vcpu->kvm);
+	sp->mmu_write_protect_all_gen =
+			get_write_protect_all_gen(write_protect_indicator);
 	clear_page(sp->spt);
 	trace_kvm_mmu_get_page(sp, true);
 
@@ -2948,6 +2980,70 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep)
 	__direct_pte_prefetch(vcpu, sp, sptep);
 }
 
+static bool mmu_load_shadow_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	unsigned int offset;
+	u64 wp_all_indicator = get_write_protect_all_indicator(kvm);
+	u64 kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
+	bool flush = false;
+
+	if (!is_write_protect_all_enabled(wp_all_indicator))
+		return false;
+
+	if (sp->mmu_write_protect_all_gen == kvm_wp_all_gen)
+		return false;
+
+	if (!sp->possiable_writable_sptes)
+		return false;
+
+	for_each_set_bit(offset, sp->possible_writable_spte_bitmap,
+	      KVM_MMU_SP_ENTRY_NR) {
+		u64 *sptep = sp->spt + offset, spte = *sptep;
+
+		if (!sp->possiable_writable_sptes)
+			break;
+
+		if (is_last_spte(spte, sp->role.level)) {
+			flush |= spte_write_protect(sptep, false);
+			continue;
+		}
+
+		mmu_spte_update_no_track(sptep, spte & ~PT_WRITABLE_MASK);
+		flush = true;
+	}
+
+	sp->mmu_write_protect_all_gen = kvm_wp_all_gen;
+	return flush;
+}
+
+static bool
+handle_readonly_upper_spte(struct kvm *kvm, u64 *sptep, int write_fault)
+{
+	u64 spte = *sptep;
+	struct kvm_mmu_page *child = page_header(spte & PT64_BASE_ADDR_MASK);
+	bool flush;
+
+	/*
+	 * delay the spte update to the point when write permission is
+	 * really needed.
+	 */
+	if (!write_fault)
+		return false;
+
+	/*
+	 * if it is already writable, that means the write-protection has
+	 * been moved to lower level.
+	 */
+	if (is_writable_pte(spte))
+		return false;
+
+	flush = mmu_load_shadow_page(kvm, child);
+
+	/* needn't flush tlb if the spte is changed from RO to RW. */
+	mmu_spte_update_no_track(sptep, spte | PT_WRITABLE_MASK);
+	return flush;
+}
+
 static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 			int level, gfn_t gfn, kvm_pfn_t pfn, bool prefault)
 {
@@ -2955,6 +3051,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 	struct kvm_mmu_page *sp;
 	int emulate = 0;
 	gfn_t pseudo_gfn;
+	bool flush = false;
 
 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
 		return 0;
@@ -2977,10 +3074,19 @@ static int __direct_map(struct kvm_vcpu *vcpu, int write, int map_writable,
 			pseudo_gfn = base_addr >> PAGE_SHIFT;
 			sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
 					      iterator.level - 1, 1, ACC_ALL);
+			if (write)
+				flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 
 			link_shadow_page(vcpu, iterator.sptep, sp);
+			continue;
 		}
+
+		flush |= handle_readonly_upper_spte(vcpu->kvm, iterator.sptep,
+						    write);
 	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(vcpu->kvm);
 	return emulate;
 }
 
@@ -3182,11 +3288,20 @@ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level,
 	do {
 		u64 new_spte;
 
-		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte)
+		for_each_shadow_entry_lockless(vcpu, gva, iterator, spte) {
 			if (!is_shadow_present_pte(spte) ||
 			    iterator.level < level)
 				break;
 
+			/*
+			 * the fast path can not fix the upper spte which
+			 * is readonly.
+			 */
+			if ((error_code & PFERR_WRITE_MASK) &&
+			      !is_writable_pte(spte))
+				break;
+		}
+
 		sp = page_header(__pa(iterator.sptep));
 		if (!is_last_spte(spte, sp->role.level))
 			break;
@@ -3390,23 +3505,32 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		spin_lock(&vcpu->kvm->mmu_lock);
 		make_mmu_pages_available(vcpu);
 		sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_LEVEL, 1, ACC_ALL);
+		if (mmu_load_shadow_page(vcpu->kvm, sp))
+			kvm_flush_remote_tlbs(vcpu->kvm);
+
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu.root_hpa = __pa(sp->spt);
 	} else if (vcpu->arch.mmu.shadow_root_level == PT32E_ROOT_LEVEL) {
+		bool flush = false;
+
+		spin_lock(&vcpu->kvm->mmu_lock);
 		for (i = 0; i < 4; ++i) {
 			hpa_t root = vcpu->arch.mmu.pae_root[i];
 
 			MMU_WARN_ON(VALID_PAGE(root));
-			spin_lock(&vcpu->kvm->mmu_lock);
 			make_mmu_pages_available(vcpu);
 			sp = kvm_mmu_get_page(vcpu, i << (30 - PAGE_SHIFT),
 					i << 30, PT32_ROOT_LEVEL, 1, ACC_ALL);
+			flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 			root = __pa(sp->spt);
 			++sp->root_count;
-			spin_unlock(&vcpu->kvm->mmu_lock);
 			vcpu->arch.mmu.pae_root[i] = root | PT_PRESENT_MASK;
 		}
+
+		if (flush)
+			kvm_flush_remote_tlbs(vcpu->kvm);
+		spin_unlock(&vcpu->kvm->mmu_lock);
 		vcpu->arch.mmu.root_hpa = __pa(vcpu->arch.mmu.pae_root);
 	} else
 		BUG();
@@ -3420,6 +3544,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	u64 pdptr, pm_mask;
 	gfn_t root_gfn;
 	int i;
+	bool flush = false;
 
 	root_gfn = vcpu->arch.mmu.get_cr3(vcpu) >> PAGE_SHIFT;
 
@@ -3439,6 +3564,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 		make_mmu_pages_available(vcpu);
 		sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL,
 				      0, ACC_ALL);
+		if (mmu_load_shadow_page(vcpu->kvm, sp))
+			kvm_flush_remote_tlbs(vcpu->kvm);
+
 		root = __pa(sp->spt);
 		++sp->root_count;
 		spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3455,6 +3583,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL)
 		pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
+	spin_lock(&vcpu->kvm->mmu_lock);
 	for (i = 0; i < 4; ++i) {
 		hpa_t root = vcpu->arch.mmu.pae_root[i];
 
@@ -3466,19 +3595,25 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 				continue;
 			}
 			root_gfn = pdptr >> PAGE_SHIFT;
-			if (mmu_check_root(vcpu, root_gfn))
+			if (mmu_check_root(vcpu, root_gfn)) {
+				if (flush)
+					kvm_flush_remote_tlbs(vcpu->kvm);
+				spin_unlock(&vcpu->kvm->mmu_lock);
 				return 1;
+			}
 		}
-		spin_lock(&vcpu->kvm->mmu_lock);
 		make_mmu_pages_available(vcpu);
 		sp = kvm_mmu_get_page(vcpu, root_gfn, i << 30, PT32_ROOT_LEVEL,
 				      0, ACC_ALL);
+		flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 		root = __pa(sp->spt);
 		++sp->root_count;
-		spin_unlock(&vcpu->kvm->mmu_lock);
-
 		vcpu->arch.mmu.pae_root[i] = root | pm_mask;
 	}
+
+	if (flush)
+		kvm_flush_remote_tlbs(vcpu->kvm);
+	spin_unlock(&vcpu->kvm->mmu_lock);
 	vcpu->arch.mmu.root_hpa = __pa(vcpu->arch.mmu.pae_root);
 
 	/*
@@ -5254,6 +5389,33 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots)
 	}
 }
 
+void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect)
+{
+	u64 wp_all_indicator, kvm_wp_all_gen;
+
+	mutex_lock(&kvm->slots_lock);
+	wp_all_indicator = get_write_protect_all_indicator(kvm);
+	kvm_wp_all_gen = get_write_protect_all_gen(wp_all_indicator);
+
+	/*
+	 * whenever it is enabled, we increase the generation to
+	 * update shadow pages.
+	 */
+	if (write_protect)
+		kvm_wp_all_gen++;
+
+	set_write_protect_all_indicator(kvm, write_protect, kvm_wp_all_gen);
+
+	/*
+	 * if it is enabled, we need to sync the root page tables
+	 * immediately, otherwise, the write protection is dropped
+	 * on demand, i.e, when page fault is triggered.
+	 */
+	if (write_protect)
+		kvm_reload_remote_mmus(kvm);
+	mutex_unlock(&kvm->slots_lock);
+}
+
 static unsigned long
 mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index d8ccb32..5a398aa 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -202,4 +202,5 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
+void kvm_mmu_write_protect_all_pages(struct kvm *kvm, bool write_protect);
 #endif
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 314d207..8bac8e9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -582,6 +582,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 	struct kvm_shadow_walk_iterator it;
 	unsigned direct_access, access = gw->pt_access;
 	int top_level, emulate;
+	bool flush = false;
 
 	direct_access = gw->pte_access;
 
@@ -613,6 +614,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 			table_gfn = gw->table_gfn[it.level - 2];
 			sp = kvm_mmu_get_page(vcpu, table_gfn, addr, it.level-1,
 					      false, access);
+			if (write_fault)
+				flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 		}
 
 		/*
@@ -624,6 +627,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
 		if (sp)
 			link_shadow_page(vcpu, it.sptep, sp);
+		else
+			flush |= handle_readonly_upper_spte(vcpu->kvm, it.sptep,
+							    write_fault);
 	}
 
 	for (;
@@ -636,13 +642,18 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
 		drop_large_spte(vcpu, it.sptep);
 
-		if (is_shadow_present_pte(*it.sptep))
+		if (is_shadow_present_pte(*it.sptep)) {
+			flush |= handle_readonly_upper_spte(vcpu->kvm,
+							it.sptep, write_fault);
 			continue;
+		}
 
 		direct_gfn = gw->gfn & ~(KVM_PAGES_PER_HPAGE(it.level) - 1);
 
 		sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
 				      true, direct_access);
+		if (write_fault)
+			flush |= mmu_load_shadow_page(vcpu->kvm, sp);
 		link_shadow_page(vcpu, it.sptep, sp);
 	}
 
-- 
2.9.3


[Qemu-devel] [PATCH 4/7] KVM: MMU: enable KVM_WRITE_PROTECT_ALL_MEM
Posted by guangrong.xiao@gmail.com 6 years, 10 months ago
From: Xiao Guangrong <xiaoguangrong@tencent.com>

The functionality of write protection for all guest memory is ready,
it is the time to make its usable for userspace which is indicated
by KVM_CAP_X86_WRITE_PROTECT_ALL_MEM

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/kvm/x86.c       | 6 ++++++
 include/uapi/linux/kvm.h | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index be2ade5..dcbeaf4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2669,6 +2669,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_SET_BOOT_CPU_ID:
  	case KVM_CAP_SPLIT_IRQCHIP:
 	case KVM_CAP_IMMEDIATE_EXIT:
+	case KVM_CAP_X86_WRITE_PROTECT_ALL_MEM:
 		r = 1;
 		break;
 	case KVM_CAP_ADJUST_CLOCK:
@@ -4204,6 +4205,11 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_WRITE_PROTECT_ALL_MEM: {
+		kvm_mmu_write_protect_all_pages(kvm, !!arg);
+		r = 0;
+		break;
+	}
 	case KVM_ENABLE_CAP: {
 		struct kvm_enable_cap cap;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 577429a..7d4a395 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -895,6 +895,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_SPAPR_TCE_VFIO 142
 #define KVM_CAP_X86_GUEST_MWAIT 143
 #define KVM_CAP_ARM_USER_IRQ 144
+#define KVM_CAP_X86_WRITE_PROTECT_ALL_MEM 145
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1144,6 +1145,7 @@ struct kvm_vfio_spapr_tce {
 					struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_WRITE_PROTECT_ALL_MEM _IO(KVMIO,  0x49)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
-- 
2.9.3


[Qemu-devel] [PATCH 5/7] KVM: MMU: allow dirty log without write protect
Posted by guangrong.xiao@gmail.com 6 years, 10 months ago
From: Xiao Guangrong <xiaoguangrong@tencent.com>

A new flag, KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT, is introduced which
indicates the userspace just wants to get the snapshot of dirty bitmap

During live migration, after all snapshot of dirty bitmap is fetched
from KVM, the guest memory can be write protected by calling
KVM_WRITE_PROTECT_ALL_MEM

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/kvm/x86.c       |  1 +
 include/uapi/linux/kvm.h |  6 +++++-
 virt/kvm/kvm_main.c      | 15 ++++++++++++---
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dcbeaf4..524c96b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2670,6 +2670,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
  	case KVM_CAP_SPLIT_IRQCHIP:
 	case KVM_CAP_IMMEDIATE_EXIT:
 	case KVM_CAP_X86_WRITE_PROTECT_ALL_MEM:
+	case KVM_CAP_X86_DIRTY_LOG_WITHOUT_WRITE_PROTECT:
 		r = 1;
 		break;
 	case KVM_CAP_ADJUST_CLOCK:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7d4a395..e0f348c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -443,9 +443,12 @@ struct kvm_interrupt {
 };
 
 /* for KVM_GET_DIRTY_LOG */
+
+#define KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT	0x1
+
 struct kvm_dirty_log {
 	__u32 slot;
-	__u32 padding1;
+	__u32 flags;
 	union {
 		void __user *dirty_bitmap; /* one bit per page */
 		__u64 padding2;
@@ -896,6 +899,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_X86_GUEST_MWAIT 143
 #define KVM_CAP_ARM_USER_IRQ 144
 #define KVM_CAP_X86_WRITE_PROTECT_ALL_MEM 145
+#define KVM_CAP_X86_DIRTY_LOG_WITHOUT_WRITE_PROTECT 146
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 035bc51..c82e4d1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1169,6 +1169,12 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
 	unsigned long n;
 	unsigned long *dirty_bitmap;
 	unsigned long *dirty_bitmap_buffer;
+	bool write_protect;
+
+	if (log->flags & ~KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT)
+		return -EINVAL;
+
+	write_protect = !(log->flags & KVM_DIRTY_LOG_WITHOUT_WRITE_PROTECT);
 
 	as_id = log->slot >> 16;
 	id = (u16)log->slot;
@@ -1196,11 +1202,14 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
 		if (!dirty_bitmap[i])
 			continue;
 
-		*is_dirty = true;
-
 		mask = xchg(&dirty_bitmap[i], 0);
 		dirty_bitmap_buffer[i] = mask;
 
+		if (!write_protect)
+			continue;
+
+		*is_dirty = true;
+
 		if (mask) {
 			offset = i * BITS_PER_LONG;
 			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
@@ -3155,7 +3164,7 @@ static long kvm_vm_compat_ioctl(struct file *filp,
 				   sizeof(compat_log)))
 			return -EFAULT;
 		log.slot	 = compat_log.slot;
-		log.padding1	 = compat_log.padding1;
+		log.flags	 = compat_log.padding1;
 		log.padding2	 = compat_log.padding2;
 		log.dirty_bitmap = compat_ptr(compat_log.dirty_bitmap);
 
-- 
2.9.3


[Qemu-devel] [PATCH 6/7] KVM: MMU: clarify fast_pf_fix_direct_spte
Posted by guangrong.xiao@gmail.com 6 years, 10 months ago
From: Xiao Guangrong <xiaoguangrong@tencent.com>

The writable spte can not be locklessly fixed and add a WARN_ON()
to trigger the warning if something out of our mind happens, that
is good for us to track if the log for writable spte is missed
on the fast path

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/kvm/mmu.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ad6ee46..f6a74e7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3225,6 +3225,15 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	WARN_ON(!sp->role.direct);
 
 	/*
+	 * the original spte can not be writable as only the spte which
+	 * fulfills is_access_track_spte() or
+	 * spte_can_locklessly_be_made_writable() can be locklessly fixed,
+	 * for the former, the W bit is always cleared, for the latter,
+	 * there is nothing to do if it is already writable.
+	 */
+	WARN_ON(is_writable_pte(old_spte));
+
+	/*
 	 * Theoretically we could also set dirty bit (and flush TLB) here in
 	 * order to eliminate unnecessary PML logging. See comments in
 	 * set_spte. But fast_page_fault is very unlikely to happen with PML
@@ -3239,7 +3248,7 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
 		return false;
 
-	if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) {
+	if (is_writable_pte(new_spte)) {
 		/*
 		 * The gfn of direct spte is stable since it is
 		 * calculated by sp->gfn.
-- 
2.9.3


[Qemu-devel] [PATCH 7/7] KVM: MMU: stop using mmu_spte_get_lockless under mmu-lock
Posted by guangrong.xiao@gmail.com 6 years, 10 months ago
From: Xiao Guangrong <xiaoguangrong@tencent.com>

mmu_spte_age() is under the protection of mmu-lock, no reason to use
mmu_spte_get_lockless()

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 arch/x86/kvm/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f6a74e7..a8b91ee 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -852,7 +852,7 @@ static u64 restore_acc_track_spte(u64 spte)
 /* Returns the Accessed status of the PTE and resets it at the same time. */
 static bool mmu_spte_age(u64 *sptep)
 {
-	u64 spte = mmu_spte_get_lockless(sptep);
+	u64 spte = *sptep;
 
 	if (!is_accessed_spte(spte))
 		return false;
-- 
2.9.3