From nobody Sun Jun 21 10:07:14 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2A9CC433EF for ; Tue, 29 Mar 2022 15:35:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238830AbiC2PhU (ORCPT ); Tue, 29 Mar 2022 11:37:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58160 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238809AbiC2PhN (ORCPT ); Tue, 29 Mar 2022 11:37:13 -0400 Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EB73824F2A1; Tue, 29 Mar 2022 08:35:29 -0700 (PDT) Received: by mail-pf1-x42b.google.com with SMTP id t2so16203269pfj.10; Tue, 29 Mar 2022 08:35:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZaW1ShDM/2OnGyJMP0bodXnVQNIFn7WzT1GlIkRksjI=; b=aaN6MeHg94Te4I7Cta6tWpMAP48QzzbdXtyKtQcmJhXBSppoWEBehBQxN3CiutZyKk 25MgVQg7jcZIGeWDsuiou57WsZGo8lRLVLLkRyrCHHuvcuCiepsoEYwuDNRQ8oXhJJTP LcvswQ1ojfMcbVzgX939FhqAfnXo3D92dJWvyOtypFfUET7zNMKep2y1NsHeZhoJLCag E9OG7aQJSwXBM3d0JAU7BRIjcVYCOMR6Gfp9ipt2Gq/YfJIG/y1Piyn93joxp+f0vnm/ 6H+mih7UgzIL2lbVci6bgxGto6K61ZF2QU0BdeTMnwQlycKznLX+NZr0zdABGXe0IaP8 auvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZaW1ShDM/2OnGyJMP0bodXnVQNIFn7WzT1GlIkRksjI=; b=vz1qzv/RATkdG2I9ViIUz/uoBv2MjEVWG/saqym+Nn/JOvSoGvTsPCw8h0gmSPPNYU oiulyr817ybQN1KNIyGm7YA+XFkS3oxnujUOHTaGwmmdtryhHdLR4jOsgWOgb4FKmEax WB7HcpWCbB20J69Z2aMqA2B4sW/Dfm83D920x0H3Tg2SA4U2FWwp2LobDRWFEdki5oga 4oAT8l4FZ0stGxiPmhH1mziOxGeueTPv1200EopYXHyYh1bne4rWJepLC0EA6MrweSsy 0NmFRHd3DxarXWVpPNUKnkmY0phZz3WK9ZPrWr0ofs6ztoKZ7+QimSih3YDtdhqxTJeM 1f0g== X-Gm-Message-State: AOAM533iv7zoGuHl9pl4YT9oewjGaONZ5pGF+O15wxV0ljtpHMg7vTmv mHxrOt1ajHV1LtLVjOyUU9bVi6D/h6o= X-Google-Smtp-Source: ABdhPJzMZGcLUB+u0PAc7F/aUR6QHUYJvodEUQrvUOixwzoxts5ogPz/PzbJAPFg4ZixNSganJluWw== X-Received: by 2002:a63:7b4a:0:b0:398:1337:e304 with SMTP id k10-20020a637b4a000000b003981337e304mr2407519pgn.371.1648568129045; Tue, 29 Mar 2022 08:35:29 -0700 (PDT) Received: from localhost ([47.251.3.230]) by smtp.gmail.com with ESMTPSA id s4-20020a056a00194400b004fb358ffe84sm12474241pfk.104.2022.03.29.08.35.27 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Mar 2022 08:35:28 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini , Sean Christopherson Cc: Lai Jiangshan , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" Subject: [RFC PATCH V2 1/4] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() Date: Tue, 29 Mar 2022 23:36:01 +0800 Message-Id: <20220329153604.507475-2-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20220329153604.507475-1-jiangshanlai@gmail.com> References: <20220329153604.507475-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Lai Jiangshan kvm_mmu_alloc_page() will access to more bits of the role. Signed-off-by: Lai Jiangshan --- arch/x86/kvm/mmu/mmu.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a7cb877f3784..8449ae089593 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1706,13 +1706,14 @@ static void drop_parent_pte(struct kvm_mmu_page *sp, mmu_spte_clear_no_track(parent_pte); } =20 -static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int = direct) +static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, gfn_= t gfn, + union kvm_mmu_page_role role) { struct kvm_mmu_page *sp; =20 sp =3D kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); sp->spt =3D kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); - if (!direct) + if (!role.direct) sp->gfns =3D kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache); set_page_private(virt_to_page(sp->spt), (unsigned long)sp); =20 @@ -1724,6 +1725,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct= kvm_vcpu *vcpu, int direct sp->mmu_valid_gen =3D vcpu->kvm->arch.mmu_valid_gen; list_add(&sp->link, &vcpu->kvm->arch.active_mmu_pages); kvm_mod_used_mmu_pages(vcpu->kvm, +1); + sp->gfn =3D gfn; + sp->role =3D role; return sp; } =20 @@ -2107,10 +2110,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct = kvm_vcpu *vcpu, =20 ++vcpu->kvm->stat.mmu_cache_miss; =20 - sp =3D kvm_mmu_alloc_page(vcpu, direct); - - sp->gfn =3D gfn; - sp->role =3D role; + sp =3D kvm_mmu_alloc_page(vcpu, gfn, role); hlist_add_head(&sp->hash_link, sp_list); if (!direct) { account_shadowed(vcpu->kvm, sp); --=20 2.19.1.6.gb485710b From nobody Sun Jun 21 10:07:14 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C84FFC4332F for ; Tue, 29 Mar 2022 15:35:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238859AbiC2PhZ (ORCPT ); Tue, 29 Mar 2022 11:37:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58828 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238854AbiC2PhV (ORCPT ); Tue, 29 Mar 2022 11:37:21 -0400 Received: from mail-pj1-x1033.google.com (mail-pj1-x1033.google.com [IPv6:2607:f8b0:4864:20::1033]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8D29C25514C; Tue, 29 Mar 2022 08:35:36 -0700 (PDT) Received: by mail-pj1-x1033.google.com with SMTP id v4so17858278pjh.2; Tue, 29 Mar 2022 08:35:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=YJoiV/BDnKM8S0dlkNbqY02WyI79kKm9a6oF1D+VZjQ=; b=UrCaRCfqodS38TgZBTOsVU5UhhrzbNBrSxUDrKKDh7JXN/xiZhruQzxh2syrGKHg5Q c52Uk7aWKy9IK9KApxF7iqD2M14Sy4XrFuSUlen3UTtgIopEJAaGeUghoO++Es2mFFYQ qdUM0LX+bSqi9nKx2IT1Bt3Xqtyp/9xwjZhDPX0f3E3jWAPwGU6oLCPlKWsnGjLuZN1e fiWhQVRn11x7D34YyyvtrCEuvHAazd6zUtx+5TK/TrE63YiTDdEUXmpWVpA8tXm+ywEo G++rZZ0JvX0tTGwy2OtmvbUBc1djwRMsPDJDV0TdltXqf0duaFGFF76jgRq/Lzv/3d91 KuzQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=YJoiV/BDnKM8S0dlkNbqY02WyI79kKm9a6oF1D+VZjQ=; b=T6q1uCsrb6ZVJlJ3pDsu4zmkzxZ8LMZQ0NRhOQkGtat0gKvxS0j93+IGf53DRHwucL 4+jJqJpi2Cw1ZXxL+STRGNSy0f3GK+TcsB7zyjuAmkUdJbiWB7mOWueQU5igCwFOMaNe 6XJnu6L7ESpJsBriEQtug9cFoKwv5GCOfbp/TZsRCirqkqdqr8b/w/WzSdSIOPE47d0z w2P5TeVGWKCdXFhqcaVqIe+25jrXZChxilzzdaowU53rHfvYs+ayFBBTaFo+HTySctH2 GPgDzwSr1H12aTPKq3EPop1JqmowJZjVECjn7AYH/ITIex2hndWfA1uWAK9Co9t3eYZt XNgw== X-Gm-Message-State: AOAM531+hiLVEqFlCLqCqA/1JBomoBW8CXgMGJwsmzESiBEplOL/VY37 BToXy9jQ5oSJK+fQ1dBGoBsjjdTziR0= X-Google-Smtp-Source: ABdhPJx9GgUl+CMMmUAJQRuIAr/zeL3tXlL0PgMjjmyjpzecsTU9ObGOoLFvD5kkDaSUCi6zoU6BoQ== X-Received: by 2002:a17:902:e40a:b0:155:d894:79a3 with SMTP id m10-20020a170902e40a00b00155d89479a3mr21715995ple.73.1648568135557; Tue, 29 Mar 2022 08:35:35 -0700 (PDT) Received: from localhost ([47.251.3.230]) by smtp.gmail.com with ESMTPSA id s10-20020a63a30a000000b003987eaef296sm2618230pge.44.2022.03.29.08.35.34 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Mar 2022 08:35:35 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini , Sean Christopherson Cc: Lai Jiangshan , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , linux-doc@vger.kernel.org Subject: [RFC PATCH V2 2/4] KVM: X86: Introduce role.passthrough for level expanded pagetable Date: Tue, 29 Mar 2022 23:36:02 +0800 Message-Id: <20220329153604.507475-3-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20220329153604.507475-1-jiangshanlai@gmail.com> References: <20220329153604.507475-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Lai Jiangshan Level expansion occurs when mmu->shadow_root_level > mmu->root_level. There are several cases that can cuase level expansion: shadow mmu (shadow paging for 32 bit guest): case1: gCR0_PG=3D1,gEFER_LMA=3D0,gCR4_PSE=3D0 shadow nested NPT (for 32bit L1 hypervisor): case2: gCR0_PG=3D1,gEFER_LMA=3D0,gCR4_PSE=3D0,hEFER_LMA=3D0 case3: gCR0_PG=3D1,gEFER_LMA=3D0,hEFER_LMA=3D1 shadow nested NPT (for 64bit L1 hypervisor): case4: gEFER_LMA=3D1,gCR4_LA57=3D0,hEFER_LMA=3D1,hCR4_LA57=3D1 When level expansion occurs (32bit guest, case1-3), special roots are often used. But case4 is not using special roots. It uses shadow page without fully aware of the specialty. It might work accidentally: 1) The root_page (root_sp->spt) is allocated with level =3D 5, and root_sp->spt[0] is allocated with the same gfn and the same role except role.level =3D 4. Luckly that they are different shadow pages. 2) FNAME(walk_addr_generic) sets walker->table_gfn[4] and walker->pt_access[4], which are normally unused when mmu->shadow_root_level =3D=3D mmu->root_level =3D=3D 4, so that FNAME(fetch) can use them to allocate shadow page for root_sp->spt[0] and link them when shadow_root_level =3D=3D 5. But it has problems. If the guest switches from gCR4_LA57=3D0 to gCR4_LA57=3D1 (or vice verse) and uses the same gfn as the root for the nNPT before and after switching gCR4_LA57. The host (hCR4_LA57=3D1) wold use the same root_sp for guest even guest switches gCR4_LA57. The guest will see unexpected page mapped and L2 can hurts L1. It is lucky the the problem can't hurt L0. The root_sp should be like role.direct=3D1 sometimes: its contents are not backed by gptes, root_sp->gfns is meaningless. For a normal high level sp, sp->gfns is often unused and kept zero, but it could be relevant and meaningful when sp->gfns is used because they are back by concret gptes. For expanded root_sp described before, root_sp is just a portal to contribute root_sp->spt[0], and root_sp should not have root_sp->gfns and root_sp->spt[0] should not be dropped if gpte[0] of the root gfn is changed. This patch adds role.passthrough to address the two problems. role.passthrough is set for expanded shadow pagetable: role.level > gMMU.level. An alternative way to fix the problem of case4 is that: also using the special root pml5_root for it. But it would required to change many other places because it is assumption that special roots is only used for 32bit guests. This patch also paves the way to use passthrough shadow page for case1-3, but that requires the special handling or PAE paging, so the extensive usage of it is in later patches. Signed-off-by: Lai Jiangshan --- Documentation/virt/kvm/mmu.rst | 5 +++++ arch/x86/include/asm/kvm_host.h | 5 +++-- arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++++++--- arch/x86/kvm/mmu/paging_tmpl.h | 1 + 4 files changed, 25 insertions(+), 5 deletions(-) diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst index 5b1ebad24c77..60c4057ef625 100644 --- a/Documentation/virt/kvm/mmu.rst +++ b/Documentation/virt/kvm/mmu.rst @@ -202,6 +202,11 @@ Shadow pages contain the following information: Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D bits before Haswell; shadow EPT page tables also cannot use A/D bits if the L1 hypervisor does not enable them. + role.passthrough: + Is 1 if role.level > guest paging level when shadow paging level is + larger than guest paging level; passthrough shadow page tables must + be created on the top. Like when role.has_4_byte_gpte or shadow NPT + for 32 bit L1 or 5-level shadow NPT for 4-level NPT L1. gfn: Either the guest page table containing the translations shadowed by th= is page, or the base page frame for linear translations. See role.direct. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 9694dd5e6ccc..1e6bf563b939 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -314,7 +314,7 @@ struct kvm_kernel_irq_routing_entry; * cr0_wp=3D0, therefore these three bits only give rise to 5 possibil= ities. * * Therefore, the maximum number of possible upper-level shadow pages for a - * single gfn is a bit less than 2^13. + * single gfn is a bit less than 2^14. */ union kvm_mmu_page_role { u32 word; @@ -331,7 +331,8 @@ union kvm_mmu_page_role { unsigned smap_andnot_wp:1; unsigned ad_disabled:1; unsigned guest_mode:1; - unsigned :6; + unsigned passthrough:1; + unsigned :5; =20 /* * This is left at the top of the word so that diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 8449ae089593..54c7db7c9608 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -737,6 +737,9 @@ static void mmu_free_pte_list_desc(struct pte_list_desc= *pte_list_desc) =20 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index) { + if (sp->role.passthrough) + return sp->gfn; + if (!sp->role.direct) return sp->gfns[index]; =20 @@ -745,6 +748,11 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page = *sp, int index) =20 static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t= gfn) { + if (sp->role.passthrough) { + WARN_ON_ONCE(gfn !=3D sp->gfn); + return; + } + if (!sp->role.direct) { sp->gfns[index] =3D gfn; return; @@ -1674,8 +1682,7 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *sp) hlist_del(&sp->hash_link); list_del(&sp->link); free_page((unsigned long)sp->spt); - if (!sp->role.direct) - free_page((unsigned long)sp->gfns); + free_page((unsigned long)sp->gfns); kmem_cache_free(mmu_page_header_cache, sp); } =20 @@ -1713,7 +1720,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct= kvm_vcpu *vcpu, gfn_t gfn, =20 sp =3D kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); sp->spt =3D kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); - if (!role.direct) + if (!role.direct && !role.passthrough) sp->gfns =3D kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache); set_page_private(virt_to_page(sp->spt), (unsigned long)sp); =20 @@ -2054,6 +2061,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct k= vm_vcpu *vcpu, quadrant &=3D (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; role.quadrant =3D quadrant; } + if (level <=3D vcpu->arch.mmu->root_level) + role.passthrough =3D 0; =20 sp_list =3D &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; for_each_valid_sp(vcpu->kvm, sp, sp_list) { @@ -4882,6 +4891,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *v= cpu, =20 role.base.direct =3D false; role.base.level =3D kvm_mmu_get_tdp_level(vcpu); + if (role.base.level > role_regs_to_root_level(regs)) + role.base.passthrough =3D 1; =20 return role; } @@ -5312,6 +5323,8 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, = gpa_t gpa, ++vcpu->kvm->stat.mmu_pte_write; =20 for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn) { + if (sp->role.passthrough) + continue; if (detect_write_misaligned(sp, gpa, bytes) || detect_write_flooding(sp)) { kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list); diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 8621188b46df..c1b975fb85a2 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -1042,6 +1042,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, st= ruct kvm_mmu_page *sp) .level =3D 0xf, .access =3D 0x7, .quadrant =3D 0x3, + .passthrough =3D 0x1, }; =20 /* --=20 2.19.1.6.gb485710b From nobody Sun Jun 21 10:07:14 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B6DBEC433FE for ; Tue, 29 Mar 2022 15:35:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238863AbiC2Phf (ORCPT ); Tue, 29 Mar 2022 11:37:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59292 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238851AbiC2Ph0 (ORCPT ); Tue, 29 Mar 2022 11:37:26 -0400 Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CDF6225667B; Tue, 29 Mar 2022 08:35:42 -0700 (PDT) Received: by mail-pf1-x42b.google.com with SMTP id h19so15227441pfv.1; Tue, 29 Mar 2022 08:35:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=7J0hAZ+76i4t5nkDrXP5ltSgXTZne12nv3FBQdb+CYA=; b=Y9kPkFicdNqYgPSzkGxTsuZkhUm78b0EuoKZbKn8GrNG0X9SOgUmiJP9NoTjeRvXxd WySF8g2dMUfo1l0Jszg+pMzbPNFnkzUb5YLeHVa1t34Y0Ddm7U67Lziw9U3eIZTJWJNr YYcCwy0+RJHnKLaUKK8UqCMtYw6lJ0aONPXEG3PBahblpnH2Blb1opbd6XWS8w4okbws V9/NhfvXztKGUhllynD6YtKY3OV1KMC7hua/VrLjutDyvgqUTgWoP/lq9nxYAumh7uF/ hf3Dg6Ony/0fQK8Vji1Rdw2LP3WQ5fI0o9J/9N/VrCkYnL05PxLlSWtmbl/PBVAKdhma ky6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=7J0hAZ+76i4t5nkDrXP5ltSgXTZne12nv3FBQdb+CYA=; b=LPEycr0u342Egknm3jwe2cqynWYGCYdwa45y3IcVDljL3F3ntf6XE3gjo9+H+7YQna I3PSgFhaI5wrQnQ73D6Bfayl0BLyMGerczll6S8ch965ZnpwwNiJDFK8FeUFiijWhYtK fGJC7wp5J72CTc/7Sm5YICtK05h5pcxQ94ZBAZ8lp2Yt+65iglDt8m7ixg5yEvR36ffq a9qqBkiuJJl2e7RvNtSLqC2OqHyciztbODEfkCWSdHU44Mk7NfOseQP9689yXaIEWrwA Nx7v/+jbZ/fK9qCWuJqDHHuAsaskjKIFynx0kBhkiLTfVTrCoTxHb65cHSboyWtKYwl5 ZdFQ== X-Gm-Message-State: AOAM532TaYj+h8SfnFBdwzTTFjwwEckzREK1AmjTT7+yw5c4DqkQ8BYv 0A/MCaxaBgTssTJ/e0Id2vjP4CFPVRE= X-Google-Smtp-Source: ABdhPJxFtsGrhjQEKYK1Ov8/M2YQa+kPtM/4I6Xx3o5I/eIrx3co/l6pPxF8SkqFe6QERN3rDcPbnw== X-Received: by 2002:a63:43c4:0:b0:381:10:45b8 with SMTP id q187-20020a6343c4000000b00381001045b8mr2387667pga.588.1648568142011; Tue, 29 Mar 2022 08:35:42 -0700 (PDT) Received: from localhost ([47.251.3.230]) by smtp.gmail.com with ESMTPSA id f16-20020a056a001ad000b004fb358ffe86sm12047553pfv.137.2022.03.29.08.35.40 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Mar 2022 08:35:41 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini , Sean Christopherson Cc: Lai Jiangshan , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , linux-doc@vger.kernel.org Subject: [RFC PATCH V2 3/4] KVM: X86: Alloc role.pae_root shadow page Date: Tue, 29 Mar 2022 23:36:03 +0800 Message-Id: <20220329153604.507475-4-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20220329153604.507475-1-jiangshanlai@gmail.com> References: <20220329153604.507475-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Lai Jiangshan Currently pae_root is special root page, this patch adds facility to allow using kvm_mmu_get_page() to allocate pae_root shadow page. When kvm_mmu_get_page() is called for role.level =3D=3D PT32E_ROOT_LEVEL and vcpu->arch.mmu->shadow_root_level =3D=3D PT32E_ROOT_LEVEL, it will get a PAE root pagetable and set role.pae_root=3D1 for freeing. The role.pae_root bit is needed in the page role because: o PAE roots must be allocated below 4gb (for kvm_mmu_get_page()) o PAE roots can not be encrypted (for kvm_mmu_get_page()) o Must be re-encrypted when freeing (for kvm_mmu_free_page()) o PAE root's PDPTE is special (for link_shadow_page()) o Not share the decrypted low-address pagetable with non-PAE-root ones or vice verse. (for kvm_mmu_get_page(), the crucial reason) Both role.pae_root in link_shadow_page() and in kvm_mmu_get_page() can be possible changed to use shadow_root_level and role.level instead. But in kvm_mmu_free_page(), it can't use vcpu->arch.mmu->shadow_root_level. PAE roots must be allocated below 4gb (CR3 has only 32 bits). So a cache is introduced (mmu_pae_root_cache). No functionality changed since this code is not activated because when vcpu->arch.mmu->shadow_root_level =3D=3D PT32E_ROOT_LEVEL, kvm_mmu_get_page= () is only called for level =3D=3D 1 or 2 now. Signed-off-by: Lai Jiangshan --- Documentation/virt/kvm/mmu.rst | 2 + arch/x86/include/asm/kvm_host.h | 9 +++- arch/x86/kvm/mmu/mmu.c | 78 +++++++++++++++++++++++++++++++-- arch/x86/kvm/mmu/paging_tmpl.h | 1 + 4 files changed, 86 insertions(+), 4 deletions(-) diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst index 60c4057ef625..dbeb6462c6b0 100644 --- a/Documentation/virt/kvm/mmu.rst +++ b/Documentation/virt/kvm/mmu.rst @@ -207,6 +207,8 @@ Shadow pages contain the following information: larger than guest paging level; passthrough shadow page tables must be created on the top. Like when role.has_4_byte_gpte or shadow NPT for 32 bit L1 or 5-level shadow NPT for 4-level NPT L1. + role.pae_root: + Is 1 if it is a PAE root. gfn: Either the guest page table containing the translations shadowed by th= is page, or the base page frame for linear translations. See role.direct. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 1e6bf563b939..bc31c0104eca 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -313,6 +313,11 @@ struct kvm_kernel_irq_routing_entry; * - on top of this, smep_andnot_wp and smap_andnot_wp are only set if * cr0_wp=3D0, therefore these three bits only give rise to 5 possibil= ities. * + * - pae_root can only be set when level=3D3, so combinations for level = and + * pae_root can be seen as 2/3/3-page_root/4/5, a.k.a 5 possibilities. + * Combined with cr0_wp, smep_andnot_wp and smap_andnot_wp, it will be + * 5X5 =3D 25 < 2^5. + * * Therefore, the maximum number of possible upper-level shadow pages for a * single gfn is a bit less than 2^14. */ @@ -332,7 +337,8 @@ union kvm_mmu_page_role { unsigned ad_disabled:1; unsigned guest_mode:1; unsigned passthrough:1; - unsigned :5; + unsigned pae_root:1; + unsigned :4; =20 /* * This is left at the top of the word so that @@ -699,6 +705,7 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_shadow_page_cache; struct kvm_mmu_memory_cache mmu_gfn_array_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; + void *mmu_pae_root_cache; =20 /* * QEMU userspace and the guest each have their own FPU state. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 54c7db7c9608..42046bff3c49 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -694,6 +694,35 @@ static void walk_shadow_page_lockless_end(struct kvm_v= cpu *vcpu) } } =20 +static int mmu_topup_pae_root_cache(struct kvm_vcpu *vcpu) +{ + struct page *page; + + if (vcpu->arch.mmu->shadow_root_level !=3D PT32E_ROOT_LEVEL) + return 0; + if (vcpu->arch.mmu_pae_root_cache) + return 0; + + page =3D alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_DMA32); + if (!page) + return -ENOMEM; + vcpu->arch.mmu_pae_root_cache =3D page_address(page); + + /* + * CR3 is only 32 bits when PAE paging is used, thus it's impossible to + * get the CPU to treat the PDPTEs as encrypted. Decrypt the page so + * that KVM's writes and the CPU's reads get along. Note, this is + * only necessary when using shadow paging, as 64-bit NPT can get at + * the C-bit even when shadowing 32-bit NPT, and SME isn't supported + * by 32-bit kernels (when KVM itself uses 32-bit NPT). + */ + if (!tdp_enabled) + set_memory_decrypted((unsigned long)vcpu->arch.mmu_pae_root_cache, 1); + else + WARN_ON_ONCE(shadow_me_mask); + return 0; +} + static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indir= ect) { int r; @@ -705,6 +734,9 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcp= u, bool maybe_indirect) return r; r =3D kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache, PT64_ROOT_MAX_LEVEL); + if (r) + return r; + r =3D mmu_topup_pae_root_cache(vcpu); if (r) return r; if (maybe_indirect) { @@ -717,12 +749,23 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *v= cpu, bool maybe_indirect) PT64_ROOT_MAX_LEVEL); } =20 +static void mmu_free_pae_root(void *root_pt) +{ + if (!tdp_enabled) + set_memory_encrypted((unsigned long)root_pt, 1); + free_page((unsigned long)root_pt); +} + static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) { kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache); + if (vcpu->arch.mmu_pae_root_cache) { + mmu_free_pae_root(vcpu->arch.mmu_pae_root_cache); + vcpu->arch.mmu_pae_root_cache =3D NULL; + } } =20 static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu) @@ -1681,7 +1724,10 @@ static void kvm_mmu_free_page(struct kvm_mmu_page *s= p) MMU_WARN_ON(!is_empty_shadow_page(sp->spt)); hlist_del(&sp->hash_link); list_del(&sp->link); - free_page((unsigned long)sp->spt); + if (sp->role.pae_root) + mmu_free_pae_root(sp->spt); + else + free_page((unsigned long)sp->spt); free_page((unsigned long)sp->gfns); kmem_cache_free(mmu_page_header_cache, sp); } @@ -1719,7 +1765,12 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struc= t kvm_vcpu *vcpu, gfn_t gfn, struct kvm_mmu_page *sp; =20 sp =3D kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); - sp->spt =3D kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); + if (!role.pae_root) { + sp->spt =3D kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache= ); + } else { + sp->spt =3D vcpu->arch.mmu_pae_root_cache; + vcpu->arch.mmu_pae_root_cache =3D NULL; + } if (!role.direct && !role.passthrough) sp->gfns =3D kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache); set_page_private(virt_to_page(sp->spt), (unsigned long)sp); @@ -2063,6 +2114,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct k= vm_vcpu *vcpu, } if (level <=3D vcpu->arch.mmu->root_level) role.passthrough =3D 0; + if (level !=3D PT32E_ROOT_LEVEL) + role.pae_root =3D 0; =20 sp_list =3D &vcpu->kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)]; for_each_valid_sp(vcpu->kvm, sp, sp_list) { @@ -2198,14 +2251,26 @@ static void shadow_walk_next(struct kvm_shadow_walk= _iterator *iterator) __shadow_walk_next(iterator, *iterator->sptep); } =20 +static u64 make_pae_pdpte(u64 *child_pt) +{ + /* The only ignore bits in PDPTE are 11:9. */ + BUILD_BUG_ON(!(GENMASK(11,9) & SPTE_MMU_PRESENT_MASK)); + return __pa(child_pt) | PT_PRESENT_MASK | SPTE_MMU_PRESENT_MASK | + shadow_me_mask; +} + static void link_shadow_page(struct kvm_vcpu *vcpu, u64 *sptep, struct kvm_mmu_page *sp) { + struct kvm_mmu_page *parent_sp =3D sptep_to_sp(sptep); u64 spte; =20 BUILD_BUG_ON(VMX_EPT_WRITABLE_MASK !=3D PT_WRITABLE_MASK); =20 - spte =3D make_nonleaf_spte(sp->spt, sp_ad_disabled(sp)); + if (!parent_sp->role.pae_root) + spte =3D make_nonleaf_spte(sp->spt, sp_ad_disabled(sp)); + else + spte =3D make_pae_pdpte(sp->spt); =20 mmu_spte_set(sptep, spte); =20 @@ -4781,6 +4846,8 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, role.base.level =3D kvm_mmu_get_tdp_level(vcpu); role.base.direct =3D true; role.base.has_4_byte_gpte =3D false; + if (role.base.level =3D=3D PT32E_ROOT_LEVEL) + role.base.pae_root =3D 1; =20 return role; } @@ -4846,6 +4913,9 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *v= cpu, else role.base.level =3D PT64_ROOT_4LEVEL; =20 + if (role.base.level =3D=3D PT32E_ROOT_LEVEL) + role.base.pae_root =3D 1; + return role; } =20 @@ -4893,6 +4963,8 @@ kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *v= cpu, role.base.level =3D kvm_mmu_get_tdp_level(vcpu); if (role.base.level > role_regs_to_root_level(regs)) role.base.passthrough =3D 1; + if (role.base.level =3D=3D PT32E_ROOT_LEVEL) + role.base.pae_root =3D 1; =20 return role; } diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index c1b975fb85a2..2062ac25b7e5 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -1043,6 +1043,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, st= ruct kvm_mmu_page *sp) .access =3D 0x7, .quadrant =3D 0x3, .passthrough =3D 0x1, + .pae_root =3D 0x1, }; =20 /* --=20 2.19.1.6.gb485710b From nobody Sun Jun 21 10:07:14 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3BBEEC433EF for ; Tue, 29 Mar 2022 15:36:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232127AbiC2Phn (ORCPT ); Tue, 29 Mar 2022 11:37:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59826 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238885AbiC2Phe (ORCPT ); Tue, 29 Mar 2022 11:37:34 -0400 Received: from mail-pg1-x533.google.com (mail-pg1-x533.google.com [IPv6:2607:f8b0:4864:20::533]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B389825A49F; Tue, 29 Mar 2022 08:35:49 -0700 (PDT) Received: by mail-pg1-x533.google.com with SMTP id q19so15146802pgm.6; Tue, 29 Mar 2022 08:35:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=vgt97cy17DrZAeHvnjs0ULK9B4RJYZTKqS9ejS64zbc=; b=qsheBP6t8HkzZTRg2kVBQfo66Gsnp8LFreF07SIFqbY+B1moJFLof9ImZ0Ly3/MQa/ whXERgvx5BLbZPb/r6UcRLS75JO0NF15Er7XP2228Pdtol76/8FA9cbS8OdKEyTslQfL CPy7RvsE8WFxTtbKcJ4wBUvPRTb/v2gsXFBv3y+tdkTpiO8rjxY7ByrWsOPiIAFAj+CP DRBCSSb8QbIUmXeu6Wps58eVD9EflHkf9l/VbI8qHkiCCeWouPPUELRwOMmTUiRPKZgK x7Ct1RM5uk2l+A9VsjEhfPJX8MwUsNXEWOsmkkXp2LSwMlPslUp/+1xU1ml70GoQyksV AsTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=vgt97cy17DrZAeHvnjs0ULK9B4RJYZTKqS9ejS64zbc=; b=yRB3vmj9h0m+r91O8zeuYhv1CLcQxwbqAsC4ACsoNY1Qg3yZ3lPTBN62dFLkJm92eQ OJ1qpX7z38fA5ySzlhhT2QTBTFiPvRnsX4/5nfXtxz0yOLlplo8GZkoyT1fNCqjKheHj nWqPiroCqnEGWImO+/fGT67MHS2ybb+QljZWKo1fl6PyNnCjvu09QGB7szSCe8nKfb4U APE+dB4aOFIvNvOfNRsiYeZ1c+ks6SBaVIJDznYsz0hSrOFKGpr7gca4NPWxp2D0izgf x/T1bzqxyXlmAbwkv5Z6SQRuTC8twZgX/JVv2gOKR3DJQNicS2KFUjbTmEdEV7jjyClo a6Mg== X-Gm-Message-State: AOAM5314qHz2TI9rOEYGWhomt3/vxJ7MtPgtH8IQSfq43aeClIkpDEE0 YNxgJTBoLLOIfc3L0fal3RnWUPcwimo= X-Google-Smtp-Source: ABdhPJwt/IP7wfQ3SZkdRYO5v/DyfhJQTPAQXs/TtP2KmSjlQ7wnsOB37v1J8a8lYYdTEwD3KX5Avw== X-Received: by 2002:a63:5810:0:b0:381:6562:46fb with SMTP id m16-20020a635810000000b00381656246fbmr2369604pgb.567.1648568148692; Tue, 29 Mar 2022 08:35:48 -0700 (PDT) Received: from localhost ([47.251.3.230]) by smtp.gmail.com with ESMTPSA id j8-20020a17090a060800b001c7936791d1sm3390631pjj.7.2022.03.29.08.35.47 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 29 Mar 2022 08:35:48 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Paolo Bonzini , Sean Christopherson Cc: Lai Jiangshan , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" Subject: [RFC PATCH V2 4/4] KVM: X86: Use passthrough and pae_root shadow page for 32bit guests Date: Tue, 29 Mar 2022 23:36:04 +0800 Message-Id: <20220329153604.507475-5-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20220329153604.507475-1-jiangshanlai@gmail.com> References: <20220329153604.507475-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Lai Jiangshan Use role.pae_root =3D 1 for shadow_root_level =3D=3D 3 no matter if it is shadow MMU or not. When it is shadow MMU, level expansion might occur and use role.passthrough =3D 1 for expanded shadow pagetable. And remove the unneeded special roots. Now all the root pages and pagetable pointed by a present spte in kvm_mmu are backed by struct kvm_mmu_page, and to_shadow_page() is guaranteed to be not NULL. shadow_walk() and the intialization of shadow page are much simplified since there is not special roots. Affect cases: direct mmu (nonpaping for 32 bit guest): gCR0_PG=3D0 (pae_root=3D1) shadow mmu (shadow paping for 32 bit guest): gCR0_PG=3D1,gEFER_LMA=3D0,gCR4_PSE=3D0 (pae_root=3D1,passthrough=3D1) gCR0_PG=3D1,gEFER_LMA=3D0,gCR4_PSE=3D1 (pae_root=3D1,passthrough=3D0) direct mmu (NPT for 32bit host): hEFER_LMA=3D0 (pae_root=3D1) shadow nested NPT (for 32bit L1 hypervisor): gCR0_PG=3D1,gEFER_LMA=3D0,gCR4_PSE=3D0,hEFER_LMA=3D0 (pae_root=3D1,passthrough=3D1) gCR0_PG=3D1,gEFER_LMA=3D0,gCR4_PSE=3D1,hEFER_LMA=3D0 (pae_root=3D1,passthrough=3D0) gCR0_PG=3D1,gEFER_LMA=3D0,gCR4_PSE=3D{0|1},hEFER_LMA=3D1,hCR4_LA57=3D{0|1} (pae_root=3D0,passthrough=3D1) (default_pae_pdpte is not used even guest is using PAE paging) Shadow nested NPT for 64bit L1 hypervisor has been already handled: gEFER_LMA=3D1,gCR4_LA57=3D0,hEFER_LMA=3D1,hCR4_LA57=3D1 (pae_root=3D0,passthrough=3D1) FNAME(walk_addr_generic) adds initialization code for shadow nested NPT for 32bit L1 hypervisor when the level increment might be more than one, for example, 2->4, 2->5, 3->5. After this patch, the PAE Page-Directory-Pointer-Table is also write protected (including NPT's). Signed-off-by: Lai Jiangshan --- arch/x86/include/asm/kvm_host.h | 4 - arch/x86/kvm/mmu/mmu.c | 295 ++------------------------------ arch/x86/kvm/mmu/paging_tmpl.h | 13 +- 3 files changed, 24 insertions(+), 288 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index bc31c0104eca..82eb96b7578d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -467,10 +467,6 @@ struct kvm_mmu { */ u32 pkru_mask; =20 - u64 *pae_root; - u64 *pml4_root; - u64 *pml5_root; - /* * check zero bits on shadow page table entries, these * bits include not only hardware reserved bits but also diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 42046bff3c49..40832a35e184 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2195,26 +2195,6 @@ static void shadow_walk_init_using_root(struct kvm_s= hadow_walk_iterator *iterato iterator->addr =3D addr; iterator->shadow_addr =3D root; iterator->level =3D vcpu->arch.mmu->shadow_root_level; - - if (iterator->level >=3D PT64_ROOT_4LEVEL && - vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL && - !vcpu->arch.mmu->direct_map) - iterator->level =3D PT32E_ROOT_LEVEL; - - if (iterator->level =3D=3D PT32E_ROOT_LEVEL) { - /* - * prev_root is currently only used for 64-bit hosts. So only - * the active root_hpa is valid here. - */ - BUG_ON(root !=3D vcpu->arch.mmu->root.hpa); - - iterator->shadow_addr - =3D vcpu->arch.mmu->pae_root[(addr >> 30) & 3]; - iterator->shadow_addr &=3D PT64_BASE_ADDR_MASK; - --iterator->level; - if (!iterator->shadow_addr) - iterator->level =3D 0; - } } =20 static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator, @@ -3327,18 +3307,7 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_= mmu *mmu, &invalid_list); =20 if (free_active_root) { - if (to_shadow_page(mmu->root.hpa)) { - mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list); - } else if (mmu->pae_root) { - for (i =3D 0; i < 4; ++i) { - if (!IS_VALID_PAE_ROOT(mmu->pae_root[i])) - continue; - - mmu_free_root_page(kvm, &mmu->pae_root[i], - &invalid_list); - mmu->pae_root[i] =3D INVALID_PAE_ROOT; - } - } + mmu_free_root_page(kvm, &mmu->root.hpa, &invalid_list); mmu->root.hpa =3D INVALID_PAGE; mmu->root.pgd =3D 0; } @@ -3403,7 +3372,6 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vc= pu) struct kvm_mmu *mmu =3D vcpu->arch.mmu; u8 shadow_root_level =3D mmu->shadow_root_level; hpa_t root; - unsigned i; int r; =20 write_lock(&vcpu->kvm->mmu_lock); @@ -3414,24 +3382,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *v= cpu) if (is_tdp_mmu_enabled(vcpu->kvm)) { root =3D kvm_tdp_mmu_get_vcpu_root_hpa(vcpu); mmu->root.hpa =3D root; - } else if (shadow_root_level >=3D PT64_ROOT_4LEVEL) { + } else if (shadow_root_level >=3D PT32E_ROOT_LEVEL) { root =3D mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true); mmu->root.hpa =3D root; - } else if (shadow_root_level =3D=3D PT32E_ROOT_LEVEL) { - if (WARN_ON_ONCE(!mmu->pae_root)) { - r =3D -EIO; - goto out_unlock; - } - - for (i =3D 0; i < 4; ++i) { - WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i])); - - root =3D mmu_alloc_root(vcpu, i << (30 - PAGE_SHIFT), - i << 30, PT32_ROOT_LEVEL, true); - mmu->pae_root[i] =3D root | PT_PRESENT_MASK | - shadow_me_mask; - } - mmu->root.hpa =3D __pa(mmu->pae_root); } else { WARN_ONCE(1, "Bad TDP root level =3D %d\n", shadow_root_level); r =3D -EIO; @@ -3509,10 +3462,8 @@ static int mmu_first_shadow_root_alloc(struct kvm *k= vm) static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu) { struct kvm_mmu *mmu =3D vcpu->arch.mmu; - u64 pdptrs[4], pm_mask; gfn_t root_gfn, root_pgd; hpa_t root; - unsigned i; int r; =20 root_pgd =3D mmu->get_guest_pgd(vcpu); @@ -3521,21 +3472,6 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *v= cpu) if (mmu_check_root(vcpu, root_gfn)) return 1; =20 - /* - * On SVM, reading PDPTRs might access guest memory, which might fault - * and thus might sleep. Grab the PDPTRs before acquiring mmu_lock. - */ - if (mmu->root_level =3D=3D PT32E_ROOT_LEVEL) { - for (i =3D 0; i < 4; ++i) { - pdptrs[i] =3D mmu->get_pdptr(vcpu, i); - if (!(pdptrs[i] & PT_PRESENT_MASK)) - continue; - - if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT)) - return 1; - } - } - r =3D mmu_first_shadow_root_alloc(vcpu->kvm); if (r) return r; @@ -3545,70 +3481,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *v= cpu) if (r < 0) goto out_unlock; =20 - /* - * Do we shadow a long mode page table? If so we need to - * write-protect the guests page table root. - */ - if (mmu->root_level >=3D PT64_ROOT_4LEVEL) { - root =3D mmu_alloc_root(vcpu, root_gfn, 0, - mmu->shadow_root_level, false); - mmu->root.hpa =3D root; - goto set_root_pgd; - } - - if (WARN_ON_ONCE(!mmu->pae_root)) { - r =3D -EIO; - goto out_unlock; - } - - /* - * We shadow a 32 bit page table. This may be a legacy 2-level - * or a PAE 3-level page table. In either case we need to be aware that - * the shadow page table may be a PAE or a long mode page table. - */ - pm_mask =3D PT_PRESENT_MASK | shadow_me_mask; - if (mmu->shadow_root_level >=3D PT64_ROOT_4LEVEL) { - pm_mask |=3D PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK; - - if (WARN_ON_ONCE(!mmu->pml4_root)) { - r =3D -EIO; - goto out_unlock; - } - mmu->pml4_root[0] =3D __pa(mmu->pae_root) | pm_mask; - - if (mmu->shadow_root_level =3D=3D PT64_ROOT_5LEVEL) { - if (WARN_ON_ONCE(!mmu->pml5_root)) { - r =3D -EIO; - goto out_unlock; - } - mmu->pml5_root[0] =3D __pa(mmu->pml4_root) | pm_mask; - } - } - - for (i =3D 0; i < 4; ++i) { - WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i])); - - if (mmu->root_level =3D=3D PT32E_ROOT_LEVEL) { - if (!(pdptrs[i] & PT_PRESENT_MASK)) { - mmu->pae_root[i] =3D INVALID_PAE_ROOT; - continue; - } - root_gfn =3D pdptrs[i] >> PAGE_SHIFT; - } - - root =3D mmu_alloc_root(vcpu, root_gfn, i << 30, - PT32_ROOT_LEVEL, false); - mmu->pae_root[i] =3D root | pm_mask; - } - - if (mmu->shadow_root_level =3D=3D PT64_ROOT_5LEVEL) - mmu->root.hpa =3D __pa(mmu->pml5_root); - else if (mmu->shadow_root_level =3D=3D PT64_ROOT_4LEVEL) - mmu->root.hpa =3D __pa(mmu->pml4_root); - else - mmu->root.hpa =3D __pa(mmu->pae_root); - -set_root_pgd: + root =3D mmu_alloc_root(vcpu, root_gfn, 0, + mmu->shadow_root_level, false); + mmu->root.hpa =3D root; mmu->root.pgd =3D root_pgd; out_unlock: write_unlock(&vcpu->kvm->mmu_lock); @@ -3616,77 +3491,6 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *v= cpu) return r; } =20 -static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu) -{ - struct kvm_mmu *mmu =3D vcpu->arch.mmu; - bool need_pml5 =3D mmu->shadow_root_level > PT64_ROOT_4LEVEL; - u64 *pml5_root =3D NULL; - u64 *pml4_root =3D NULL; - u64 *pae_root; - - /* - * When shadowing 32-bit or PAE NPT with 64-bit NPT, the PML4 and PDP - * tables are allocated and initialized at root creation as there is no - * equivalent level in the guest's NPT to shadow. Allocate the tables - * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare. - */ - if (mmu->direct_map || mmu->root_level >=3D PT64_ROOT_4LEVEL || - mmu->shadow_root_level < PT64_ROOT_4LEVEL) - return 0; - - /* - * NPT, the only paging mode that uses this horror, uses a fixed number - * of levels for the shadow page tables, e.g. all MMUs are 4-level or - * all MMus are 5-level. Thus, this can safely require that pml5_root - * is allocated if the other roots are valid and pml5 is needed, as any - * prior MMU would also have required pml5. - */ - if (mmu->pae_root && mmu->pml4_root && (!need_pml5 || mmu->pml5_root)) - return 0; - - /* - * The special roots should always be allocated in concert. Yell and - * bail if KVM ends up in a state where only one of the roots is valid. - */ - if (WARN_ON_ONCE(!tdp_enabled || mmu->pae_root || mmu->pml4_root || - (need_pml5 && mmu->pml5_root))) - return -EIO; - - /* - * Unlike 32-bit NPT, the PDP table doesn't need to be in low mem, and - * doesn't need to be decrypted. - */ - pae_root =3D (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); - if (!pae_root) - return -ENOMEM; - -#ifdef CONFIG_X86_64 - pml4_root =3D (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); - if (!pml4_root) - goto err_pml4; - - if (need_pml5) { - pml5_root =3D (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT); - if (!pml5_root) - goto err_pml5; - } -#endif - - mmu->pae_root =3D pae_root; - mmu->pml4_root =3D pml4_root; - mmu->pml5_root =3D pml5_root; - - return 0; - -#ifdef CONFIG_X86_64 -err_pml5: - free_page((unsigned long)pml4_root); -err_pml4: - free_page((unsigned long)pae_root); - return -ENOMEM; -#endif -} - static bool is_unsync_root(hpa_t root) { struct kvm_mmu_page *sp; @@ -3724,8 +3528,7 @@ static bool is_unsync_root(hpa_t root) =20 void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu) { - int i; - struct kvm_mmu_page *sp; + hpa_t root =3D vcpu->arch.mmu->root.hpa; =20 if (vcpu->arch.mmu->direct_map) return; @@ -3735,31 +3538,11 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu) =20 vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY); =20 - if (vcpu->arch.mmu->root_level >=3D PT64_ROOT_4LEVEL) { - hpa_t root =3D vcpu->arch.mmu->root.hpa; - sp =3D to_shadow_page(root); - - if (!is_unsync_root(root)) - return; - - write_lock(&vcpu->kvm->mmu_lock); - mmu_sync_children(vcpu, sp, true); - write_unlock(&vcpu->kvm->mmu_lock); + if (!is_unsync_root(root)) return; - } =20 write_lock(&vcpu->kvm->mmu_lock); - - for (i =3D 0; i < 4; ++i) { - hpa_t root =3D vcpu->arch.mmu->pae_root[i]; - - if (IS_VALID_PAE_ROOT(root)) { - root &=3D PT64_BASE_ADDR_MASK; - sp =3D to_shadow_page(root); - mmu_sync_children(vcpu, sp, true); - } - } - + mmu_sync_children(vcpu, to_shadow_page(root), true); write_unlock(&vcpu->kvm->mmu_lock); } =20 @@ -4913,8 +4696,11 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *= vcpu, else role.base.level =3D PT64_ROOT_4LEVEL; =20 - if (role.base.level =3D=3D PT32E_ROOT_LEVEL) + if (role.base.level =3D=3D PT32E_ROOT_LEVEL) { role.base.pae_root =3D 1; + if (____is_cr0_pg(regs) && !____is_cr4_pse(regs)) + role.base.passthrough =3D 1; + } =20 return role; } @@ -5162,9 +4948,6 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu) int r; =20 r =3D mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map); - if (r) - goto out; - r =3D mmu_alloc_special_roots(vcpu); if (r) goto out; if (vcpu->arch.mmu->direct_map) @@ -5635,65 +5418,14 @@ slot_handle_level_4k(struct kvm *kvm, const struct = kvm_memory_slot *memslot, PG_LEVEL_4K, flush_on_yield); } =20 -static void free_mmu_pages(struct kvm_mmu *mmu) -{ - if (!tdp_enabled && mmu->pae_root) - set_memory_encrypted((unsigned long)mmu->pae_root, 1); - free_page((unsigned long)mmu->pae_root); - free_page((unsigned long)mmu->pml4_root); - free_page((unsigned long)mmu->pml5_root); -} - static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) { - struct page *page; int i; =20 mmu->root.hpa =3D INVALID_PAGE; mmu->root.pgd =3D 0; for (i =3D 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) mmu->prev_roots[i] =3D KVM_MMU_ROOT_INFO_INVALID; - - /* vcpu->arch.guest_mmu isn't used when !tdp_enabled. */ - if (!tdp_enabled && mmu =3D=3D &vcpu->arch.guest_mmu) - return 0; - - /* - * When using PAE paging, the four PDPTEs are treated as 'root' pages, - * while the PDP table is a per-vCPU construct that's allocated at MMU - * creation. When emulating 32-bit mode, cr3 is only 32 bits even on - * x86_64. Therefore we need to allocate the PDP table in the first - * 4GB of memory, which happens to fit the DMA32 zone. TDP paging - * generally doesn't use PAE paging and can skip allocating the PDP - * table. The main exception, handled here, is SVM's 32-bit NPT. The - * other exception is for shadowing L1's 32-bit or PAE NPT on 64-bit - * KVM; that horror is handled on-demand by mmu_alloc_special_roots(). - */ - if (tdp_enabled && kvm_mmu_get_tdp_level(vcpu) > PT32E_ROOT_LEVEL) - return 0; - - page =3D alloc_page(GFP_KERNEL_ACCOUNT | __GFP_DMA32); - if (!page) - return -ENOMEM; - - mmu->pae_root =3D page_address(page); - - /* - * CR3 is only 32 bits when PAE paging is used, thus it's impossible to - * get the CPU to treat the PDPTEs as encrypted. Decrypt the page so - * that KVM's writes and the CPU's reads get along. Note, this is - * only necessary when using shadow paging, as 64-bit NPT can get at - * the C-bit even when shadowing 32-bit NPT, and SME isn't supported - * by 32-bit kernels (when KVM itself uses 32-bit NPT). - */ - if (!tdp_enabled) - set_memory_decrypted((unsigned long)mmu->pae_root, 1); - else - WARN_ON_ONCE(shadow_me_mask); - - for (i =3D 0; i < 4; ++i) - mmu->pae_root[i] =3D INVALID_PAE_ROOT; - return 0; } =20 @@ -5722,7 +5454,6 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu) =20 return ret; fail_allocate_root: - free_mmu_pages(&vcpu->arch.guest_mmu); return ret; } =20 @@ -6363,8 +6094,6 @@ int kvm_mmu_module_init(void) void kvm_mmu_destroy(struct kvm_vcpu *vcpu) { kvm_mmu_unload(vcpu); - free_mmu_pages(&vcpu->arch.root_mmu); - free_mmu_pages(&vcpu->arch.guest_mmu); mmu_free_memory_caches(vcpu); } =20 diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 2062ac25b7e5..c5bf9b619c51 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -365,6 +365,16 @@ static int FNAME(walk_addr_generic)(struct guest_walke= r *walker, pte =3D mmu->get_guest_pgd(vcpu); have_ad =3D PT_HAVE_ACCESSED_DIRTY(mmu); =20 + /* kvm_mmu_get_page() might use this values for allocating passthrough + * shadow page. + */ + walker->table_gfn[4] =3D gpte_to_gfn(pte); + walker->pt_access[4] =3D ACC_ALL; + walker->table_gfn[3] =3D gpte_to_gfn(pte); + walker->pt_access[3] =3D ACC_ALL; + walker->table_gfn[2] =3D gpte_to_gfn(pte); + walker->pt_access[2] =3D ACC_ALL; + #if PTTYPE =3D=3D 64 walk_nx_mask =3D 1ULL << PT64_NX_SHIFT; if (walker->level =3D=3D PT32E_ROOT_LEVEL) { @@ -710,7 +720,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct k= vm_page_fault *fault, * Verify that the gpte in the page we've just write * protected is still there. */ - if (FNAME(gpte_changed)(vcpu, gw, it.level - 1)) + if (it.level - 1 < top_level && + FNAME(gpte_changed)(vcpu, gw, it.level - 1)) goto out_gpte_changed; =20 if (sp) --=20 2.19.1.6.gb485710b