From nobody Sat Apr 11 18:33:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D24B9C00140 for ; Mon, 8 Aug 2022 14:57:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243536AbiHHO5o (ORCPT ); Mon, 8 Aug 2022 10:57:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47272 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243428AbiHHO5g (ORCPT ); Mon, 8 Aug 2022 10:57:36 -0400 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7434211C26 for ; Mon, 8 Aug 2022 07:57:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1659970654; x=1691506654; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=RqWLLHXdKq7i3jEZDCWPo2WblS5wC3PZ6h5tyOrVLyQ=; b=UiuO/XnqNLMdIWjXlxmA9GKV+SM1/jLVckdm2zMs+eNM7LVsO4iDnGqH LN6JZ10mtS/MkIamTkzRID1yOOV0QnWBNVBCEBJ4Ck49TO/4Ro0DBPu1A ZPoQGnyK1EZhqW1DFJg2wYsibL+8QWh1f+23NXmtg/xT42NAIJK+9busm iJPAp2eCqD6P4nLk8jmEbuh6p08hD2qdFfyGTDwJp0Hfl87RFklgR7QSr m67nsXyM+Ru4dKNm8/Nrh6FOzcEq/2pguv0St1cDEJalta6mwSZvb7Igy gaTVQ81eI/pBWqPRJzwrwRwor8r+8RjLlSurvBr5k7HL9k7+TSigNSYwn w==; X-IronPort-AV: E=McAfee;i="6400,9594,10433"; a="270996886" X-IronPort-AV: E=Sophos;i="5.93,222,1654585200"; d="scan'208";a="270996886" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2022 07:57:34 -0700 X-IronPort-AV: E=Sophos;i="5.93,222,1654585200"; d="scan'208";a="663980492" Received: from ziqianlu-desk2.sh.intel.com ([10.238.2.76]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2022 07:57:32 -0700 From: Aaron Lu To: Dave Hansen , Rick Edgecombe Cc: Song Liu , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH 1/4] x86/mm/cpa: restore global bit when page is present Date: Mon, 8 Aug 2022 22:56:46 +0800 Message-Id: <20220808145649.2261258-2-aaron.lu@intel.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808145649.2261258-1-aaron.lu@intel.com> References: <20220808145649.2261258-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" For configs that don't have PTI enabled or cpus that don't need meltdown mitigation, current kernel can lose GLOBAL bit after a page goes through a cycle of present -> not present -> present. It happened like this(__vunmap() does this in vm_remove_mappings()): original page protection: 0x8000000000000163 (NX/G/D/A/RW/P) set_memory_np(page, 1): 0x8000000000000062 (NX/D/A/RW) lose G and P set_memory_p(pagem 1): 0x8000000000000063 (NX/D/A/RW/P) restored P In the end, this page's protection no longer has Global bit set and this would create problem for this merge small mapping feature. For this reason, restore Global bit for systems that do not have PTI enabled if page is present. (pgprot_clear_protnone_bits() deserves a better name if this patch is acceptible but first, I would like to get some feedback if this is the right way to solve this so I didn't bother with the name yet) Signed-off-by: Aaron Lu --- arch/x86/mm/pat/set_memory.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 1abd5438f126..33657a54670a 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -758,6 +758,8 @@ static pgprot_t pgprot_clear_protnone_bits(pgprot_t pro= t) */ if (!(pgprot_val(prot) & _PAGE_PRESENT)) pgprot_val(prot) &=3D ~_PAGE_GLOBAL; + else + pgprot_val(prot) |=3D _PAGE_GLOBAL & __default_kernel_pte_mask; =20 return prot; } --=20 2.37.1 From nobody Sat Apr 11 18:33:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 974EAC00140 for ; Mon, 8 Aug 2022 14:57:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243468AbiHHO5w (ORCPT ); Mon, 8 Aug 2022 10:57:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47310 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243455AbiHHO5h (ORCPT ); Mon, 8 Aug 2022 10:57:37 -0400 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5054E12ACA for ; Mon, 8 Aug 2022 07:57:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1659970656; x=1691506656; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=NvohfDahb9Ga60b7Ddo8K4Xaj4ChBuprZ05/+jYnTA0=; b=Eamz4A1bGREqvA1EKPoWGZorR+PJOmNPCE6tw+Fn9cEzP76x8hfPT1Xn hld0XXtkhsOBMxx6U/QmA8KilirVpvMq6leB4cHrvIUVQNRMTfjXiTDo6 M8tGYb7id7SAzrDpO/aZK79VmS6lchdtPskMOsNRJ24tksOgyAXMFN2vN MlwS+WGKhjozYHriy2NAxFm65tM6imhFvfWs8I6wYF6PYiCIEUwHQMqZa v2kSE7oJSxeOYuEN8EAWv36vg1THTNGbUSnPNoEb9wcG0i+ZgWDf1r29I 8sMv4dGy09HIJvEeXECgJ076XkE9+wwOLlQ3o0qk+F3Hw5RlNTzDhXt4Q w==; X-IronPort-AV: E=McAfee;i="6400,9594,10433"; a="270996888" X-IronPort-AV: E=Sophos;i="5.93,222,1654585200"; d="scan'208";a="270996888" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2022 07:57:36 -0700 X-IronPort-AV: E=Sophos;i="5.93,222,1654585200"; d="scan'208";a="663980501" Received: from ziqianlu-desk2.sh.intel.com ([10.238.2.76]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2022 07:57:34 -0700 From: Aaron Lu To: Dave Hansen , Rick Edgecombe Cc: Song Liu , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH 2/4] x86/mm/cpa: merge splitted direct mapping when possible Date: Mon, 8 Aug 2022 22:56:47 +0800 Message-Id: <20220808145649.2261258-3-aaron.lu@intel.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808145649.2261258-1-aaron.lu@intel.com> References: <20220808145649.2261258-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" On x86_64, Linux has direct mapping of almost all physical memory. For performance reasons, this mapping is usually set as large page like 2M or 1G per hardware's capability with read, write and non-execute protection. There are cases where some pages have to change their protection to RO and eXecutable, like pages that host module code or bpf prog. When these pages' protection are changed, the corresponding large mapping that cover these pages will have to be splitted into 4K first and then individual 4k page's protection changed accordingly, i.e. unaffected pages keep their original protection as RW and NX while affected pages' protection changed to RO and X. There is a problem due to this split: the large mapping will remain splitted even after the affected pages' protection are changed back to RW and NX, like when the module is unloaded or bpf progs are freed. After system runs a long time, there can be more and more large mapping being splitted, causing more and more dTLB misses and overall system performance getting hurt. This patch tries to restore splitted large mapping by tracking how many entries of the splitted small mapping page table have the same protection bits and once that number becomes PTRS_PER_PTE, this small mapping page table can be released with its upper level page table entry pointing directly to a large page. Testing: see patch4 for detailed testing. Signed-off-by: Aaron Lu --- arch/x86/mm/pat/set_memory.c | 184 +++++++++++++++++++++++++++++++++-- include/linux/mm_types.h | 6 ++ include/linux/page-flags.h | 6 ++ 3 files changed, 189 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 33657a54670a..fea2c70ff37f 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -718,13 +718,89 @@ phys_addr_t slow_virt_to_phys(void *__virt_addr) } EXPORT_SYMBOL_GPL(slow_virt_to_phys); =20 +static void merge_splitted_mapping(struct page *pgt, int level); +static void set_pte_adjust_nr_same_prot(pte_t *kpte, int level, pte_t pte) +{ + struct page *pgt =3D virt_to_page(kpte); + pgprot_t old_prot, new_prot; + int i; + + /* The purpose of tracking entries with same_prot is to hopefully + * mege splitted small mappings to large ones. Since only 2M and + * 1G mapping are supported, there is no need tracking for page + * tables of level > 2M. + */ + if (!PageSplitpgt(pgt) || level > PG_LEVEL_2M) { + set_pte(kpte, pte); + return; + } + + /* get old protection before kpte is updated */ + if (level =3D=3D PG_LEVEL_4K) { + old_prot =3D pte_pgprot(*kpte); + new_prot =3D pte_pgprot(pte); + } else { + old_prot =3D pmd_pgprot(*(pmd_t *)kpte); + new_prot =3D pmd_pgprot(*(pmd_t *)&pte); + } + + set_pte(kpte, pte); + + if (pgprot_val(pgt->same_prot) !=3D pgprot_val(old_prot) && + pgprot_val(pgt->same_prot) =3D=3D pgprot_val(new_prot)) + pgt->nr_same_prot++; + + if (pgprot_val(pgt->same_prot) =3D=3D pgprot_val(old_prot) && + pgprot_val(pgt->same_prot) !=3D pgprot_val(new_prot)) + pgt->nr_same_prot--; + + if (unlikely(pgt->nr_same_prot =3D=3D 0)) { + pte_t *entry =3D page_address(pgt); + + /* + * Now all entries' prot have changed, check again + * to see if all entries have the same new prot. + * Use the 1st entry's prot as the new pgt->same_prot. + */ + if (level =3D=3D PG_LEVEL_4K) + pgt->same_prot =3D pte_pgprot(*entry); + else + pgt->same_prot =3D pmd_pgprot(*(pmd_t *)entry); + + for (i =3D 0; i < PTRS_PER_PTE; i++, entry++) { + pgprot_t prot; + + if (level =3D=3D PG_LEVEL_4K) + prot =3D pte_pgprot(*entry); + else + prot =3D pmd_pgprot(*(pmd_t *)entry); + + if (pgprot_val(prot) =3D=3D pgprot_val(pgt->same_prot)) + pgt->nr_same_prot++; + } + } + + /* + * If this splitted page table's entries all have the same + * protection now, try merge it. Note that for a PMD level + * page table, if all entries are pointing to PTE page table, + * no merge can be done. + */ + if (unlikely(pgt->nr_same_prot =3D=3D PTRS_PER_PTE && + (pgprot_val(pgt->same_prot) & _PAGE_PRESENT) && + (level =3D=3D PG_LEVEL_4K || + pgprot_val(pgt->same_prot) & _PAGE_PSE))) + merge_splitted_mapping(pgt, level); + +} + /* * Set the new pmd in all the pgds we know about: */ -static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) +static void __set_pmd_pte(pte_t *kpte, int level, unsigned long address, p= te_t pte) { /* change init_mm */ - set_pte_atomic(kpte, pte); + set_pte_adjust_nr_same_prot(kpte, level, pte); #ifdef CONFIG_X86_32 if (!SHARED_KERNEL_PMD) { struct page *page; @@ -739,12 +815,68 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long = address, pte_t pte) p4d =3D p4d_offset(pgd, address); pud =3D pud_offset(p4d, address); pmd =3D pmd_offset(pud, address); - set_pte_atomic((pte_t *)pmd, pte); + set_pte_adjust_nr_same_prot((pte_t *)pmd, level, pte); } } #endif } =20 +static void merge_splitted_mapping(struct page *pgt, int level) +{ + pte_t *kpte =3D page_address(pgt); + pgprot_t pte_prot, pmd_prot; + unsigned long address; + unsigned long pfn; + pte_t pte; + pud_t pud; + + switch (level) { + case PG_LEVEL_4K: + pte_prot =3D pte_pgprot(*kpte); + pmd_prot =3D pgprot_4k_2_large(pte_prot); + pgprot_val(pmd_prot) |=3D _PAGE_PSE; + pfn =3D pte_pfn(*kpte); + pte =3D pfn_pte(pfn, pmd_prot); + + /* + * update upper level kpte. + * Note that further merge can happen if all PMD table's + * entries have the same protection bits after this change. + */ + address =3D (unsigned long)page_address(pfn_to_page(pfn)); + __set_pmd_pte(pgt->upper_kpte, level + 1, address, pte); + break; + case PG_LEVEL_2M: + pfn =3D pmd_pfn(*(pmd_t *)kpte); + pmd_prot =3D pmd_pgprot(*(pmd_t *)kpte); + pud =3D pfn_pud(pfn, pmd_prot); + set_pud(pgt->upper_kpte, pud); + break; + default: + WARN_ON_ONCE(1); + return; + } + + /* + * Current kernel did flush_tlb_all() when splitting a large page + * inside pgd_lock because: + * - an errata of Atom AAH41; as well as + * - avoid another cpu simultaneously changing the just splitted + * large page's attr. + * The first does not require a full tlb flush according to + * commit 211b3d03c7400("x86: work around Fedora-11 x86-32 kernel + * failures on Intel Atom CPUs") while the 2nd can be already + * achieved by cpa_lock. commit c0a759abf5a68("x86/mm/cpa: Move + * flush_tlb_all()") simplified the code by doing a full tlb flush + * inside pgd_lock. For the same reason, I also did a full tlb + * flush inside pgd_lock after doing a merge. + */ + flush_tlb_all(); + + __ClearPageSplitpgt(pgt); + __free_page(pgt); +} + static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot) { /* @@ -901,9 +1033,10 @@ static int __should_split_large_page(pte_t *kpte, uns= igned long address, =20 /* All checks passed. Update the large page mapping. */ new_pte =3D pfn_pte(old_pfn, new_prot); - __set_pmd_pte(kpte, address, new_pte); + __set_pmd_pte(kpte, level, address, new_pte); cpa->flags |=3D CPA_FLUSHTLB; cpa_inc_lp_preserved(level); + return 0; } =20 @@ -1023,6 +1156,11 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte= , unsigned long address, for (i =3D 0; i < PTRS_PER_PTE; i++, pfn +=3D pfninc, lpaddr +=3D lpinc) split_set_pte(cpa, pbase + i, pfn, ref_prot, lpaddr, lpinc); =20 + __SetPageSplitpgt(base); + base->upper_kpte =3D kpte; + base->same_prot =3D ref_prot; + base->nr_same_prot =3D PTRS_PER_PTE; + if (virt_addr_valid(address)) { unsigned long pfn =3D PFN_DOWN(__pa(address)); =20 @@ -1037,7 +1175,7 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte,= unsigned long address, * pagetable protections, the actual ptes set above control the * primary protection behavior: */ - __set_pmd_pte(kpte, address, mk_pte(base, __pgprot(_KERNPG_TABLE))); + __set_pmd_pte(kpte, level, address, mk_pte(base, __pgprot(_KERNPG_TABLE))= ); =20 /* * Do a global flush tlb after splitting the large page @@ -1508,6 +1646,23 @@ static int __cpa_process_fault(struct cpa_data *cpa,= unsigned long vaddr, } } =20 +/* + * When debug_pagealloc_enabled(): + * - direct map will not use large page mapping; + * - kernel highmap can still use large mapping. + * When !debug_pagealloc_enabled(): both direct map and kernel highmap + * can use large page mapping. + * + * When large page mapping is used, it can be splitted due to reasons + * like protection change and thus, it is also possible a merge can + * happen for that splitted small mapping page table page. + */ +static bool subject_to_merge(unsigned long addr) +{ + return !debug_pagealloc_enabled() || + within(addr, (unsigned long)_text, _brk_end); +} + static int __change_page_attr(struct cpa_data *cpa, int primary) { unsigned long address; @@ -1526,10 +1681,23 @@ static int __change_page_attr(struct cpa_data *cpa,= int primary) return __cpa_process_fault(cpa, address, primary); =20 if (level =3D=3D PG_LEVEL_4K) { - pte_t new_pte; + pte_t new_pte, *tmp; pgprot_t new_prot =3D pte_pgprot(old_pte); unsigned long pfn =3D pte_pfn(old_pte); =20 + if (subject_to_merge(address)) { + spin_lock(&pgd_lock); + /* + * Check for races, another CPU might have merged + * this page up already. + */ + tmp =3D _lookup_address_cpa(cpa, address, &level); + if (tmp !=3D kpte) { + spin_unlock(&pgd_lock); + goto repeat; + } + } + pgprot_val(new_prot) &=3D ~pgprot_val(cpa->mask_clr); pgprot_val(new_prot) |=3D pgprot_val(cpa->mask_set); =20 @@ -1551,10 +1719,12 @@ static int __change_page_attr(struct cpa_data *cpa,= int primary) * Do we really change anything ? */ if (pte_val(old_pte) !=3D pte_val(new_pte)) { - set_pte_atomic(kpte, new_pte); + set_pte_adjust_nr_same_prot(kpte, level, new_pte); cpa->flags |=3D CPA_FLUSHTLB; } cpa->numpages =3D 1; + if (subject_to_merge(address)) + spin_unlock(&pgd_lock); return 0; } =20 diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c29ab4c0cd5c..6124c575fdad 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -160,6 +160,12 @@ struct page { spinlock_t ptl; #endif }; + struct { /* splitted page table pages */ + void *upper_kpte; /* compound_head */ + int nr_same_prot; + unsigned long _split_pt_pad; /* mapping */ + pgprot_t same_prot; + }; struct { /* ZONE_DEVICE pages */ /** @pgmap: Points to the hosting device page map. */ struct dev_pagemap *pgmap; diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index e66f7aa3191d..3fe395dd7dfc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -942,6 +942,7 @@ static inline bool is_page_hwpoison(struct page *page) #define PG_offline 0x00000100 #define PG_table 0x00000200 #define PG_guard 0x00000400 +#define PG_splitpgt 0x00000800 =20 #define PageType(page, flag) \ ((page->page_type & (PAGE_TYPE_BASE | flag)) =3D=3D PAGE_TYPE_BASE) @@ -1012,6 +1013,11 @@ PAGE_TYPE_OPS(Table, table) */ PAGE_TYPE_OPS(Guard, guard) =20 +/* + * Marks pages in use as splitted page tables + */ +PAGE_TYPE_OPS(Splitpgt, splitpgt) + extern bool is_free_buddy_page(struct page *page); =20 PAGEFLAG(Isolated, isolated, PF_ANY); --=20 2.37.1 From nobody Sat Apr 11 18:33:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DACB9C00140 for ; Mon, 8 Aug 2022 14:58:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243684AbiHHO6D (ORCPT ); Mon, 8 Aug 2022 10:58:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47324 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243491AbiHHO5i (ORCPT ); Mon, 8 Aug 2022 10:57:38 -0400 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1E47C12AFB for ; Mon, 8 Aug 2022 07:57:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1659970658; x=1691506658; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ZZZmTi+uFgSNOrvmXUlZxD/TTF+6bac/va5jAJ+uc90=; b=G64yTYy21LHIDNCfx+gwPiYL4fH29r2I1AnFaNGob5hReZPjYPbSzVNZ oJCl5dIGVXLdG3znuIm4UFqjDQdAijAe5anj8XEEWzKsfLEZYiQCCBQxE f/El11b4FcsqkCmmhzn8Dll0u8DThPQA2Zi0z2l3qnbC5TRHHgkwc9sJG u2bMa06up4HXr/8inGP94F3yS4uR4xP7x5AFIYV8R49MgtQ6cPt6OW/FF MB6Bp3osRoGm9Ne0Y6BEVrPve7U/zGEDXQUP1enDnb1ukaPwhN61KbTEP AoYtc4pdHJzM6LaRm1sFb6xTNQjurslb18Ql+9qXWaCLvkHy8SprQv7/t A==; X-IronPort-AV: E=McAfee;i="6400,9594,10433"; a="270996892" X-IronPort-AV: E=Sophos;i="5.93,222,1654585200"; d="scan'208";a="270996892" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2022 07:57:38 -0700 X-IronPort-AV: E=Sophos;i="5.93,222,1654585200"; d="scan'208";a="663980513" Received: from ziqianlu-desk2.sh.intel.com ([10.238.2.76]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2022 07:57:36 -0700 From: Aaron Lu To: Dave Hansen , Rick Edgecombe Cc: Song Liu , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH 3/4] x86/mm/cpa: add merge event counter Date: Mon, 8 Aug 2022 22:56:48 +0800 Message-Id: <20220808145649.2261258-4-aaron.lu@intel.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808145649.2261258-1-aaron.lu@intel.com> References: <20220808145649.2261258-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Like split event counter, this patch add counter for merge event. Signed-off-by: Aaron Lu --- arch/x86/mm/pat/set_memory.c | 19 +++++++++++++++++++ include/linux/vm_event_item.h | 2 ++ mm/vmstat.c | 2 ++ 3 files changed, 23 insertions(+) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index fea2c70ff37f..1be9aab42c79 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -105,6 +105,23 @@ static void split_page_count(int level) direct_pages_count[level - 1] +=3D PTRS_PER_PTE; } =20 +static void merge_page_count(int level) +{ + if (direct_pages_count[level] < PTRS_PER_PTE) { + WARN_ON_ONCE(1); + return; + } + + direct_pages_count[level] -=3D PTRS_PER_PTE; + if (system_state =3D=3D SYSTEM_RUNNING) { + if (level =3D=3D PG_LEVEL_4K) + count_vm_event(DIRECT_MAP_LEVEL1_MERGE); + else if (level =3D=3D PG_LEVEL_2M) + count_vm_event(DIRECT_MAP_LEVEL2_MERGE); + } + direct_pages_count[level + 1]++; +} + void arch_report_meminfo(struct seq_file *m) { seq_printf(m, "DirectMap4k: %8lu kB\n", @@ -875,6 +892,8 @@ static void merge_splitted_mapping(struct page *pgt, in= t level) =20 __ClearPageSplitpgt(pgt); __free_page(pgt); + + merge_page_count(level); } =20 static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 404024486fa5..00a9a435af49 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -143,6 +143,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_X86 DIRECT_MAP_LEVEL2_SPLIT, DIRECT_MAP_LEVEL3_SPLIT, + DIRECT_MAP_LEVEL1_MERGE, + DIRECT_MAP_LEVEL2_MERGE, #endif NR_VM_EVENT_ITEMS }; diff --git a/mm/vmstat.c b/mm/vmstat.c index 373d2730fcf2..1a4287a4d614 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1403,6 +1403,8 @@ const char * const vmstat_text[] =3D { #ifdef CONFIG_X86 "direct_map_level2_splits", "direct_map_level3_splits", + "direct_map_level1_merges", + "direct_map_level2_merges", #endif #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ }; --=20 2.37.1 From nobody Sat Apr 11 18:33:10 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6FC09C00140 for ; Mon, 8 Aug 2022 14:58:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243694AbiHHO6N (ORCPT ); Mon, 8 Aug 2022 10:58:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47388 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243542AbiHHO5l (ORCPT ); Mon, 8 Aug 2022 10:57:41 -0400 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3635613D19 for ; Mon, 8 Aug 2022 07:57:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1659970660; x=1691506660; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=smWZqXtuGrkZm9PWI4t1Ng9axLN7SJF3zgimxq/DvoY=; b=jnVq964YeJsskxokORYCeVHyGtwWhbhwjQjcXigpJqF1kXQko7Rpoa03 vc83sQHSOeFJCm0wF08JbU5X7dVF/pfxej8Hcg8HofXbocHvNSdEzCyNz PNgWS5h0LVbaYkTZn66vpkrEgw+PciRETGb3gxf/C9RJGIKZy2uEiFIgX OFYgRQf7LM8lm+o4G+H9F38ha0op68K02lD03MZRUsoi1fjNdVhzFUu9P fB8BUvzucvatr2ZU+OjyhOqqYnRkoU1B4FsQiaPBb7qERgLG6WceRjkGq 3hrFRK01t41he4+aMz+bqx3ypMI8rdg2d3w+31/bjjbpqUngSYCRxMrpQ A==; X-IronPort-AV: E=McAfee;i="6400,9594,10433"; a="270996900" X-IronPort-AV: E=Sophos;i="5.93,222,1654585200"; d="scan'208";a="270996900" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2022 07:57:39 -0700 X-IronPort-AV: E=Sophos;i="5.93,222,1654585200"; d="scan'208";a="663980524" Received: from ziqianlu-desk2.sh.intel.com ([10.238.2.76]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2022 07:57:38 -0700 From: Aaron Lu To: Dave Hansen , Rick Edgecombe Cc: Song Liu , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [TEST NOT_FOR_MERGE 4/4] x86/mm/cpa: add a test interface to split direct map Date: Mon, 8 Aug 2022 22:56:49 +0800 Message-Id: <20220808145649.2261258-5-aaron.lu@intel.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220808145649.2261258-1-aaron.lu@intel.com> References: <20220808145649.2261258-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" To test this functionality, a debugfs interface is added: /sys/kernel/debug/x86/split_mapping There are three test modes. mode 0: allocate $page_nr pages and set each page's protection first to RO and X and then back to RW and NX. This is used to test multiple CPUs dealing with different address ranges. mode 1: allocate several pages and create $nr_cpu kthreads to simultaneously change those pages protection with a fixed pattern. This is used to test multiple CPUs dealing with the same address range. mode 2: same as mode 0 except using alloc_pages() instead of vmalloc() because vmalloc space is too small on x86_32/pae. On a x86_64 VM, I started mode0.sh and mode1.sh at the same time: mode0.sh: mode=3D0 page_nr=3D200000 nr_cpu=3D16 function test_one() { echo $mode $page_nr > /sys/kernel/debug/x86/split_mapping } while true; do for i in `seq $nr_cpu`; do test_one & done wait done mode1.sh: mode=3D1 page_nr=3D1 echo $mode $page_nr > /sys/kernel/debug/x86/split_mapping After 5 hours, no problem occured with some millions of splits and merges. For x86_32 and x86_pae, mode2 test is used and also no problem found. Signed-off-by: Aaron Lu --- arch/x86/mm/pat/set_memory.c | 206 +++++++++++++++++++++++++++++++++++ 1 file changed, 206 insertions(+) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 1be9aab42c79..4deea4de73e7 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -20,6 +20,9 @@ #include #include #include +#include +#include +#include =20 #include #include @@ -2556,6 +2559,209 @@ int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, un= signed long address, return retval; } =20 +static int split_mapping_mode0_test(int page_nr) +{ + void **addr_buff; + void *addr; + int i, j; + + addr_buff =3D kvmalloc(sizeof(void *) * page_nr, GFP_KERNEL); + if (!addr_buff) { + pr_err("addr_buff: no memory\n"); + return -ENOMEM; + } + + for (i =3D 0; i < page_nr; i++) { + addr =3D vmalloc(PAGE_SIZE); + if (!addr) { + pr_err("no memory\n"); + break; + } + + set_memory_ro((unsigned long)addr, 1); + set_memory_x((unsigned long)addr, 1); + + addr_buff[i] =3D addr; + } + + for (j =3D 0; j < i; j++) { + set_memory_nx((unsigned long)addr_buff[j], 1); + set_memory_rw((unsigned long)addr_buff[j], 1); + vfree(addr_buff[j]); + } + + kvfree(addr_buff); + + return 0; +} + +struct split_mapping_mode1_data { + unsigned long addr; + int page_nr; +}; + +static int split_mapping_set_prot(void *data) +{ + struct split_mapping_mode1_data *d =3D data; + unsigned long addr =3D d->addr; + int page_nr =3D d->page_nr; + int m; + + m =3D get_random_int() % 100; + msleep(m); + + while (!kthread_should_stop()) { + set_memory_ro(addr, page_nr); + set_memory_x(addr, page_nr); + set_memory_rw(addr, page_nr); + set_memory_nx(addr, page_nr); + cond_resched(); + } + + return 0; +} + +static int split_mapping_mode1_test(int page_nr) +{ + int nr_kthreads =3D num_online_cpus(); + struct split_mapping_mode1_data d; + struct task_struct **kthreads; + int i, j, ret; + void *addr; + + addr =3D vmalloc(PAGE_SIZE * page_nr); + if (!addr) + return -ENOMEM; + + kthreads =3D kmalloc(nr_kthreads * sizeof(struct task_struct *), GFP_KERN= EL); + if (!kthreads) { + vfree(addr); + return -ENOMEM; + } + + d.addr =3D (unsigned long)addr; + d.page_nr =3D page_nr; + for (i =3D 0; i < nr_kthreads; i++) { + kthreads[i] =3D kthread_run(split_mapping_set_prot, &d, "split_mappingd%= d", i); + if (IS_ERR(kthreads[i])) { + for (j =3D 0; j < i; j++) + kthread_stop(kthreads[j]); + ret =3D PTR_ERR(kthreads[i]); + goto out; + } + } + + while (1) { + if (signal_pending(current)) { + for (i =3D 0; i < nr_kthreads; i++) + kthread_stop(kthreads[i]); + ret =3D 0; + break; + } + msleep(1000); + } + +out: + kfree(kthreads); + vfree(addr); + return ret; +} + +static int split_mapping_mode2_test(int page_nr) +{ + struct page *p, *t; + unsigned long addr; + int i; + + LIST_HEAD(head); + + for (i =3D 0; i < page_nr; i++) { + p =3D alloc_pages(GFP_KERNEL | GFP_DMA32, 0); + if (!p) { + pr_err("no memory\n"); + break; + } + + addr =3D (unsigned long)page_address(p); + BUG_ON(!addr); + + set_memory_ro(addr, 1); + set_memory_x(addr, 1); + + list_add(&p->lru, &head); + } + + list_for_each_entry_safe(p, t, &head, lru) { + addr =3D (unsigned long)page_address(p); + set_memory_nx(addr, 1); + set_memory_rw(addr, 1); + + list_del(&p->lru); + __free_page(p); + } + + return 0; +} +static ssize_t split_mapping_write_file(struct file *file, const char __us= er *buf, + size_t count, loff_t *ppos) +{ + unsigned int mode =3D 0, page_nr =3D 0; + char buffer[64]; + int ret; + + if (count > 64) + return -EINVAL; + + if (copy_from_user(buffer, buf, count)) + return -EFAULT; + sscanf(buffer, "%u %u", &mode, &page_nr); + + /* + * There are 3 test modes. + * mode 0: each thread allocates $page_nr pages and set each page's + * protection first to RO and X and then back to RW and NX. + * This is used to test multiple CPUs dealing with different + * pages. + * mode 1: allocate several pages and create $nr_cpu kthreads to + * simultaneously change those pages protection to a fixed + * pattern. This is used to test multiple CPUs dealing with + * some same page's protection. + * mode 2: like mode 0 but directly use alloc_pages() because vmalloc + * area on x86_32 is too small, only 128M. + */ + if (mode > 2) + return -EINVAL; + + if (page_nr =3D=3D 0) + return -EINVAL; + + if (mode =3D=3D 0) + ret =3D split_mapping_mode0_test(page_nr); + else if (mode =3D=3D 1) + ret =3D split_mapping_mode1_test(page_nr); + else + ret =3D split_mapping_mode2_test(page_nr); + + return ret ? ret : count; +} + +static const struct file_operations split_mapping_fops =3D { + .write =3D split_mapping_write_file, +}; + +static int __init split_mapping_init(void) +{ + struct dentry *d =3D debugfs_create_file("split_mapping", S_IWUSR, arch_d= ebugfs_dir, NULL, + &split_mapping_fops); + if (IS_ERR(d)) { + pr_err("create split_mapping failed: %ld\n", PTR_ERR(d)); + return PTR_ERR(d); + } + + return 0; +} +late_initcall(split_mapping_init); + /* * The testcases use internal knowledge of the implementation that shouldn= 't * be exposed to the rest of the kernel. Include these directly here. --=20 2.37.1