clear_pages_rep(), clear_pages_erms() clear using string instructions.
While clearing extents of more than a single page, we can use these
more effectively by explicitly advertising the region-size to the
processor.
This can be used as a hint by the processor-uarch to optimize the
clearing (ex. to avoid polluting one or more levels of the data-cache.)
As a secondary benefit, string instructions are typically microcoded,
and so it's a good idea to amortize the cost of the decode across larger
regions.
Accordingly, clear_huge_page() now does huge-page clearing in three
parts: the neighbourhood of the faulting address, the left, and the
right region of the neighbourhood.
The local neighbourhood is cleared last to keep its cachelines hot.
Performance
==
Use mmap(MAP_HUGETLB) to demand fault a 128GB region (on the local
NUMA node):
Milan (EPYC 7J13, boost=1):
mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)
pg-sz=2MB 14.55 19.29 +32.5%
pg-sz=1GB 19.34 49.60 +156.4%
Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, so we see a dropoff in cacheline-allocations for
pg-sz=1GB:
pg-sz=1GB:
-23,088,001,347 cycles # 3.487 GHz ( +- 0.08% ) (35.68%)
- 4,680,678,939 L1-dcache-loads # 706.831 M/sec ( +- 0.02% ) (35.74%)
- 2,150,395,280 L1-dcache-load-misses # 45.93% of all L1-dcache accesses ( +- 0.01% ) (35.74%)
+ 8,983,798,764 cycles # 3.489 GHz ( +- 0.05% ) (35.59%)
+ 18,294,725 L1-dcache-loads # 7.104 M/sec ( +- 18.88% ) (35.78%)
+ 6,677,565 L1-dcache-load-misses # 30.48% of all L1-dcache accesses ( +- 20.72% ) (35.78%)
That's not the case with pg-sz=2MB, where we perform better but the
number of cacheline allocations remain the same:
pg-sz=2MB:
-31,087,683,852 cycles # 3.494 GHz ( +- 0.17% ) (35.72%)
- 4,898,684,886 L1-dcache-loads # 550.627 M/sec ( +- 0.03% ) (35.71%)
- 2,161,434,236 L1-dcache-load-misses # 44.11% of all L1-dcache accesses ( +- 0.01% ) (35.71%)
+23,368,914,596 cycles # 3.480 GHz ( +- 0.27% ) (35.72%)
+ 4,481,808,430 L1-dcache-loads # 667.382 M/sec ( +- 0.03% ) (35.71%)
+ 2,170,453,309 L1-dcache-load-misses # 48.41% of all L1-dcache accesses ( +- 0.06% ) (35.71%)
Icelakex (Platinum 8358, no_turbo=0):
mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)
pg-sz=2MB 9.19 12.94 +40.8%
pg-sz=1GB 9.36 12.97 +38.5%
For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
see a drop in cycles but there's no drop in cacheline allocation.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/mm/hugetlbpage.c | 54 +++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 5804bbae4f01..0b9f7a6dad93 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -148,6 +148,60 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
return hugetlb_get_unmapped_area_topdown(file, addr, len,
pgoff, flags);
}
+
+#ifndef CONFIG_HIGHMEM
+static void clear_contig_region(struct page *page, unsigned int npages)
+{
+ clear_pages(page_address(page), npages);
+}
+
+/*
+ * clear_huge_page(): multi-page clearing variant of clear_huge_page().
+ *
+ * Taking inspiration from the common code variant, we split the zeroing in
+ * three parts: left of the fault, right of the fault, and up to 5 pages
+ * in the immediate neighbourhood of the target page.
+ *
+ * Cleared in that order to keep cache lines of the target region hot.
+ *
+ * For gigantic pages, there is no expectation of cache locality so we do a
+ * straight zeroing.
+ */
+void clear_huge_page(struct page *page,
+ unsigned long addr_hint, unsigned int pages_per_huge_page)
+{
+ unsigned long addr = addr_hint &
+ ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+ const long pgidx = (addr_hint - addr) / PAGE_SIZE;
+ const int first_pg = 0, last_pg = pages_per_huge_page - 1;
+ const int width = 2; /* pages cleared last on either side */
+ int sidx[3], eidx[3];
+ int i, n;
+
+ if (pages_per_huge_page > MAX_ORDER_NR_PAGES)
+ return clear_contig_region(page, pages_per_huge_page);
+
+ /*
+ * Neighbourhood of the fault. Cleared at the end to ensure
+ * it sticks around in the cache.
+ */
+ n = 2;
+ sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width);
+ eidx[n] = (pgidx + width) > last_pg ? last_pg : (pgidx + width);
+
+ sidx[0] = first_pg; /* Region to the left of the fault */
+ eidx[0] = sidx[n] - 1;
+
+ sidx[1] = eidx[n] + 1; /* Region to the right of the fault */
+ eidx[1] = last_pg;
+
+ for (i = 0; i <= 2; i++) {
+ if (eidx[i] >= sidx[i])
+ clear_contig_region(page + sidx[i],
+ eidx[i] - sidx[i] + 1);
+ }
+}
+#endif /* CONFIG_HIGHMEM */
#endif /* CONFIG_HUGETLB_PAGE */
#ifdef CONFIG_X86_64
--
2.31.1
On Wed, Aug 30, 2023 at 11:49:55AM -0700, Ankur Arora wrote:
> +#ifndef CONFIG_HIGHMEM
I'm thinking this wants to be: #ifdef CONFIG_X86_64. All the previous
stuff was 64bit specific.
> +static void clear_contig_region(struct page *page, unsigned int npages)
> +{
> + clear_pages(page_address(page), npages);
> +}
I'm not sure about the naming of this helper -- but whatever.
> +
> +/*
> + * clear_huge_page(): multi-page clearing variant of clear_huge_page().
> + *
> + * Taking inspiration from the common code variant, we split the zeroing in
> + * three parts: left of the fault, right of the fault, and up to 5 pages
> + * in the immediate neighbourhood of the target page.
> + *
> + * Cleared in that order to keep cache lines of the target region hot.
> + *
> + * For gigantic pages, there is no expectation of cache locality so we do a
> + * straight zeroing.
> + */
> +void clear_huge_page(struct page *page,
> + unsigned long addr_hint, unsigned int pages_per_huge_page)
> +{
> + unsigned long addr = addr_hint &
> + ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
> + const long pgidx = (addr_hint - addr) / PAGE_SIZE;
> + const int first_pg = 0, last_pg = pages_per_huge_page - 1;
> + const int width = 2; /* pages cleared last on either side */
> + int sidx[3], eidx[3];
> + int i, n;
> +
> + if (pages_per_huge_page > MAX_ORDER_NR_PAGES)
> + return clear_contig_region(page, pages_per_huge_page);
> +
> + /*
> + * Neighbourhood of the fault. Cleared at the end to ensure
> + * it sticks around in the cache.
> + */
> + n = 2;
> + sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width);
> + eidx[n] = (pgidx + width) > last_pg ? last_pg : (pgidx + width);
> +
> + sidx[0] = first_pg; /* Region to the left of the fault */
> + eidx[0] = sidx[n] - 1;
> +
> + sidx[1] = eidx[n] + 1; /* Region to the right of the fault */
> + eidx[1] = last_pg;
> +
> + for (i = 0; i <= 2; i++) {
> + if (eidx[i] >= sidx[i])
> + clear_contig_region(page + sidx[i],
> + eidx[i] - sidx[i] + 1);
Since the if has a multi-line statement it needs { } per coding style.
> + }
> +}
> +#endif /* CONFIG_HIGHMEM */
> #endif /* CONFIG_HUGETLB_PAGE */
>
> #ifdef CONFIG_X86_64
> --
> 2.31.1
>
Hi Ankur,
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/x86/core]
[also build test ERROR on akpm-mm/mm-everything tip/x86/mm tip/sched/core tip/core/entry linus/master v6.5 next-20230831]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/mm-clear_huge_page-allow-arch-override-for-clear_huge_page/20230831-030156
base: tip/x86/core
patch link: https://lore.kernel.org/r/20230830184958.2333078-7-ankur.a.arora%40oracle.com
patch subject: [PATCH v2 6/9] x86/clear_huge_page: multi-page clearing
config: i386-buildonly-randconfig-006-20230831 (https://download.01.org/0day-ci/archive/20230901/202309010241.u7o4IETu-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230901/202309010241.u7o4IETu-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309010241.u7o4IETu-lkp@intel.com/
All errors (new ones prefixed by >>):
arch/x86/mm/hugetlbpage.c: In function 'clear_contig_region':
>> arch/x86/mm/hugetlbpage.c:155:9: error: implicit declaration of function 'clear_pages'; did you mean 'clear_page'? [-Werror=implicit-function-declaration]
155 | clear_pages(page_address(page), npages);
| ^~~~~~~~~~~
| clear_page
cc1: some warnings being treated as errors
vim +155 arch/x86/mm/hugetlbpage.c
151
152 #ifndef CONFIG_HIGHMEM
153 static void clear_contig_region(struct page *page, unsigned int npages)
154 {
> 155 clear_pages(page_address(page), npages);
156 }
157
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
© 2016 - 2025 Red Hat, Inc.