[PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization

alexlzhu@fb.com posted 1 patch 6 days, 5 hours ago
Documentation/admin-guide/mm/transhuge.rst |   5 +
Documentation/filesystems/proc.rst         |  30 ++++
include/linux/huge_mm.h                    |   2 +
mm/huge_memory.c                           | 167 +++++++++++++++++++++
4 files changed, 204 insertions(+)
[PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by alexlzhu@fb.com 6 days, 5 hours ago
From: Alexander Zhu <alexlzhu@fb.com>

THPs have historically been enabled on a per application basis due to
performance increase or decrease depending on how the particular
application uses physical memory. When THPs are heavily utilized,
application performance improves due to fewer TLB cache misses.
It has long been suspected that performance regressions when THP
is enabled happens due to heavily underutilized anonymous THPs.

Previously there was no way to track how much of a THP is
actually being used. With this change, we seek to gain visibility
into the utilization of THPs in order to make more intelligent
decisions regarding paging.

This change introduces a tool that scans through all of physical
memory for anonymous THPs and groups them into buckets based
on utilization. It also includes an interface under
/proc/thp_utilization.

Utilization of a THP is defined as the percentage of nonzero
pages in the THP. The worker thread will scan through all
of physical memory and obtain utilization of all anonymous
THPs. It will gather this information by periodically scanning
through all of physical memory for anonymous THPs, group them
into buckets based on utilization, and report utilization
information through /proc/thp_utilization.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
 Documentation/admin-guide/mm/transhuge.rst |   5 +
 Documentation/filesystems/proc.rst         |  30 ++++
 include/linux/huge_mm.h                    |   2 +
 mm/huge_memory.c                           | 167 +++++++++++++++++++++
 4 files changed, 204 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index c9c37f16eef8..46ebdd8ea0d1 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -297,6 +297,11 @@ To identify what applications are mapping file transparent huge pages, it
 is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
 for each mapping.
 
+The utilization of transparent hugepages can be viewed by reading
+``/proc/thp_utilization``. This shows the number of THPs per
+utilization bucket and the number of zero pages in each bucket. The
+last two lines show the time and the duration of the last scan.
+
 Note that reading the smaps file is expensive and reading it
 frequently will incur overhead.
 
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 1bc91fb8c321..bb49e97cc469 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -720,6 +720,7 @@ files are there, and which are missing.
  sys          See chapter 2
  sysvipc      Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty 	      Info of tty drivers
+ thp_utilization 	      Info on thp utilization
  uptime       Wall clock since boot, combined idle time of all cpus
  version      Kernel version
  video 	      bttv info of video resources			(2.4)
@@ -1158,6 +1159,35 @@ DirectMap4k, DirectMap2M, DirectMap1G
               Breakdown of page table sizes used in the kernel's
               identity mapping of RAM
 
+thp_utilization
+~~~~~~~~~~~~~~~
+
+Provides information on the utilization of Transparent Hugepages. The
+utilization of a THP is defined as the ratio of non zero filled 4kb
+pages to the total number of pages in a THP. The buckets are labelled
+by the range of total utilized 4kb pages with one line per utilization
+bucket. Each line contains the total number of THPs in that bucket
+and the total number of zero filled 4kb pages summed over all THPs
+in that bucket. The last two lines show the timestamp and duration
+respectively of the most recent scan over all of physical memory.
+
+::
+
+    > cat /proc/thp_utilization
+    Utilized[0-50]: 435 217667
+    Utilized[51-101]: 58 25394
+    Utilized[102-152]: 51 19605
+    Utilized[153-203]: 43 14169
+    Utilized[204-255]: 54 15300
+    Utilized[256-306]: 55 12537
+    Utilized[307-357]: 49 8675
+    Utilized[358-408]: 67 8601
+    Utilized[409-459]: 82 6259
+    Utilized[460-512]: 189 2613
+    Last Scan Time: 1202.83
+    Last Scan Duration: 70.72
+
+
 vmallocinfo
 ~~~~~~~~~~~
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index de29821231c9..0ece9d53ab48 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -196,6 +196,8 @@ bool transparent_hugepage_active(struct vm_area_struct *vma);
 unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 
+int thp_number_utilized_pages(struct page *page);
+
 void prep_transhuge_page(struct page *page);
 void free_transhuge_page(struct page *page);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 834f288b3769..a393fa491632 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -30,6 +30,7 @@
 #include <linux/hashtable.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/page_idle.h>
+#include <linux/proc_fs.h>
 #include <linux/shmem_fs.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
@@ -44,6 +45,16 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/thp.h>
 
+/*
+ * The number of utilization buckets THPs will be grouped in
+ * under /proc/thp_utilization.
+ */
+#define THP_UTIL_BUCKET_NR 10
+/*
+ * The number of addresses to scan through on each periodic
+ * run of the scanner that generates /proc/thp_utilization.
+ */
+#define THP_UTIL_SCAN_SIZE 256
 /*
  * By default, transparent hugepage support is disabled in order to avoid
  * risking an increased memory footprint for applications that are not
@@ -69,6 +80,52 @@ static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
 
+static void thp_utilization_workfn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
+
+struct thp_scan_info_bucket {
+	int nr_thps;
+	int nr_zero_pages;
+};
+
+struct thp_scan_info {
+	struct thp_scan_info_bucket buckets[THP_UTIL_BUCKET_NR];
+	struct zone *scan_zone;
+	struct timespec64 last_scan_duration;
+	struct timespec64 last_scan_time;
+	unsigned long pfn;
+};
+
+static struct thp_scan_info thp_scan_proc;
+static struct thp_scan_info thp_scan;
+
+static int thp_utilization_show(struct seq_file *seqf, void *pos)
+{
+	int i;
+	int start;
+	int end;
+
+	for (i = 0; i < THP_UTIL_BUCKET_NR; i++) {
+		start = i * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR;
+		end = (i + 1 == THP_UTIL_BUCKET_NR)
+			   ? HPAGE_PMD_NR
+			   : ((i + 1) * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR - 1);
+		/* The last bucket will need to contain 100 */
+		seq_printf(seqf, "Utilized[%d-%d]: %d %d\n", start, end,
+			   thp_scan_proc.buckets[i].nr_thps,
+			   thp_scan_proc.buckets[i].nr_zero_pages);
+	}
+	seq_printf(seqf, "Last Scan Time: %lu.%02lu\n",
+		   (unsigned long)thp_scan_proc.last_scan_time.tv_sec,
+		   (thp_scan_proc.last_scan_time.tv_nsec / (NSEC_PER_SEC / 100)));
+
+	seq_printf(seqf, "Last Scan Duration: %lu.%02lu\n",
+		   (unsigned long)thp_scan_proc.last_scan_duration.tv_sec,
+		   (thp_scan_proc.last_scan_duration.tv_nsec / (NSEC_PER_SEC / 100)));
+
+	return 0;
+}
+
 bool transparent_hugepage_active(struct vm_area_struct *vma)
 {
 	/* The addr is used to check if the vma size fits */
@@ -423,6 +480,9 @@ static int __init hugepage_init(void)
 	if (err)
 		goto err_slab;
 
+	proc_create_single("thp_utilization", 0, NULL, &thp_utilization_show);
+	schedule_delayed_work(&thp_utilization_work, HZ);
+
 	err = register_shrinker(&huge_zero_page_shrinker);
 	if (err)
 		goto err_hzp_shrinker;
@@ -537,6 +597,12 @@ static inline bool is_transparent_hugepage(struct page *page)
 	       page[1].compound_dtor == TRANSHUGE_PAGE_DTOR;
 }
 
+bool is_anon_transparent_hugepage(struct page *page)
+{
+	return PageAnon(page) && is_transparent_hugepage(page);
+}
+EXPORT_SYMBOL_GPL(is_anon_transparent_hugepage);
+
 static unsigned long __thp_get_unmapped_area(struct file *filp,
 		unsigned long addr, unsigned long len,
 		loff_t off, unsigned long flags, unsigned long size)
@@ -587,6 +653,34 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
 
+int thp_number_utilized_pages(struct page *page)
+{
+	unsigned long page_index, page_offset, value;
+	int thp_nr_utilized_pages = HPAGE_PMD_NR;
+	int step_size = sizeof(unsigned long);
+	bool is_all_zeroes;
+	void *kaddr;
+
+	if (!page || !is_anon_transparent_hugepage(page))
+		return -1;
+
+	kaddr = kmap_local_page(page);
+	for (page_index = 0; page_index < HPAGE_PMD_NR; page_index++) {
+		is_all_zeroes = true;
+		for (page_offset = 0; page_offset < PAGE_SIZE; page_offset += step_size) {
+			value = *(unsigned long *)(kaddr + page_index * PAGE_SIZE + page_offset);
+			if (value != 0) {
+				is_all_zeroes = false;
+				break;
+			}
+		}
+		if (is_all_zeroes)
+			thp_nr_utilized_pages--;
+	}
+	kunmap_local(kaddr);
+	return thp_nr_utilized_pages;
+}
+
 static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 			struct page *page, gfp_t gfp)
 {
@@ -3167,3 +3261,76 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	trace_remove_migration_pmd(address, pmd_val(pmde));
 }
 #endif
+
+static void thp_scan_next_zone(void)
+{
+	struct timespec64 current_time;
+	int i;
+	bool update_proc;
+
+	thp_scan.scan_zone = next_zone(thp_scan.scan_zone);
+	update_proc = !thp_scan.scan_zone;
+	thp_scan.scan_zone = update_proc ? (first_online_pgdat())->node_zones
+			: thp_scan.scan_zone;
+	thp_scan.pfn = (thp_scan.scan_zone->zone_start_pfn + HPAGE_PMD_NR - 1)
+			& ~(HPAGE_PMD_SIZE - 1);
+	if (!update_proc)
+		return;
+
+	ktime_get_ts64(&current_time);
+	thp_scan_proc.last_scan_duration = timespec64_sub(current_time,
+							  thp_scan_proc.last_scan_time);
+	thp_scan_proc.last_scan_time = current_time;
+
+	for (i = 0; i < THP_UTIL_BUCKET_NR; i++) {
+		thp_scan_proc.buckets[i].nr_thps = thp_scan.buckets[i].nr_thps;
+		thp_scan_proc.buckets[i].nr_zero_pages = thp_scan.buckets[i].nr_zero_pages;
+		thp_scan.buckets[i].nr_thps = 0;
+		thp_scan.buckets[i].nr_zero_pages = 0;
+	}
+}
+
+static void thp_util_scan(unsigned long pfn_end)
+{
+	struct page *page, *head = NULL;
+	int bucket, num_utilized_pages, current_pfn;
+	int i;
+
+	for (i = 0; i < THP_UTIL_SCAN_SIZE; i++) {
+		current_pfn = thp_scan.pfn;
+		thp_scan.pfn += HPAGE_PMD_NR;
+		if (current_pfn >= pfn_end)
+			return;
+		if (!pfn_valid(current_pfn))
+			continue;
+
+		page = pfn_to_page(current_pfn);
+		num_utilized_pages = thp_number_utilized_pages(page);
+		if (num_utilized_pages < 0)
+			continue;
+
+		head = compound_head(page);
+		bucket = num_utilized_pages * THP_UTIL_BUCKET_NR / HPAGE_PMD_NR;
+		bucket = min(bucket, THP_UTIL_BUCKET_NR - 1);
+		thp_scan.buckets[bucket].nr_thps++;
+		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
+	}
+}
+
+static void thp_utilization_workfn(struct work_struct *work)
+{
+	unsigned long pfn_end;
+
+	if (!thp_scan.scan_zone)
+		thp_scan.scan_zone = (first_online_pgdat())->node_zones;
+
+	pfn_end = (thp_scan.scan_zone->zone_start_pfn +
+			thp_scan.scan_zone->spanned_pages + HPAGE_PMD_NR - 1)
+			& ~(HPAGE_PMD_SIZE - 1);
+	if (!populated_zone(thp_scan.scan_zone) || thp_scan.pfn >= pfn_end)
+		thp_scan_next_zone();
+	else
+		thp_util_scan(pfn_end);
+
+	schedule_delayed_work(&thp_utilization_work, HZ);
+}
-- 
2.30.2
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yang Shi 3 days, 6 hours ago
On Fri, Aug 5, 2022 at 11:40 AM <alexlzhu@fb.com> wrote:
>
> From: Alexander Zhu <alexlzhu@fb.com>
>
> THPs have historically been enabled on a per application basis due to
> performance increase or decrease depending on how the particular
> application uses physical memory. When THPs are heavily utilized,
> application performance improves due to fewer TLB cache misses.
> It has long been suspected that performance regressions when THP
> is enabled happens due to heavily underutilized anonymous THPs.
>
> Previously there was no way to track how much of a THP is
> actually being used. With this change, we seek to gain visibility
> into the utilization of THPs in order to make more intelligent
> decisions regarding paging.
>
> This change introduces a tool that scans through all of physical
> memory for anonymous THPs and groups them into buckets based
> on utilization. It also includes an interface under
> /proc/thp_utilization.
>
> Utilization of a THP is defined as the percentage of nonzero
> pages in the THP. The worker thread will scan through all
> of physical memory and obtain utilization of all anonymous
> THPs. It will gather this information by periodically scanning
> through all of physical memory for anonymous THPs, group them
> into buckets based on utilization, and report utilization
> information through /proc/thp_utilization.
>
> Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst |   5 +
>  Documentation/filesystems/proc.rst         |  30 ++++
>  include/linux/huge_mm.h                    |   2 +
>  mm/huge_memory.c                           | 167 +++++++++++++++++++++
>  4 files changed, 204 insertions(+)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index c9c37f16eef8..46ebdd8ea0d1 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -297,6 +297,11 @@ To identify what applications are mapping file transparent huge pages, it
>  is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
>  for each mapping.
>
> +The utilization of transparent hugepages can be viewed by reading
> +``/proc/thp_utilization``. This shows the number of THPs per
> +utilization bucket and the number of zero pages in each bucket. The
> +last two lines show the time and the duration of the last scan.
> +
>  Note that reading the smaps file is expensive and reading it
>  frequently will incur overhead.
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 1bc91fb8c321..bb49e97cc469 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -720,6 +720,7 @@ files are there, and which are missing.
>   sys          See chapter 2
>   sysvipc      Info of SysVIPC Resources (msg, sem, shm)                (2.4)
>   tty         Info of tty drivers
> + thp_utilization             Info on thp utilization
>   uptime       Wall clock since boot, combined idle time of all cpus
>   version      Kernel version
>   video               bttv info of video resources                      (2.4)
> @@ -1158,6 +1159,35 @@ DirectMap4k, DirectMap2M, DirectMap1G
>                Breakdown of page table sizes used in the kernel's
>                identity mapping of RAM
>
> +thp_utilization
> +~~~~~~~~~~~~~~~
> +
> +Provides information on the utilization of Transparent Hugepages. The
> +utilization of a THP is defined as the ratio of non zero filled 4kb
> +pages to the total number of pages in a THP. The buckets are labelled
> +by the range of total utilized 4kb pages with one line per utilization
> +bucket. Each line contains the total number of THPs in that bucket
> +and the total number of zero filled 4kb pages summed over all THPs
> +in that bucket. The last two lines show the timestamp and duration
> +respectively of the most recent scan over all of physical memory.
> +
> +::
> +
> +    > cat /proc/thp_utilization
> +    Utilized[0-50]: 435 217667
> +    Utilized[51-101]: 58 25394
> +    Utilized[102-152]: 51 19605
> +    Utilized[153-203]: 43 14169
> +    Utilized[204-255]: 54 15300
> +    Utilized[256-306]: 55 12537
> +    Utilized[307-357]: 49 8675
> +    Utilized[358-408]: 67 8601
> +    Utilized[409-459]: 82 6259
> +    Utilized[460-512]: 189 2613
> +    Last Scan Time: 1202.83
> +    Last Scan Duration: 70.72

I'm not sure how useful the system level THP utilization is. IMHO the
THP utilization is quite application specific. So the per task stats
might be more useful than system level IMHO.

> +
> +
>  vmallocinfo
>  ~~~~~~~~~~~
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index de29821231c9..0ece9d53ab48 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -196,6 +196,8 @@ bool transparent_hugepage_active(struct vm_area_struct *vma);
>  unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>                 unsigned long len, unsigned long pgoff, unsigned long flags);
>
> +int thp_number_utilized_pages(struct page *page);
> +
>  void prep_transhuge_page(struct page *page);
>  void free_transhuge_page(struct page *page);
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 834f288b3769..a393fa491632 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -30,6 +30,7 @@
>  #include <linux/hashtable.h>
>  #include <linux/userfaultfd_k.h>
>  #include <linux/page_idle.h>
> +#include <linux/proc_fs.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/oom.h>
>  #include <linux/numa.h>
> @@ -44,6 +45,16 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/thp.h>
>
> +/*
> + * The number of utilization buckets THPs will be grouped in
> + * under /proc/thp_utilization.
> + */
> +#define THP_UTIL_BUCKET_NR 10
> +/*
> + * The number of addresses to scan through on each periodic
> + * run of the scanner that generates /proc/thp_utilization.
> + */
> +#define THP_UTIL_SCAN_SIZE 256
>  /*
>   * By default, transparent hugepage support is disabled in order to avoid
>   * risking an increased memory footprint for applications that are not
> @@ -69,6 +80,52 @@ static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
>
> +static void thp_utilization_workfn(struct work_struct *work);
> +static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
> +
> +struct thp_scan_info_bucket {
> +       int nr_thps;
> +       int nr_zero_pages;
> +};
> +
> +struct thp_scan_info {
> +       struct thp_scan_info_bucket buckets[THP_UTIL_BUCKET_NR];
> +       struct zone *scan_zone;
> +       struct timespec64 last_scan_duration;
> +       struct timespec64 last_scan_time;
> +       unsigned long pfn;
> +};
> +
> +static struct thp_scan_info thp_scan_proc;
> +static struct thp_scan_info thp_scan;
> +
> +static int thp_utilization_show(struct seq_file *seqf, void *pos)
> +{
> +       int i;
> +       int start;
> +       int end;
> +
> +       for (i = 0; i < THP_UTIL_BUCKET_NR; i++) {
> +               start = i * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR;
> +               end = (i + 1 == THP_UTIL_BUCKET_NR)
> +                          ? HPAGE_PMD_NR
> +                          : ((i + 1) * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR - 1);
> +               /* The last bucket will need to contain 100 */
> +               seq_printf(seqf, "Utilized[%d-%d]: %d %d\n", start, end,
> +                          thp_scan_proc.buckets[i].nr_thps,
> +                          thp_scan_proc.buckets[i].nr_zero_pages);
> +       }
> +       seq_printf(seqf, "Last Scan Time: %lu.%02lu\n",
> +                  (unsigned long)thp_scan_proc.last_scan_time.tv_sec,
> +                  (thp_scan_proc.last_scan_time.tv_nsec / (NSEC_PER_SEC / 100)));
> +
> +       seq_printf(seqf, "Last Scan Duration: %lu.%02lu\n",
> +                  (unsigned long)thp_scan_proc.last_scan_duration.tv_sec,
> +                  (thp_scan_proc.last_scan_duration.tv_nsec / (NSEC_PER_SEC / 100)));
> +
> +       return 0;
> +}
> +
>  bool transparent_hugepage_active(struct vm_area_struct *vma)
>  {
>         /* The addr is used to check if the vma size fits */
> @@ -423,6 +480,9 @@ static int __init hugepage_init(void)
>         if (err)
>                 goto err_slab;
>
> +       proc_create_single("thp_utilization", 0, NULL, &thp_utilization_show);
> +       schedule_delayed_work(&thp_utilization_work, HZ);
> +
>         err = register_shrinker(&huge_zero_page_shrinker);
>         if (err)
>                 goto err_hzp_shrinker;
> @@ -537,6 +597,12 @@ static inline bool is_transparent_hugepage(struct page *page)
>                page[1].compound_dtor == TRANSHUGE_PAGE_DTOR;
>  }
>
> +bool is_anon_transparent_hugepage(struct page *page)
> +{
> +       return PageAnon(page) && is_transparent_hugepage(page);
> +}
> +EXPORT_SYMBOL_GPL(is_anon_transparent_hugepage);
> +
>  static unsigned long __thp_get_unmapped_area(struct file *filp,
>                 unsigned long addr, unsigned long len,
>                 loff_t off, unsigned long flags, unsigned long size)
> @@ -587,6 +653,34 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  }
>  EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
>
> +int thp_number_utilized_pages(struct page *page)
> +{
> +       unsigned long page_index, page_offset, value;
> +       int thp_nr_utilized_pages = HPAGE_PMD_NR;
> +       int step_size = sizeof(unsigned long);
> +       bool is_all_zeroes;
> +       void *kaddr;
> +
> +       if (!page || !is_anon_transparent_hugepage(page))
> +               return -1;
> +
> +       kaddr = kmap_local_page(page);
> +       for (page_index = 0; page_index < HPAGE_PMD_NR; page_index++) {
> +               is_all_zeroes = true;
> +               for (page_offset = 0; page_offset < PAGE_SIZE; page_offset += step_size) {
> +                       value = *(unsigned long *)(kaddr + page_index * PAGE_SIZE + page_offset);
> +                       if (value != 0) {
> +                               is_all_zeroes = false;
> +                               break;
> +                       }
> +               }
> +               if (is_all_zeroes)
> +                       thp_nr_utilized_pages--;
> +       }
> +       kunmap_local(kaddr);
> +       return thp_nr_utilized_pages;
> +}
> +
>  static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>                         struct page *page, gfp_t gfp)
>  {
> @@ -3167,3 +3261,76 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>         trace_remove_migration_pmd(address, pmd_val(pmde));
>  }
>  #endif
> +
> +static void thp_scan_next_zone(void)
> +{
> +       struct timespec64 current_time;
> +       int i;
> +       bool update_proc;
> +
> +       thp_scan.scan_zone = next_zone(thp_scan.scan_zone);
> +       update_proc = !thp_scan.scan_zone;
> +       thp_scan.scan_zone = update_proc ? (first_online_pgdat())->node_zones
> +                       : thp_scan.scan_zone;
> +       thp_scan.pfn = (thp_scan.scan_zone->zone_start_pfn + HPAGE_PMD_NR - 1)
> +                       & ~(HPAGE_PMD_SIZE - 1);
> +       if (!update_proc)
> +               return;
> +
> +       ktime_get_ts64(&current_time);
> +       thp_scan_proc.last_scan_duration = timespec64_sub(current_time,
> +                                                         thp_scan_proc.last_scan_time);
> +       thp_scan_proc.last_scan_time = current_time;
> +
> +       for (i = 0; i < THP_UTIL_BUCKET_NR; i++) {
> +               thp_scan_proc.buckets[i].nr_thps = thp_scan.buckets[i].nr_thps;
> +               thp_scan_proc.buckets[i].nr_zero_pages = thp_scan.buckets[i].nr_zero_pages;
> +               thp_scan.buckets[i].nr_thps = 0;
> +               thp_scan.buckets[i].nr_zero_pages = 0;
> +       }
> +}
> +
> +static void thp_util_scan(unsigned long pfn_end)
> +{
> +       struct page *page, *head = NULL;
> +       int bucket, num_utilized_pages, current_pfn;
> +       int i;
> +
> +       for (i = 0; i < THP_UTIL_SCAN_SIZE; i++) {
> +               current_pfn = thp_scan.pfn;
> +               thp_scan.pfn += HPAGE_PMD_NR;
> +               if (current_pfn >= pfn_end)
> +                       return;
> +               if (!pfn_valid(current_pfn))
> +                       continue;
> +
> +               page = pfn_to_page(current_pfn);
> +               num_utilized_pages = thp_number_utilized_pages(page);
> +               if (num_utilized_pages < 0)
> +                       continue;
> +
> +               head = compound_head(page);
> +               bucket = num_utilized_pages * THP_UTIL_BUCKET_NR / HPAGE_PMD_NR;
> +               bucket = min(bucket, THP_UTIL_BUCKET_NR - 1);
> +               thp_scan.buckets[bucket].nr_thps++;
> +               thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
> +       }
> +}
> +
> +static void thp_utilization_workfn(struct work_struct *work)
> +{
> +       unsigned long pfn_end;
> +
> +       if (!thp_scan.scan_zone)
> +               thp_scan.scan_zone = (first_online_pgdat())->node_zones;
> +
> +       pfn_end = (thp_scan.scan_zone->zone_start_pfn +
> +                       thp_scan.scan_zone->spanned_pages + HPAGE_PMD_NR - 1)
> +                       & ~(HPAGE_PMD_SIZE - 1);
> +       if (!populated_zone(thp_scan.scan_zone) || thp_scan.pfn >= pfn_end)
> +               thp_scan_next_zone();
> +       else
> +               thp_util_scan(pfn_end);
> +
> +       schedule_delayed_work(&thp_utilization_work, HZ);
> +}
> --
> 2.30.2
>
>
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by kernel test robot 4 days, 17 hours ago
Hi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.19]
[cannot apply to akpm-mm/mm-everything next-20220805]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/alexlzhu-fb-com/mm-add-thp_utilization-metrics-to-proc-thp_utilization/20220806-024150
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 9e2f40233670c70c25e0681cb66d50d1e2742829
config: i386-randconfig-a002 (https://download.01.org/0day-ci/archive/20220807/202208071415.Aav6LUtf-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 5f1c7e2cc5a3c07cbc2412e851a7283c1841f520)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/c3896ed4c7d2319ac0880860a6bfc98e6eed3d66
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review alexlzhu-fb-com/mm-add-thp_utilization-metrics-to-proc-thp_utilization/20220806-024150
        git checkout c3896ed4c7d2319ac0880860a6bfc98e6eed3d66
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> mm/huge_memory.c:601:6: warning: no previous prototype for function 'is_anon_transparent_hugepage' [-Wmissing-prototypes]
   bool is_anon_transparent_hugepage(struct page *page)
        ^
   mm/huge_memory.c:601:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   bool is_anon_transparent_hugepage(struct page *page)
   ^
   static 
   mm/huge_memory.c:3300:22: warning: variable 'head' set but not used [-Wunused-but-set-variable]
           struct page *page, *head = NULL;
                               ^
   2 warnings generated.


vim +/is_anon_transparent_hugepage +601 mm/huge_memory.c

   600	
 > 601	bool is_anon_transparent_hugepage(struct page *page)
   602	{
   603		return PageAnon(page) && is_transparent_hugepage(page);
   604	}
   605	EXPORT_SYMBOL_GPL(is_anon_transparent_hugepage);
   606	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by kernel test robot 4 days, 17 hours ago
Hi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.19]
[cannot apply to akpm-mm/mm-everything next-20220805]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/alexlzhu-fb-com/mm-add-thp_utilization-metrics-to-proc-thp_utilization/20220806-024150
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 9e2f40233670c70c25e0681cb66d50d1e2742829
config: i386-randconfig-a013 (https://download.01.org/0day-ci/archive/20220807/202208071442.EIrjgLbV-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 5f1c7e2cc5a3c07cbc2412e851a7283c1841f520)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/c3896ed4c7d2319ac0880860a6bfc98e6eed3d66
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review alexlzhu-fb-com/mm-add-thp_utilization-metrics-to-proc-thp_utilization/20220806-024150
        git checkout c3896ed4c7d2319ac0880860a6bfc98e6eed3d66
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> mm/huge_memory.c:601:6: warning: no previous prototype for function 'is_anon_transparent_hugepage' [-Wmissing-prototypes]
   bool is_anon_transparent_hugepage(struct page *page)
        ^
   mm/huge_memory.c:601:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   bool is_anon_transparent_hugepage(struct page *page)
   ^
   static 
   mm/huge_memory.c:3300:22: warning: variable 'head' set but not used [-Wunused-but-set-variable]
           struct page *page, *head = NULL;
                               ^
   2 warnings generated.


vim +/is_anon_transparent_hugepage +601 mm/huge_memory.c

   600	
 > 601	bool is_anon_transparent_hugepage(struct page *page)
   602	{
   603		return PageAnon(page) && is_transparent_hugepage(page);
   604	}
   605	EXPORT_SYMBOL_GPL(is_anon_transparent_hugepage);
   606	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by kernel test robot 4 days, 18 hours ago
Hi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.19]
[cannot apply to akpm-mm/mm-everything next-20220805]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/alexlzhu-fb-com/mm-add-thp_utilization-metrics-to-proc-thp_utilization/20220806-024150
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 9e2f40233670c70c25e0681cb66d50d1e2742829
config: i386-randconfig-a001 (https://download.01.org/0day-ci/archive/20220807/202208071343.bwmuBz9l-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/c3896ed4c7d2319ac0880860a6bfc98e6eed3d66
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review alexlzhu-fb-com/mm-add-thp_utilization-metrics-to-proc-thp_utilization/20220806-024150
        git checkout c3896ed4c7d2319ac0880860a6bfc98e6eed3d66
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> mm/huge_memory.c:601:6: warning: no previous prototype for 'is_anon_transparent_hugepage' [-Wmissing-prototypes]
     601 | bool is_anon_transparent_hugepage(struct page *page)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/huge_memory.c: In function 'thp_util_scan':
>> mm/huge_memory.c:3300:29: warning: variable 'head' set but not used [-Wunused-but-set-variable]
    3300 |         struct page *page, *head = NULL;
         |                             ^~~~


vim +/is_anon_transparent_hugepage +601 mm/huge_memory.c

   600	
 > 601	bool is_anon_transparent_hugepage(struct page *page)
   602	{
   603		return PageAnon(page) && is_transparent_hugepage(page);
   604	}
   605	EXPORT_SYMBOL_GPL(is_anon_transparent_hugepage);
   606	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Matthew Wilcox 6 days, 5 hours ago
On Fri, Aug 05, 2022 at 11:40:16AM -0700, alexlzhu@fb.com wrote:
> THPs have historically been enabled on a per application basis due to
> performance increase or decrease depending on how the particular
> application uses physical memory. When THPs are heavily utilized,
> application performance improves due to fewer TLB cache misses.
> It has long been suspected that performance regressions when THP
> is enabled happens due to heavily underutilized anonymous THPs.
> 
> Previously there was no way to track how much of a THP is
> actually being used. With this change, we seek to gain visibility
> into the utilization of THPs in order to make more intelligent
> decisions regarding paging.
> 
> This change introduces a tool that scans through all of physical
> memory for anonymous THPs and groups them into buckets based
> on utilization. It also includes an interface under
> /proc/thp_utilization.

OK, so I understand why we want to do the scanning, but why do we want to
report it to userspace at all?  And if we do, why do we want to do it in
this format?  AFAIK, there's nothing userspace can do with the information
"93% of your THPs are underutilised".  If there was something userspace
could do about it, wouldn't it need to know which ones?

Isn't the real solution here entirely in-kernel?  This scanning thread
you've created should be the one splitting the THPs.  And anyway, isn't
this exactly the kind of thing that DAMON was invented for?
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Rik van Riel 6 days, 5 hours ago
On Fri, 2022-08-05 at 19:50 +0100, Matthew Wilcox wrote:
> > 
> > This change introduces a tool that scans through all of physical
> > memory for anonymous THPs and groups them into buckets based
> > on utilization. It also includes an interface under
> > /proc/thp_utilization.
> 
> OK, so I understand why we want to do the scanning, but why do we
> want to
> report it to userspace at all?  And if we do, why do we want to do it
> in
> this format?  AFAIK, there's nothing userspace can do with the
> information
> "93% of your THPs are underutilised".  If there was something
> userspace
> could do about it, wouldn't it need to know which ones?
> 
> Isn't the real solution here entirely in-kernel?  This scanning
> thread
> you've created should be the one splitting the THPs.  And anyway,
> isn't
> this exactly the kind of thing that DAMON was invented for?

Alex does have an (in kernel) shrinker that can reclaim
underutilized THPs in order to free memory.

This is a regular shrinker called from kswapd. I am not
sure a shrinker going through the DAMON infrastructure
would be any smaller, faster, or better. OTOH, DAMON
does allow for more flexible policy...

Getting some info on the effectiveness of the shrinker
seems useful, though. Could debugfs be a better place?
Should this be resubmitted together with the shrinker
code that makes this more useful?
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Matthew Wilcox 6 days, 4 hours ago
On Fri, Aug 05, 2022 at 07:04:30PM +0000, Rik van Riel wrote:
> On Fri, 2022-08-05 at 19:50 +0100, Matthew Wilcox wrote:
> > > 
> > > This change introduces a tool that scans through all of physical
> > > memory for anonymous THPs and groups them into buckets based
> > > on utilization. It also includes an interface under
> > > /proc/thp_utilization.
> > 
> > OK, so I understand why we want to do the scanning, but why do we
> > want to
> > report it to userspace at all?  And if we do, why do we want to do it
> > in
> > this format?  AFAIK, there's nothing userspace can do with the
> > information
> > "93% of your THPs are underutilised".  If there was something
> > userspace
> > could do about it, wouldn't it need to know which ones?
> > 
> > Isn't the real solution here entirely in-kernel?  This scanning
> > thread
> > you've created should be the one splitting the THPs.  And anyway,
> > isn't
> > this exactly the kind of thing that DAMON was invented for?
> 
> Alex does have an (in kernel) shrinker that can reclaim
> underutilized THPs in order to free memory.

Ah!  So when that exists, this interface tells us "how well" we're doing.

> This is a regular shrinker called from kswapd. I am not
> sure a shrinker going through the DAMON infrastructure
> would be any smaller, faster, or better. OTOH, DAMON
> does allow for more flexible policy...
> 
> Getting some info on the effectiveness of the shrinker
> seems useful, though. Could debugfs be a better place?
> Should this be resubmitted together with the shrinker
> code that makes this more useful?

Yeah, debugfs seems like a better place.  And I'd love to see the shrinker
code.  Before you mentioned that I was having all kinds of peculiar
feelings about this code.  For example, suppose you have incredibly hot
256kB of data, but the other 1792kB of data are never touched ... that
could cause us to do entirely the wrong thing and break up this THP.
Having it as a shrinker makes sense because the hot 256kB will keep the
THP from reaching the end of the list and being reclaimed.
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Alex Zhu (Kernel) 6 days, 4 hours ago
> 
> Ah!  So when that exists, this interface tells us "how well" we're doing.
> 

Yes, exactly. 
> 
> Yeah, debugfs seems like a better place.  And I'd love to see the shrinker
> code.  Before you mentioned that I was having all kinds of peculiar
> feelings about this code.  For example, suppose you have incredibly hot
> 256kB of data, but the other 1792kB of data are never touched ... that
> could cause us to do entirely the wrong thing and break up this THP.
> Having it as a shrinker makes sense because the hot 256kB will keep the
> THP from reaching the end of the list and being reclaimed.

Sounds good, I’ll move this to debugfs then. Will follow up with the shrinker code
in another patch. The shrinker relies on this scanning thread to figure out which
THPs to reclaim. 

What are your thoughts regarding integrating this with DAMON? 
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yang Shi 3 days, 6 hours ago
On Fri, Aug 5, 2022 at 12:52 PM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
>
>
> >
> > Ah!  So when that exists, this interface tells us "how well" we're doing.
> >
>
> Yes, exactly.
> >
> > Yeah, debugfs seems like a better place.  And I'd love to see the shrinker
> > code.  Before you mentioned that I was having all kinds of peculiar
> > feelings about this code.  For example, suppose you have incredibly hot
> > 256kB of data, but the other 1792kB of data are never touched ... that
> > could cause us to do entirely the wrong thing and break up this THP.
> > Having it as a shrinker makes sense because the hot 256kB will keep the
> > THP from reaching the end of the list and being reclaimed.
>
> Sounds good, I’ll move this to debugfs then. Will follow up with the shrinker code
> in another patch. The shrinker relies on this scanning thread to figure out which
> THPs to reclaim.

I'm wondering whether you could reuse the THP deferred split shrinker
or not. It is already memcg aware.

>
> What are your thoughts regarding integrating this with DAMON?
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Rik van Riel 3 days, 5 hours ago
On Mon, 2022-08-08 at 10:55 -0700, Yang Shi wrote:
> 
> On Fri, Aug 5, 2022 at 12:52 PM Alex Zhu (Kernel) <alexlzhu@fb.com>
> wrote:
> > 
> > Sounds good, I’ll move this to debugfs then. Will follow up with
> > the shrinker code
> > in another patch. The shrinker relies on this scanning thread to
> > figure out which
> > THPs to reclaim.
> 
> I'm wondering whether you could reuse the THP deferred split shrinker
> or not. It is already memcg aware.
> 
I'm not convinced that will buy much, since there is
very little code duplication between the two.

Merging the two might also bring about another bit of
extra complexity, due to the deferred split shrinker
wanting to shrink every single THP on its "to split"
list, while for Alex's shrinker we probably want to
split just one (or a few) THPs at a time, depending on
memory pressure.

Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yang Shi 2 days, 6 hours ago
On Mon, Aug 8, 2022 at 11:35 AM Rik van Riel <riel@fb.com> wrote:
>
> On Mon, 2022-08-08 at 10:55 -0700, Yang Shi wrote:
> >
> > On Fri, Aug 5, 2022 at 12:52 PM Alex Zhu (Kernel) <alexlzhu@fb.com>
> > wrote:
> > >
> > > Sounds good, I’ll move this to debugfs then. Will follow up with
> > > the shrinker code
> > > in another patch. The shrinker relies on this scanning thread to
> > > figure out which
> > > THPs to reclaim.
> >
> > I'm wondering whether you could reuse the THP deferred split shrinker
> > or not. It is already memcg aware.
> >
> I'm not convinced that will buy much, since there is
> very little code duplication between the two.
>
> Merging the two might also bring about another bit of
> extra complexity, due to the deferred split shrinker
> wanting to shrink every single THP on its "to split"
> list, while for Alex's shrinker we probably want to
> split just one (or a few) THPs at a time, depending on
> memory pressure.

OK, it is hard to tell what it looks like now. But the THPs on the
deferred split list may be on the "low utilization split" list too?
IIUC the major difference is to replace zero-filled subpage to special
zero page, so you implemented another THP split function to handle it?

Anyway the code should answer the most questions.

>
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Alex Zhu (Kernel) 2 days, 6 hours ago
> OK, it is hard to tell what it looks like now. But the THPs on the
> deferred split list may be on the "low utilization split" list too?
> IIUC the major difference is to replace zero-filled subpage to special
> zero page, so you implemented another THP split function to handle it?
> 
> Anyway the code should answer the most questions.

They can indeed end up on both lists. This did have to be handled when 
implementing the shrinker. 

We free the zero filled subpages, while modifying the existing split_huge_page()
function. Will follow up that change in another patch.
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yu Zhao 2 days ago
On Tue, Aug 9, 2022 at 11:16 AM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
>
>
> > OK, it is hard to tell what it looks like now. But the THPs on the
> > deferred split list may be on the "low utilization split" list too?
> > IIUC the major difference is to replace zero-filled subpage to special
> > zero page, so you implemented another THP split function to handle it?
> >
> > Anyway the code should answer the most questions.
>
> They can indeed end up on both lists. This did have to be handled when
> implementing the shrinker.
>
> We free the zero filled subpages, while modifying the existing split_huge_page()
> function. Will follow up that change in another patch.

FYI. This series does it:

https://lore.kernel.org/r/20210731063938.1391602-1-yuzhao@google.com/

And this one:

https://lore.kernel.org/r/1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yang Shi 1 day, 7 hours ago
On Tue, Aug 9, 2022 at 4:36 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Aug 9, 2022 at 11:16 AM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
> >
> >
> > > OK, it is hard to tell what it looks like now. But the THPs on the
> > > deferred split list may be on the "low utilization split" list too?
> > > IIUC the major difference is to replace zero-filled subpage to special
> > > zero page, so you implemented another THP split function to handle it?
> > >
> > > Anyway the code should answer the most questions.
> >
> > They can indeed end up on both lists. This did have to be handled when
> > implementing the shrinker.
> >
> > We free the zero filled subpages, while modifying the existing split_huge_page()
> > function. Will follow up that change in another patch.
>
> FYI. This series does it:
>
> https://lore.kernel.org/r/20210731063938.1391602-1-yuzhao@google.com/
>
> And this one:
>
> https://lore.kernel.org/r/1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/

Thanks, Yu. I totally forgot about these series. It is time to refresh
my memory.
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Alex Zhu (Kernel) 1 day, 6 hours ago
> On Aug 10, 2022, at 10:07 AM, Yang Shi <shy828301@gmail.com> wrote:
> 
> On Tue, Aug 9, 2022 at 4:36 PM Yu Zhao <yuzhao@google.com> wrote:
>> 
>> On Tue, Aug 9, 2022 at 11:16 AM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
>>> 
>>> 
>>>> OK, it is hard to tell what it looks like now. But the THPs on the
>>>> deferred split list may be on the "low utilization split" list too?
>>>> IIUC the major difference is to replace zero-filled subpage to special
>>>> zero page, so you implemented another THP split function to handle it?
>>>> 
>>>> Anyway the code should answer the most questions.
>>> 
>>> They can indeed end up on both lists. This did have to be handled when
>>> implementing the shrinker.
>>> 
>>> We free the zero filled subpages, while modifying the existing split_huge_page()
>>> function. Will follow up that change in another patch.
>> 
>> FYI. This series does it:
>> 
>> https://lore.kernel.org/r/20210731063938.1391602-1-yuzhao@google.com/
>> 
>> And this one:
>> 
>> https://lore.kernel.org/r/1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/
> 
> Thanks, Yu. I totally forgot about these series. It is time to refresh
> my memory.

I looked through these patches yesterday. There are indeed parts that are very similar, but the approach
taken seems overly complicated compared to what I have written. What’s the status of work on this since last year?  
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yu Zhao 1 day, 6 hours ago
On Wed, Aug 10, 2022 at 11:15 AM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
>
>
> > On Aug 10, 2022, at 10:07 AM, Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Tue, Aug 9, 2022 at 4:36 PM Yu Zhao <yuzhao@google.com> wrote:
> >>
> >> On Tue, Aug 9, 2022 at 11:16 AM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
> >>>
> >>>
> >>>> OK, it is hard to tell what it looks like now. But the THPs on the
> >>>> deferred split list may be on the "low utilization split" list too?
> >>>> IIUC the major difference is to replace zero-filled subpage to special
> >>>> zero page, so you implemented another THP split function to handle it?
> >>>>
> >>>> Anyway the code should answer the most questions.
> >>>
> >>> They can indeed end up on both lists. This did have to be handled when
> >>> implementing the shrinker.
> >>>
> >>> We free the zero filled subpages, while modifying the existing split_huge_page()
> >>> function. Will follow up that change in another patch.
> >>
> >> FYI. This series does it:
> >>
> >> https://lore.kernel.org/r/20210731063938.1391602-1-yuzhao@google.com/
> >>
> >> And this one:
> >>
> >> https://lore.kernel.org/r/1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/
> >
> > Thanks, Yu. I totally forgot about these series. It is time to refresh
> > my memory.
>
> I looked through these patches yesterday. There are indeed parts that are very similar, but the approach
> taken seems overly complicated compared to what I have written. What’s the status of work on this since last year?

Overly complicated... which patches and how?

At a minimum, you'd need 1 & 3 from the first series and this patch:

https://lore.kernel.org/r/20220608141432.23258-1-linmiaohe@huawei.com/
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Alex Zhu (Kernel) 1 day, 2 hours ago

> On Aug 10, 2022, at 10:54 AM, Yu Zhao <yuzhao@google.com> wrote:
> 
> !-------------------------------------------------------------------|
>  This Message Is From an External Sender
> 
> |-------------------------------------------------------------------!
> 
> On Wed, Aug 10, 2022 at 11:15 AM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
>> 
>> 
>>> On Aug 10, 2022, at 10:07 AM, Yang Shi <shy828301@gmail.com> wrote:
>>> 
>>> On Tue, Aug 9, 2022 at 4:36 PM Yu Zhao <yuzhao@google.com> wrote:
>>>> 
>>>> On Tue, Aug 9, 2022 at 11:16 AM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
>>>>> 
>>>>> 
>>>>>> OK, it is hard to tell what it looks like now. But the THPs on the
>>>>>> deferred split list may be on the "low utilization split" list too?
>>>>>> IIUC the major difference is to replace zero-filled subpage to special
>>>>>> zero page, so you implemented another THP split function to handle it?
>>>>>> 
>>>>>> Anyway the code should answer the most questions.
>>>>> 
>>>>> They can indeed end up on both lists. This did have to be handled when
>>>>> implementing the shrinker.
>>>>> 
>>>>> We free the zero filled subpages, while modifying the existing split_huge_page()
>>>>> function. Will follow up that change in another patch.
>>>> 
>>>> FYI. This series does it:
>>>> 
>>>> https://lore.kernel.org/r/20210731063938.1391602-1-yuzhao@google.com/
>>>> 
>>>> And this one:
>>>> 
>>>> https://lore.kernel.org/r/1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/
>>> 
>>> Thanks, Yu. I totally forgot about these series. It is time to refresh
>>> my memory.
>> 
>> I looked through these patches yesterday. There are indeed parts that are very similar, but the approach
>> taken seems overly complicated compared to what I have written. What’s the status of work on this since last year?
> 
> Overly complicated... which patches and how?
> 
> At a minimum, you'd need 1 & 3 from the first series and this patch:
> 
> https://lore.kernel.org/r/20220608141432.23258-1-linmiaohe@huawei.com/

The changes from the previous patches implement freeing of THPs as part of memcgroup and reclaim. Zero tail pages are disposed of via
lruvec as part of reclaim. 

Our approach is a thp utilization worker thread scanning through physical memory adding under utilized THPs to a shrinker that calls split_huge_page(). We free zero tail pages within split_huge_page(). Reclaim will trigger the shrinker. 

There is some overlap between the implementations, in particular creating a linked list in the third tail page and methods to check for zero pages. 
(I believe the previous patches have a cleaner method for identifying zero pages). However, looking through the code I do believe our approach is simpler. 

We chose to free within split_huge_page(), but it’s worth discussing whether to free zero pages immediately or to add to lruvec to free eventually. 

I believe the split_huge_page() changes could be valuable by as a patch by itself though. Will send that out shortly. 

Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yu Zhao 1 day, 2 hours ago
On Wed, Aug 10, 2022 at 3:39 PM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
>
>
>
> > On Aug 10, 2022, at 10:54 AM, Yu Zhao <yuzhao@google.com> wrote:
> >
> > !-------------------------------------------------------------------|
> >  This Message Is From an External Sender
> >
> > |-------------------------------------------------------------------!
> >
> > On Wed, Aug 10, 2022 at 11:15 AM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
> >>
> >>
> >>> On Aug 10, 2022, at 10:07 AM, Yang Shi <shy828301@gmail.com> wrote:
> >>>
> >>> On Tue, Aug 9, 2022 at 4:36 PM Yu Zhao <yuzhao@google.com> wrote:
> >>>>
> >>>> On Tue, Aug 9, 2022 at 11:16 AM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
> >>>>>
> >>>>>
> >>>>>> OK, it is hard to tell what it looks like now. But the THPs on the
> >>>>>> deferred split list may be on the "low utilization split" list too?
> >>>>>> IIUC the major difference is to replace zero-filled subpage to special
> >>>>>> zero page, so you implemented another THP split function to handle it?
> >>>>>>
> >>>>>> Anyway the code should answer the most questions.
> >>>>>
> >>>>> They can indeed end up on both lists. This did have to be handled when
> >>>>> implementing the shrinker.
> >>>>>
> >>>>> We free the zero filled subpages, while modifying the existing split_huge_page()
> >>>>> function. Will follow up that change in another patch.
> >>>>
> >>>> FYI. This series does it:
> >>>>
> >>>> https://lore.kernel.org/r/20210731063938.1391602-1-yuzhao@google.com/
> >>>>
> >>>> And this one:
> >>>>
> >>>> https://lore.kernel.org/r/1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/
> >>>
> >>> Thanks, Yu. I totally forgot about these series. It is time to refresh
> >>> my memory.
> >>
> >> I looked through these patches yesterday. There are indeed parts that are very similar, but the approach
> >> taken seems overly complicated compared to what I have written. What’s the status of work on this since last year?
> >
> > Overly complicated... which patches and how?
> >
> > At a minimum, you'd need 1 & 3 from the first series and this patch:
> >
> > https://lore.kernel.org/r/20220608141432.23258-1-linmiaohe@huawei.com/
>
> The changes from the previous patches implement freeing of THPs as part of memcgroup and reclaim. Zero tail pages are disposed of via
> lruvec as part of reclaim.

Which series are you talking about? I listed two series and they are
very different on the code level.

> Our approach is a thp utilization worker thread scanning through physical memory adding under utilized THPs to a shrinker that calls split_huge_page(). We free zero tail pages within split_huge_page(). Reclaim will trigger the shrinker.
>
> There is some overlap between the implementations, in particular creating a linked list in the third tail page and methods to check for zero pages.
> (I believe the previous patches have a cleaner method for identifying zero pages). However, looking through the code I do believe our approach is simpler.
>
> We chose to free within split_huge_page()

The 2nd patch from the first series does exactly this.

> but it’s worth discussing whether to free zero pages immediately or to add to lruvec to free eventually.

And that patch can be omitted if the third link (a single patch, not a
series) is used, which makes the workflow "add to lruvec to free
eventually".

> I believe the split_huge_page() changes could be valuable by as a patch by itself though. Will send that out shortly.
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Alex Zhu (Kernel) 1 day ago
> Which series are you talking about? I listed two series and they are
> very different on the code level.
> 

I was referring to the second patch: https://lore.kernel.org/all/1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/. 

This patchset adds the THP shrinking as part of shrink_lruvec in mm/vmscan.c. We create a new shrinker that shrinks THPs based off the results
of the scanning implemented in this thp_utilization patch. We also do not have any of the additional knobs for controlling THP reclaim that the patchset above has. That seems unnecessary in the initial patch as shrinking THPs that are almost entirely zero pages should only improve performance. 

I believe the resulting implementation we have is simpler and easier to understand than the above patchset. By identifying and freeing underutilized THPs we hope to eventually deprecate madvise entirely and have THP always enabled. 

> The 2nd patch from the first series does exactly this.
> 
>> but it’s worth discussing whether to free zero pages immediately or to add to lruvec to free eventually.
> 
> And that patch can be omitted if the third link (a single patch, not a
> series) is used, which makes the workflow "add to lruvec to free
> eventually".
> 
>> I believe the split_huge_page() changes could be valuable by as a patch by itself though. Will send that out shortly.

Referring to this patch: https://lore.kernel.org/r/20210731063938.1391602-1-yuzhao@google.com/. 

We do indeed do something similar to patches 1 and 3. We may be able to make use of this instead, I’ll take a closer look. 


Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yu Zhao 22 hours ago
On Wed, Aug 10, 2022 at 6:00 PM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
>
>
> > Which series are you talking about? I listed two series and they are
> > very different on the code level.
> >
>
> I was referring to the second patch: https://lore.kernel.org/all/1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/.

You mean the second patch*set* or series, the link doesn't point to a
single patch :)  "the second patch" could mean the second patch in
that series.

> This patchset adds the THP shrinking as part of shrink_lruvec in mm/vmscan.c. We create a new shrinker that shrinks THPs based off the results
> of the scanning implemented in this thp_utilization patch. We also do not have any of the additional knobs for controlling THP reclaim that the patchset above has. That seems unnecessary in the initial patch as shrinking THPs that are almost entirely zero pages should only improve performance.
>
> I believe the resulting implementation we have is simpler and easier to understand than the above patchset. By identifying and freeing underutilized THPs we hope to eventually deprecate madvise entirely and have THP always enabled.
>
> > The 2nd patch from the first series does exactly this.
> >
> >> but it’s worth discussing whether to free zero pages immediately or to add to lruvec to free eventually.
> >
> > And that patch can be omitted if the third link (a single patch, not a
> > series) is used, which makes the workflow "add to lruvec to free
> > eventually".
> >
> >> I believe the split_huge_page() changes could be valuable by as a patch by itself though. Will send that out shortly.
>
> Referring to this patch: https://lore.kernel.org/r/20210731063938.1391602-1-yuzhao@google.com/.
>
> We do indeed do something similar to patches 1 and 3. We may be able to make use of this instead, I’ll take a closer look.

Please do.

Based on what you said ("chose to free within split_huge_page()"), I
very much suspect you do something similar to patch 2 as well. IIRC,
that location is the best location to free subpages that only contain
zeros because it covers multiple scenarios.
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Alex Zhu (Kernel) 22 hours ago

> On Aug 10, 2022, at 6:15 PM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Wed, Aug 10, 2022 at 6:00 PM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
>> 
>> 
>>> Which series are you talking about? I listed two series and they are
>>> very different on the code level.
>>> 
>> 
>> I was referring to the second patch: https://lore.kernel.org/all/1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/.
> 
> You mean the second patch*set* or series, the link doesn't point to a
> single patch :)  "the second patch" could mean the second patch in
> that series.

Yes, that’s what I meant. Thanks :)

Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Alex Zhu (Kernel) 4 hours ago
Hi Yu,

I’ve updated your patch set from last year to work with folio and am testing it now. The functionality in split_huge_page() is the same as what I have. Was there any follow up work done later? 

If not, I would like to incorporate this into what I have, and then resubmit. Will reference the original patchset. We need this functionality for the shrinker, but even the changes to split_huge_page() by itself it should show some performance improvement when used by the existing deferred_split_huge_page(). 

Thanks,
Alex
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yu Zhao 2 hours ago
On Thu, Aug 11, 2022 at 1:20 PM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
>
> Hi Yu,
>
> I’ve updated your patch set from last year to work with folio and am testing it now. The functionality in split_huge_page() is the same as what I have. Was there any follow up work done later?

Yes, but it won't change the landscape any time soon (see below). So
please feel free to continue along your current direction.

> If not, I would like to incorporate this into what I have, and then resubmit. Will reference the original patchset. We need this functionality for the shrinker, but even the changes to split_huge_page() by itself it should show some performance improvement when used by the existing deferred_split_huge_page().

SGTM. Thanks!

A side note:

I'm working on a new mode: THP=auto, meaning the kernel will detect
internal fragmentation of 2MB compound pages to decide whether to map
them by PMDs or split them under memory pressure. The general workflow
of this new mode is as follows.

In the page fault path:
1. Compound pages are allocated as usual.
2. Each is mapped by 512 consecutive PTEs rather than a PMD.
3. There will be more TLB misses but the same number of page faults.
4. TLB coalescing can mitigate the performance degradation.

In khugepaged:
1. Check the dirty bit in the PTEs mapping a compound page, to
determine its utilization.
2. Remap compound pages that meet a certain utilization threshold by
PMDs in place, i.e., no migrations.

In the reclaim path, e.g., MGLRU page table scanning:
1. Decide whether compound pages mapped by PTEs should be split based
on their utilizations and memory pressure, e.g., reclaim priority.
2. Clean subpages should be freed directly after split, rather than swapped out.

N.B.
1. This workflow relies on the dirty bit rather examining the content of a page.
2. Sampling can be done by periodically switching between a PMD and
512 consecutive PTEs.
3. It only needs to hold mmap_lock for read because this special mode
(512 consecutive PTEs) is not considered the split mode.
4. Don't hold your breath :)

Other references:
1. https://www.usenix.org/system/files/atc20-zhu-weixi_0.pdf
2. https://www.usenix.org/system/files/osdi21-hunter.pdf
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yang Shi an hour ago
On Thu, Aug 11, 2022 at 2:55 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Aug 11, 2022 at 1:20 PM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
> >
> > Hi Yu,
> >
> > I’ve updated your patch set from last year to work with folio and am testing it now. The functionality in split_huge_page() is the same as what I have. Was there any follow up work done later?
>
> Yes, but it won't change the landscape any time soon (see below). So
> please feel free to continue along your current direction.
>
> > If not, I would like to incorporate this into what I have, and then resubmit. Will reference the original patchset. We need this functionality for the shrinker, but even the changes to split_huge_page() by itself it should show some performance improvement when used by the existing deferred_split_huge_page().
>
> SGTM. Thanks!
>
> A side note:
>
> I'm working on a new mode: THP=auto, meaning the kernel will detect
> internal fragmentation of 2MB compound pages to decide whether to map
> them by PMDs or split them under memory pressure. The general workflow
> of this new mode is as follows.

I tend to agree that avoiding allocating THP in the first place is the
preferred way to avoid internal fragmentation. But I got some
questions about your design/implementation:

>
> In the page fault path:
> 1. Compound pages are allocated as usual.
> 2. Each is mapped by 512 consecutive PTEs rather than a PMD.
> 3. There will be more TLB misses but the same number of page faults.
> 4. TLB coalescing can mitigate the performance degradation.

Why not just allocate base pages in the first place? Khugepaged has
max_pte_none tunable to detect internal fragmentation. If you worry
about zero page, you could add max_pte_zero tunable.

Or did you investigate whether the new MADV_COLLAPSE may be helpful or
not? It leaves the decision to the userspace.

>
> In khugepaged:
> 1. Check the dirty bit in the PTEs mapping a compound page, to
> determine its utilization.
> 2. Remap compound pages that meet a certain utilization threshold by
> PMDs in place, i.e., no migrations.
>
> In the reclaim path, e.g., MGLRU page table scanning:
> 1. Decide whether compound pages mapped by PTEs should be split based
> on their utilizations and memory pressure, e.g., reclaim priority.
> 2. Clean subpages should be freed directly after split, rather than swapped out.
>
> N.B.
> 1. This workflow relies on the dirty bit rather examining the content of a page.
> 2. Sampling can be done by periodically switching between a PMD and
> 512 consecutive PTEs.
> 3. It only needs to hold mmap_lock for read because this special mode
> (512 consecutive PTEs) is not considered the split mode.
> 4. Don't hold your breath :)
>
> Other references:
> 1. https://www.usenix.org/system/files/atc20-zhu-weixi_0.pdf
> 2. https://www.usenix.org/system/files/osdi21-hunter.pdf
Re: [PATCH v3] mm: add thp_utilization metrics to /proc/thp_utilization
Posted by Yu Zhao an hour ago
On Thu, Aug 11, 2022 at 4:12 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Aug 11, 2022 at 2:55 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Thu, Aug 11, 2022 at 1:20 PM Alex Zhu (Kernel) <alexlzhu@fb.com> wrote:
> > >
> > > Hi Yu,
> > >
> > > I’ve updated your patch set from last year to work with folio and am testing it now. The functionality in split_huge_page() is the same as what I have. Was there any follow up work done later?
> >
> > Yes, but it won't change the landscape any time soon (see below). So
> > please feel free to continue along your current direction.
> >
> > > If not, I would like to incorporate this into what I have, and then resubmit. Will reference the original patchset. We need this functionality for the shrinker, but even the changes to split_huge_page() by itself it should show some performance improvement when used by the existing deferred_split_huge_page().
> >
> > SGTM. Thanks!
> >
> > A side note:
> >
> > I'm working on a new mode: THP=auto, meaning the kernel will detect
> > internal fragmentation of 2MB compound pages to decide whether to map
> > them by PMDs or split them under memory pressure. The general workflow
> > of this new mode is as follows.
>
> I tend to agree that avoiding allocating THP in the first place is the
> preferred way to avoid internal fragmentation. But I got some
> questions about your design/implementation:
>
> >
> > In the page fault path:
> > 1. Compound pages are allocated as usual.
> > 2. Each is mapped by 512 consecutive PTEs rather than a PMD.
> > 3. There will be more TLB misses but the same number of page faults.
> > 4. TLB coalescing can mitigate the performance degradation.
>
> Why not just allocate base pages in the first place? Khugepaged has
> max_pte_none tunable to detect internal fragmentation. If you worry
> about zero page, you could add max_pte_zero tunable.
>
> Or did you investigate whether the new MADV_COLLAPSE may be helpful or
> not? It leaves the decision to the userspace.

There are two problems we have to workaround.
1. External fragmentation that prevents later compound page allocations.
2. The cost of taking mmap_lock for write.

IIRC, the first reference I listed describes the first problem. (It
uses a similar reservation technique.) From a very high level, smaller
page allocations add more entropy than larger ones and accelerate the
system toward equilibrium, in which state the system can't allocate
more THPs without exerting additional force (compaction).

Reserving compound pages delays the equilibrium whereas MADV_COLLAPSE
tries to reverse the equilibrium. The latter has a higher cost. In
addition, it needs to take mmap_lock for write.

> > In khugepaged:
> > 1. Check the dirty bit in the PTEs mapping a compound page, to
> > determine its utilization.
> > 2. Remap compound pages that meet a certain utilization threshold by
> > PMDs in place, i.e., no migrations.
> >
> > In the reclaim path, e.g., MGLRU page table scanning:
> > 1. Decide whether compound pages mapped by PTEs should be split based
> > on their utilizations and memory pressure, e.g., reclaim priority.
> > 2. Clean subpages should be freed directly after split, rather than swapped out.
> >
> > N.B.
> > 1. This workflow relies on the dirty bit rather examining the content of a page.
> > 2. Sampling can be done by periodically switching between a PMD and
> > 512 consecutive PTEs.
> > 3. It only needs to hold mmap_lock for read because this special mode
> > (512 consecutive PTEs) is not considered the split mode.
> > 4. Don't hold your breath :)
> >
> > Other references:
> > 1. https://www.usenix.org/system/files/atc20-zhu-weixi_0.pdf
> > 2. https://www.usenix.org/system/files/osdi21-hunter.pdf