From nobody Wed Dec 17 21:00:57 2025 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8EDCA11717 for ; Sat, 4 May 2024 07:30:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714807851; cv=none; b=M56f4cr9ua/7juuKgUxlyS2+cBChouxOOfio/BTbVKmA4dQHQqQdzI9T+cCj04iwUcpjy8Y8T2RJQ9gLRHGB/KuVYV+e9fxN3gBbUMaLMClagJrM9tARsBNH+WoHCgML+Kqhsj3Ye5A9SF1CPXrxQC4i+1+1wbuPJQy0svAxIVU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714807851; c=relaxed/simple; bh=QVMqH9d0it4qIkcZCQESGKqws1HRsFqpZl0aWCgEt2E=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=F9IUY/KwU605t0VbXAlZ72VaeiOn3+C70ouYE4yeJ7R06WW0Dp1gLfplmRyTd9gufGEqGhxKd9qg7f0CHF/n5M/6brLzbqqO38JBlgo2qE1r9kyWJNa1AB8pbmxQDlZCLLCVfpxslKAZVys83VaKtHMrAk8M9746qZWUWHCzRgA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=XTMlbhmt; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="XTMlbhmt" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-61be3f082b0so7200847b3.1 for ; Sat, 04 May 2024 00:30:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1714807848; x=1715412648; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=dH44YFbsI/wViZnIs4Aonmrz8ZyFoelrelW04d6e1qc=; b=XTMlbhmtJ8WhSKIHZOuoInZ8z10a6NfSqfibwnhKXog1jNiKPZTpuywJyYNG9GshdA OqAxVmaqfBwwPWBUxMJcRFyz098a/uiUkbgcEARqioZaoZaoXGjDx3ytAa9uqwEfzEPF 1OvGSnawo9yufOasqNuGwK56ngmi7jjCVEIEI73/XaLQNrBkMC9CXJ4GE9x2smWYvpjm qw4dnM36p8dCpKecRHPPtMEHMePJ7BSdMHk5r2b3Dw6PEDUdUV07MvBsTBlDi0RqysU7 aTmdMsCT3W7q+TjAio3S7S7JGfELEQXyv2XmNHMiuqN8xTbUZJ2waG6SY655AXjXcFBT eD1w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714807848; x=1715412648; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=dH44YFbsI/wViZnIs4Aonmrz8ZyFoelrelW04d6e1qc=; b=VF//Ut9BiC4sTzZ/XgWgHv9vM4xKVWalEGcJbrF2kbkyqPFuDL5M+1lQNk/tlwhqIc uBQPOv7dQ0usuMeBu1dbqLP8U1xi3KDC2iRh1eH/UsM4o1YWe4F6kUWnDNOvpyZZbZiL N4oH64thf0KNuADW4abyfHRz8cJTlNLSl79DqdtP9CYCPL7ZGnnGXUqQt5wcOEdt8JPR 8wLqYrlaJLy+Nq+XrDL0nEmO/Wkxi7RLQ4O+dhD6deET1BLvNqopoXhtchHbkSnQAigR 3VCqsTy/QE6gLuDMt1wkqGvjZPWLLpHHMr9ogI/XZvvR70eQX8AL/op0TmlbQ9ayEqtE dWyA== X-Forwarded-Encrypted: i=1; AJvYcCWxzh0v+gPwe9R9Nt1RIdQpKrPQi+mRte0T5wj8xfvlZekaxU+TUSw3HJQ7EDI8q3e7kc3l2SAUUrCddHj1Sp+n5YwcKA3ZA41asud6 X-Gm-Message-State: AOJu0YzvccoV9VKCdcu6cxDbvzFJ9HR/X7r3paWoxgT2drPhqmckswHN st7hbgHdLSIe2kW3ropHLCHUb3tlKdvunvty4srgA04wB4drEoKxVkER74EEZSn2QGlIZag0URb 1P0Kedg== X-Google-Smtp-Source: AGHT+IFcMAMPXIkyMkHP8ysggr4nF8ifhyWpORS93sMeo0wY3uA25ty9ehoTDuN2Dk7E8tmJu5w76waYdzEo X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:da8f:bd07:9977:eb21]) (user=yuanchu job=sendgmr) by 2002:a81:6fc3:0:b0:61b:e6a8:a8a with SMTP id k186-20020a816fc3000000b0061be6a80a8amr1032920ywc.6.1714807848626; Sat, 04 May 2024 00:30:48 -0700 (PDT) Date: Sat, 4 May 2024 00:30:06 -0700 In-Reply-To: <20240504073011.4000534-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240504073011.4000534-1-yuanchu@google.com> X-Mailer: git-send-email 2.45.0.rc1.225.g2a3ae87e7f-goog Message-ID: <20240504073011.4000534-3-yuanchu@google.com> Subject: [PATCH v1 2/7] mm: aggregate working set information into histograms From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying Cc: Kalesh Singh , Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Hierarchically aggregate all memcgs' MGLRU generations and their page counts into working set page age histograms. The histograms break down the system's working set per-node, per-anon/file. The sysfs interfaces are as follows: /sys/devices/system/node/nodeX/page_age A per-node page age histogram, showing an aggregate of the node's lruvecs. The information is extracted from MGLRU's per-generation page counters. Reading this file causes a hierarchical aging of all lruvecs, scanning pages and creates a new generation in each lruvec. For example: 1000 anon=3D0 file=3D0 2000 anon=3D0 file=3D0 100000 anon=3D5533696 file=3D5566464 18446744073709551615 anon=3D0 file=3D0 /sys/devices/system/node/nodeX/page_age_interval A comma separated list of time in milliseconds that configures what the page age histogram uses for aggregation. Signed-off-by: Yuanchu Xie --- drivers/base/node.c | 6 + include/linux/mmzone.h | 9 + include/linux/workingset_report.h | 79 ++++++ mm/Kconfig | 9 + mm/Makefile | 1 + mm/internal.h | 9 + mm/memcontrol.c | 2 + mm/mm_init.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 32 +++ mm/workingset_report.c | 438 ++++++++++++++++++++++++++++++ 11 files changed, 589 insertions(+) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c diff --git a/drivers/base/node.c b/drivers/base/node.c index 1c05640461dd..81bf0c68efca 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,8 @@ #include #include #include +#include +#include =20 static const struct bus_type node_subsys =3D { .name =3D "node", @@ -625,6 +627,7 @@ static int register_node(struct node *node, int num) } else { hugetlb_register_node(node); compaction_register_node(node); + wsr_init_sysfs(node); } =20 return error; @@ -641,6 +644,9 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); compaction_unregister_node(node); + wsr_remove_sysfs(node); + wsr_destroy_lruvec(mem_cgroup_lruvec(NULL, NODE_DATA(node->dev.id))); + wsr_destroy_pgdat(NODE_DATA(node->dev.id)); node_remove_accesses(node); node_remove_caches(node); device_unregister(&node->dev); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a497f189d988..3e94d76c8f29 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -24,6 +24,7 @@ #include #include #include +#include =20 /* Free memory management - zoned buddy allocator. */ #ifndef CONFIG_ARCH_FORCE_MAX_ORDER @@ -625,6 +626,9 @@ struct lruvec { struct lru_gen_mm_state mm_state; #endif #endif /* CONFIG_LRU_GEN */ +#ifdef CONFIG_WORKINGSET_REPORT + struct wsr_state wsr; +#endif /* CONFIG_WORKINGSET_REPORT */ #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif @@ -1398,6 +1402,11 @@ typedef struct pglist_data { struct lru_gen_memcg memcg_lru; #endif =20 +#ifdef CONFIG_WORKINGSET_REPORT + struct mutex wsr_update_mutex; + struct wsr_report_bins __rcu *wsr_page_age_bins; +#endif + CACHELINE_PADDING(_pad2_); =20 /* Per-node vmstats */ diff --git a/include/linux/workingset_report.h b/include/linux/workingset_r= eport.h new file mode 100644 index 000000000000..d7c2ee14ec87 --- /dev/null +++ b/include/linux/workingset_report.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_WORKINGSET_REPORT_H +#define _LINUX_WORKINGSET_REPORT_H + +#include +#include + +struct mem_cgroup; +struct pglist_data; +struct node; +struct lruvec; + +#ifdef CONFIG_WORKINGSET_REPORT + +#define WORKINGSET_REPORT_MIN_NR_BINS 2 +#define WORKINGSET_REPORT_MAX_NR_BINS 32 + +#define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) +#define ANON_AND_FILE 2 + +struct wsr_report_bin { + unsigned long idle_age; + unsigned long nr_pages[ANON_AND_FILE]; +}; + +struct wsr_report_bins { + /* excludes the WORKINGSET_INTERVAL_MAX bin */ + unsigned long nr_bins; + /* last bin contains WORKINGSET_INTERVAL_MAX */ + unsigned long idle_age[WORKINGSET_REPORT_MAX_NR_BINS]; + struct rcu_head rcu; +}; + +struct wsr_page_age_histo { + unsigned long timestamp; + struct wsr_report_bin bins[WORKINGSET_REPORT_MAX_NR_BINS]; +}; + +struct wsr_state { + /* breakdown of workingset by page age */ + struct mutex page_age_lock; + struct wsr_page_age_histo *page_age; +}; + +void wsr_init_lruvec(struct lruvec *lruvec); +void wsr_destroy_lruvec(struct lruvec *lruvec); +void wsr_init_pgdat(struct pglist_data *pgdat); +void wsr_destroy_pgdat(struct pglist_data *pgdat); +void wsr_init_sysfs(struct node *node); +void wsr_remove_sysfs(struct node *node); + +/* + * Returns true if the wsr is configured to be refreshed. + * The next refresh time is stored in refresh_time. + */ +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat); +#else +static inline void wsr_init_lruvec(struct lruvec *lruvec) +{ +} +static inline void wsr_destroy_lruvec(struct lruvec *lruvec) +{ +} +static inline void wsr_init_pgdat(struct pglist_data *pgdat) +{ +} +static inline void wsr_destroy_pgdat(struct pglist_data *pgdat) +{ +} +static inline void wsr_init_sysfs(struct node *node) +{ +} +static inline void wsr_remove_sysfs(struct node *node) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT */ + +#endif /* _LINUX_WORKINGSET_REPORT_H */ diff --git a/mm/Kconfig b/mm/Kconfig index ffc3a2ba3a8c..212f203b10b9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1261,6 +1261,15 @@ config LOCK_MM_AND_FIND_VMA config IOMMU_MM_DATA bool =20 +config WORKINGSET_REPORT + bool "Working set reporting" + depends on LRU_GEN && SYSFS + help + Report system and per-memcg working set to userspace. + + This option exports stats and events giving the user more insight + into its memory working set. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index e4b5b75aaec9..57093657030d 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) +=3D migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) +=3D huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) +=3D page_counter.o obj-$(CONFIG_MEMCG) +=3D memcontrol.o vmpressure.o +obj-$(CONFIG_WORKINGSET_REPORT) +=3D workingset_report.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) +=3D swap_cgroup.o endif diff --git a/mm/internal.h b/mm/internal.h index f309a010d50f..5e0caba64ee4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -198,12 +198,21 @@ extern unsigned long highest_memmap_pfn; /* * in mm/vmscan.c: */ +struct scan_control; bool isolate_lru_page(struct page *page); bool folio_isolate_lru(struct folio *folio); void putback_lru_page(struct page *page); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state = reason); =20 +#ifdef CONFIG_WORKINGSET_REPORT +/* + * in mm/wsr.c + */ +/* Requires wsr->page_age_lock held */ +void wsr_refresh_scan(struct lruvec *lruvec); +#endif + /* * in mm/rmap.c: */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1ed40f9d3a27..b5b67c93c287 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -65,6 +65,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -5457,6 +5458,7 @@ static void free_mem_cgroup_per_node_info(struct mem_= cgroup *memcg, int node) if (!pn) return; =20 + wsr_destroy_lruvec(&pn->lruvec); free_percpu(pn->lruvec_stats_percpu); kfree(pn); } diff --git a/mm/mm_init.c b/mm/mm_init.c index 2c19f5515e36..c741c3f1e3db 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "internal.h" #include "slab.h" #include "shuffle.h" @@ -1368,6 +1369,7 @@ static void __meminit pgdat_init_internals(struct pgl= ist_data *pgdat) =20 pgdat_page_ext_init(pgdat); lruvec_init(&pgdat->__lruvec); + wsr_init_pgdat(pgdat); } =20 static void __meminit zone_init_internals(struct zone *zone, enum zone_typ= e idx, int nid, diff --git a/mm/mmzone.c b/mm/mmzone.c index c01896eca736..477cd5ac1d78 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -90,6 +90,8 @@ void lruvec_init(struct lruvec *lruvec) */ list_del(&lruvec->lists[LRU_UNEVICTABLE]); =20 + wsr_init_lruvec(lruvec); + lru_gen_init_lruvec(lruvec); } =20 diff --git a/mm/vmscan.c b/mm/vmscan.c index 1a7c7d537db6..9af6793a6534 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -56,6 +56,7 @@ #include #include #include +#include =20 #include #include @@ -5606,6 +5607,8 @@ static int __init init_lru_gen(void) if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) pr_err("lru_gen: failed to create sysfs group\n"); =20 + wsr_init_sysfs(NULL); + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops); =20 @@ -5613,6 +5616,35 @@ static int __init init_lru_gen(void) }; late_initcall(init_lru_gen); =20 +/*************************************************************************= ***** + * workingset reporting + *************************************************************************= *****/ +#ifdef CONFIG_WORKINGSET_REPORT +void wsr_refresh_scan(struct lruvec *lruvec) +{ + DEFINE_MAX_SEQ(lruvec); + struct scan_control sc =3D { + .may_writepage =3D true, + .may_unmap =3D true, + .may_swap =3D true, + .proactive =3D true, + .reclaim_idx =3D MAX_NR_ZONES - 1, + .gfp_mask =3D GFP_KERNEL, + }; + unsigned int flags; + + set_task_reclaim_state(current, &sc.reclaim_state); + flags =3D memalloc_noreclaim_save(); + /* + * setting can_swap=3Dtrue and force_scan=3Dtrue ensures + * proper workingset stats when the system cannot swap. + */ + try_to_inc_max_seq(lruvec, max_seq, &sc, true, true); + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); +} +#endif /* CONFIG_WORKINGSET_REPORT */ + #else /* !CONFIG_LRU_GEN */ =20 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_contro= l *sc) diff --git a/mm/workingset_report.c b/mm/workingset_report.c new file mode 100644 index 000000000000..7b872b9fa7da --- /dev/null +++ b/mm/workingset_report.c @@ -0,0 +1,438 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +void wsr_init_pgdat(struct pglist_data *pgdat) +{ + mutex_init(&pgdat->wsr_update_mutex); + RCU_INIT_POINTER(pgdat->wsr_page_age_bins, NULL); +} + +void wsr_destroy_pgdat(struct pglist_data *pgdat) +{ + struct wsr_report_bins __rcu *bins; + + mutex_lock(&pgdat->wsr_update_mutex); + bins =3D rcu_replace_pointer(pgdat->wsr_page_age_bins, NULL, + lockdep_is_held(&pgdat->wsr_update_mutex)); + kfree_rcu(bins, rcu); + mutex_unlock(&pgdat->wsr_update_mutex); + mutex_destroy(&pgdat->wsr_update_mutex); +} + +void wsr_init_lruvec(struct lruvec *lruvec) +{ + struct wsr_state *wsr =3D &lruvec->wsr; + + memset(wsr, 0, sizeof(*wsr)); + mutex_init(&wsr->page_age_lock); +} + +void wsr_destroy_lruvec(struct lruvec *lruvec) +{ + struct wsr_state *wsr =3D &lruvec->wsr; + + mutex_destroy(&wsr->page_age_lock); + kfree(wsr->page_age); + memset(wsr, 0, sizeof(*wsr)); +} + +static int workingset_report_intervals_parse(char *src, + struct wsr_report_bins *bins) +{ + int err =3D 0, i =3D 0; + char *cur, *next =3D strim(src); + + if (*next =3D=3D '\0') + return 0; + + while ((cur =3D strsep(&next, ","))) { + unsigned int interval; + + err =3D kstrtouint(cur, 0, &interval); + if (err) + goto out; + + bins->idle_age[i] =3D msecs_to_jiffies(interval); + if (i > 0 && bins->idle_age[i] <=3D bins->idle_age[i - 1]) { + err =3D -EINVAL; + goto out; + } + + if (++i =3D=3D WORKINGSET_REPORT_MAX_NR_BINS) { + err =3D -ERANGE; + goto out; + } + } + + if (i && i < WORKINGSET_REPORT_MIN_NR_BINS - 1) { + err =3D -ERANGE; + goto out; + } + + bins->nr_bins =3D i; + bins->idle_age[i] =3D WORKINGSET_INTERVAL_MAX; +out: + return err ?: i; +} + +static unsigned long get_gen_start_time(const struct lru_gen_folio *lrugen, + unsigned long seq, + unsigned long max_seq, + unsigned long curr_timestamp) +{ + int younger_gen; + + if (seq =3D=3D max_seq) + return curr_timestamp; + younger_gen =3D lru_gen_from_seq(seq + 1); + return READ_ONCE(lrugen->timestamps[younger_gen]); +} + +static void collect_page_age_type(const struct lru_gen_folio *lrugen, + struct wsr_report_bin *bin, + unsigned long max_seq, unsigned long min_seq, + unsigned long curr_timestamp, int type) +{ + unsigned long seq; + + for (seq =3D max_seq; seq + 1 > min_seq; seq--) { + int gen, zone; + unsigned long gen_end, gen_start, size =3D 0; + + gen =3D lru_gen_from_seq(seq); + + for (zone =3D 0; zone < MAX_NR_ZONES; zone++) + size +=3D max( + READ_ONCE(lrugen->nr_pages[gen][type][zone]), + 0L); + + gen_start =3D get_gen_start_time(lrugen, seq, max_seq, + curr_timestamp); + gen_end =3D READ_ONCE(lrugen->timestamps[gen]); + + while (bin->idle_age !=3D WORKINGSET_INTERVAL_MAX && + time_before(gen_end + bin->idle_age, curr_timestamp)) { + unsigned long gen_in_bin =3D (long)gen_start - + (long)curr_timestamp + + (long)bin->idle_age; + unsigned long gen_len =3D (long)gen_start - (long)gen_end; + + if (!gen_len) + break; + if (gen_in_bin) { + unsigned long split_bin =3D + size / gen_len * gen_in_bin; + + bin->nr_pages[type] +=3D split_bin; + size -=3D split_bin; + } + gen_start =3D curr_timestamp - bin->idle_age; + bin++; + } + bin->nr_pages[type] +=3D size; + } +} + +/* + * proportionally aggregate Multi-gen LRU bins into a working set report + * MGLRU generations: + * current time + * | max_seq timestamp + * | | max_seq - 1 timestamp + * | | | unbounded + * | | | | + * -------------------------------- + * | max_seq | ... | ... | min_seq + * -------------------------------- + * + * Bins: + * + * current time + * | current - idle_age[0] + * | | current - idle_age[1] + * | | | unbounded + * | | | | + * ------------------------------ + * | bin 0 | ... | ... | bin n-1 + * ------------------------------ + * + * Assume the heuristic that pages are in the MGLRU generation + * through uniform accesses, so we can aggregate them + * proportionally into bins. + */ +static void collect_page_age(struct wsr_page_age_histo *page_age, + const struct lruvec *lruvec) +{ + int type; + const struct lru_gen_folio *lrugen =3D &lruvec->lrugen; + unsigned long curr_timestamp =3D jiffies; + unsigned long max_seq =3D READ_ONCE((lruvec)->lrugen.max_seq); + unsigned long min_seq[ANON_AND_FILE] =3D { + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_ANON]), + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_FILE]), + }; + struct wsr_report_bin *bin =3D &page_age->bins[0]; + + for (type =3D 0; type < ANON_AND_FILE; type++) + collect_page_age_type(lrugen, bin, max_seq, min_seq[type], + curr_timestamp, type); +} + +/* First step: hierarchically scan child memcgs. */ +static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + + memcg =3D mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, pgdat); + + wsr_refresh_scan(lruvec); + cond_resched(); + } while ((memcg =3D mem_cgroup_iter(root, memcg, NULL))); +} + +/* Second step: aggregate child memcgs into the page age histogram. */ +static void refresh_aggregate(struct wsr_page_age_histo *page_age, + struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + struct wsr_report_bin *bin; + + for (bin =3D page_age->bins; + bin->idle_age !=3D WORKINGSET_INTERVAL_MAX; bin++) { + bin->nr_pages[0] =3D 0; + bin->nr_pages[1] =3D 0; + } + /* the last used bin has idle_age =3D=3D WORKINGSET_INTERVAL_MAX. */ + bin->nr_pages[0] =3D 0; + bin->nr_pages[1] =3D 0; + + memcg =3D mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, pgdat); + + collect_page_age(page_age, lruvec); + cond_resched(); + } while ((memcg =3D mem_cgroup_iter(root, memcg, NULL))); + WRITE_ONCE(page_age->timestamp, jiffies); +} + +static void copy_node_bins(struct pglist_data *pgdat, + struct wsr_page_age_histo *page_age) +{ + struct wsr_report_bins *node_page_age_bins; + int i =3D 0; + + rcu_read_lock(); + node_page_age_bins =3D rcu_dereference(pgdat->wsr_page_age_bins); + if (!node_page_age_bins) + goto nocopy; + for (i =3D 0; i < node_page_age_bins->nr_bins; ++i) + page_age->bins[i].idle_age =3D node_page_age_bins->idle_age[i]; + +nocopy: + page_age->bins[i].idle_age =3D WORKINGSET_INTERVAL_MAX; + rcu_read_unlock(); +} + +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct wsr_page_age_histo *page_age; + + if (!READ_ONCE(wsr->page_age)) + return false; + + refresh_scan(wsr, root, pgdat); + mutex_lock(&wsr->page_age_lock); + page_age =3D READ_ONCE(wsr->page_age); + if (page_age) { + copy_node_bins(pgdat, page_age); + refresh_aggregate(page_age, root, pgdat); + } + mutex_unlock(&wsr->page_age_lock); + return !!page_age; +} +EXPORT_SYMBOL_GPL(wsr_refresh_report); + +static struct pglist_data *kobj_to_pgdat(struct kobject *kobj) +{ + int nid =3D IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id : + first_memory_node; + + return NODE_DATA(nid); +} + +static struct wsr_state *kobj_to_wsr(struct kobject *kobj) +{ + return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; +} + +static ssize_t page_age_intervals_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_report_bins *bins; + int len =3D 0; + struct pglist_data *pgdat =3D kobj_to_pgdat(kobj); + + rcu_read_lock(); + bins =3D rcu_dereference(pgdat->wsr_page_age_bins); + if (bins) { + int i; + int nr_bins =3D bins->nr_bins; + + for (i =3D 0; i < bins->nr_bins; ++i) { + len +=3D sysfs_emit_at( + buf, len, "%u", + jiffies_to_msecs(bins->idle_age[i])); + if (i + 1 < nr_bins) + len +=3D sysfs_emit_at(buf, len, ","); + } + } + len +=3D sysfs_emit_at(buf, len, "\n"); + rcu_read_unlock(); + + return len; +} + +static ssize_t page_age_intervals_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *src, size_t len) +{ + struct wsr_report_bins *bins =3D NULL, __rcu *old; + char *buf =3D NULL; + int err =3D 0; + struct pglist_data *pgdat =3D kobj_to_pgdat(kobj); + + buf =3D kstrdup(src, GFP_KERNEL); + if (!buf) { + err =3D -ENOMEM; + goto failed; + } + + bins =3D + kzalloc(sizeof(struct wsr_report_bins), GFP_KERNEL); + + if (!bins) { + err =3D -ENOMEM; + goto failed; + } + + err =3D workingset_report_intervals_parse(buf, bins); + if (err < 0) + goto failed; + + if (err =3D=3D 0) { + kfree(bins); + bins =3D NULL; + } + + mutex_lock(&pgdat->wsr_update_mutex); + old =3D rcu_replace_pointer(pgdat->wsr_page_age_bins, bins, + lockdep_is_held(&pgdat->wsr_update_mutex)); + mutex_unlock(&pgdat->wsr_update_mutex); + kfree_rcu(old, rcu); + kfree(buf); + return len; +failed: + kfree(bins); + kfree(buf); + + return err; +} + +static struct kobj_attribute page_age_intervals_attr =3D + __ATTR_RW(page_age_intervals); + +static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *= attr, + char *buf) +{ + struct wsr_report_bin *bin; + int ret =3D 0; + struct wsr_state *wsr =3D kobj_to_wsr(kobj); + + + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + wsr->page_age =3D + kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL); + mutex_unlock(&wsr->page_age_lock); + + wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj)); + + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + goto unlock; + for (bin =3D wsr->page_age->bins; + bin->idle_age !=3D WORKINGSET_INTERVAL_MAX; bin++) + ret +=3D sysfs_emit_at(buf, ret, "%u anon=3D%lu file=3D%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + + ret +=3D sysfs_emit_at(buf, ret, "%lu anon=3D%lu file=3D%lu\n", + WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + +unlock: + mutex_unlock(&wsr->page_age_lock); + return ret; +} + +static struct kobj_attribute page_age_attr =3D __ATTR_RO(page_age); + +static struct attribute *workingset_report_attrs[] =3D { + &page_age_intervals_attr.attr, &page_age_attr.attr, NULL +}; + +static const struct attribute_group workingset_report_attr_group =3D { + .name =3D "workingset_report", + .attrs =3D workingset_report_attrs, +}; + +void wsr_init_sysfs(struct node *node) +{ + struct kobject *kobj =3D node ? &node->dev.kobj : mm_kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr =3D kobj_to_wsr(kobj); + + if (sysfs_create_group(kobj, &workingset_report_attr_group)) + pr_warn("Workingset report failed to create sysfs files\n"); +} +EXPORT_SYMBOL_GPL(wsr_init_sysfs); + +void wsr_remove_sysfs(struct node *node) +{ + struct kobject *kobj =3D &node->dev.kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr =3D kobj_to_wsr(kobj); + sysfs_remove_group(kobj, &workingset_report_attr_group); +} +EXPORT_SYMBOL_GPL(wsr_remove_sysfs); --=20 2.45.0.rc1.225.g2a3ae87e7f-goog