From nobody Thu Oct 2 20:44:30 2025 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CD8FE278771 for ; Wed, 10 Sep 2025 23:57:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757548664; cv=none; b=iRGsHKKKTqhyWGFEzraYYiSCCTdvyzUSiHJHnILALAjgwdL19YHRtqATJ8NTmR+Efj7z1dJxZI/D/s5+WK7zplEcL8lXpHpGTI0IeNyla6lm3j4zlBFVHpazMBcZPNHZJZeAKBKCitkSIh2GyD9l4e9WAxAL2IW2r1zgkOYtXm8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757548664; c=relaxed/simple; bh=vXsZEa1RaUghRGnZuVkPYY61AqPk0l38Uxlg0QuYa6U=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=nJjLHc5b9nGy3DJx5KJU5T4y1J4ycPoSC8v5zsi3wXhk6xNbgVR1smV1ntXBMYifttINo7pmHr7udkcELO1f6SKdzG8Bkn8rvMStI5VjWPqrhvzsLWopQQ6PDi6CFOXnAC6ClN59tdWgsNbXARwKeTxyiajvLqm1qbk8u/zuM6I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--kinseyho.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=doyBtQst; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--kinseyho.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="doyBtQst" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-25177b75e38so1321265ad.0 for ; Wed, 10 Sep 2025 16:57:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1757548660; x=1758153460; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=fKWU0JCSGAal/YneTCB+xLzgEtertXC6DRFwV85LNoU=; b=doyBtQst3iGyzw+xGzMpE7UaEJUUQt+P6CAz8qWGaSaDMVK/eJSLSp3/x8x/AqjVu0 QICwmi4GkDJzNDspuBkznHXN1FG3VwvXkQWLeeTgbCygsL6KbSEFqf8gpNXv/7uq8TTc Su7x9Z/vSKqHwAw6bv/yP7zepJys6ciOe0VV39MQbT72fb/e0rnpq2R7xkG3mSLq0mIw XqowFUsdNR77VYo4ezMsd3vkmImCfzkeneutQeJ9idWiKCltBiQzoi3ONZeUyjExL3aR p1JU/+co/dmmc6NoSSccQRNulhNShdoqB/fleZOU24rMXZotqqHcpznxX2Rtb94t0YKY EvOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757548660; x=1758153460; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fKWU0JCSGAal/YneTCB+xLzgEtertXC6DRFwV85LNoU=; b=hrqxNs4oEsjeuWOKPxca+TKxxt/Y/N+hV8y0ReIy4wk/PJ+PNRZH69Rkbz3UIVm8L5 129WryxsbUziIglhYsc2nYF4pJ/d3YGyXVSeHJlIw+wy1UIzPR5PlY/rT111RXWnCwlE oAxq6VYkJRPms7AMPcIaFYtpPhYCibcHz6SpZCbGAnwR9RpccrvywV/PgjKBCJ6gSFrU lIE9H5mv2YGbhcWD0/2f6ei8ly8s+CyWcTrIBZ+NOSJMiGK2p4BKP8DQIQ97JDzt8UQW XKwcJz/hevHIRD92tI3giUvx3xBMddRmKFQENH0dZ+CBzJ5XhYGWdUSyHoB93OltIYX6 68zA== X-Forwarded-Encrypted: i=1; AJvYcCWBGE2aNDlf6BFV6W7p8JZup/LXEoArN6CMZ+L6XS+pbjYEydHXOD+1bf8Zw8pQtO2pTu59l0ogCunRj6c=@vger.kernel.org X-Gm-Message-State: AOJu0YwtpCV/ZOkizGpR/nUHWgOjJzEq/IMCG04ks4Z5fQi1paMy8Ydx nUapNanBkGjnJsjAMyQZHhY9oN3iQo7b0Lhl31AAljSJ4NH4N0oEbAVSCFK+zMK1yhjalhp7JMp UQnQ/+8N4GQvM1A== X-Google-Smtp-Source: AGHT+IFq8VkhrHeJOG3/Ln7wQEcgZo8IP0bqwkSRXeXCOz+z/QeqWuym37ouwoU6uQh35ghe7DgvgPGcQOUmDQ== X-Received: from plbe19.prod.google.com ([2002:a17:902:f113:b0:245:f81c:1553]) (user=kinseyho job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:f690:b0:248:79d4:93ba with SMTP id d9443c01a7336-25170c432b4mr230955415ad.39.1757548659989; Wed, 10 Sep 2025 16:57:39 -0700 (PDT) Date: Wed, 10 Sep 2025 16:51:20 -0700 In-Reply-To: <20250910235121.2544928-1-kinseyho@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250910235121.2544928-1-kinseyho@google.com> X-Mailer: git-send-email 2.51.0.384.g4c02a37b29-goog Message-ID: <20250910235121.2544928-2-kinseyho@google.com> Subject: [RFC PATCH v2 1/2] mm: mglru: generalize page table walk From: Kinsey Ho To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Jonathan.Cameron@huawei.com, dave.hansen@intel.com, gourry@gourry.net, hannes@cmpxchg.org, mgorman@techsingularity.net, mingo@redhat.com, peterz@infradead.org, raghavendra.kt@amd.com, riel@surriel.com, rientjes@google.com, sj@kernel.org, weixugc@google.com, willy@infradead.org, ying.huang@linux.alibaba.com, ziy@nvidia.com, dave@stgolabs.net, nifan.cxl@gmail.com, xuezhengchu@huawei.com, yiannis@zptcorp.com, akpm@linux-foundation.org, david@redhat.com, byungchul@sk.com, kinseyho@google.com, joshua.hahnjy@gmail.com, yuanchu@google.com, balbirs@nvidia.com, alok.rathore@samsung.com, lorenzo.stoakes@oracle.com, axelrasmussen@google.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, zhengqi.arch@bytedance.com, shakeel.butt@linux.dev Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Refactor the existing MGLRU page table walking logic to make it resumable. Additionally, introduce two hooks into the MGLRU page table walk: accessed callback and flush callback. The accessed callback is called for each accessed page detected via the scanned accessed bit. The flush callback is called when the accessed callback reports that a flush is required. This allows for processing pages in batches for efficiency. With a generalised page table walk, introduce a new scan function which repeatedly scans on the same young generation and does not add a new young generation. Signed-off-by: Kinsey Ho Signed-off-by: Yuanchu Xie --- include/linux/mmzone.h | 5 ++ mm/internal.h | 4 + mm/vmscan.c | 181 +++++++++++++++++++++++++++++++---------- 3 files changed, 145 insertions(+), 45 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f7094babed10..a30fbdbb1a57 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -533,6 +533,8 @@ struct lru_gen_mm_walk { unsigned long seq; /* the next address within an mm to scan */ unsigned long next_addr; + /* called for each accessed pte/pmd */ + bool (*accessed_cb)(unsigned long pfn); /* to batch promoted pages */ int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; /* to batch the mm stats */ @@ -540,6 +542,9 @@ struct lru_gen_mm_walk { /* total batched items */ int batched; int swappiness; + /* for the pmd under scanning */ + int nr_young_pte; + int nr_total_pte; bool force_scan; }; =20 diff --git a/mm/internal.h b/mm/internal.h index 45b725c3dc03..f8ca128f61ab 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -538,6 +538,10 @@ extern unsigned long highest_memmap_pfn; bool folio_isolate_lru(struct folio *folio); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state = reason); +void set_task_reclaim_state(struct task_struct *task, + struct reclaim_state *rs); +void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq, + bool (*accessed_cb)(unsigned long), void (*flush_cb)(void)); #ifdef CONFIG_NUMA int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat); diff --git a/mm/vmscan.c b/mm/vmscan.c index a48aec8bfd92..88db10f1aee2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -289,7 +289,7 @@ static int sc_swappiness(struct scan_control *sc, struc= t mem_cgroup *memcg) continue; \ else =20 -static void set_task_reclaim_state(struct task_struct *task, +void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs) { /* Check for an overwrite */ @@ -3092,7 +3092,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *w= alk, struct mm_struct **ite =20 VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->seq); =20 - if (walk->seq <=3D mm_state->seq) + if (!walk->accessed_cb && walk->seq <=3D mm_state->seq) goto done; =20 if (!mm_state->head) @@ -3518,16 +3518,14 @@ static void walk_update_folio(struct lru_gen_mm_wal= k *walk, struct folio *folio, } } =20 -static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long = end, - struct mm_walk *args) +static int walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long e= nd, + struct mm_walk *args, bool *suitable) { int i; bool dirty; pte_t *pte; spinlock_t *ptl; unsigned long addr; - int total =3D 0; - int young =3D 0; struct folio *last =3D NULL; struct lru_gen_mm_walk *walk =3D args->private; struct mem_cgroup *memcg =3D lruvec_memcg(walk->lruvec); @@ -3535,19 +3533,24 @@ static bool walk_pte_range(pmd_t *pmd, unsigned lon= g start, unsigned long end, DEFINE_MAX_SEQ(walk->lruvec); int gen =3D lru_gen_from_seq(max_seq); pmd_t pmdval; + int err =3D 0; =20 pte =3D pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval= , &ptl); - if (!pte) - return false; + if (!pte) { + *suitable =3D false; + return err; + } =20 if (!spin_trylock(ptl)) { pte_unmap(pte); - return true; + *suitable =3D true; + return err; } =20 if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) { pte_unmap_unlock(pte, ptl); - return false; + *suitable =3D false; + return err; } =20 arch_enter_lazy_mmu_mode(); @@ -3556,8 +3559,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long = start, unsigned long end, unsigned long pfn; struct folio *folio; pte_t ptent =3D ptep_get(pte + i); + bool do_flush; =20 - total++; + walk->nr_total_pte++; walk->mm_stats[MM_LEAF_TOTAL]++; =20 pfn =3D get_pte_pfn(ptent, args->vma, addr, pgdat); @@ -3581,23 +3585,36 @@ static bool walk_pte_range(pmd_t *pmd, unsigned lon= g start, unsigned long end, if (pte_dirty(ptent)) dirty =3D true; =20 - young++; + walk->nr_young_pte++; walk->mm_stats[MM_LEAF_YOUNG]++; + + if (!walk->accessed_cb) + continue; + + do_flush =3D walk->accessed_cb(pfn); + if (do_flush) { + walk->next_addr =3D addr + PAGE_SIZE; + + err =3D -EAGAIN; + break; + } } =20 walk_update_folio(walk, last, gen, dirty); last =3D NULL; =20 - if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &= end)) + if (!err && i < PTRS_PER_PTE && + get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end)) goto restart; =20 arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte, ptl); =20 - return suitable_to_scan(total, young); + *suitable =3D suitable_to_scan(walk->nr_total_pte, walk->nr_young_pte); + return err; } =20 -static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct v= m_area_struct *vma, +static int walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm= _area_struct *vma, struct mm_walk *args, unsigned long *bitmap, unsigned long *first) { int i; @@ -3610,6 +3627,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigne= d long addr, struct vm_area struct pglist_data *pgdat =3D lruvec_pgdat(walk->lruvec); DEFINE_MAX_SEQ(walk->lruvec); int gen =3D lru_gen_from_seq(max_seq); + int err =3D 0; =20 VM_WARN_ON_ONCE(pud_leaf(*pud)); =20 @@ -3617,13 +3635,13 @@ static void walk_pmd_range_locked(pud_t *pud, unsig= ned long addr, struct vm_area if (*first =3D=3D -1) { *first =3D addr; bitmap_zero(bitmap, MIN_LRU_BATCH); - return; + return err; } =20 i =3D addr =3D=3D -1 ? 0 : pmd_index(addr) - pmd_index(*first); if (i && i <=3D MIN_LRU_BATCH) { __set_bit(i - 1, bitmap); - return; + return err; } =20 pmd =3D pmd_offset(pud, *first); @@ -3637,6 +3655,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigne= d long addr, struct vm_area do { unsigned long pfn; struct folio *folio; + bool do_flush; =20 /* don't round down the first address */ addr =3D i ? (*first & PMD_MASK) + i * PMD_SIZE : *first; @@ -3673,6 +3692,17 @@ static void walk_pmd_range_locked(pud_t *pud, unsign= ed long addr, struct vm_area dirty =3D true; =20 walk->mm_stats[MM_LEAF_YOUNG]++; + if (!walk->accessed_cb) + goto next; + + do_flush =3D walk->accessed_cb(pfn); + if (do_flush) { + i =3D find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1; + + walk->next_addr =3D (*first & PMD_MASK) + i * PMD_SIZE; + err =3D -EAGAIN; + break; + } next: i =3D i > MIN_LRU_BATCH ? 0 : find_next_bit(bitmap, MIN_LRU_BATCH, i) + = 1; } while (i <=3D MIN_LRU_BATCH); @@ -3683,9 +3713,10 @@ static void walk_pmd_range_locked(pud_t *pud, unsign= ed long addr, struct vm_area spin_unlock(ptl); done: *first =3D -1; + return err; } =20 -static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long = end, +static int walk_pmd_range(pud_t *pud, unsigned long start, unsigned long e= nd, struct mm_walk *args) { int i; @@ -3697,6 +3728,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long = start, unsigned long end, unsigned long first =3D -1; struct lru_gen_mm_walk *walk =3D args->private; struct lru_gen_mm_state *mm_state =3D get_mm_state(walk->lruvec); + int err =3D 0; =20 VM_WARN_ON_ONCE(pud_leaf(*pud)); =20 @@ -3710,6 +3742,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long = start, unsigned long end, /* walk_pte_range() may call get_next_vma() */ vma =3D args->vma; for (i =3D pmd_index(start), addr =3D start; addr !=3D end; i++, addr =3D= next) { + bool suitable; pmd_t val =3D pmdp_get_lockless(pmd + i); =20 next =3D pmd_addr_end(addr, end); @@ -3726,7 +3759,10 @@ static void walk_pmd_range(pud_t *pud, unsigned long= start, unsigned long end, walk->mm_stats[MM_LEAF_TOTAL]++; =20 if (pfn !=3D -1) - walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first); + err =3D walk_pmd_range_locked(pud, addr, vma, args, + bitmap, &first); + if (err) + return err; continue; } =20 @@ -3735,33 +3771,51 @@ static void walk_pmd_range(pud_t *pud, unsigned lon= g start, unsigned long end, if (!pmd_young(val)) continue; =20 - walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first); + err =3D walk_pmd_range_locked(pud, addr, vma, args, + bitmap, &first); + if (err) + return err; } =20 if (!walk->force_scan && !test_bloom_filter(mm_state, walk->seq, pmd + i= )) continue; =20 + err =3D walk_pte_range(&val, addr, next, args, &suitable); + if (err && walk->next_addr < next && first =3D=3D -1) + return err; + + walk->nr_total_pte =3D 0; + walk->nr_young_pte =3D 0; + walk->mm_stats[MM_NONLEAF_FOUND]++; =20 - if (!walk_pte_range(&val, addr, next, args)) - continue; + if (!suitable) + goto next; =20 walk->mm_stats[MM_NONLEAF_ADDED]++; =20 /* carry over to the next generation */ update_bloom_filter(mm_state, walk->seq + 1, pmd + i); +next: + if (err) { + walk->next_addr =3D first; + return err; + } } =20 - walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first); + err =3D walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first); =20 - if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &e= nd)) + if (!err && i < PTRS_PER_PMD && + get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end)) goto restart; + + return err; } =20 static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long e= nd, struct mm_walk *args) { - int i; + int i, err; pud_t *pud; unsigned long addr; unsigned long next; @@ -3779,7 +3833,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long s= tart, unsigned long end, if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val))) continue; =20 - walk_pmd_range(&val, addr, next, args); + err =3D walk_pmd_range(&val, addr, next, args); + if (err) + return err; =20 if (need_resched() || walk->batched >=3D MAX_LRU_BATCH) { end =3D (addr | ~PUD_MASK) + 1; @@ -3800,40 +3856,48 @@ static int walk_pud_range(p4d_t *p4d, unsigned long= start, unsigned long end, return -EAGAIN; } =20 -static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) +static int try_walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) { + int err; static const struct mm_walk_ops mm_walk_ops =3D { .test_walk =3D should_skip_vma, .p4d_entry =3D walk_pud_range, .walk_lock =3D PGWALK_RDLOCK, }; - int err; struct lruvec *lruvec =3D walk->lruvec; =20 - walk->next_addr =3D FIRST_USER_ADDRESS; + DEFINE_MAX_SEQ(lruvec); =20 - do { - DEFINE_MAX_SEQ(lruvec); + err =3D -EBUSY; =20 - err =3D -EBUSY; + /* another thread might have called inc_max_seq() */ + if (walk->seq !=3D max_seq) + return err; =20 - /* another thread might have called inc_max_seq() */ - if (walk->seq !=3D max_seq) - break; + /* the caller might be holding the lock for write */ + if (mmap_read_trylock(mm)) { + err =3D walk_page_range(mm, walk->next_addr, ULONG_MAX, + &mm_walk_ops, walk); =20 - /* the caller might be holding the lock for write */ - if (mmap_read_trylock(mm)) { - err =3D walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, w= alk); + mmap_read_unlock(mm); + } =20 - mmap_read_unlock(mm); - } + if (walk->batched) { + spin_lock_irq(&lruvec->lru_lock); + reset_batch_size(walk); + spin_unlock_irq(&lruvec->lru_lock); + } =20 - if (walk->batched) { - spin_lock_irq(&lruvec->lru_lock); - reset_batch_size(walk); - spin_unlock_irq(&lruvec->lru_lock); - } + return err; +} + +static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) +{ + int err; =20 + walk->next_addr =3D FIRST_USER_ADDRESS; + do { + err =3D try_walk_mm(mm, walk); cond_resched(); } while (err =3D=3D -EAGAIN); } @@ -4045,6 +4109,33 @@ static bool inc_max_seq(struct lruvec *lruvec, unsig= ned long seq, int swappiness return success; } =20 +void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq, + bool (*accessed_cb)(unsigned long), void (*flush_cb)(void)) +{ + struct lru_gen_mm_walk *walk =3D current->reclaim_state->mm_walk; + struct mm_struct *mm =3D NULL; + + walk->lruvec =3D lruvec; + walk->seq =3D seq; + walk->accessed_cb =3D accessed_cb; + walk->swappiness =3D MAX_SWAPPINESS; + + do { + int err =3D -EBUSY; + + iterate_mm_list(walk, &mm); + if (!mm) + break; + + walk->next_addr =3D FIRST_USER_ADDRESS; + do { + err =3D try_walk_mm(mm, walk); + cond_resched(); + flush_cb(); + } while (err =3D=3D -EAGAIN); + } while (mm); +} + static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness, bool force_scan) { --=20 2.51.0.384.g4c02a37b29-goog From nobody Thu Oct 2 20:44:30 2025 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B1C20283CB5 for ; Wed, 10 Sep 2025 23:57:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757548664; cv=none; b=Y0iJjSk/ITBjZsrB5rL8bFT4gtcqQHX33COz81kIs5JpOaxr3pehED9TQf8d+99AnnvNc+C2xtMrPVuyvHv48TFsqpMl9l2JIUML9ph5eMNbm8kl003xACjXpCOc6MtYSWDeLHlUUOy3SfBzaccxBk/2YFdA6oHlduXYdefckiI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757548664; c=relaxed/simple; bh=RAKRMgWwtIaRTYCDXz6ui8zqoTkb0gxlZp+dl5xc8KY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=agtg75pIJ7aTyJejDWXfB9XMXJFQ/2KTvhB7SzCwPs4hIZW6FUF+mz0EQyIpL66ID1mhGwZgkWgEtVA5NDZ/aCt08icmahR6YBrkweQg+mD3qL7cuPsBs13NcXq+2W3nIaXsUoFg0Xa9MPbbv41W83sriItYGfZ6rUwwj0GC6Rw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--kinseyho.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=M1Db7Ecp; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--kinseyho.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="M1Db7Ecp" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-24c8264a137so1012175ad.3 for ; Wed, 10 Sep 2025 16:57:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1757548662; x=1758153462; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=bGwQF+CXl3vYtzdP9mbgCIJA/DkcNggZlxUnpS7TgbA=; b=M1Db7EcpaaIkYOjJIVF+s64nO1jNRkqQiK2umCay2BTkC/lSZG9V5A2vOGCgKx0Kd6 nQ8a7FhQj/c7E0r7Pgz51N/hbGmT/PszNhWILamLr+hKzATW4zCQRrWfQvzwsUQiw2lA HQ+EhKGXJojrYCOt+SAAPefpN+3tZFWdz27YfXVOsbKaGFS2dY0ss+9HWvtWVDULzW+F k7v/pKw2AQcVwiQdQsroepeIWxZeJAGKWao+R3qCNoamNFbe5EaPYvDEsCnMuKTltl3l APlQFzpmU6oGlY4OmN65X2CUO0Kk42B/LZfxBlYYiGecvAbnpSskINnEackTr1Hz1ZpQ F++w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757548662; x=1758153462; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bGwQF+CXl3vYtzdP9mbgCIJA/DkcNggZlxUnpS7TgbA=; b=Npwj65Z0A64frGKDyUN9XLw7r5SRB1hjRIFHjenLby7y0UJ+ekhtySHeCwRIGMeSUh 5KcNZg6aC1YBQYASpRakPJsa7kRmplM0WhKKWnl4quX7lD7H7gFyxxpGssYgir1yplWB TNwr7xVgJ2/DAgKZB97MyeomSasgNMF/ilWUbmzqOdZc+eMOkAyN3BvI/YuYhNIQDUJt WzXEQDfdqE1vj99WYDF4XHTXlG5EiQMfgjeVtMZbsOIoPgWEW2S9b/OZZ88kt+F/1t24 aUp3ftAwP6xFZk+d6cnXb2IDV+ye6Td232ed7khCRnDwktJznVTS/D+s8R0C16+MyxLq zqVA== X-Forwarded-Encrypted: i=1; AJvYcCXmjZ1X44SHseNRzyqG+OUSAXEoP0xU2vLWDB6HavCpgqJMClHa7vtyBT2CNQTLQtQ1Mr+NAx75euhhw7k=@vger.kernel.org X-Gm-Message-State: AOJu0YwREWUZN/9AItBANpxZjzX2J1gw93DEGeQ0vdH+DSib83QksU+W AsjMpDZfbhhb3yk7sEfqfoKc9gxUEXT/ot7rTJT+FvO2agNZxwGh7OfuoXGNYI+Kc81D3orZD6Q ssyLWk+5QqVbSOA== X-Google-Smtp-Source: AGHT+IFgMc9+BRlDZ/xcmHEhHQ3WZP/Hvgh8+t59rNiHJPd+WEnbMa1Kj1RSd0TLn33KhEA3jAl11A/Teaqamg== X-Received: from pljf9.prod.google.com ([2002:a17:902:ff09:b0:24c:b6ae:fcb0]) (user=kinseyho job=prod-delivery.src-stubby-dispatcher) by 2002:a17:903:3d05:b0:24c:a417:4490 with SMTP id d9443c01a7336-2516ce60006mr244218345ad.5.1757548661978; Wed, 10 Sep 2025 16:57:41 -0700 (PDT) Date: Wed, 10 Sep 2025 16:51:21 -0700 In-Reply-To: <20250910235121.2544928-1-kinseyho@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250910235121.2544928-1-kinseyho@google.com> X-Mailer: git-send-email 2.51.0.384.g4c02a37b29-goog Message-ID: <20250910235121.2544928-3-kinseyho@google.com> Subject: [RFC PATCH v2 2/2] mm: klruscand: use mglru scanning for page promotion From: Kinsey Ho To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Jonathan.Cameron@huawei.com, dave.hansen@intel.com, gourry@gourry.net, hannes@cmpxchg.org, mgorman@techsingularity.net, mingo@redhat.com, peterz@infradead.org, raghavendra.kt@amd.com, riel@surriel.com, rientjes@google.com, sj@kernel.org, weixugc@google.com, willy@infradead.org, ying.huang@linux.alibaba.com, ziy@nvidia.com, dave@stgolabs.net, nifan.cxl@gmail.com, xuezhengchu@huawei.com, yiannis@zptcorp.com, akpm@linux-foundation.org, david@redhat.com, byungchul@sk.com, kinseyho@google.com, joshua.hahnjy@gmail.com, yuanchu@google.com, balbirs@nvidia.com, alok.rathore@samsung.com, lorenzo.stoakes@oracle.com, axelrasmussen@google.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, zhengqi.arch@bytedance.com, shakeel.butt@linux.dev Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce a new kernel daemon, klruscand, that periodically invokes the MGLRU page table walk. It leverages the new callbacks to gather access information and forwards it to the kpromoted daemon for promotion decisions. This benefits from reusing the existing MGLRU page table walk infrastructure, which is optimized with features such as hierarchical scanning and bloom filters to reduce CPU overhead. As an additional optimization to be added in the future, we can tune the scan intervals for each memcg. Signed-off-by: Kinsey Ho Signed-off-by: Yuanchu Xie --- mm/Kconfig | 8 ++++ mm/Makefile | 1 + mm/klruscand.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 127 insertions(+) create mode 100644 mm/klruscand.c diff --git a/mm/Kconfig b/mm/Kconfig index 8b236eb874cf..6d53c1208729 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1393,6 +1393,14 @@ config PGHOT by various sources. Asynchronous promotion is done by per-node kernel threads. =20 +config KLRUSCAND + bool "Kernel lower tier access scan daemon" + default y + depends on PGHOT && LRU_GEN_WALKS_MMU + help + Scan for accesses from lower tiers by invoking MGLRU to perform + page table walks. + source "mm/damon/Kconfig" =20 endmenu diff --git a/mm/Makefile b/mm/Makefile index ecdd5241bea8..05a96ec35aa3 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -148,3 +148,4 @@ obj-$(CONFIG_EXECMEM) +=3D execmem.o obj-$(CONFIG_TMPFS_QUOTA) +=3D shmem_quota.o obj-$(CONFIG_PT_RECLAIM) +=3D pt_reclaim.o obj-$(CONFIG_PGHOT) +=3D pghot.o +obj-$(CONFIG_KLRUSCAND) +=3D klruscand.o diff --git a/mm/klruscand.c b/mm/klruscand.c new file mode 100644 index 000000000000..1ee2ac906771 --- /dev/null +++ b/mm/klruscand.c @@ -0,0 +1,118 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +#define KLRUSCAND_INTERVAL 2000 +#define BATCH_SIZE (2 << 16) + +static struct task_struct *scan_thread; +static unsigned long pfn_batch[BATCH_SIZE]; +static int batch_index; + +static void flush_cb(void) +{ + int i; + + for (i =3D 0; i < batch_index; i++) { + unsigned long pfn =3D pfn_batch[i]; + + pghot_record_access(pfn, NUMA_NO_NODE, + PGHOT_PGTABLE_SCAN, jiffies); + + if (i % 16 =3D=3D 0) + cond_resched(); + } + batch_index =3D 0; +} + +static bool accessed_cb(unsigned long pfn) +{ + WARN_ON_ONCE(batch_index =3D=3D BATCH_SIZE); + + if (batch_index < BATCH_SIZE) + pfn_batch[batch_index++] =3D pfn; + + return batch_index =3D=3D BATCH_SIZE; +} + +static int klruscand_run(void *unused) +{ + struct lru_gen_mm_walk *walk; + + walk =3D kzalloc(sizeof(*walk), + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN); + if (!walk) + return -ENOMEM; + + while (!kthread_should_stop()) { + unsigned long next_wake_time; + long sleep_time; + struct mem_cgroup *memcg; + int flags; + int nid; + + next_wake_time =3D jiffies + msecs_to_jiffies(KLRUSCAND_INTERVAL); + + for_each_node_state(nid, N_MEMORY) { + pg_data_t *pgdat =3D NODE_DATA(nid); + struct reclaim_state rs =3D { 0 }; + + if (node_is_toptier(nid)) + continue; + + rs.mm_walk =3D walk; + set_task_reclaim_state(current, &rs); + flags =3D memalloc_noreclaim_save(); + + memcg =3D mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec =3D + mem_cgroup_lruvec(memcg, pgdat); + unsigned long max_seq =3D + READ_ONCE((lruvec)->lrugen.max_seq); + + lru_gen_scan_lruvec(lruvec, max_seq, accessed_cb, flush_cb); + cond_resched(); + } while ((memcg =3D mem_cgroup_iter(NULL, memcg, NULL))); + + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); + memset(walk, 0, sizeof(*walk)); + } + + sleep_time =3D next_wake_time - jiffies; + if (sleep_time > 0 && sleep_time !=3D MAX_SCHEDULE_TIMEOUT) + schedule_timeout_idle(sleep_time); + } + kfree(walk); + return 0; +} + +static int __init klruscand_init(void) +{ + struct task_struct *task; + + task =3D kthread_run(klruscand_run, NULL, "klruscand"); + + if (IS_ERR(task)) { + pr_err("Failed to create klruscand kthread\n"); + return PTR_ERR(task); + } + + scan_thread =3D task; + return 0; +} +module_init(klruscand_init); --=20 2.51.0.384.g4c02a37b29-goog