From nobody Tue Jun 16 01:27:52 2026 Received: from mail-pf1-f181.google.com (mail-pf1-f181.google.com [209.85.210.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 82254298CAB for ; Wed, 15 Apr 2026 02:23:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776219833; cv=none; b=hvd44Yn4g0iV1Mluq/e9zij6XuuMjDcCckhv9hfMfLi3Ub1R8tS0wlEz5TuPcD0xQpn5rCOBv+G0giqXw7AAAmZNKJiYz4VdvgSWpfxLX0O5zNbELvZvA/vpSdeyTxJdBu5u0FX8yzpm7gUMz+jmE43mwcauOrPRR2D4SwwFg9c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776219833; c=relaxed/simple; bh=1qYzuPfEiagDAJsWVpKR2GhgDpXMsF1bWLnPT/gVGyc=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=mwRvnK/LzEpAXwqeaniSeEMYDoupocqDUJx1yJiO3J4tsTndT61MSiOILF8+rwd/4vPC0qJKKJW5hNjOZ8pkbVapppkH/NEtwNyLr6WN48c0lIGScikG9On7SAynGAUsvj5qy0wBbgLI/brNNXe8w9ThBe0uUl/BMvb8P9UZXzA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=bQWmCSQ6; arc=none smtp.client-ip=209.85.210.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="bQWmCSQ6" Received: by mail-pf1-f181.google.com with SMTP id d2e1a72fcca58-82cebbdbdccso3083220b3a.1 for ; Tue, 14 Apr 2026 19:23:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1776219831; x=1776824631; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=AS7olroNFEXattkGrKD+rS7tU+zP8C4dXdnIq/RdScM=; b=bQWmCSQ6xBvFvYK4SC5b+SjtAZFCF8Az/Bi2HFfLctCNuflpsLhay8mVG0rYlDwUHe BaipAsRpGFw/X1QE8oe6KVDcbK4ayIj9xMgdb4U6H/QAU5q9rQLt3LvA6wduqV3fR/9B kRjJnJL/FG0BXZowz7sefhE9edmeuv1lNGxwWbzIwdHxJP4pkW6vgRNGUD4aecx+M2nr 4Dg5UvZ3SY2gYM+vocwLWzACSrV+1+4DLGkV4Gsah2csZtn/njnPdhTpddISUlaqfMmZ zdbXHqaTdZwrYDRL4ADeEWg1wDD9zvEEfKgSb510RzPNCKZ2vZyoRVEUXF1ruvc6LqAN vRaw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776219831; x=1776824631; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=AS7olroNFEXattkGrKD+rS7tU+zP8C4dXdnIq/RdScM=; b=KLXuby1f+5vYkIo466lBHGtDIo0XppYdbcySNQYWaq7wRnOv3NSyIMcRwv0WK8TWH+ x/NDWy0gvy3WqaK9H+lPYAmag+17COe0+nCNZc7L+EDZ50tW42umetXqLxIKzxJC4j2J pE7HzHHWdk0fzqm47ev82XXS/epbsZgVVObQ7qxV+JhEkos7ZGDllpYJlalgmlnt+7PF Wgi3qrgvL4kyzq8H1CReaIEDUXBhxIN9wPk9duu1z38s5RziAWpMigSM0MQVhYKnKJoI h/4JcjNs8AHAjjqZdPRmtued33lZqasypy5FED+lSL/GzVcuyfYzlPZEOOqJ76sRtWbv +W7Q== X-Forwarded-Encrypted: i=1; AFNElJ9vKgQFaUz9SzO2pZTTuaFfJz75z7TiSYJcdR0Lr0W5IslHSGXPveDJfEMKUupHzFSMLPU95ZMt58keODs=@vger.kernel.org X-Gm-Message-State: AOJu0YxB89454TVBxB2S3zu+1WYm0nHM/E2MNJkz36I1CYxRVQ8KaFPu 5cKeK9PVHa123gf8mfp/SEB6MCKSRMWhps2VpORJubEZ1Au4eV/1IHkZQTkiXmoWCtU= X-Gm-Gg: AeBDievx3AgTiz06Z6Bg1xYfz9mjSWGHOdC7rDn6IAR+Szk3n329FwH616KB1cSh8HW 9BZWeGsPsZZRDISwk9A9CccwpQJ3I3vkwkxGcgOKW46rVtIGhCoXJrFs5uEJpjBwQmkZWWhmeBO bS6JTd/X+uCfuA226pnoGzpKJOa+E3n+K95dSel3/Wtb+L5YGaZDLX7kw+wZJnwegS56CE1o4dd OhS4dLfyvOTKvmEEpft1GXcJP5FWdJ0/xG34Fdcem8/cr85X/+8l5EWRWi1AY80dzbQey3LoSaS YFb3Yh+GofHgMwWJ7RbPKXZhf4e0BVhHJ9R3NfO9aBwGco4Cv+IYEPWRILOqpVFgWENuy3mAqvH ImHvdQNCk42pifFZoVNSOPDVevH9EInBjF/ipfhjLP5Ryk0oIPyIJzlpiSoCXWRNxRD2G+M7x+z WIprLSTlPwdsAEwRW/Ng== X-Received: by 2002:a05:6a00:cc3:b0:81e:1b77:9e61 with SMTP id d2e1a72fcca58-82f0d379ba8mr16521188b3a.25.1776219830661; Tue, 14 Apr 2026 19:23:50 -0700 (PDT) Received: from n232-176-004.byted.org ([240e:83:200::347]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-82f69ef0b34sm102517b3a.31.2026.04.14.19.23.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Apr 2026 19:23:50 -0700 (PDT) From: Muchun Song To: Andrew Morton , David Hildenbrand , Oscar Salvador , Charan Teja Kalla Cc: Muchun Song , Muchun Song , Kairui Song , Qi Zheng , Shakeel Butt , Barry Song , Axel Rasmussen , Yuanchu Xie , Wei Xu , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org Subject: [PATCH] mm/sparse: Fix race on mem_section->usage in pfn walkers Date: Wed, 15 Apr 2026 10:23:26 +0800 Message-Id: <20260415022326.53218-1-songmuchun@bytedance.com> X-Mailer: git-send-email 2.20.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When memory is hot-removed, section_deactivate() can tear down mem_section->usage while concurrent pfn walkers still inspect the subsection map via pfn_section_valid() or pfn_section_first_valid(). After commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage") converted the teardown to an RCU-based scheme, the code still relies on SECTION_HAS_MEM_MAP becoming visible to readers before ms->usage is cleared and queued for freeing. That ordering is not guaranteed. section_deactivate() can clear ms->usage and queue kfree_rcu() before another CPU observes the SECTION_HAS_MEM_MAP clear. A concurrent pfn walker can therefore see valid_section() return true, enter its sched-RCU read-side critical section after kfree_rcu() has already been queued, and then dereference a stale ms->usage pointer. And pfn_to_online_page() can call pfn_section_valid() without its own sched-RCU read-side critical section, which has similar problem. The race looks like this: compact_zone() memunmap_pages =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D __remove_pages()-> sparse_remove_section()-> section_deactivate(): a) [ Clear SECTION_HAS_MEM_MAP is reordered to b) ] kfree_rcu(ms->usage) __pageblock_pfn_to_page ...... pfn_valid(): rcu_read_lock_sched() valid_section() // return true pfn_section_valid() [Access ms->usage which is UAF] WRITE_ONCE(ms->usage, NULL) rcu_read_unlock_sched() b) Clear SECTION_HAS_MEM_MAP Fix this by using rcu_replace_pointer() when clearing ms->usage in section_deactivate(), then it does not rely on the order of clearing of SECTION_HAS_MEM_MAP. Fixes: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->u= sage") Signed-off-by: Muchun Song --- This patch is focused on the ms->usage lifetime race only. One open question is the interaction between pfn_to_online_page() and vmemmap teardown during memory hot-remove. Could pfn_to_online_page() still hand out a stale struct page here? The new sched-RCU critical section ends before pfn_to_page(pfn), but section_deactivate() can still tear the vmemmap down immediately afterwards: mm/sparse-vmemmap.c:section_deactivate() ms->section_mem_map &=3D ~SECTION_HAS_MEM_MAP; usage =3D rcu_replace_pointer(ms->usage, NULL, true); kfree_rcu(usage, rcu); depopulate_section_memmap(...); That looks like a reader can observe valid =3D true, drop sched-RCU, race w= ith section_deactivate(), and then execute pfn_to_page(pfn) after the backing vmemmap was depopulated. Callers such as mm/compaction.c:__reset_isolation_pfn(), mm/page_idle.c:page_idle_get_folio(), and fs/proc/kcore.c:read_kcore_iter() dereference the returned page immediately, and they do not appear to hold get_online_mems() across the pfn_to_online_page() call. I am not fully sure whether that reasoning is correct, or whether current callers are expected to rely on additional hotplug serialization instead. Comments on whether this is a real issue, and how the vmemmap lifetime is expected to be handled here, would be very helpful. --- include/linux/mmzone.h | 6 +++--- mm/memory_hotplug.c | 6 +++++- mm/sparse-vmemmap.c | 6 ++++-- 3 files changed, 12 insertions(+), 6 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 238bf2d35a54..0e850924cbeb 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -2014,7 +2014,7 @@ struct mem_section { */ unsigned long section_mem_map; =20 - struct mem_section_usage *usage; + struct mem_section_usage __rcu *usage; #ifdef CONFIG_PAGE_EXTENSION /* * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use @@ -2178,14 +2178,14 @@ static inline int subsection_map_index(unsigned lon= g pfn) static inline int pfn_section_valid(struct mem_section *ms, unsigned long = pfn) { int idx =3D subsection_map_index(pfn); - struct mem_section_usage *usage =3D READ_ONCE(ms->usage); + struct mem_section_usage *usage =3D rcu_dereference_sched(ms->usage); =20 return usage ? test_bit(idx, usage->subsection_map) : 0; } =20 static inline bool pfn_section_first_valid(struct mem_section *ms, unsigne= d long *pfn) { - struct mem_section_usage *usage =3D READ_ONCE(ms->usage); + struct mem_section_usage *usage =3D rcu_dereference_sched(ms->usage); int idx =3D subsection_map_index(*pfn); unsigned long bit; =20 diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0c1d3df3a296..335835abe74c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -340,6 +340,7 @@ struct page *pfn_to_online_page(unsigned long pfn) unsigned long nr =3D pfn_to_section_nr(pfn); struct dev_pagemap *pgmap; struct mem_section *ms; + bool valid; =20 if (nr >=3D NR_MEM_SECTIONS) return NULL; @@ -355,7 +356,10 @@ struct page *pfn_to_online_page(unsigned long pfn) if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) && !pfn_valid(pfn)) return NULL; =20 - if (!pfn_section_valid(ms, pfn)) + rcu_read_lock_sched(); + valid =3D pfn_section_valid(ms, pfn); + rcu_read_unlock_sched(); + if (!valid) return NULL; =20 if (!online_device_section(ms)) diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 717ac953bba2..05f68dcec0f8 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -601,8 +601,10 @@ static void section_deactivate(unsigned long pfn, unsi= gned long nr_pages, * was allocated during boot. */ if (!PageReserved(virt_to_page(ms->usage))) { - kfree_rcu(ms->usage, rcu); - WRITE_ONCE(ms->usage, NULL); + struct mem_section_usage *usage; + + usage =3D rcu_replace_pointer(ms->usage, NULL, true); + kfree_rcu(usage, rcu); } memmap =3D pfn_to_page(SECTION_ALIGN_DOWN(pfn)); } --=20 2.20.1