From nobody Tue Jun 9 01:01:39 2026 Received: from mxct.zte.com.cn (mxct.zte.com.cn [58.251.27.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 32A0718050 for ; Mon, 25 May 2026 04:49:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=58.251.27.85 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779684595; cv=none; b=Hr/dSUs0BazS0Kjew7+gZf6QY8Ccm/azs/LgPBB9wlZjHWMbgMndGjOnJRA8oqNrgYxSTVB+S2s0RdYXCqLjZSWclFF2b1f8uwNOGqqGbMRkfTXf85cJHTcg6bEHTMf9optTvhu4TuZnuZ5ZA+6i0JSfvfKeJabyy8e9A2vvReI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779684595; c=relaxed/simple; bh=V7VfIGgH+6JrwUWLKW2P2N23RfhFkZhNdQ9pnID7GzE=; h=Message-ID:Date:Mime-Version:From:To:Cc:Subject:Content-Type; b=rRbBI1bYdISk1l4Q6404gMkGRDJ0lc9M03q6msOjb53CuEzktNOBOGsOHmSwEizlv75Mm7ojJUUywIE8iUi6dWLS3GAFEXbSkqSgLS2Or3G9DgKdAyhXAPeTpBZHp4FoWynKJD6dhgjMkG2+Y9xfT5+eulyco8J/6GR/ekWIhM4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zte.com.cn; spf=pass smtp.mailfrom=zte.com.cn; arc=none smtp.client-ip=58.251.27.85 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zte.com.cn Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=zte.com.cn Received: from mxde.zte.com.cn (unknown [10.35.20.121]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mxct.zte.com.cn (FangMail) with ESMTPS id 4gP38H3l29z57fW for ; Mon, 25 May 2026 12:39:59 +0800 (CST) Received: from mxhk.zte.com.cn (unknown [192.168.250.137]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mxde.zte.com.cn (FangMail) with ESMTPS id 4gP3876VPbzBc3kB for ; Mon, 25 May 2026 12:39:51 +0800 (CST) Received: from mse-fl1.zte.com.cn (unknown [10.5.228.132]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mxhk.zte.com.cn (FangMail) with ESMTPS id 4gP37x6hfLz8Xrr9; Mon, 25 May 2026 12:39:41 +0800 (CST) Received: from xaxapp01.zte.com.cn ([10.88.99.176]) by mse-fl1.zte.com.cn with SMTP id 64P4dcfO091434; Mon, 25 May 2026 12:39:38 +0800 (+08) (envelope-from wang.yaxin@zte.com.cn) Received: from mapi (xaxapp02[null]) by mapi (Zmail) with MAPI id mid32; Mon, 25 May 2026 12:39:38 +0800 (CST) X-Zmail-TransId: 2afa6a13d28aa9d-bf2e5 X-Mailer: Zmail v1.0 Message-ID: <20260525123938427-2LRlh6S2Ew79m61xNh6S@zte.com.cn> Date: Mon, 25 May 2026 12:39:38 +0800 (CST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 From: To: , , , , , , , , , , , Cc: , , , , , Subject: =?UTF-8?B?W1BBVENIIHN0YWJsZSA1LjEwXSBtbTogbnVtYTogcHJlc2VydmUgUE1EIHdyaXRlIHBlcm1pc3Npb25zIGluIG1pZ3JhdGVfbWlzcGxhY2VkX3RyYW5zaHVnZV9wYWdl?= Content-Type: text/plain; charset="utf-8" X-MAIL: mse-fl1.zte.com.cn 64P4dcfO091434 X-CLEAN: YES X-TLS: YES X-SPF-DOMAIN: zte.com.cn X-ENVELOPE-SENDER: wang.yaxin@zte.com.cn X-SPF: None X-SOURCE-IP: 10.35.20.121 unknown Mon, 25 May 2026 12:39:59 +0800 X-Fangmail-Anti-Spam-Filtered: true X-Fangmail-MID-QID: 6A13D29E.000/4gP38H3l29z57fW Content-Transfer-Encoding: quoted-printable From: Chen Junlin When a process allocates a transparent huge page in its address space, and then enters the kernel driver via an ioctl system call, a driver (eg. ib_uverbs) calls the pin_user_pages_fast function to pin the process=E2=80= =99s virtual addresses to physical pages. Subsequently, when the process accesses this pinned memory across NUMA nodes, triggering the system=E2=80= =99s NUMA balancing capability, a page fault occurs and the kernel enters do_huge_pmd_numa_page, then it calls migrate_misplaced_transhuge_page to migrate the transparent huge page. However, because the memory within the huge page has been pinned by pin_user_pages_fast, numamigrate_isolate_page returns 0. migrate_misplaced_transhuge_page proceeds to the out_fail path, where it changes the PMD page table entry to write-protected by pte_modify. If the process then performs a fork operation, copy_huge_pmd is invoked. Due to the pinned memory, __split_huge_pmd is called to split the PMD page table entry into PTE page table entries. These PTEs are also set to write-protected. Finally, when the process writes to this memory region, a copy-on-write (COW) operation takes place, allocating a new physical memory page. This breaks the binding between the process=E2=80=99s virtual address and the pinned physical memory. Here is my test code in userspace.The /dev/test_gup is provided by a simple kerenl mod, it just help us calls pin_user_pages_fast in kernel by passing va to ioctl, so I do not provides code of kernel mod. The test code runs on an x86 QEMU-KVM virtual machine with a specification of 64 cores and 2 NUMA nodes (0-31:node0, 32-63:node1). Kernel is 5.10.256, numa balancing para is kernel.numa_balancing =3D 1 kernel.numa_balancing_scan_delay_ms =3D 1000 kernel.numa_balancing_scan_period_max_ms =3D 60000 kernel.numa_balancing_scan_period_min_ms =3D 1000 kernel.numa_balancing_scan_size_mb =3D 256 /sys/kernel/mm/transparent_hugepage/enabled is always =3D=3D=3D #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #define HUGE_PAGE_SIZE (2 * 1024 * 1024) #define ALIGNMENT HUGE_PAGE_SIZE #define MEMORY_SIZE (HUGE_PAGE_SIZE * 16) #define TEST_GUP_IOC_MAGIC 'G' #define TEST_GUP_IOCTL_PIN_PAGES \=20 _IOWR(TEST_GUP_IOC_MAGIC, 1, struct test_gup_request) #define TEST_GUP_IOCTL_UNPIN_PAGES \ _IOW(TEST_GUP_IOC_MAGIC, 2, struct test_gup_request) struct test_gup_request { __u64 user_addr; __u64 size; __u64 page_count; __u32 flags; }; void touch_memory(void *addr, size_t length, int write) { volatile char *ptr =3D (volatile char *)addr; char tmp; size_t page_size =3D getpagesize(); for (size_t i =3D 0; i < length; i +=3D page_size) { if (write) ptr[i] =3D (char)(i % 256); else tmp =3D ptr[i]; } } void init_sigchld() { struct sigaction sa; sa.sa_handler =3D SIG_IGN; sigemptyset(&sa.sa_mask); sa.sa_flags =3D 0; sigaction(SIGCHLD, &sa, NULL); } int main() { init_sigchld(); int fd; struct test_gup_request req; int ret; pid_t pid =3D getpid(); int cpu =3D sched_getcpu(); int cpu_af =3D 63; int i; void *memory =3D mmap(NULL, MEMORY_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); touch_memory(memory, MEMORY_SIZE, 1); if (cpu > 31) cpu_af =3D 0; cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(cpu_af, &mask); sched_setaffinity(0, sizeof(mask), &mask); // /dev/test_gup here is provide by a test mod fd =3D open("/dev/test_gup", O_RDWR); memset(&req, 0, sizeof(req)); req.user_addr =3D (__u64) memory; req.size =3D 4096 * 9; // ioctl just call pin_user_pages_fast in kenrel ret =3D ioctl(fd, TEST_GUP_IOCTL_PIN_PAGES, &req); if (ret < 0) { printf("IOCTL pin pages failed: %s\n", strerror(errno)); } else { printf("Successfully pinned %lu pages at pid %d %lx\n", req.page_count, pid, req.user_addr); } getchar(); // here you can see the original va <-> pa binding in crash by vtop i =3D 0; while (i < 100000) { touch_memory(memory, MEMORY_SIZE, 0); i++; } printf("numa balance done\n"); getchar(); // pmd was write-protected pid_t t_pid =3D fork(); if (t_pid =3D=3D 0) _exit(0); sleep(1); printf("fork done\n"); getchar(); // pte was write-protected memset((void *)req.user_addr, 9, 1); printf("write pinned mem done\n"); getchar(); // cow was done, the binding of va <-> pa was broken return 0; } =3D=3D=3D commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault") added write permission recovery in do_huge_pmd_numa_page, but did not add the same recovery in migrate_misplaced_transhuge_page. Later, commit d042035eaf5f ("mm/thp: Split huge pmds/puds if they're pinned when fork()") enforced that transparent huge pages with pinned memory must have their PMD page tables split into PTE page tables in copy_huge_pmd. After that, this issue started to appear. So, the simplest way to fix this issue is to also perform the corresponding write permission recovery in the out_fail code path of migrate_misplaced_transhuge_page. Signed-off-by: Chen Junlin Reviewed-by: xu xin --- mm/migrate.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/migrate.c b/mm/migrate.c index bf59b09455ad..126b6ad675ce 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2143,6 +2143,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct= *mm, struct page *new_page =3D NULL; int page_lru =3D page_is_file_lru(page); unsigned long start =3D address & HPAGE_PMD_MASK; + bool was_writable; new_page =3D alloc_pages_node(node, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), @@ -2247,7 +2248,10 @@ int migrate_misplaced_transhuge_page(struct mm_struc= t *mm, count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); ptl =3D pmd_lock(mm, pmd); if (pmd_same(*pmd, entry)) { + was_writable =3D pmd_savedwrite(entry); entry =3D pmd_modify(entry, vma->vm_page_prot); + if (was_writable) + entry =3D pmd_mkwrite(entry); set_pmd_at(mm, start, pmd, entry); update_mmu_cache_pmd(vma, address, &entry); } -- 2.27.0