mm: numa: preserve PMD write permissions in migrate_misplaced_transhuge_page

[PATCH stable 5.10] mm: numa: preserve PMD write permissions in migrate_misplaced_transhuge_page

Posted by wang.yaxin@zte.com.cn 2 weeks ago

From: Chen Junlin <chen.junlin@zte.com.cn>

When a process allocates a transparent huge page in its address space, and
then enters the kernel driver via an ioctl system call, a driver (eg.
ib_uverbs) calls the pin_user_pages_fast function to pin the process’s
virtual addresses to physical pages. Subsequently, when the process
accesses this pinned memory across NUMA nodes, triggering the system’s
NUMA balancing capability, a page fault occurs and the kernel enters
do_huge_pmd_numa_page, then it calls migrate_misplaced_transhuge_page to
migrate the transparent huge page. However, because the memory within the
huge page has been pinned by pin_user_pages_fast, numamigrate_isolate_page
returns 0. migrate_misplaced_transhuge_page proceeds to the out_fail path,
where it changes the PMD page table entry to write-protected by pte_modify.
If the process then performs a fork operation, copy_huge_pmd is invoked.
Due to the pinned memory, __split_huge_pmd is called to split the PMD page
table entry into PTE page table entries. These PTEs are also set to
write-protected. Finally, when the process writes to this memory region, a
copy-on-write (COW) operation takes place, allocating a new physical
memory page. This breaks the binding between the process’s virtual
address and the pinned physical memory.

Here is my test code in userspace.The /dev/test_gup is provided by a
simple kerenl mod, it just help us calls pin_user_pages_fast in kernel by
passing va to ioctl, so I do not provides code of kernel mod.

The test code runs on an x86 QEMU-KVM virtual machine with a specification
of 64 cores and 2 NUMA nodes (0-31:node0, 32-63:node1).
Kernel is 5.10.256, numa balancing para is
kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256
/sys/kernel/mm/transparent_hugepage/enabled is always

===
 #define _GNU_SOURCE
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <errno.h>
 #include <sched.h>
 #include <fcntl.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
 #include <sys/wait.h>
 #include <signal.h>
 #include <linux/ioctl.h>
 #include <linux/types.h>

 #define HUGE_PAGE_SIZE (2 * 1024 * 1024)
 #define ALIGNMENT HUGE_PAGE_SIZE
 #define MEMORY_SIZE (HUGE_PAGE_SIZE * 16)
 #define TEST_GUP_IOC_MAGIC 'G'
 #define TEST_GUP_IOCTL_PIN_PAGES \ 
         _IOWR(TEST_GUP_IOC_MAGIC, 1, struct test_gup_request)
 #define TEST_GUP_IOCTL_UNPIN_PAGES \
         _IOW(TEST_GUP_IOC_MAGIC, 2, struct test_gup_request)

struct test_gup_request {
        __u64 user_addr;
        __u64 size;
        __u64 page_count;
        __u32 flags;
};

void touch_memory(void *addr, size_t length, int write)
{
        volatile char *ptr = (volatile char *)addr;
        char tmp;
        size_t page_size = getpagesize();
        for (size_t i = 0; i < length; i += page_size) {
                if (write)
                        ptr[i] = (char)(i % 256);
                else
                        tmp = ptr[i];
        }
}

void init_sigchld()
{
        struct sigaction sa;
        sa.sa_handler = SIG_IGN;
        sigemptyset(&sa.sa_mask);
        sa.sa_flags = 0;
        sigaction(SIGCHLD, &sa, NULL);
}

int main()
{
        init_sigchld();
        int fd;
        struct test_gup_request req;
        int ret;
        pid_t pid = getpid();
        int cpu = sched_getcpu();
        int cpu_af = 63;
        int i;

        void *memory = mmap(NULL, MEMORY_SIZE, PROT_READ | PROT_WRITE,
                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        touch_memory(memory, MEMORY_SIZE, 1);

        if (cpu > 31)
                cpu_af = 0;
        cpu_set_t mask;
        CPU_ZERO(&mask);
        CPU_SET(cpu_af, &mask);
        sched_setaffinity(0, sizeof(mask), &mask);

        // /dev/test_gup here is provide by a test mod
        fd = open("/dev/test_gup", O_RDWR);
        memset(&req, 0, sizeof(req));
        req.user_addr = (__u64) memory;
        req.size = 4096 * 9;
        // ioctl just call pin_user_pages_fast in kenrel
        ret = ioctl(fd, TEST_GUP_IOCTL_PIN_PAGES, &req);
        if (ret < 0) {
                printf("IOCTL pin pages failed: %s\n", strerror(errno));
        } else {
                printf("Successfully pinned %lu pages at pid %d %lx\n",
                       req.page_count, pid, req.user_addr);
        }
        getchar();
        // here you can see the original va <-> pa binding in crash by vtop

        i = 0;
        while (i < 100000) {
                touch_memory(memory, MEMORY_SIZE, 0);
                i++;
        }
        printf("numa balance done\n");
        getchar(); // pmd was write-protected

        pid_t t_pid = fork();
        if (t_pid == 0)
                _exit(0);
        sleep(1);
        printf("fork done\n");
        getchar(); // pte was write-protected

        memset((void *)req.user_addr, 9, 1);
        printf("write pinned mem done\n");
        getchar(); // cow was done, the binding of va <-> pa was broken

        return 0;
}
===

commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
NUMA hinting fault") added write permission recovery in
do_huge_pmd_numa_page, but did not add the same recovery in
migrate_misplaced_transhuge_page. Later, commit d042035eaf5f ("mm/thp:
Split huge pmds/puds if they're pinned when fork()") enforced that
transparent huge pages with pinned memory must have their PMD page
tables split into PTE page tables in copy_huge_pmd. After that, this
issue started to appear.

So, the simplest way to fix this issue is to also perform the
corresponding write permission recovery in the out_fail code path of
migrate_misplaced_transhuge_page.

Signed-off-by: Chen Junlin <chen.junlin@zte.com.cn>
---
 mm/migrate.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index bf59b09455ad..126b6ad675ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2143,6 +2143,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_lru(page);
 	unsigned long start = address & HPAGE_PMD_MASK;
+	bool was_writable;

 	new_page = alloc_pages_node(node,
 		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
@@ -2247,7 +2248,10 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
+		was_writable = pmd_savedwrite(entry);
 		entry = pmd_modify(entry, vma->vm_page_prot);
+		if (was_writable)
+			entry = pmd_mkwrite(entry);
 		set_pmd_at(mm, start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
--
2.27.0

Re: [PATCH stable 5.10] mm: numa: preserve PMD write permissions in migrate_misplaced_transhuge_page

Posted by xu.xin16@zte.com.cn 2 weeks ago

> When a process allocates a transparent huge page in its address space, and
> then enters the kernel driver via an ioctl system call, a driver (eg.
> ib_uverbs) calls the pin_user_pages_fast function to pin the process’s
> virtual addresses to physical pages. Subsequently, when the process
> accesses this pinned memory across NUMA nodes, triggering the system’s
> NUMA balancing capability, a page fault occurs and the kernel enters
> do_huge_pmd_numa_page, then it calls migrate_misplaced_transhuge_page to
> migrate the transparent huge page. However, because the memory within the
> huge page has been pinned by pin_user_pages_fast, numamigrate_isolate_page
> returns 0. migrate_misplaced_transhuge_page proceeds to the out_fail path,
> where it changes the PMD page table entry to write-protected by pte_modify.
> If the process then performs a fork operation, copy_huge_pmd is invoked.
> Due to the pinned memory, __split_huge_pmd is called to split the PMD page
> table entry into PTE page table entries. These PTEs are also set to
> write-protected. Finally, when the process writes to this memory region, a
> copy-on-write (COW) operation takes place, allocating a new physical
> memory page. This breaks the binding between the process’s virtual
> address and the pinned physical memory.
> 
> commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
> NUMA hinting fault") added write permission recovery in
> do_huge_pmd_numa_page, but did not add the same recovery in
> migrate_misplaced_transhuge_page. Later, commit d042035eaf5f ("mm/thp:
> Split huge pmds/puds if they're pinned when fork()") enforced that
> transparent huge pages with pinned memory must have their PMD page
> tables split into PTE page tables in copy_huge_pmd. After that, this
> issue started to appear.
> 
> So, the simplest way to fix this issue is to also perform the
> corresponding write permission recovery in the out_fail code path of
> migrate_misplaced_transhuge_page.
> 
> Signed-off-by: Chen Junlin <chen.junlin@zte.com.cn>

It seems the latest kernel version is not affected by the issue you point out
because commit f66e2727ddfc ("mm: huge_memory: use folio_can_map_prot_numa() for pmd folio")
will make pined huge pages skip numa_balancing.

Could you please add two tags?  
1) Add a "fixes" tag (the commit that introduced this issue)
2) Cc: stable@vger.kernel.org

and Could you answer which other long-term stable branches are also affected?


The patch itself seems fine; if the above questions is resolved, feel free to add:
Reviewed-by: xu xin <xu.xin16@zte.com.cn>