From nobody Mon Feb 9 21:11:50 2026 Received: from mxct.zte.com.cn (mxct.zte.com.cn [58.251.27.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7518C2EBBB2 for ; Fri, 6 Feb 2026 07:15:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=58.251.27.85 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770362101; cv=none; b=ZejQC81z7ABhvAaCldfeplQf+mnrhePODxDkedN2mmtOq1XFiCo4CndLaowHrtNIlz8MVPzgs0KSFUbgE//84qAxXPAfgl8yKxHe5rRb+AQxpq4nZo2bUgL6LpXvYlTIPBnLOUo0acReM88o1PbqRDP8LUg844LnZVrvRe4sWL4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770362101; c=relaxed/simple; bh=6ThylFFTftq2ZK9aKolSgYKto5vtf6GvGNeoSQu64wY=; h=Message-ID:In-Reply-To:References:Date:Mime-Version:From:To:Cc: Subject:Content-Type; b=WtrmWxXN0PTJGzD958sgSlTiceWUwiFNMA9vLE/eW/WCPsnYaxJYtliciyNi3ej/qBSTvo8R9GWgakv6hgRbxS98vWlg3weGQe7dUgNthWHDZPFPHGfH+iH71OFp6sNFBj1lUsW9V3Wp9ft5Xn3fDhbTYyTPfZSzcYdDAI1ZFd4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zte.com.cn; spf=pass smtp.mailfrom=zte.com.cn; arc=none smtp.client-ip=58.251.27.85 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zte.com.cn Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=zte.com.cn Received: from mxde.zte.com.cn (unknown [10.35.20.165]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mxct.zte.com.cn (FangMail) with ESMTPS id 4f6lhk70pKz1Fqc for ; Fri, 06 Feb 2026 15:14:46 +0800 (CST) Received: from mxhk.zte.com.cn (unknown [192.168.250.137]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mxde.zte.com.cn (FangMail) with ESMTPS id 4f6lhb5Fnlz5TCG4 for ; Fri, 06 Feb 2026 15:14:39 +0800 (CST) Received: from mse-fl2.zte.com.cn (unknown [10.5.228.133]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mxhk.zte.com.cn (FangMail) with ESMTPS id 4f6lhP6mG2z8Xs3G; Fri, 06 Feb 2026 15:14:29 +0800 (CST) Received: from xaxapp05.zte.com.cn ([10.99.98.109]) by mse-fl2.zte.com.cn with SMTP id 6167EMZ4079486; Fri, 6 Feb 2026 15:14:22 +0800 (+08) (envelope-from xu.xin16@zte.com.cn) Received: from mapi (xaxapp02[null]) by mapi (Zmail) with MAPI id mid32; Fri, 6 Feb 2026 15:14:24 +0800 (CST) X-Zmail-TransId: 2afa698594d0ff6-fd440 X-Mailer: Zmail v1.0 Message-ID: <20260206151424734QIyWL_pA-1QeJPbJlUxsO@zte.com.cn> In-Reply-To: References: 20260112220143497dgs9w3S7sfdTUNRbflDtb@zte.com.cn,ba03780a-fd65-4a03-97de-bc0905106260@kernel.org Date: Fri, 6 Feb 2026 15:14:24 +0800 (CST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 From: To: Cc: , , , , , , Subject: =?UTF-8?B?W1JlcHJvZHVjZXJdOiBbUEFUQ0ggMi8yXSBrc206IE9wdGltaXplIHJtYXBfd2Fsa19rc20gYnkgcGFzc2luZyBhIHN1aXRhYmxlIGFkZHJlc3MgcmFuZ2U=?= Content-Type: text/plain; charset="utf-8" X-MAIL: mse-fl2.zte.com.cn 6167EMZ4079486 X-TLS: YES X-SPF-DOMAIN: zte.com.cn X-ENVELOPE-SENDER: xu.xin16@zte.com.cn X-SPF: None X-SOURCE-IP: 10.35.20.165 unknown Fri, 06 Feb 2026 15:14:47 +0800 X-Fangmail-Anti-Spam-Filtered: true X-Fangmail-MID-QID: 698594E5.001/4f6lhk70pKz1Fqc Content-Transfer-Encoding: quoted-printable Hi, This is a simple demo reproducer for the high delay of rmap_walk_ksm which = uses mprotect() to split so many VMAs from a large VMA, and these VMA shares the same anon_= vma. Reproducing steps: On a Linux machine with 1GB or 4GB memory=EF=BC=8C doing as follows=EF=BC= =9A 1 Compile:=20 gcc test_ksm_rmap.c -o test_ksm_rmap -lpthread =09 2 Configure Swap Space, for example we use CONFIG_ZRAM=3Dy: echo 300M > /sys/block/zram0/disksize; mkswap /dev/zram0; swapon /dev/zram0; echo 150 > /proc/sys/vm/swappiness; =09 3 Running this test program: ./test_ksm_rmap 4 There are two ways to monitor the rmap_walk_ksm delay. 1) Before running test program (./test_ksm_rmap), you can use Ftrace's f= unction_graph to monitor. =20 2) you can apply a monitoring sample patch at the end. You can acquire t= he following data by: "cat /proc/rmap_walk/delay_max" =20 /* * KSM rmap_walk delay reproducer. * * The main idea is to make KSM pages scanned by kswapped or kcompactd * or swapped by kswapd. So do the following steps: * * 1) Alloc some same-content pages and trigger ksmd to merge them * 2) Create another thread and alloc memory gradually to increase memory * pressure. * 3) Wait 1 mintutes at maximum. */ #include #include #include #include #include #include #include #include #include #include #include #include #define PAGE_SIZE 4096 #define KSM_PAGES 50001 #define TEST_PATTERN 0xAA #define WAIT_PRESSURE_TIME 60 #define SWAP_THRESHHOLD_KB 100 #define LOW_MEMORY_THRESH_KB (15 * 1024) #define KSM_PATH "/sys/kernel/mm/ksm/" #define KSM_RUN KSM_PATH "run" #define KSM_PAGES_TO_SCAN KSM_PATH "pages_to_scan" #define KSM_SLEEP_MILLISECONDS KSM_PATH "sleep_millisecs" #define KSM_MAX_SHARING KSM_PATH "max_page_sharing" #define KSM_PAGES_SHARED KSM_PATH "pages_shared" #define KSM_PAGES_SHARING KSM_PATH "pages_sharing" static int read_sysfs(const char *path, unsigned long *value) { FILE *f =3D fopen(path, "r"); if (!f) { perror("fopen"); return -1; } =09 if (fscanf(f, "%lu", value) !=3D 1) { fclose(f); return -1; } =09 fclose(f); return 0; } static int write_sysfs(const char *path, const char *value) { FILE *f =3D fopen(path, "w"); if (!f) { perror("fopen"); return -1; } =09 if (fprintf(f, "%s", value) < 0) { fclose(f); return -1; } =09 fclose(f); return 0; } static unsigned long get_system_memory_pages() { FILE *f =3D fopen("/proc/meminfo", "r"); if (!f) { perror("fopen /proc/meminfo"); return 0; } =09 unsigned long mem_total_kb =3D 0; char line[256]; while (fgets(line, sizeof(line), f)) { if (strstr(line, "MemTotal:")) { sscanf(line, "MemTotal: %lu kB", &mem_total_kb); break; } } =09 fclose(f); =09 return mem_total_kb / 4; } static int configure_ksm(void) { printf("Configuring KSM parameters...\n"); =09 if (write_sysfs(KSM_RUN, "1") < 0) { fprintf(stderr, "Failed to start KSM\n"); return -1; } if (write_sysfs(KSM_MAX_SHARING, "10") < 0) { fprintf(stderr, "Failed to set max_page_sharing\n"); } if (write_sysfs(KSM_PAGES_TO_SCAN, "2000") < 0) { fprintf(stderr, "Failed to set pages_to_scan\n"); return -1; } =09 if (write_sysfs(KSM_SLEEP_MILLISECONDS, "10") < 0) { fprintf(stderr, "Failed to set sleep_millisecs\n"); return -1; } =09 printf("KSM started, scan speed increased\n"); return 0; } static void **allocate_ksm_pages(size_t ksm_pages_number) { printf("Allocating %zu KSM pages (%.2f MB)...\n",=20 ksm_pages_number, (ksm_pages_number * PAGE_SIZE) / (1024.0 * 1024.0= )); =09 void *ksm_region =3D mmap(NULL, PAGE_SIZE * ksm_pages_number, PROT_READ | = PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (!ksm_region) { perror("mmap ksm region pages"); return NULL; } =09 if (madvise(ksm_region, PAGE_SIZE * ksm_pages_number, MADV_MERGEABLE) !=3D= 0) fprintf(stderr, "madvise failed: %s\n", strerror(errno)); =09 for (size_t i =3D 0; i < ksm_pages_number; i++) { memset(ksm_region + i * PAGE_SIZE, i, PAGE_SIZE); ((char *)ksm_region)[i * PAGE_SIZE] =3D TEST_PATTERN; } /* Use mprotect to split many VMAs by one vma per page*/ for (size_t i =3D 0; i < ksm_pages_number; i++) { if(i % 2 =3D=3D 0){ int ret =3D mprotect(ksm_region + i * PAGE_SIZE, PAGE_SIZE, PROT_READ); if (ret =3D=3D -1) { printf("seq:%ld\n",i); perror("mprotect failed"); } }=20 } return ksm_region; } static void free_ksm_pages(void *pages, size_t ksm_pages_number) { if (!pages) return; =20 munmap(pages, PAGE_SIZE * ksm_pages_number); } static unsigned long get_available_memory_kb() { FILE *f =3D fopen("/proc/meminfo", "r"); if (!f) { perror("fopen /proc/meminfo"); return 0; } =09 unsigned long mem_available_kb =3D 0; char line[256]; while (fgets(line, sizeof(line), f)) { if (strstr(line, "MemAvailable:")) { sscanf(line, "MemAvailable: %lu kB", &mem_available_kb); break; } } =09 fclose(f); return mem_available_kb; } /* Get swap used memory (kb) */ static unsigned long get_swap_used_memory_kb() { FILE *f =3D fopen("/proc/meminfo", "r"); if (!f) { perror("fopen /proc/meminfo when get swap"); return 0; } unsigned long swap_free_kb =3D 0; unsigned long swap_total_kb =3D 0; char line[256]; while (fgets(line, sizeof(line), f)) { if (strstr(line, "SwapTotal")) sscanf(line, "SwapTotal: %lu kB", &swap_total_kb); if (strstr(line, "SwapFree")) { sscanf(line, "SwapFree: %lu kB", &swap_free_kb); break; } } fclose(f); return (swap_total_kb - swap_free_kb); } typedef struct { size_t max_alloc_times; =20 void ***pressure_memory_ptr; volatile int running; size_t *allocated_pages; } pressure_args_t; static void *memory_pressure_thread(void *arg) { pressure_args_t *args =3D (pressure_args_t *)arg; =09 void **pressure_memory =3D malloc(args->max_alloc_times * sizeof(void *)); if (!pressure_memory) { perror("malloc pressure pages array"); return NULL; } =09 size_t allocated_times =3D 0; size_t allocated_pages =3D 0; unsigned long available_memory_kb, current_swap_used; size_t pages_to_alloc; while (allocated_times < args->max_alloc_times && args->running) { available_memory_kb =3D get_available_memory_kb(); if (available_memory_kb <=3D LOW_MEMORY_THRESH_KB) { pages_to_alloc =3D available_memory_kb / 4; printf("Now available_memory_kb (%lu) is low, allocation %zu page by pag= e\n", available_memory_kb, pages_to_alloc); for (size_t i =3D 0; i < pages_to_alloc; i++) { /* If SWAP has been trggered ,then task completed! */ if ((current_swap_used =3D get_swap_used_memory_kb()) > SWAP_THRESHHOLD= _KB) { printf("Swap space %lu kbused, now pressure thread quit\n", current_sw= ap_used); args->running =3D 0; break; } else if (allocated_times + i >=3D args->max_alloc_times) { printf("\n The index allocated_times:%ld, i:%ld excced the limit\n\n",= allocated_times, i); args->running =3D 0; break; } else if (args->running =3D=3D 0) { printf("Maybe timeout, pressure thread" "should quit\n"); } pressure_memory[allocated_times + i] =3D mmap(NULL, PAGE_SIZE, PROT_REA= D | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (args->running =3D=3D 0) printf("Maybe timeout, pressure thread" "should quit\n"); memset(pressure_memory[allocated_times + i], (allocated_times + i) % 25= 6, PAGE_SIZE); if ( i % 100 =3D=3D 0) { printf("Now available_memory_kb:%lu, Swap used kb: %lu\n", get_available_memory_kb(), get_swap_used_memory_kb()); usleep(200000); } } allocated_times +=3D pages_to_alloc; allocated_pages +=3D pages_to_alloc; } else { /* Memeory is enough! alloc a large area */ pages_to_alloc =3D (available_memory_kb - LOW_MEMORY_THRESH_KB) / 4 + 1; pressure_memory[allocated_times] =3D mmap(NULL, pages_to_alloc * PAGE_SI= ZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); /* force the kernel to alloc physical memory */ memset(pressure_memory[allocated_times], (allocated_times) % 256, pages_= to_alloc* PAGE_SIZE); allocated_times++; allocated_pages +=3D pages_to_alloc; printf(" Allocated %zu pressure pages, available memory: %lu KB\n", allocated_pages, available_memory_kb); continue; } =20 =20 } =09 printf(" Allocated %zu pressure pages, available memory: %lu KB\n", allocated_pages, available_memory_kb); =09 *args->pressure_memory_ptr =3D pressure_memory; *args->allocated_pages =3D allocated_pages; =09 printf("Memory pressure thread completed allocation, actually allocated %z= u pages\n", allocated_pages); return NULL; } static int monitor_ksm_merging(unsigned long *initial_shared) { printf("Waiting for KSM page merging...\n"); =09 unsigned long pages_shared =3D 0; unsigned long pages_sharing =3D 0; unsigned long last_shared =3D 0; int stable_count =3D 0; int max_wait =3D 60; =09 for (int i =3D 0; i < max_wait; i++) { if (read_sysfs(KSM_PAGES_SHARED, &pages_shared) < 0) return -1; =09 if (read_sysfs(KSM_PAGES_SHARING, &pages_sharing) < 0) return -1; =20 printf(" Second %2d: pages_shared =3D %lu pages_sharing =3D %lu\n", i= , pages_shared, pages_sharing); =20 if (pages_shared =3D=3D last_shared) { stable_count++; if (stable_count >=3D 2) { break; } } else { stable_count =3D 0; last_shared =3D pages_shared; } =20 sleep(1); } =09 if (initial_shared) { *initial_shared =3D pages_shared; } =09 printf("KSM merging completed, shared pages: %lu\n", pages_shared); return 0; } static int test_rmap_walk() { void **ksm_pages =3D allocate_ksm_pages(KSM_PAGES); if (!ksm_pages) return -1; =20 unsigned long shared_before_pressure; if (monitor_ksm_merging(&shared_before_pressure) < 0) { free_ksm_pages(ksm_pages, KSM_PAGES); return -1; } =20 if (shared_before_pressure =3D=3D 0) { printf("Warning: No KSM merging detected!\n"); sleep(15); free_ksm_pages(ksm_pages, KSM_PAGES); return -1; } =20 printf("\nStarting to create memory pressure to trigger swap or compact...= \n"); =20 void **pressure_memory =3D NULL; size_t allocated_pressure_memory =3D 0; pressure_args_t pressure_args =3D { .max_alloc_times =3D 10000, .pressure_memory_ptr =3D &pressure_memory, .running =3D 1, .allocated_pages =3D &allocated_pressure_memory }; =20 pthread_t pressure_thread; if (pthread_create(&pressure_thread, NULL,=20 memory_pressure_thread, &pressure_args) !=3D 0) { perror("pthread_create"); free_ksm_pages(ksm_pages, KSM_PAGES); return -1; } =20 int wait_time =3D WAIT_PRESSURE_TIME; unsigned long swap_used; while (wait_time > 0 && pressure_args.running) { if ((swap_used =3D get_swap_used_memory_kb()) > SWAP_THRESHHOLD_KB) { printf("Swap space used (%lu) is > %d kb\n", swap_used, SWAP_THRESHHOLD_= KB); break; } sleep(1); wait_time--; } if (!wait_time) printf("Timeout now quit\n"); pressure_args.running =3D 0; printf("Wait pressure_thread exit.\n"); pthread_join(pressure_thread, NULL); printf("\nDone. Please check ftrace trace result to see how long rmap_walk= _ksm...\n"); return 0; } /* Get system memory information */ static void print_system_memory_info(void) { printf("System memory information:\n"); =20 FILE *f =3D fopen("/proc/meminfo", "r"); if (!f) { perror("fopen /proc/meminfo"); return; } =20 char line[256]; while (fgets(line, sizeof(line), f)) { if (strstr(line, "MemTotal:") ||=20 strstr(line, "MemFree:") || strstr(line, "MemAvailable:") || strstr(line, "SwapTotal:") || strstr(line, "SwapFree:")) printf(" %s", line); } =20 fclose(f); } /* Monitor page reclaim statistics in /proc/vmstat */ static void print_vmstat_info(void) { printf("VM statistics (relevant items):\n"); =20 FILE *f =3D fopen("/proc/vmstat", "r"); if (!f) { perror("fopen /proc/vmstat"); return; } =20 char line[256]; while (fgets(line, sizeof(line), f)) { if (strstr(line, "pgscan") || strstr(line, "pgsteal") || strstr(line, "ksm= ") || strstr(line, "swap")) printf(" %s", line); } =20 fclose(f); } int main(int argc, char *argv[]) { printf("\n=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D\n"); printf("KSM rmap_walk Feature Test Program\n"); printf("=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D\n\n"); =20 if (geteuid() !=3D 0) { fprintf(stderr, "Error: Root privileges required to run this test program= \n"); fprintf(stderr, "Please use: sudo %s\n", argv[0]); return 1; } =20 print_system_memory_info(); print_vmstat_info(); =20 if (configure_ksm() < 0) return 1; =20 if (test_rmap_walk() < 0) { fprintf(stderr, "Test 1 failed\n"); return 1; } =20 printf("\nRestoring KSM default settings...\n"); write_sysfs(KSM_PAGES_TO_SCAN, "100"); write_sysfs(KSM_SLEEP_MILLISECONDS, "20"); =20 printf("\nTest completed!\n"); return 0; } =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Subject: [PATCH] Sample monitoring: monitor rmap_walk_ksm() delay This is a sample patch to monitor rmap_walk_ksm() metrics as shown at https://lore.kernel.org/all/20260112220143497dgs9w3S7sfdTUNRbflDtb@zte.com.= cn/ You can acquire the following data by: cat /proc/rmap_walk/delay_max 1) Time_ms: Max time for holding anon_vma lock in a single rmap_walk_ksm. 2) Nr_iteration_total: The max times of iterations in a loop of anon_vma_in= terval_tree_foreach 3) Skip_addr_out_of_range: The max times of skipping due to the first check= (vma->vm_start and vma->vm_end) in a loop of anon_vma_interval_tree_foreach. 4) Skip_mm_mismatch: The max times of skipping due to the second check (rma= p_item->mm =3D=3D vma->vm_mm) in a loop of anon_vma_interval_tree_foreach.w --- include/linux/delayacct.h | 26 +++++++++ kernel/delayacct.c | 112 ++++++++++++++++++++++++++++++++++++++ mm/ksm.c | 25 ++++++++- 3 files changed, 160 insertions(+), 3 deletions(-) diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h index ecb06f16d22c..398df73dbe75 100644 --- a/include/linux/delayacct.h +++ b/include/linux/delayacct.h @@ -107,6 +107,18 @@ extern void __delayacct_compact_end(void); extern void __delayacct_wpcopy_start(void); extern void __delayacct_wpcopy_end(void); extern void __delayacct_irq(struct task_struct *task, u32 delta); +struct rmap_walk_call_stats { + u64 skip_addr_out_of_range; + u64 skip_mm_mismatch; + u64 skip_invalid_vma; + u64 rmap_one_false; + u64 done_true; + u64 complete_processed; + u64 interval_tree_total; +}; + +extern void __delayacct_rmap_start(u64 *start_time); +extern void __delayacct_rmap_end(u64 start_time, struct rmap_walk_call_sta= ts *stats); static inline void delayacct_tsk_init(struct task_struct *tsk) { @@ -250,6 +262,16 @@ static inline void delayacct_irq(struct task_struct *t= ask, u32 delta) __delayacct_irq(task, delta); } +static inline void delayacct_rmap_start(u64 *start_time) +{ + __delayacct_rmap_start(start_time); +} + +static inline void delayacct_rmap_end(u64 start_time, struct rmap_walk_cal= l_stats *stats) +{ + __delayacct_rmap_end(start_time, stats); +} + #else static inline void delayacct_init(void) {} @@ -290,6 +312,10 @@ static inline void delayacct_wpcopy_end(void) {} static inline void delayacct_irq(struct task_struct *task, u32 delta) {} +static inline void delayacct_rmap_start(u64 *start_time) +{} +static inline void delayacct_rmap_end(u64 start_time, struct rmap_walk_cal= l_stats *stats) +{} #endif /* CONFIG_TASK_DELAY_ACCT */ diff --git a/kernel/delayacct.c b/kernel/delayacct.c index 2e55c493c98b..77d0f362d336 100644 --- a/kernel/delayacct.c +++ b/kernel/delayacct.c @@ -10,9 +10,14 @@ #include #include #include +#include +#include #include #include #include +#include +#include +#include #define UPDATE_DELAY(type) \ do { \ @@ -29,6 +34,16 @@ DEFINE_STATIC_KEY_FALSE(delayacct_key); int delayacct_on __read_mostly; /* Delay accounting turned on/off */ struct kmem_cache *delayacct_cache; +/* Global statistics for rmap_walk_ksm lock delay */ +static DEFINE_RAW_SPINLOCK(rmap_stats_lock); + +/* Maximum delay statistics */ +static u64 rmap_delay_max __read_mostly =3D 0; +static struct timespec64 rmap_delay_max_ts; +static char rmap_delay_max_comm[TASK_COMM_LEN]; +static struct rmap_walk_call_stats rmap_delay_max_stats; + + static void set_delayacct(bool enabled) { if (enabled) { @@ -318,3 +333,100 @@ void __delayacct_irq(struct task_struct *task, u32 de= lta) raw_spin_unlock_irqrestore(&task->delays->lock, flags); } +void __delayacct_rmap_start(u64 *start_time) +{ + *start_time =3D ktime_get_ns(); +} + +void __delayacct_rmap_end(u64 start_time, struct rmap_walk_call_stats *sta= ts) +{ + unsigned long flags; + s64 ns; + u64 delay_ns; + + if (start_time =3D=3D 0) + return; + + ns =3D ktime_get_ns() - start_time; + if (ns <=3D 0) + return; + + delay_ns =3D (u64)ns; + + raw_spin_lock_irqsave(&rmap_stats_lock, flags); + + /* Update maximum delay */ + if (delay_ns > rmap_delay_max) { + rmap_delay_max =3D delay_ns; + ktime_get_real_ts64(&rmap_delay_max_ts); + memcpy(rmap_delay_max_comm, current->comm, TASK_COMM_LEN); + /* Save statistics for this call that produced the max delay */ + if (stats) + rmap_delay_max_stats =3D *stats; + } + + raw_spin_unlock_irqrestore(&rmap_stats_lock, flags); +} + + +#ifdef CONFIG_PROC_FS + +/* Show maximum delay information */ +static int proc_rmap_delay_max_show(struct seq_file *m, void *v) +{ + unsigned long flags; + u64 max_delay; + struct timespec64 ts; + char comm[TASK_COMM_LEN]; + struct rmap_walk_call_stats stats; + struct tm tm; + + raw_spin_lock_irqsave(&rmap_stats_lock, flags); + max_delay =3D rmap_delay_max; + ts =3D rmap_delay_max_ts; + memcpy(comm, rmap_delay_max_comm, TASK_COMM_LEN); + stats =3D rmap_delay_max_stats; + raw_spin_unlock_irqrestore(&rmap_stats_lock, flags); + + /* Convert timestamp to hour:minute:second format */ + time64_to_tm(ts.tv_sec, 0, &tm); + + seq_printf(m, "max_delay_ns: %llu\n", max_delay); + seq_printf(m, "max_delay_ms: %llu\n", max_delay / 1000000ULL); + seq_printf(m, "max_delay_ts: %04ld-%02d-%02d %02d:%02d:%02d\n", + (long)(tm.tm_year + 1900), tm.tm_mon + 1, tm.tm_mday, + tm.tm_hour, tm.tm_min, tm.tm_sec); + seq_printf(m, "max_delay_comm: %s\n", comm); + seq_printf(m, "\n"); + seq_printf(m, "=3D=3D=3D Statistics for the call that produced max_delay = =3D=3D=3D\n"); + seq_printf(m, "interval_tree_total: %llu\n", stats.interval_tree_total); + seq_printf(m, "skip_addr_out_of_range: %llu\n", stats.skip_addr_out_of_ra= nge); + seq_printf(m, "skip_mm_mismatch: %llu\n", stats.skip_mm_mismatch); + seq_printf(m, "skip_invalid_vma: %llu\n", stats.skip_invalid_vma); + seq_printf(m, "rmap_one_false: %llu\n", stats.rmap_one_false); + seq_printf(m, "done_true: %llu\n", stats.done_true); + seq_printf(m, "complete_processed: %llu\n", stats.complete_processed); + + return 0; +} + +static struct proc_dir_entry *rmap_walk_dir; + +static int __init proc_rmap_stats_init(void) +{ + /* Create /proc/rmap_walk directory */ + rmap_walk_dir =3D proc_mkdir("rmap_walk", NULL); + if (!rmap_walk_dir) { + pr_err("Failed to create /proc/rmap_walk directory\n"); + return -ENOMEM; + } + + /* Create proc files under /proc/rmap_walk/ */ + proc_create_single("delay_max", 0444, rmap_walk_dir, proc_rmap_delay_max_= show); + + return 0; +} +fs_initcall(proc_rmap_stats_init); + +#endif /* CONFIG_PROC_FS */ + diff --git a/mm/ksm.c b/mm/ksm.c index 031c17e4ada6..0f45a8ea9006 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include @@ -3154,6 +3155,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_w= alk_control *rwc) struct ksm_stable_node *stable_node; struct ksm_rmap_item *rmap_item; int search_new_forks =3D 0; + u64 lock_start_time =3D 0; VM_BUG_ON_FOLIO(!folio_test_ksm(folio), folio); @@ -3173,6 +3175,7 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_w= alk_control *rwc) struct vm_area_struct *vma; unsigned long addr; pgoff_t pgoff_start, pgoff_end; + struct rmap_walk_call_stats call_stats =3D {0}; cond_resched(); if (!anon_vma_trylock_read(anon_vma)) { @@ -3189,35 +3192,51 @@ void rmap_walk_ksm(struct folio *folio, struct rmap= _walk_control *rwc) pgoff_start =3D rmap_item->address >> PAGE_SHIFT; pgoff_end =3D pgoff_start + folio_nr_pages(folio) - 1; + delayacct_rmap_start(&lock_start_time); anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, pgoff_start, pgoff_end) { + + call_stats.interval_tree_total++; cond_resched(); vma =3D vmac->vma; - if (addr < vma->vm_start || addr >=3D vma->vm_end) + if (addr < vma->vm_start || addr >=3D vma->vm_end) { + call_stats.skip_addr_out_of_range++; continue; + } /* * Initially we examine only the vma which covers this * rmap_item; but later, if there is still work to do, * we examine covering vmas in other mms: in case they * were forked from the original since ksmd passed. */ - if ((rmap_item->mm =3D=3D vma->vm_mm) =3D=3D search_new_forks) + if ((rmap_item->mm =3D=3D vma->vm_mm) =3D=3D search_new_forks) { + call_stats.skip_mm_mismatch++; continue; + } - if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) + if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) { + call_stats.skip_invalid_vma++; + delayacct_rmap_end(lock_start_time, &call_stats); continue; + } if (!rwc->rmap_one(folio, vma, addr, rwc->arg)) { + call_stats.rmap_one_false++; + delayacct_rmap_end(lock_start_time, &call_stats); anon_vma_unlock_read(anon_vma); return; } if (rwc->done && rwc->done(folio)) { + call_stats.done_true++; + delayacct_rmap_end(lock_start_time, &call_stats); anon_vma_unlock_read(anon_vma); return; } + call_stats.complete_processed++; } + delayacct_rmap_end(lock_start_time, &call_stats); anon_vma_unlock_read(anon_vma); } if (!search_new_forks++) --=20 2.25.1