[PATCH v3 0/4] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering

Zhen Ni posted 4 patches 1 month, 2 weeks ago
There is a newer version of this series
Documentation/mm/page_owner.rst |  55 +++++++++++++-
mm/page_owner.c                 | 130 +++++++++++++++++++++++++++++++-
2 files changed, 182 insertions(+), 3 deletions(-)
[PATCH v3 0/4] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering
Posted by Zhen Ni 1 month, 2 weeks ago
This patch series introduces filtering capabilities to the page_owner
feature to address storage and performance challenges in production
environments.

Changes from v2:
- Remove READ_ONCE/WRITE_ONCE for nodemask_t (fixes compilation errors)
  * nodemask_t is a large structure (128 bytes) that triggers compile-time asserts
  * Direct assignment is safe for this use case
- Add comment explaining input length calculation formula
  * 6 bytes = ",NNNNN" (comma + 5-digit node number)
- Simplify "-1" check using kstrtoint() instead of dual strcmp()
- Move nodemask_t mask read outside PFN iteration loop for performance
  * Avoids 128-byte structure copy on each iteration
- Add documentation for filter features (patch 4/4)

Changes from v1:
- Renamed 'compact' to 'print_mode' with enum type for better clarity
  * PAGE_OWNER_PRINT_FULL_STACK (0): print full stack traces
  * PAGE_OWNER_PRINT_STACK_HANDLE (1): print only stack handles
- Changed NUMA filter from single node to nodelist with bitmask support
  * Uses nodelist_parse() to support "0", "0,2", "0-3", "0,2-4,7" formats
  * Uses nodemask_t internally for efficient multi-node filtering
  * Output uses %*pbl format (e.g., "0-2", "0,2-4,7")
- Improved memory handling in nid_filter_write using dynamic allocation
  * Limit: (100 + 6 * MAX_NUMNODES) to handle worst-case input

These changes address feedback from v2 review:
- AI review tool (sashiko.dev) identified READ_ONCE/WRITE_ONCE issue with nodemask_t
- Andrew Morton requested documentation for filter features
- Input length calculation justification
- Code simplification using kstrtoint()
- Performance optimization for mask read

Problem Statement
=================

In production environments with large memory configurations (e.g., 250GB+),
collecting page_owner information often results in files ranging from
several gigabytes to over 10GB. This creates significant challenges:

1. Storage pressure on production systems
2. Difficulty transferring large files from production environments
3. Post-processing overhead with tools/mm/page_owner_sort.c

The primary contributor to file size is redundant stack trace
information. While the kernel already deduplicates stacks via
stackdepot, page_owner retrieves and stores full stack traces for
each page, only to deduplicate them again during post-processing.

Additionally, in NUMA-aware environments (e.g., DPDK-based cloud
deployments where QEMU processes are bound to specific NUMA nodes),
OOM events are often node-specific rather than system-wide.
Currently, page_owner cannot filter by NUMA node, forcing users to
collect and analyze data for all nodes.

Solution
========

This patch series introduces a flexible filter infrastructure with
two initial filters:

1. **Print Mode Filter**: Outputs only stack handles instead of
   full stack traces. The handle-to-stack mapping can be retrieved
   from the existing show_stacks_handles interface. This dramatically
   reduces output size while preserving all allocation metadata.

2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s)
   using flexible nodelist format, enabling targeted analysis of memory
   issues in NUMA-aware deployments.

Implementation
==============

The series is structured as follows:

- Patch 1: Add filter infrastructure (data structures and
  debugfs directory)
- Patch 2: Implement print_mode filter
- Patch 3: Implement NUMA node filter with nodelist support
- Patch 4: Document filter features

Usage Example
=============

Enable print_mode and filter for NUMA nodes 0,2-3:

    # cd /sys/kernel/debug/page_owner_filter/
    # echo 1 > print_mode
    # echo "0,2-3" > nid
    # cat /sys/kernel/debug/page_owner > page_owner.txt

Sample print_mode output (showing handles only):

    Page allocated via order 0, mask 0x0(), pid 0, tgid 0 (swapper),
    ts 0 ns PFN 0x40000 type Unmovable Block 512 type Unmovable
    Flags 0x3fffe0000000000(node=0|zone=0|lastcpupid=0x1ffff)
    handle: 1048577

    Page allocated via order 0, mask 0x252000(__GFP_NOWARN|
    __GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 0, tgid 0 (swapper),
    ts 0 ns PFN 0x40002 type Unmovable Block 512 type Unmovable
    Flags 0x23fffe0000000200(workingset|node=0|zone=0|lastcpupid=0x1ffff)
    handle: 1048577

Testing
=======

Tested on a system with multiple NUMA nodes. Verified that:
- Filters work independently and in combination
- Print_mode output correlates correctly with show_stacks_handles
- Default behavior (filters disabled) remains unchanged
- NUMA filter works with single node, multiple nodes, and ranges
- Code compiles without warnings or errors (allmodconfig tested)

Example test session:
    # cat print_mode
    0
    # echo "0,1-2" > nid
    # cat nid
    0-2
    # echo "0,2-3" > nid
    # cat nid
    0,2-3
    # echo 1 > print_mode
    # head -n 100 /sys/kernel/debug/page_owner
    [Shows compact mode output with handles only]

Future Enhancements
===================

The filter infrastructure is designed to be extensible. Potential
future filters could include:
- PID/TGID filtering
- Time range filtering (allocation timestamp windows)
- GFP flag filtering
- Migration type filtering

Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---

Zhen Ni (4):
  mm/page_owner: add filter infrastructure
  mm/page_owner: add print_mode filter
  mm/page_owner: add NUMA node filter with nodelist support
  mm/page_owner: document page_owner filter features

 Documentation/mm/page_owner.rst |  55 +++++++++++++-
 mm/page_owner.c                 | 130 +++++++++++++++++++++++++++++++-
 2 files changed, 182 insertions(+), 3 deletions(-)

--
2.20.1
Re: [PATCH v3 0/4] mm/page_owner: add filter infrastructure for print_mode and NUMA filtering
Posted by Andrew Morton 1 month, 2 weeks ago
On Tue, 28 Apr 2026 15:11:08 +0800 Zhen Ni <zhen.ni@easystack.cn> wrote:

> This patch series introduces filtering capabilities to the page_owner
> feature to address storage and performance challenges in production
> environments.

Thanks, I updated mm.git's mm-new branch to this version.

> Changes from v2:
> - Remove READ_ONCE/WRITE_ONCE for nodemask_t (fixes compilation errors)
>   * nodemask_t is a large structure (128 bytes) that triggers compile-time asserts
>   * Direct assignment is safe for this use case
> - Add comment explaining input length calculation formula
>   * 6 bytes = ",NNNNN" (comma + 5-digit node number)
> - Simplify "-1" check using kstrtoint() instead of dual strcmp()
> - Move nodemask_t mask read outside PFN iteration loop for performance
>   * Avoids 128-byte structure copy on each iteration
> - Add documentation for filter features (patch 4/4)

Here's how v3 altered mm.git:



 Documentation/mm/page_owner.rst |   55 +++++++++++++++++++++++++++++-
 mm/page_owner.c                 |   14 +++++--
 2 files changed, 64 insertions(+), 5 deletions(-)

--- a/Documentation/mm/page_owner.rst~b
+++ a/Documentation/mm/page_owner.rst
@@ -74,7 +74,17 @@ Usage
 
 3) Do the job that you want to debug.
 
-4) Analyze information from page owner::
+4) (Optional) Use filters to focus on specific memory allocations::
+
+    cd /sys/kernel/debug/page_owner_filter
+
+    # Print only stack handles instead of full traces
+    echo 1 > print_mode
+
+    # Filter by NUMA nodes
+    echo "0,2-3" > nid
+
+5) Analyze information from page owner::
 
 	cat /sys/kernel/debug/page_owner_stacks/show_stacks > stacks.txt
 	cat stacks.txt
@@ -238,6 +248,49 @@ Usage
 				./page_owner_sort <input> <output> --tgid=1,2,3
 				./page_owner_sort <input> <output> --name name1,name2
 
+Page Owner Filters
+==================
+
+The page_owner feature provides filtering capabilities to focus on specific
+memory allocations (e.g., by NUMA node). Filters are controlled through debugfs
+files in ``/sys/kernel/debug/page_owner_filter/``.
+
+Print Mode Filter
+-----------------
+
+The ``print_mode`` file controls the level of detail in stack trace output.
+
+Available modes:
+
+- ``0`` (default): Print full stack traces
+- ``1``: Print only stack handles
+
+The ``print_mode=1`` output format::
+
+    Page allocated via order 0, mask 0x42800(GFP_NOWAIT|__GFP_COMP),
+    pid 1, tgid 1 (systemd), ts 349667370 ns
+    PFN 0xa00a2 type Unmovable Block 1280 type Unmovable
+    Flags 0x33fffe0000004124(...)
+    handle: 17432583
+
+To retrieve the full stack trace for a handle, use::
+
+    cat /sys/kernel/debug/page_owner_stacks/show_stacks_handles
+
+NUMA Node Filter
+----------------
+
+The ``nid`` file filters pages by NUMA node. This is useful for NUMA-aware
+environments to analyze node-specific memory allocation.
+
+Supported input formats:
+
+- Single node: ``echo "2" > nid``
+- Multiple nodes: ``echo "0,2,3" > nid``
+- Node range: ``echo "0-3" > nid``
+- Mixed format: ``echo "0,2-4,7" > nid``
+- Disable filter: ``echo "-1" > nid``
+
 STANDARD FORMAT SPECIFIERS
 ==========================
 ::
--- a/mm/page_owner.c~b
+++ a/mm/page_owner.c
@@ -685,6 +685,7 @@ read_page_owner(struct file *file, char
 	struct page_ext *page_ext;
 	struct page_owner *page_owner;
 	depot_stack_handle_t handle;
+	nodemask_t mask;
 
 	if (!static_branch_unlikely(&page_owner_inited))
 		return -EINVAL;
@@ -698,6 +699,8 @@ read_page_owner(struct file *file, char
 	while (!pfn_valid(pfn) && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0)
 		pfn++;
 
+	mask = owner_filter.nid_mask;
+
 	/* Find an allocated page */
 	for (; pfn < max_pfn; pfn++) {
 		/*
@@ -707,7 +710,6 @@ read_page_owner(struct file *file, char
 		 * user through copy_to_user() or GFP_KERNEL allocations.
 		 */
 		struct page_owner page_owner_tmp;
-		nodemask_t mask;
 
 		/*
 		 * If the new page is in a new MAX_ORDER_NR_PAGES area,
@@ -732,7 +734,6 @@ read_page_owner(struct file *file, char
 			continue;
 
 		/* NUMA node filter using bitmask */
-		mask = owner_filter.nid_mask;
 		if (!nodes_empty(mask)) {
 			int nid = page_to_nid(page);
 
@@ -1026,8 +1027,13 @@ static ssize_t nid_filter_write(struct f
 	char *kbuf;
 	nodemask_t mask;
 	int ret;
+	int val;
 
-	/* Limit input size to handle worst-case nodelist (all nodes) */
+	/*
+	 * Limit input size to handle worst-case nodelist (all nodes).
+	 * Worst case per node: ",NNNNN" (comma + 5-digit node number) = 6 bytes.
+	 * Formula: 100 bytes overhead + 6 * MAX_NUMNODES
+	 */
 	if (count > (100 + 6 * MAX_NUMNODES))
 		return -EINVAL;
 
@@ -1042,7 +1048,7 @@ static ssize_t nid_filter_write(struct f
 	kbuf[count] = '\0';
 
 	/* Support: "-1" to clear, or nodelist format like "0", "0,2", "0-3" */
-	if (strcmp(kbuf, "-1\n") == 0 || strcmp(kbuf, "-1") == 0)
+	if (kstrtoint(kbuf, 10, &val) == 0 && val == -1)
 		nodes_clear(mask);
 	else if (nodelist_parse(kbuf, mask)) {
 		ret = -EINVAL;
_