Documentation/mm/page_owner.rst | 61 ++++++++++- mm/page_owner.c | 174 +++++++++++++++++++++++++++++++- 2 files changed, 232 insertions(+), 3 deletions(-)
This patch series introduces filtering capabilities to the page_owner
feature to address storage and performance challenges in production
environments.
Changes from v5:
- Address SeongJae Park's review comments for patch 1/3:
* Remove unnecessary braces in if/else statement
* Use stack array instead of kmalloc for input buffer
- Address SeongJae Park's review comments for patch 2/3:
* Add node validity check using nodes_subset() to reject non-existent nodes
* Separate variable declaration and statement
* Use kmalloc_objs() for consistency with kernel patterns
* Remove 100 bytes overhead
- Add lore links to all previous versions
Changes from v4:
- Optimize nodes_empty() check in page iteration loop
- Add __data_racy qualifier to nid_mask field
Changes from v3:
- Change print_mode from numeric (0/1) to string-based interface
* Use "full_stack"/"stack_handle" strings instead of numbers
* Display current mode with bracket notation: "[full_stack] stack_handle"
- Remove "-1" support from NUMA filter
* Use empty string to clear filter (echo > nid)
- Use strncpy_from_user() instead of copy_from_user()
- Rename nid_filter_fops to page_owner_nid_filter_fops for consistency
- Merge patch 1 (infrastructure) and patch 2 (print_mode) from v3
- Update documentation to match new interface
* String-based examples
* Tab indentation in code blocks
Changes from v2:
- Remove READ_ONCE/WRITE_ONCE for nodemask_t (fixes compilation errors)
* nodemask_t is a large structure (128 bytes) that triggers compile-time asserts
* Direct assignment is safe for this use case
- Add comment explaining input length calculation formula
* 6 bytes = ",NNNNN" (comma + 5-digit node number)
- Simplify "-1" check using kstrtoint() instead of dual strcmp()
- Move nodemask_t mask read outside PFN iteration loop for performance
* Avoids 128-byte structure copy on each iteration
- Add documentation for filter features (patch 3/3)
Changes from v1:
- Renamed 'compact' to 'print_mode' with enum type for better clarity
* PAGE_OWNER_PRINT_FULL_STACK (0): print full stack traces
* PAGE_OWNER_PRINT_STACK_HANDLE (1): print only stack handles
- Changed NUMA filter from single node to nodelist with bitmask support
* Uses nodelist_parse() to support "0", "0,2", "0-3", "0,2-4,7" formats
* Uses nodemask_t internally for efficient multi-node filtering
* Output uses %*pbl format (e.g., "0-2", "0,2-4,7")
- Improved memory handling in nid_filter_write using dynamic allocation
* Limit: (100 + 6 * MAX_NUMNODES) to handle worst-case input
Problem Statement
=================
In production environments with large memory configurations (e.g., 250GB+),
collecting page_owner information often results in files ranging from
several gigabytes to over 10GB. This creates significant challenges:
1. Storage pressure on production systems
2. Difficulty transferring large files from production environments
3. Post-processing overhead with tools/mm/page_owner_sort.c
The primary contributor to file size is redundant stack trace
information. While the kernel already deduplicates stacks via
stackdepot, page_owner retrieves and stores full stack traces for
each page, only to deduplicate them again during post-processing.
Additionally, in NUMA-aware environments (e.g., DPDK-based cloud
deployments where QEMU processes are bound to specific NUMA nodes),
OOM events are often node-specific rather than system-wide.
Currently, page_owner cannot filter by NUMA node, forcing users to
collect and analyze data for all nodes.
Solution
========
This patch series introduces a flexible filter infrastructure with
two initial filters:
1. **Print Mode Filter**: Outputs only stack handles instead of
full stack traces. The handle-to-stack mapping can be retrieved
from the existing show_stacks_handles interface. This dramatically
reduces output size while preserving all allocation metadata.
2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s)
using flexible nodelist format, enabling targeted analysis of memory
issues in NUMA-aware deployments.
Implementation
==============
The series is structured as follows:
- Patch 1: Implement print_mode filter with string-based interface
(merges infrastructure + print_mode from v3)
- Patch 2: Implement NUMA node filter with nodelist support
* v6: Add node validity check to reject non-existent nodes
- Patch 3: Document filter features
Usage Example
=============
Enable print_mode and filter for NUMA nodes 0,2-3:
# cd /sys/kernel/debug/page_owner_filter/
# echo stack_handle > print_mode
# echo "0,2-3" > nid
# cat /sys/kernel/debug/page_owner > page_owner.txt
Sample print_mode output (showing handles only):
Page allocated via order 0, mask 0x0(), pid 0, tgid 0 (swapper),
ts 0 ns PFN 0x40000 type Unmovable Block 512 type Unmovable
Flags 0x3fffe0000000000(node=0|zone=0|lastcpupid=0x1ffff)
handle: 1048577
Page allocated via order 0, mask 0x252000(__GFP_NOWARN|
__GFP_NORETRY|__GFP_COMP|__GFP_THISNODE), pid 0, tgid 0 (swapper),
ts 0 ns PFN 0x40002 type Unmovable Block 512 type Unmovable
Flags 0x23fffe0000000200(workingset|node=0|zone=0|lastcpupid=0x1ffff)
handle: 1048577
Testing
=======
Tested on a system with multiple NUMA nodes. Verified that:
- Filters work independently and in combination
- Print_mode output correlates correctly with show_stacks_handles
- Default behavior (filters disabled) remains unchanged
- NUMA filter works with single node, multiple nodes, and ranges
- String-based interface works correctly ("full_stack"/"stack_handle")
- Empty string clears NUMA filter
- Node validity check correctly rejects non-existent nodes
- Code compiles without warnings or errors (allmodconfig tested)
Example test session:
# cat print_mode
[full_stack] stack_handle
# echo stack_handle > print_mode
# cat print_mode
full_stack [stack_handle]
# echo "0,1-2" > nid
# cat nid
0-2
# echo "0,2-3" > nid
# cat nid
0,2-3
# echo "10" > nid
-bash: echo: write error: Invalid argument
# echo > nid
# cat nid
(empty - filter cleared)
Future Enhancements
===================
The filter infrastructure is designed to be extensible. Potential
future filters could include:
- PID/TGID filtering
- Time range filtering (allocation timestamp windows)
- GFP flag filtering
- Migration type filtering
v5: https://lore.kernel.org/linux-mm/20260507064643.179187-1-zhen.ni@easystack.cn/
v4: https://lore.kernel.org/linux-mm/20260430163247.13628-1-zhen.ni@easystack.cn/
v3: https://lore.kernel.org/linux-mm/20260428071112.1420380-1-zhen.ni@easystack.cn/
v2: https://lore.kernel.org/linux-mm/20260419155540.376847-1-zhen.ni@easystack.cn/
v1: https://lore.kernel.org/linux-mm/20260417154638.22370-1-zhen.ni@easystack.cn/
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
---
Zhen Ni (3):
mm/page_owner: add print_mode filter
mm/page_owner: add NUMA node filter with nodelist support
mm/page_owner: document page_owner filter features
Documentation/mm/page_owner.rst | 61 ++++++++++-
mm/page_owner.c | 174 +++++++++++++++++++++++++++++++-
2 files changed, 232 insertions(+), 3 deletions(-)
--
2.20.1
On Mon 11-05-26 11:30:14, Zhen Ni wrote: > Solution > ======== > > This patch series introduces a flexible filter infrastructure with > two initial filters: > > 1. **Print Mode Filter**: Outputs only stack handles instead of > full stack traces. The handle-to-stack mapping can be retrieved > from the existing show_stacks_handles interface. This dramatically > reduces output size while preserving all allocation metadata. > > 2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s) > using flexible nodelist format, enabling targeted analysis of memory > issues in NUMA-aware deployments. How does this work when there are multiple consumers of the interface? E.g per numa tool to watch node lock page_owner information? -- Michal Hocko SUSE Labs
在 2026/5/11 20:23, Michal Hocko 写道: > On Mon 11-05-26 11:30:14, Zhen Ni wrote: >> Solution >> ======== >> >> This patch series introduces a flexible filter infrastructure with >> two initial filters: >> >> 1. **Print Mode Filter**: Outputs only stack handles instead of >> full stack traces. The handle-to-stack mapping can be retrieved >> from the existing show_stacks_handles interface. This dramatically >> reduces output size while preserving all allocation metadata. >> >> 2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s) >> using flexible nodelist format, enabling targeted analysis of memory >> issues in NUMA-aware deployments. > > How does this work when there are multiple consumers of the interface? > E.g per numa tool to watch node lock page_owner information? > I understand your concern about concurrent access. Are you asking about this scenario? Scenario: Multiple tools monitoring different NUMA nodes Tool 1: echo "0" > nid && cat page_owner > node0.log Tool 2: echo "1" > nid && cat page_owner > node1.log The current global filter implementation would have race conditions in this case. Best regards, Zhen
On Mon 11-05-26 20:40:07, zhen.ni wrote: > > > 在 2026/5/11 20:23, Michal Hocko 写道: > > On Mon 11-05-26 11:30:14, Zhen Ni wrote: > > > Solution > > > ======== > > > > > > This patch series introduces a flexible filter infrastructure with > > > two initial filters: > > > > > > 1. **Print Mode Filter**: Outputs only stack handles instead of > > > full stack traces. The handle-to-stack mapping can be retrieved > > > from the existing show_stacks_handles interface. This dramatically > > > reduces output size while preserving all allocation metadata. > > > > > > 2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s) > > > using flexible nodelist format, enabling targeted analysis of memory > > > issues in NUMA-aware deployments. > > > > How does this work when there are multiple consumers of the interface? > > E.g per numa tool to watch node lock page_owner information? > > > I understand your concern about concurrent access. Are you asking > about this scenario? > > Scenario: Multiple tools monitoring different NUMA nodes > Tool 1: echo "0" > nid && cat page_owner > node0.log > Tool 2: echo "1" > nid && cat page_owner > node1.log > > The current global filter implementation would have race conditions > in this case. That makes the interface rather broken in my eyes TBH. Is there any way to make the filter local to the fd? -- Michal Hocko SUSE Labs
在 2026/5/11 20:54, Michal Hocko 写道: > On Mon 11-05-26 20:40:07, zhen.ni wrote: >> >> >> 在 2026/5/11 20:23, Michal Hocko 写道: >>> On Mon 11-05-26 11:30:14, Zhen Ni wrote: >>>> Solution >>>> ======== >>>> >>>> This patch series introduces a flexible filter infrastructure with >>>> two initial filters: >>>> >>>> 1. **Print Mode Filter**: Outputs only stack handles instead of >>>> full stack traces. The handle-to-stack mapping can be retrieved >>>> from the existing show_stacks_handles interface. This dramatically >>>> reduces output size while preserving all allocation metadata. >>>> >>>> 2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s) >>>> using flexible nodelist format, enabling targeted analysis of memory >>>> issues in NUMA-aware deployments. >>> >>> How does this work when there are multiple consumers of the interface? >>> E.g per numa tool to watch node lock page_owner information? >>> >> I understand your concern about concurrent access. Are you asking >> about this scenario? >> >> Scenario: Multiple tools monitoring different NUMA nodes >> Tool 1: echo "0" > nid && cat page_owner > node0.log >> Tool 2: echo "1" > nid && cat page_owner > node1.log >> >> The current global filter implementation would have race conditions >> in this case. > > That makes the interface rather broken in my eyes TBH. Is there any way > to make the filter local to the fd? I agree that the global filter state creates race conditions for concurrent consumers. Regarding per-fd filters, I've looked into this approach. The main challenge is that per-fd filter state would require changing the current simple usage model: Current usage: echo "0" > /sys/kernel/debug/page_owner_filter/nid cat /sys/kernel/debug/page_owner Per-fd implementation would require: - Add ioctl interface and allocate filter state in file->private_data - Change page_owner_fops to add .open/.unlocked_ioctl callbacks - Provide user-space tool (e.g., ./page_owner_tool --node 0) - New UAPI header with ioctl definitions This would replace the current "echo + cat" interface with a tool-based approach. Alternative: Simple mutex protection to serialize concurrent filter modifications. Though this doesn't fully address concurrent reads, it could mitigate the most obvious race conditions. I'm wondering if you have any thoughts on the trade-off here. Since page_owner is mainly used for debugging (typically not in concurrent scenarios), would a simpler approach like mutex protection or documenting this limitation be sufficient? Thanks, Zhen
On Tue 12-05-26 11:11:47, zhen.ni wrote: > > > 在 2026/5/11 20:54, Michal Hocko 写道: > > On Mon 11-05-26 20:40:07, zhen.ni wrote: > > > > > > > > > 在 2026/5/11 20:23, Michal Hocko 写道: > > > > On Mon 11-05-26 11:30:14, Zhen Ni wrote: > > > > > Solution > > > > > ======== > > > > > > > > > > This patch series introduces a flexible filter infrastructure with > > > > > two initial filters: > > > > > > > > > > 1. **Print Mode Filter**: Outputs only stack handles instead of > > > > > full stack traces. The handle-to-stack mapping can be retrieved > > > > > from the existing show_stacks_handles interface. This dramatically > > > > > reduces output size while preserving all allocation metadata. > > > > > > > > > > 2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s) > > > > > using flexible nodelist format, enabling targeted analysis of memory > > > > > issues in NUMA-aware deployments. > > > > > > > > How does this work when there are multiple consumers of the interface? > > > > E.g per numa tool to watch node lock page_owner information? > > > > > > > I understand your concern about concurrent access. Are you asking > > > about this scenario? > > > > > > Scenario: Multiple tools monitoring different NUMA nodes > > > Tool 1: echo "0" > nid && cat page_owner > node0.log > > > Tool 2: echo "1" > nid && cat page_owner > node1.log > > > > > > The current global filter implementation would have race conditions > > > in this case. > > > > That makes the interface rather broken in my eyes TBH. Is there any way > > to make the filter local to the fd? > > I agree that the global filter state creates race conditions for > concurrent consumers. > > Regarding per-fd filters, I've looked into this approach. The main > challenge is that per-fd filter state would require changing the current > simple usage model: > Current usage: > echo "0" > /sys/kernel/debug/page_owner_filter/nid > cat /sys/kernel/debug/page_owner > Per-fd implementation would require: > - Add ioctl interface and allocate filter state in file->private_data > - Change page_owner_fops to add .open/.unlocked_ioctl callbacks > - Provide user-space tool (e.g., ./page_owner_tool --node 0) > - New UAPI header with ioctl definitions ioctl is one option. Have you considered to write the filter state to the page_owner fd to create a local state? > This would replace the current "echo + cat" interface with a > tool-based approach. Which doesn't sound all that terrible comparing to a non-deterministic behavior of this proposal > Alternative: Simple mutex protection to serialize > concurrent filter modifications. Though this doesn't fully address > concurrent reads, it could mitigate the most obvious race conditions. > > I'm wondering if you have any thoughts on the trade-off here. Since > page_owner is mainly used for debugging (typically not in concurrent > scenarios), would a simpler approach like mutex protection or documenting > this limitation be sufficient? The thing is that unless you own the whole machine you never know who might consider information from page_owner interesting to filter and read. So you might easily get garbage. Not completely terrible considering this is debugging interface but I believe we can do better than that. -- Michal Hocko SUSE Labs
在 2026/5/12 15:26, Michal Hocko 写道:
> On Tue 12-05-26 11:11:47, zhen.ni wrote:
>>
>>
>> 在 2026/5/11 20:54, Michal Hocko 写道:
>>> On Mon 11-05-26 20:40:07, zhen.ni wrote:
>>>>
>>>>
>>>> 在 2026/5/11 20:23, Michal Hocko 写道:
>>>>> On Mon 11-05-26 11:30:14, Zhen Ni wrote:
>>>>>> Solution
>>>>>> ========
>>>>>>
>>>>>> This patch series introduces a flexible filter infrastructure with
>>>>>> two initial filters:
>>>>>>
>>>>>> 1. **Print Mode Filter**: Outputs only stack handles instead of
>>>>>> full stack traces. The handle-to-stack mapping can be retrieved
>>>>>> from the existing show_stacks_handles interface. This dramatically
>>>>>> reduces output size while preserving all allocation metadata.
>>>>>>
>>>>>> 2. **NUMA Node Filter**: Allows filtering pages by specific NUMA node(s)
>>>>>> using flexible nodelist format, enabling targeted analysis of memory
>>>>>> issues in NUMA-aware deployments.
>>>>>
>>>>> How does this work when there are multiple consumers of the interface?
>>>>> E.g per numa tool to watch node lock page_owner information?
>>>>>
>>>> I understand your concern about concurrent access. Are you asking
>>>> about this scenario?
>>>>
>>>> Scenario: Multiple tools monitoring different NUMA nodes
>>>> Tool 1: echo "0" > nid && cat page_owner > node0.log
>>>> Tool 2: echo "1" > nid && cat page_owner > node1.log
>>>>
>>>> The current global filter implementation would have race conditions
>>>> in this case.
>>>
>>> That makes the interface rather broken in my eyes TBH. Is there any way
>>> to make the filter local to the fd?
>>
>> I agree that the global filter state creates race conditions for
>> concurrent consumers.
>>
>> Regarding per-fd filters, I've looked into this approach. The main
>> challenge is that per-fd filter state would require changing the current
>> simple usage model:
>
>> Current usage:
>> echo "0" > /sys/kernel/debug/page_owner_filter/nid
>> cat /sys/kernel/debug/page_owner
>
>> Per-fd implementation would require:
>> - Add ioctl interface and allocate filter state in file->private_data
>> - Change page_owner_fops to add .open/.unlocked_ioctl callbacks
>> - Provide user-space tool (e.g., ./page_owner_tool --node 0)
>> - New UAPI header with ioctl definitions
>
> ioctl is one option. Have you considered to write the filter state to
> the page_owner fd to create a local state?
>
>> This would replace the current "echo + cat" interface with a
>> tool-based approach.
>
> Which doesn't sound all that terrible comparing to a non-deterministic
> behavior of this proposal
>
>> Alternative: Simple mutex protection to serialize
>> concurrent filter modifications. Though this doesn't fully address
>> concurrent reads, it could mitigate the most obvious race conditions.
>>
>> I'm wondering if you have any thoughts on the trade-off here. Since
>> page_owner is mainly used for debugging (typically not in concurrent
>> scenarios), would a simpler approach like mutex protection or documenting
>> this limitation be sufficient?
>
> The thing is that unless you own the whole machine you never know who
> might consider information from page_owner interesting to filter and
> read. So you might easily get garbage. Not completely terrible
> considering this is debugging interface but I believe we can do better
> than that.
Thank you for the feedback.
I've been thinking about the per-fd filtering approach you suggested:
## Implementation Plan
1. Add per-fd filtering to page_owner file
- Add .open/.release/.write callbacks
- Each file descriptor has its own filter state
- Write filter commands: "nid=0", "mode=stack_handle"
2. Provide user-space tool
- Simple CLI: ./page_owner_tool --nid=0
- Handle fd management internally
## User Experience
Direct access (default: no filter):
cat /sys/kernel/debug/page_owner
With filtering:
./page_owner_tool --nid=0
./page_owner_tool --nid=0,2-3
./page_owner_tool --nid=0 --mode=stack_handle
## Benefits
- Completely eliminates race condition
- Per-fd isolation for concurrent access
- Correct design for multi-consumer scenarios
Does this approach look good to you?
Please let me know if you have any suggestions or concerns.
Thanks,
Zhen
On Tue 12-05-26 16:16:36, zhen.ni wrote: > ## Implementation Plan > > 1. Add per-fd filtering to page_owner file > - Add .open/.release/.write callbacks > - Each file descriptor has its own filter state > - Write filter commands: "nid=0", "mode=stack_handle" > > 2. Provide user-space tool > - Simple CLI: ./page_owner_tool --nid=0 > - Handle fd management internally > > ## User Experience > > Direct access (default: no filter): > cat /sys/kernel/debug/page_owner > > With filtering: > ./page_owner_tool --nid=0 > ./page_owner_tool --nid=0,2-3 > ./page_owner_tool --nid=0 --mode=stack_handle > > ## Benefits > > - Completely eliminates race condition > - Per-fd isolation for concurrent access > - Correct design for multi-consumer scenarios > > Does this approach look good to you? > > Please let me know if you have any suggestions or concerns. Yes, this is what I had in mind. Thanks for looking into that. -- Michal Hocko SUSE Labs
© 2016 - 2026 Red Hat, Inc.