include/linux/damon.h | 13 ++ mm/damon/Kconfig | 10 + mm/damon/Makefile | 1 + mm/damon/core.c | 341 +++++++++++++++++++++++++++------ mm/damon/damon_ibs.c | 369 ++++++++++++++++++++++++++++++++++++ mm/damon/ops-common.h | 13 ++ mm/damon/paddr.c | 15 +- mm/damon/sysfs.c | 12 +- mm/damon/tests/core-kunit.h | 2 +- 9 files changed, 707 insertions(+), 69 deletions(-) create mode 100644 mm/damon/damon_ibs.c
Hi all,
This is an RFC, not for merge. The series exercises and validates
damon_report_access() -- the consumer API SeongJae introduced in [1]
-- as a substrate for ingesting access reports from hardware-sampling
sources. The series includes one worked-example backend, an AMD IBS
Op module (damon_ibs.ko), that runs on Zen 3+ silicon via the
existing perf event subsystem.
Combined with node_eligible_mem_bp [2], the recently-merged DAMOS goal
metric, the same DAMON interface composes naturally for two
operational regimes from one set of primitives:
1. Traditional tiering -- promote hot pages to DRAM up to a target
cap.
2. System-wide bandwidth interleaving -- split hot pages between
DRAM and CXL at an operator-chosen ratio, for workloads where
placing some hot pages on CXL improves aggregate throughput.
Either regime composes with a separately-configured migrate_cold
scheme to pair bandwidth shaping with capacity expansion: the
hot-page schemes drive placement to meet the bandwidth target while
migrate_cold reclaims DRAM by demoting cold pages.
The demonstration in this RFC exercises different
target ratios of the same PULL+PUSH setup.
Why a hardware-source primitive complements existing primitives
===============================================================
DAMON's existing access-check primitives observe access through
software paths:
- PTE-Accessed bit scanning samples Accessed bits and clears them
periodically. The hardware sets PTE-A on TLB miss, so already-
resident TLB entries do not re-set the bit until they're evicted.
For pages whose translations stay TLB-resident across DAMON's
aggregation interval, nr_accesses reflects fewer accesses than
the page actually serviced. This is correct behaviour for the
primitive -- it observes what the TLB-miss path observes.
- Page-fault sampling (NUMA hint faults) requires unmapping pages
to provoke the fault, then samples access on the fault path.
For closed-loop schemes that drive migrate_hot from the same
observations, the unmap and the migrate action interact.
Both primitives produce a view of hotness that converges to the
true distribution over the aggregation interval. For systems where
the address space is small relative to the aggregation rate, this is
the right tool. On large heterogeneous-memory systems with goal-
driven schemes asking the closed-loop tuner to converge on a target
distribution, a complementary lower-latency view of accesses can
tighten the loop -- reducing the time DAMON's nr_accesses takes to
reflect the workload's actual access distribution, which in turn
reduces ramp duration and oscillation amplitude during convergence
of goal-driven schemes.
A hardware-sampling primitive provides this complementary view:
hardware retirement records each access at its natural event rate,
with a physical address per sample, independent of TLB state and
independent of the unmap/fault path.
This RFC adds the substrate (damon_report_access) so any hardware
sampler -- IBS, PEBS, future CXL hotness monitoring units --
can feed access reports into the kdamond drain path and existing
DAMOS schemes. The substrate is the contribution; the IBS backend
is one worked example proving it on broadly-available silicon today.
Demonstration
=============
The two-scheme PULL+PUSH setup from the node_eligible_mem_bp
introduction holds a target hot-memory ratio across DRAM and CXL.
With damon_ibs.ko feeding damon_report_access, we observe two
operational regimes:
Cold-start convergence -- workload starts at an even DRAM/CXL
distribution (numactl --interleave=DRAM,CXL), DAMON context starts
with the target ratio set at kdamond launch, schemes converge from
the initial distribution to the target distribution.
+-----------+--------+----------+---------+
| Target | Mean | Offset | Stddev |
+-----------+--------+----------+---------+
| 70% DRAM | 69.73% | -0.27pp | 0.70pp |
| 30% DRAM | 31.00% | +1.00pp | 1.28pp |
+-----------+--------+----------+---------+
Live target changes from a converged state -- kdamond context runs
continuously, target ratio updated via DAMOS commit_schemes_quota_goals
without kdamond teardown.
+-----------+--------+----------+---------+
| Target | Mean | Offset | Stddev |
+-----------+--------+----------+---------+
| 90% DRAM | 89.74% | -0.26pp | 0.64pp |
| 85% DRAM | 84.61% | -0.39pp | 0.60pp |
+-----------+--------+----------+---------+
In both regimes, convergence to target is quick, and the workload's
measured DRAM share then holds within 1.3 percentage points of
target with standard deviation under 1.3 percentage points, sustained
over runs of 15-30 minutes per target.
Hardware envelope: AMD EPYC dual-socket, CXL.mem on a separate NUMA
node, 32GB hot working set, two migrate_hot schemes with complementary
address filters, temporal quota tuner, 256-entry per-CPU report ring,
512 MiB per-scheme quota, 1s reset interval.
What's in this series
=====================
Patch 1. mm/damon/core: refcount ops owner module to prevent
rmmod UAF
Patch 2. mm/damon/paddr: export damon_pa_* ops for IBS module
Patch 3. mm/damon/core: replace mutex-protected report buffer
with per-CPU lockless ring
Patch 4. mm/damon/core: flat-array snapshot + bsearch in ring-
drain loop
Patch 5. mm/damon: add sysfs binding and dispatch hookup for
paddr_ibs operations
Patch 6. mm/damon/core: accept paddr_ibs in node_eligible_mem_bp
ops check
Patch 7. mm/damon/damon_ibs: add AMD IBS-based access sampling
backend
Patches 1, 3, and 4 are general infrastructure that benefits any
consumer of damon_report_access(). Patches 2, 5, 6, and 7 are the
worked-example backend (paddr_ibs ops, sysfs binding, IBS module).
Patches worth folding into damon/next
=====================================
Patches 1, 3, and 4 are not specific to IBS or to this RFC's
backend. Each is preparatory infrastructure that any consumer of
damon_report_access() will need:
- Patch 1 (refcount ops owner) -- any modular ops set, including
out-of-tree backends, needs clean module unload to avoid UAF
on damon_unregister_ops.
- Patch 3 (per-CPU lockless ring) -- damon_report_access() cannot
be called from NMI context with the current mutex-protected
buffer. Hardware samplers all need NMI-safe submission.
- Patch 4 (flat-array snapshot + bsearch drain) -- the linear-
scan drain is O(reports x regions) and exceeds the sample
interval at high-CPU x large-region products. Bsearch brings
it to O(reports x log regions).
If these belong directly on damon/next as preparatory patches for
damon_report_access() rather than living inside an IBS-specific
track, we are happy to rebase and resend them that way.
Relation to prior and ongoing work
==================================
The IBS sampling pattern in patch 7 -- attr.config=0 to use IBS Op
default config, dc_phy_addr_valid filter, NMI-safe sample submission
-- is derived from concepts in Bharata B Rao's pghot RFC v5 [3].
The attribution header is in mm/damon/damon_ibs.c and the patch
carries a Suggested-by: trailer.
Bharata's pghot v7 [4] introduces a different IBS driver targeting
the new IBS Memory Profiler (IBS-MProf) facility, which Bharata
describes as a facility "that will be present in future AMD
processors" -- a separate IBS instance from the one this RFC's
backend uses. This version of driver based out of v5 [3] is an
example of how DAMON can be benefited from AMD IBS Hardware
source and validates importance of IBS information indepedently.
It is not meant to be merged in the current form.
@Bharata if you see a path where IBS samples can be consumed
by DAMON at some point, will be happy to collaborate.
Akinobu Mita's perf-event-based access-check RFC [5] explores a
configurable perf-event-driven access source for DAMON. IBS has
vendor-specific MSR setup beyond what perf_event_attr alone
expresses (e.g. dc_phy_addr_valid filtering on the produced sample,
not on the perf attr), so the IBS path here appears complementary
to [5] -- operators choose based on whether their hardware sampler
fits stock perf or needs additional kernel-side setup.
Specific asks
=============
To SeongJae:
1. Patches 1, 3, and 4 are infrastructure that benefits any consumer
of damon_report_access(), not just the IBS backend in this RFC.
Would these belong directly on damon/next as preparatory patches
for damon_report_access(), rather than living inside an
IBS-specific track? Happy to rebase and resend them that way if
you'd prefer that shape. Tested-by: tags can come along.
Future work
===========
- Longer-duration stability and broader workload coverage.
Test branch
===========
A single fetch reproduces the cover-letter measurements on top of
both this RFC and the companion DAMOS quota controller and paddr
migration walk fixes posted separately at [6]:
git fetch https://github.com/ravis-opensrc/linux.git \
damon/hw-hotness-rfc-v1-testing
The companion fixes are not required for this RFC to function, but
the closed-loop measurements above were collected on the testing
branch which has both applied. The standalone series-only branches
are also available:
git fetch https://github.com/ravis-opensrc/linux.git \
damon/hw-hotness-rfc-v1
git fetch https://github.com/ravis-opensrc/linux.git \
damon/closed-loop-fixes-v1
Links
=====
[1] [RFC PATCH v3 00/37] mm/damon: introduce per-CPUs/threads/
write/read monitoring (SeongJae Park)
https://lore.kernel.org/linux-mm/20251208062943.68824-1-sj@kernel.org/
Patch 01 introduces damon_report_access(), the consumer API
this RFC builds on.
[2] mm/damon: add node_eligible_mem_bp goal metric
https://lore.kernel.org/linux-mm/20260428030520.701-1-ravis.opensrc@gmail.com/
[3] [RFC PATCH v5 00/10] mm: Hot page tracking and promotion
infrastructure (Bharata B Rao)
https://lore.kernel.org/linux-mm/20260129144043.231636-1-bharata@amd.com/
[4] [PATCH v7 0/7] mm: Hot page tracking and promotion
infrastructure (Bharata B Rao)
https://lore.kernel.org/linux-mm/20260504060924.344313-1-bharata@amd.com/
[5] [RFC PATCH v3 0/4] mm/damon: introduce perf event based access
check (Akinobu Mita)
https://lore.kernel.org/linux-mm/20260423004211.7037-1-akinobu.mita@gmail.com/
[6] [PATCH 0/5] mm/damon: DAMOS quota controller and paddr
migration walk fixes (Ravi Jonnalagadda)
https://lore.kernel.org/linux-mm/20260516210357.2247-1-ravis.opensrc@gmail.com/
Ravi Jonnalagadda (7):
mm/damon/core: refcount ops owner module to prevent rmmod UAF
mm/damon/paddr: export damon_pa_* ops for IBS module
mm/damon/core: replace mutex-protected report buffer with per-CPU
lockless ring
mm/damon/core: flat-array snapshot + bsearch in ring-drain loop
mm/damon: add sysfs binding and dispatch hookup for paddr_ibs
operations
mm/damon/core: accept paddr_ibs in node_eligible_mem_bp ops check
mm/damon/damon_ibs: add AMD IBS-based access sampling backend
include/linux/damon.h | 13 ++
mm/damon/Kconfig | 10 +
mm/damon/Makefile | 1 +
mm/damon/core.c | 341 +++++++++++++++++++++++++++------
mm/damon/damon_ibs.c | 369 ++++++++++++++++++++++++++++++++++++
mm/damon/ops-common.h | 13 ++
mm/damon/paddr.c | 15 +-
mm/damon/sysfs.c | 12 +-
mm/damon/tests/core-kunit.h | 2 +-
9 files changed, 707 insertions(+), 69 deletions(-)
create mode 100644 mm/damon/damon_ibs.c
base-commit: 606bfbf72120df4f406ef46971d48053706f6f75
--
2.43.0
+ Akinobu Hello Ravi, On Sat, 16 May 2026 15:34:25 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote: > Hi all, > > This is an RFC, not for merge. The series exercises and validates > damon_report_access() -- the consumer API SeongJae introduced in [1] > -- as a substrate for ingesting access reports from hardware-sampling > sources. The series includes one worked-example backend, an AMD IBS > Op module (damon_ibs.ko), that runs on Zen 3+ silicon via the > existing perf event subsystem. Thank you for sharing this great RFC series! [...] > Why a hardware-source primitive complements existing primitives > =============================================================== [...] > Both primitives produce a view of hotness that converges to the > true distribution over the aggregation interval. For systems where > the address space is small relative to the aggregation rate, this is > the right tool. On large heterogeneous-memory systems with goal- > driven schemes asking the closed-loop tuner to converge on a target > distribution, a complementary lower-latency view of accesses can > tighten the loop -- reducing the time DAMON's nr_accesses takes to > reflect the workload's actual access distribution, which in turn > reduces ramp duration and oscillation amplitude during convergence > of goal-driven schemes. > > A hardware-sampling primitive provides this complementary view: > hardware retirement records each access at its natural event rate, > with a physical address per sample, independent of TLB state and > independent of the unmap/fault path. Yes, I fully agree. Different multiple access check primitives have different characteristics. [...] > Demonstration > ============= [...] > In both regimes, convergence to target is quick, and the workload's > measured DRAM share then holds within 1.3 percentage points of > target with standard deviation under 1.3 percentage points, sustained > over runs of 15-30 minutes per target. I understand this demonstration shows your AMD IBS-based version of DAMON is functioning as expected. Thank you for sharing this! [...] > What's in this series > ===================== > > Patch 1. mm/damon/core: refcount ops owner module to prevent > rmmod UAF > Patch 2. mm/damon/paddr: export damon_pa_* ops for IBS module > Patch 3. mm/damon/core: replace mutex-protected report buffer > with per-CPU lockless ring > Patch 4. mm/damon/core: flat-array snapshot + bsearch in ring- > drain loop > Patch 5. mm/damon: add sysfs binding and dispatch hookup for > paddr_ibs operations > Patch 6. mm/damon/core: accept paddr_ibs in node_eligible_mem_bp > ops check > Patch 7. mm/damon/damon_ibs: add AMD IBS-based access sampling > backend > > Patches 1, 3, and 4 are general infrastructure that benefits any > consumer of damon_report_access(). Patches 2, 5, 6, and 7 are the > worked-example backend (paddr_ibs ops, sysfs binding, IBS module). I didn't read the detailed code of each patch. But my high level understanding is as below. Patches 1 and 2 are needed for supporting loadable module-based DAMON operation sets (access sampling backend). Patch 3 is needed for supporting access check primitives that can provide the access information in only nmi context. It can also speedup the access reporting in general, though. Patch 4 makes DAMON's internal reported access information retrieval faster, so will help any reporting-based DAMON operation set use case. Patches 5-7 are required for only the IBS-based DAMON operations set (paddr_ibs). So I agree patch 4 is a general infrastructure improvement that benefits multiple use cases. Patch 3 is also arguably general infrastructure improvement, as it will make the reporting faster in general. Patch 1 is not technically coupled with paddr_ibs, and will be needed for general loadable module based access check primitives. But, should we support lodable modules? If so, why? Patch 2 is also not technically coupled with paddr_ibs, to my understanding, so should be categorized together with patch 1? In other words, if we agree we should support lodable modules based DAMON operation sets, this should be useful for not only paddr_ibs but more general cases. Correct me if I'm wrong. > > > Patches worth folding into damon/next > ===================================== > > Patches 1, 3, and 4 are not specific to IBS or to this RFC's > backend. Each is preparatory infrastructure that any consumer of > damon_report_access() will need: > > - Patch 1 (refcount ops owner) -- any modular ops set, including > out-of-tree backends, needs clean module unload to avoid UAF > on damon_unregister_ops. > - Patch 3 (per-CPU lockless ring) -- damon_report_access() cannot > be called from NMI context with the current mutex-protected > buffer. Hardware samplers all need NMI-safe submission. > - Patch 4 (flat-array snapshot + bsearch drain) -- the linear- > scan drain is O(reports x regions) and exceeds the sample > interval at high-CPU x large-region products. Bsearch brings > it to O(reports x log regions). > > If these belong directly on damon/next as preparatory patches for > damon_report_access() rather than living inside an IBS-specific > track, we are happy to rebase and resend them that way. So I'm bit unsure about patch 1. If we don't have a plan to support lodable modules based DAMON operations set, we might not need it for now. For patches 3 and 4, I agree those will be useful in general. Nonetheless, I'd slightly prefer to do that optimizations at the later part of the long term project. > > > Relation to prior and ongoing work > ================================== > > The IBS sampling pattern in patch 7 -- attr.config=0 to use IBS Op > default config, dc_phy_addr_valid filter, NMI-safe sample submission > -- is derived from concepts in Bharata B Rao's pghot RFC v5 [3]. > The attribution header is in mm/damon/damon_ibs.c and the patch > carries a Suggested-by: trailer. > > Bharata's pghot v7 [4] introduces a different IBS driver targeting > the new IBS Memory Profiler (IBS-MProf) facility, which Bharata > describes as a facility "that will be present in future AMD > processors" -- a separate IBS instance from the one this RFC's > backend uses. This version of driver based out of v5 [3] is an > example of how DAMON can be benefited from AMD IBS Hardware > source and validates importance of IBS information indepedently. > It is not meant to be merged in the current form. > @Bharata if you see a path where IBS samples can be consumed > by DAMON at some point, will be happy to collaborate. > > Akinobu Mita's perf-event-based access-check RFC [5] explores a > configurable perf-event-driven access source for DAMON. IBS has > vendor-specific MSR setup beyond what perf_event_attr alone > expresses (e.g. dc_phy_addr_valid filtering on the produced sample, > not on the perf attr), so the IBS path here appears complementary > to [5] -- operators choose based on whether their hardware sampler > fits stock perf or needs additional kernel-side setup. So apparently there are multiple approaches to develop and use h/w-based access monitoring. Akinobu and you are trying to do that using DAMON as the frontend, and already made the working prototypes. There were more people who showed interest and will to contribute to this project other than you, too. I 100% agree h/w-based access monitoring can be useful, and I of course thinking using DAMON as the fronend is the right approach. I'm all for making this upstreamed. I was therefore spending time on thinking about in what long-term maintainable shape this capability can successfully be upstreamed. I suggested damon_report_access() as the internal interface between DAMON and the h/w-based access check primitives, and apparently we all (I, Ravi and Akinobu in this context) agreed. Akinobu thankfully revisioned his implementation based on damon_report_access() interface. Ravi also implemented this RFC based on the interface. After making the consensus with Akinobu, I was taking time on the user space interface. When I was discussing with Akinobu, my idea was extending the user interface for the page faults based monitoring v3 [1]. But, recently I decided to make this more general, so proposed data attributes monitoring extension [2] at LSFMMBPF. The patch series for the initial change [3] is merged into mm-new for more testing, today. The cover letter of the patch series is also sharing how it will be extended for h/w based access monitoring in long term. I of course want us to go in this direction. I believe you already had chances to take a look on the long term plan and didn't make some voice because you don't strongly disagree about the plan. If not, please make a voice. Assuming you don't have concern on the long term plan yet, I will take time to write down more formal and detailed plan. It will explain the overall roadmap, timeline and how we could collaborate. On top of that, we could further discuss. > > > Specific asks > ============= > > To SeongJae: > > 1. Patches 1, 3, and 4 are infrastructure that benefits any consumer > of damon_report_access(), not just the IBS backend in this RFC. > Would these belong directly on damon/next as preparatory patches > for damon_report_access(), rather than living inside an > IBS-specific track? Happy to rebase and resend them that way if > you'd prefer that shape. Tested-by: tags can come along. I'm still thinking about how we can collaborate well. The answer for the above question would be a part of that. In other words, I have no good answer right now, sorry. Could you please give me more time to think more and share the plan? I will share the plan as another mail. On the thread, we could further discuss. Of course, we could have DAMON beer/coffee/tea chats [4] like additional discussions before/after/during the plan discussion. So, long story short, we agreed this project (h/w-based data access monitoring) should be upstreamed. But give me little more time on thinking about how we will do it and collaborate. It will take some time. Please bear in mind. Sorry for making you wait, but I pretty sure and promise that we will eventually make it. [1] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org [2] https://lwn.net/Articles/1071256/ [3] https://lore.kernel.org/20260518234119.97569-1-sj@kernel.org [4] https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing Thanks, SJ [...]
On Mon, May 18, 2026 at 11:19 PM SeongJae Park <sj@kernel.org> wrote:
>
> + Akinobu
>
> Hello Ravi,
>
> On Sat, 16 May 2026 15:34:25 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
>
> > Hi all,
> >
> > This is an RFC, not for merge. The series exercises and validates
> > damon_report_access() -- the consumer API SeongJae introduced in [1]
> > -- as a substrate for ingesting access reports from hardware-sampling
> > sources. The series includes one worked-example backend, an AMD IBS
> > Op module (damon_ibs.ko), that runs on Zen 3+ silicon via the
> > existing perf event subsystem.
>
> Thank you for sharing this great RFC series!
>
> [...]
> > Why a hardware-source primitive complements existing primitives
> > ===============================================================
> [...]
> > Both primitives produce a view of hotness that converges to the
> > true distribution over the aggregation interval. For systems where
> > the address space is small relative to the aggregation rate, this is
> > the right tool. On large heterogeneous-memory systems with goal-
> > driven schemes asking the closed-loop tuner to converge on a target
> > distribution, a complementary lower-latency view of accesses can
> > tighten the loop -- reducing the time DAMON's nr_accesses takes to
> > reflect the workload's actual access distribution, which in turn
> > reduces ramp duration and oscillation amplitude during convergence
> > of goal-driven schemes.
> >
> > A hardware-sampling primitive provides this complementary view:
> > hardware retirement records each access at its natural event rate,
> > with a physical address per sample, independent of TLB state and
> > independent of the unmap/fault path.
>
> Yes, I fully agree. Different multiple access check primitives have different
> characteristics.
>
> [...]
>
> > Demonstration
> > =============
> [...]
> > In both regimes, convergence to target is quick, and the workload's
> > measured DRAM share then holds within 1.3 percentage points of
> > target with standard deviation under 1.3 percentage points, sustained
> > over runs of 15-30 minutes per target.
>
> I understand this demonstration shows your AMD IBS-based version of DAMON is
> functioning as expected. Thank you for sharing this!
>
> [...]
> > What's in this series
> > =====================
> >
> > Patch 1. mm/damon/core: refcount ops owner module to prevent
> > rmmod UAF
> > Patch 2. mm/damon/paddr: export damon_pa_* ops for IBS module
> > Patch 3. mm/damon/core: replace mutex-protected report buffer
> > with per-CPU lockless ring
> > Patch 4. mm/damon/core: flat-array snapshot + bsearch in ring-
> > drain loop
> > Patch 5. mm/damon: add sysfs binding and dispatch hookup for
> > paddr_ibs operations
> > Patch 6. mm/damon/core: accept paddr_ibs in node_eligible_mem_bp
> > ops check
> > Patch 7. mm/damon/damon_ibs: add AMD IBS-based access sampling
> > backend
> >
> > Patches 1, 3, and 4 are general infrastructure that benefits any
> > consumer of damon_report_access(). Patches 2, 5, 6, and 7 are the
> > worked-example backend (paddr_ibs ops, sysfs binding, IBS module).
>
> I didn't read the detailed code of each patch. But my high level understanding
> is as below.
>
> Patches 1 and 2 are needed for supporting loadable module-based DAMON operation
> sets (access sampling backend).
>
> Patch 3 is needed for supporting access check primitives that can provide the
> access information in only nmi context. It can also speedup the access
> reporting in general, though.
>
> Patch 4 makes DAMON's internal reported access information retrieval faster, so
> will help any reporting-based DAMON operation set use case.
>
> Patches 5-7 are required for only the IBS-based DAMON operations set
> (paddr_ibs).
>
> So I agree patch 4 is a general infrastructure improvement that benefits
> multiple use cases.
>
> Patch 3 is also arguably general infrastructure improvement, as it will make
> the reporting faster in general.
>
> Patch 1 is not technically coupled with paddr_ibs, and will be needed for
> general loadable module based access check primitives. But, should we support
> lodable modules? If so, why?
>
> Patch 2 is also not technically coupled with paddr_ibs, to my understanding, so
> should be categorized together with patch 1? In other words, if we agree we
> should support lodable modules based DAMON operation sets, this should be
> useful for not only paddr_ibs but more general cases.
>
> Correct me if I'm wrong.
>
> >
> >
> > Patches worth folding into damon/next
> > =====================================
> >
> > Patches 1, 3, and 4 are not specific to IBS or to this RFC's
> > backend. Each is preparatory infrastructure that any consumer of
> > damon_report_access() will need:
> >
> > - Patch 1 (refcount ops owner) -- any modular ops set, including
> > out-of-tree backends, needs clean module unload to avoid UAF
> > on damon_unregister_ops.
> > - Patch 3 (per-CPU lockless ring) -- damon_report_access() cannot
> > be called from NMI context with the current mutex-protected
> > buffer. Hardware samplers all need NMI-safe submission.
> > - Patch 4 (flat-array snapshot + bsearch drain) -- the linear-
> > scan drain is O(reports x regions) and exceeds the sample
> > interval at high-CPU x large-region products. Bsearch brings
> > it to O(reports x log regions).
> >
> > If these belong directly on damon/next as preparatory patches for
> > damon_report_access() rather than living inside an IBS-specific
> > track, we are happy to rebase and resend them that way.
>
> So I'm bit unsure about patch 1. If we don't have a plan to support lodable
> modules based DAMON operations set, we might not need it for now.
>
> For patches 3 and 4, I agree those will be useful in general. Nonetheless, I'd
> slightly prefer to do that optimizations at the later part of the long term
> project.
>
> >
> >
> > Relation to prior and ongoing work
> > ==================================
> >
> > The IBS sampling pattern in patch 7 -- attr.config=0 to use IBS Op
> > default config, dc_phy_addr_valid filter, NMI-safe sample submission
> > -- is derived from concepts in Bharata B Rao's pghot RFC v5 [3].
> > The attribution header is in mm/damon/damon_ibs.c and the patch
> > carries a Suggested-by: trailer.
> >
> > Bharata's pghot v7 [4] introduces a different IBS driver targeting
> > the new IBS Memory Profiler (IBS-MProf) facility, which Bharata
> > describes as a facility "that will be present in future AMD
> > processors" -- a separate IBS instance from the one this RFC's
> > backend uses. This version of driver based out of v5 [3] is an
> > example of how DAMON can be benefited from AMD IBS Hardware
> > source and validates importance of IBS information indepedently.
> > It is not meant to be merged in the current form.
> > @Bharata if you see a path where IBS samples can be consumed
> > by DAMON at some point, will be happy to collaborate.
> >
> > Akinobu Mita's perf-event-based access-check RFC [5] explores a
> > configurable perf-event-driven access source for DAMON. IBS has
> > vendor-specific MSR setup beyond what perf_event_attr alone
> > expresses (e.g. dc_phy_addr_valid filtering on the produced sample,
> > not on the perf attr), so the IBS path here appears complementary
> > to [5] -- operators choose based on whether their hardware sampler
> > fits stock perf or needs additional kernel-side setup.
>
> So apparently there are multiple approaches to develop and use h/w-based access
> monitoring. Akinobu and you are trying to do that using DAMON as the frontend,
> and already made the working prototypes. There were more people who showed
> interest and will to contribute to this project other than you, too. I 100%
> agree h/w-based access monitoring can be useful, and I of course thinking using
> DAMON as the fronend is the right approach. I'm all for making this
> upstreamed.
>
> I was therefore spending time on thinking about in what long-term maintainable
> shape this capability can successfully be upstreamed. I suggested
> damon_report_access() as the internal interface between DAMON and the h/w-based
> access check primitives, and apparently we all (I, Ravi and Akinobu in this
> context) agreed. Akinobu thankfully revisioned his implementation based on
> damon_report_access() interface. Ravi also implemented this RFC based on the
> interface.
>
> After making the consensus with Akinobu, I was taking time on the user space
> interface. When I was discussing with Akinobu, my idea was extending the user
> interface for the page faults based monitoring v3 [1]. But, recently I decided
> to make this more general, so proposed data attributes monitoring extension [2]
> at LSFMMBPF. The patch series for the initial change [3] is merged into mm-new
> for more testing, today. The cover letter of the patch series is also sharing
> how it will be extended for h/w based access monitoring in long term.
>
> I of course want us to go in this direction. I believe you already had chances
> to take a look on the long term plan and didn't make some voice because you
> don't strongly disagree about the plan. If not, please make a voice.
>
Hi SJ,
One layering question I'd like to flag before the plan is written,
since it affects how this RFC's substrate slots in:
In [3], .apply_probes is a periodic per-region classifier driven
from kdamond_fn after .check_accesses, in process context, that
applies a (folio -> bool) predicate to each region's sampling_addr
and accounts the results in r->probe_hits[]. damon_report_access()
on the other hand is a per-event delivery callback into a per-CPU
buffer, called from the access source (NMI for IBS / PEBS / SPE,
process context for page-fault-based sources). These appear to
me to sit at different layers - delivery vs. classification.
The reason I want to confirm this: NMI context for HW samplers
precludes the operations .apply_probes can do today (no mutex, no
kmalloc, no sleep, no folio lookup that touches pte_lock). And
the data shape is inverted - .apply_probes asks "does region R's
sampling_addr have attribute A?", evaluated on the kdamond-chosen
address; an HW sample announces "PA Y was accessed at retirement
time T", arriving asynchronously and needing to find the region
it falls into. If access events end up routed through
.apply_probes in the long-term plan, the IBS / PEBS / SPE
backends would each need a deferral path under it (per-CPU ring
for NMI-safe submission, region mapping at drain time).
Happy to be wrong here if you see a unified shape that handles
both - just want to surface the constraint before the plan is
written.
On the loadable-module question for patches 1 and 2: agreed it's a
genuinely open architectural call, not just a paddr_ibs convenience.
- paddr_ibs (this RFC) targets the existing IBS Op facility on
Zen 3+ silicon via the perf event subsystem and uses a
vendor-specific
overflow-handler filter that perf_event_attr cannot express
(dc_phy_addr_valid in IBS_OP_DATA3). Bharata's pghot v7
[pghot-v7] introduces a separate IBS driver targeting the new
IBS-MProf
facility on future AMD silicon via direct MSR programming -
not perf at all. These are two AMD-specific HW samplers with
non-overlapping silicon coverage and non-overlapping kernel
paths. A distro shipping a single kernel image to a fleet
with mixed silicon needs runtime-selectable backends, which
obj=y can't do across exclusive `depends on` chains.
- Akinobu's perf-event RFC v3 [akinobu-v3] is a useful contrast:
it stays builtin because it's a generic configurable
perf_event_attr passthrough, no vendor-specific code in the
overflow handler. The tristate case is specifically for the
backends that need vendor logic outside perf_event_attr
(IBS dc_phy_addr_valid, future ARM SPE record-format
handling, future Intel PEBS DLA quirks if they need
kernel-side filtering beyond what perf delivers).
Bharata, would value your perspective on two related questions: in
your long-term plan for pghot, do you see the legacy IBS Op path
(this RFC) staying as a DAMON-side backend, while the new IBS-MProf
path lands under pghot? Or do you envision both IBS facilities
eventually feeding through a common HW-sampler primitive (pghot or
DAMON), with frontend selectable by user config? And on existing
Zen 3+ silicon: is the legacy IBS Op driver in this RFC the right
home for those processors going forward.
Thanks,
Ravi
> Assuming you don't have concern on the long term plan yet, I will take time to
> write down more formal and detailed plan. It will explain the overall roadmap,
> timeline and how we could collaborate. On top of that, we could further
> discuss.
>
> >
> >
> > Specific asks
> > =============
> >
> > To SeongJae:
> >
> > 1. Patches 1, 3, and 4 are infrastructure that benefits any consumer
> > of damon_report_access(), not just the IBS backend in this RFC.
> > Would these belong directly on damon/next as preparatory patches
> > for damon_report_access(), rather than living inside an
> > IBS-specific track? Happy to rebase and resend them that way if
> > you'd prefer that shape. Tested-by: tags can come along.
>
> I'm still thinking about how we can collaborate well. The answer for the above
> question would be a part of that. In other words, I have no good answer right
> now, sorry. Could you please give me more time to think more and share the
> plan? I will share the plan as another mail. On the thread, we could further
> discuss. Of course, we could have DAMON beer/coffee/tea chats [4] like
> additional discussions before/after/during the plan discussion.
>
> So, long story short, we agreed this project (h/w-based data access monitoring)
> should be upstreamed. But give me little more time on thinking about how we
> will do it and collaborate. It will take some time. Please bear in mind.
> Sorry for making you wait, but I pretty sure and promise that we will
> eventually make it.
>
> [1] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org
> [2] https://lwn.net/Articles/1071256/
> [3] https://lore.kernel.org/20260518234119.97569-1-sj@kernel.org
> [4] https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing
>
>
> Thanks,
> SJ
>
> [...]
On Wed, 20 May 2026 12:01:43 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote: > On Mon, May 18, 2026 at 11:19 PM SeongJae Park <sj@kernel.org> wrote: > > > > + Akinobu > > > > Hello Ravi, > > > > On Sat, 16 May 2026 15:34:25 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote: > > > > > Hi all, > > > > > > This is an RFC, not for merge. The series exercises and validates > > > damon_report_access() -- the consumer API SeongJae introduced in [1] > > > -- as a substrate for ingesting access reports from hardware-sampling > > > sources. The series includes one worked-example backend, an AMD IBS > > > Op module (damon_ibs.ko), that runs on Zen 3+ silicon via the > > > existing perf event subsystem. > > > > Thank you for sharing this great RFC series! > > > > [...] > > > Why a hardware-source primitive complements existing primitives > > > =============================================================== > > [...] > > > Both primitives produce a view of hotness that converges to the > > > true distribution over the aggregation interval. For systems where > > > the address space is small relative to the aggregation rate, this is > > > the right tool. On large heterogeneous-memory systems with goal- > > > driven schemes asking the closed-loop tuner to converge on a target > > > distribution, a complementary lower-latency view of accesses can > > > tighten the loop -- reducing the time DAMON's nr_accesses takes to > > > reflect the workload's actual access distribution, which in turn > > > reduces ramp duration and oscillation amplitude during convergence > > > of goal-driven schemes. > > > > > > A hardware-sampling primitive provides this complementary view: > > > hardware retirement records each access at its natural event rate, > > > with a physical address per sample, independent of TLB state and > > > independent of the unmap/fault path. > > > > Yes, I fully agree. Different multiple access check primitives have different > > characteristics. > > > > [...] > > > > > Demonstration > > > ============= > > [...] > > > In both regimes, convergence to target is quick, and the workload's > > > measured DRAM share then holds within 1.3 percentage points of > > > target with standard deviation under 1.3 percentage points, sustained > > > over runs of 15-30 minutes per target. > > > > I understand this demonstration shows your AMD IBS-based version of DAMON is > > functioning as expected. Thank you for sharing this! > > > > [...] > > > What's in this series > > > ===================== > > > > > > Patch 1. mm/damon/core: refcount ops owner module to prevent > > > rmmod UAF > > > Patch 2. mm/damon/paddr: export damon_pa_* ops for IBS module > > > Patch 3. mm/damon/core: replace mutex-protected report buffer > > > with per-CPU lockless ring > > > Patch 4. mm/damon/core: flat-array snapshot + bsearch in ring- > > > drain loop > > > Patch 5. mm/damon: add sysfs binding and dispatch hookup for > > > paddr_ibs operations > > > Patch 6. mm/damon/core: accept paddr_ibs in node_eligible_mem_bp > > > ops check > > > Patch 7. mm/damon/damon_ibs: add AMD IBS-based access sampling > > > backend > > > > > > Patches 1, 3, and 4 are general infrastructure that benefits any > > > consumer of damon_report_access(). Patches 2, 5, 6, and 7 are the > > > worked-example backend (paddr_ibs ops, sysfs binding, IBS module). > > > > I didn't read the detailed code of each patch. But my high level understanding > > is as below. > > > > Patches 1 and 2 are needed for supporting loadable module-based DAMON operation > > sets (access sampling backend). > > > > Patch 3 is needed for supporting access check primitives that can provide the > > access information in only nmi context. It can also speedup the access > > reporting in general, though. > > > > Patch 4 makes DAMON's internal reported access information retrieval faster, so > > will help any reporting-based DAMON operation set use case. > > > > Patches 5-7 are required for only the IBS-based DAMON operations set > > (paddr_ibs). > > > > So I agree patch 4 is a general infrastructure improvement that benefits > > multiple use cases. > > > > Patch 3 is also arguably general infrastructure improvement, as it will make > > the reporting faster in general. > > > > Patch 1 is not technically coupled with paddr_ibs, and will be needed for > > general loadable module based access check primitives. But, should we support > > lodable modules? If so, why? > > > > Patch 2 is also not technically coupled with paddr_ibs, to my understanding, so > > should be categorized together with patch 1? In other words, if we agree we > > should support lodable modules based DAMON operation sets, this should be > > useful for not only paddr_ibs but more general cases. > > > > Correct me if I'm wrong. > > > > > > > > > > > Patches worth folding into damon/next > > > ===================================== > > > > > > Patches 1, 3, and 4 are not specific to IBS or to this RFC's > > > backend. Each is preparatory infrastructure that any consumer of > > > damon_report_access() will need: > > > > > > - Patch 1 (refcount ops owner) -- any modular ops set, including > > > out-of-tree backends, needs clean module unload to avoid UAF > > > on damon_unregister_ops. > > > - Patch 3 (per-CPU lockless ring) -- damon_report_access() cannot > > > be called from NMI context with the current mutex-protected > > > buffer. Hardware samplers all need NMI-safe submission. > > > - Patch 4 (flat-array snapshot + bsearch drain) -- the linear- > > > scan drain is O(reports x regions) and exceeds the sample > > > interval at high-CPU x large-region products. Bsearch brings > > > it to O(reports x log regions). > > > > > > If these belong directly on damon/next as preparatory patches for > > > damon_report_access() rather than living inside an IBS-specific > > > track, we are happy to rebase and resend them that way. > > > > So I'm bit unsure about patch 1. If we don't have a plan to support lodable > > modules based DAMON operations set, we might not need it for now. > > > > For patches 3 and 4, I agree those will be useful in general. Nonetheless, I'd > > slightly prefer to do that optimizations at the later part of the long term > > project. > > > > > > > > > > > Relation to prior and ongoing work > > > ================================== > > > > > > The IBS sampling pattern in patch 7 -- attr.config=0 to use IBS Op > > > default config, dc_phy_addr_valid filter, NMI-safe sample submission > > > -- is derived from concepts in Bharata B Rao's pghot RFC v5 [3]. > > > The attribution header is in mm/damon/damon_ibs.c and the patch > > > carries a Suggested-by: trailer. > > > > > > Bharata's pghot v7 [4] introduces a different IBS driver targeting > > > the new IBS Memory Profiler (IBS-MProf) facility, which Bharata > > > describes as a facility "that will be present in future AMD > > > processors" -- a separate IBS instance from the one this RFC's > > > backend uses. This version of driver based out of v5 [3] is an > > > example of how DAMON can be benefited from AMD IBS Hardware > > > source and validates importance of IBS information indepedently. > > > It is not meant to be merged in the current form. > > > @Bharata if you see a path where IBS samples can be consumed > > > by DAMON at some point, will be happy to collaborate. > > > > > > Akinobu Mita's perf-event-based access-check RFC [5] explores a > > > configurable perf-event-driven access source for DAMON. IBS has > > > vendor-specific MSR setup beyond what perf_event_attr alone > > > expresses (e.g. dc_phy_addr_valid filtering on the produced sample, > > > not on the perf attr), so the IBS path here appears complementary > > > to [5] -- operators choose based on whether their hardware sampler > > > fits stock perf or needs additional kernel-side setup. > > > > So apparently there are multiple approaches to develop and use h/w-based access > > monitoring. Akinobu and you are trying to do that using DAMON as the frontend, > > and already made the working prototypes. There were more people who showed > > interest and will to contribute to this project other than you, too. I 100% > > agree h/w-based access monitoring can be useful, and I of course thinking using > > DAMON as the fronend is the right approach. I'm all for making this > > upstreamed. > > > > I was therefore spending time on thinking about in what long-term maintainable > > shape this capability can successfully be upstreamed. I suggested > > damon_report_access() as the internal interface between DAMON and the h/w-based > > access check primitives, and apparently we all (I, Ravi and Akinobu in this > > context) agreed. Akinobu thankfully revisioned his implementation based on > > damon_report_access() interface. Ravi also implemented this RFC based on the > > interface. > > > > After making the consensus with Akinobu, I was taking time on the user space > > interface. When I was discussing with Akinobu, my idea was extending the user > > interface for the page faults based monitoring v3 [1]. But, recently I decided > > to make this more general, so proposed data attributes monitoring extension [2] > > at LSFMMBPF. The patch series for the initial change [3] is merged into mm-new > > for more testing, today. The cover letter of the patch series is also sharing > > how it will be extended for h/w based access monitoring in long term. > > > > I of course want us to go in this direction. I believe you already had chances > > to take a look on the long term plan and didn't make some voice because you > > don't strongly disagree about the plan. If not, please make a voice. > > > Hi SJ, > > One layering question I'd like to flag before the plan is written, > since it affects how this RFC's substrate slots in: To my understanding, this RFC reuses the damon_report_access() infrastructure that shared with the per-CPUs/threds/writes/reads monitoring series [1]. My plan at the moment is to keep using it. So from high level view, I think the final picture would be not really different from this RFC. > > In [3], .apply_probes is a periodic per-region classifier driven > from kdamond_fn after .check_accesses, in process context, that > applies a (folio -> bool) predicate to each region's sampling_addr > and accounts the results in r->probe_hits[]. damon_report_access() > on the other hand is a per-event delivery callback into a per-CPU > buffer, called from the access source (NMI for IBS / PEBS / SPE, > process context for page-fault-based sources). These appear to > me to sit at different layers - delivery vs. classification. > > The reason I want to confirm this: NMI context for HW samplers > precludes the operations .apply_probes can do today (no mutex, no > kmalloc, no sleep, no folio lookup that touches pte_lock). And > the data shape is inverted - .apply_probes asks "does region R's > sampling_addr have attribute A?", evaluated on the kdamond-chosen > address; an HW sample announces "PA Y was accessed at retirement > time T", arriving asynchronously and needing to find the region > it falls into. If access events end up routed through > .apply_probes in the long-term plan, the IBS / PEBS / SPE > backends would each need a deferral path under it (per-CPU ring > for NMI-safe submission, region mapping at drain time). It will not routed through .apply_probes, but work in a way similar to the damon_report_access() based design. That is, each (sampled) access event will syncronously call damon_report_access() with the access information. The information is stored in DAMON's internal data structure. The information will contain the access destination address, the accessor CPU/thread, whether it was reads or writes etc, if available. Then kdamond will read the reports in the data structure once per sampling interval and assess if each region got accessed or not since the last sampling interval. So my plan is not to reuse .apply_probes, but in terms of who consumes the information, it is not very different. Accessor will produce the information (report), and kdamond will consume those. But this is how damon_report_access() based structure is working on, so my understanding is that your RFC is also not very different. Am I missing something, or do you have any concern on this structure? > > Happy to be wrong here if you see a unified shape that handles > both - just want to surface the constraint before the plan is > written. > > On the loadable-module question for patches 1 and 2: agreed it's a > genuinely open architectural call, not just a paddr_ibs convenience. > > - paddr_ibs (this RFC) targets the existing IBS Op facility on > Zen 3+ silicon via the perf event subsystem and uses a > vendor-specific > overflow-handler filter that perf_event_attr cannot express > (dc_phy_addr_valid in IBS_OP_DATA3). Bharata's pghot v7 > [pghot-v7] introduces a separate IBS driver targeting the new > IBS-MProf > facility on future AMD silicon via direct MSR programming - > not perf at all. These are two AMD-specific HW samplers with > non-overlapping silicon coverage and non-overlapping kernel > paths. A distro shipping a single kernel image to a fleet > with mixed silicon needs runtime-selectable backends, which > obj=y can't do across exclusive `depends on` chains. > - Akinobu's perf-event RFC v3 [akinobu-v3] is a useful contrast: > it stays builtin because it's a generic configurable > perf_event_attr passthrough, no vendor-specific code in the > overflow handler. The tristate case is specifically for the > backends that need vendor logic outside perf_event_attr > (IBS dc_phy_addr_valid, future ARM SPE record-format > handling, future Intel PEBS DLA quirks if they need > kernel-side filtering beyond what perf delivers). I'm still not familiar with IBS and perf events. Please bear in mind with me. My understanding is that there are vendor-specific knobs for IBS that perf event is not supporting. So far, that makes sense. And are you saying that you have to write paddr_ibs as a loadable module if you want to support the vendor-specific knobs? If I'm understanding you correctly could you further share why it cannot be done as a builtin module? [1] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org Thanks, SJ [...]
© 2016 - 2026 Red Hat, Inc.