mm/damon: hardware-sampled access reports

[RFC PATCH 0/6] mm/damon: hardware-sampled access reports

Posted by Ravi Jonnalagadda 1 week, 2 days ago

This series introduces a vendor and PMU-agnostic substrate inside DAMON
that consumes hardware-sampled access reports through the standard
perf-event interface.  Userspace selects the PMU through sysfs (raw
type/config knobs), driving either Intel PEBS L3-miss sampling or AMD
IBS Op sampling.

Why a unified perf-event substrate

Earlier hardware-sampled access-monitoring proposal [1] took an AMD IBS
specific module path backend, owning its own probe configuration,
sysfs knobs, and lifecycle.

SeongJae Park has previously highlighted the advantage of Akinobu
Mita's perf-event proposal [2]: let DAMON register kernel-counter perf
events and consume samples from any sampling PMU that perf core knows
about.  This series builds on that direction with the changes we
needed to run it cross-vendor:

  - a per-CPU lockless ring between the NMI sample handler and the
    kdamond drain,
  - per-CPU events that follow CPU hotplug cleanly,
  - events fire only while the monitor is running -- created disabled,
    armed when kdamond starts, disarmed and drained when it stops,
  - all-or-nothing init across CPUs: a partial-CPU create failure rolls
    the whole event back rather than leaving silent gaps,
  - safe handling of vendor sample-validity flags so a stale or
    unpopulated address is never mistaken for a valid sample.

What the series adds

Patch 1 introduces the substrate's data types: a per-event
configuration struct and a per-context list to hang them on.  A
CONFIG_PERF_EVENTS=n build folds to no-op stubs.

Patch 2 exposes those types through sysfs.  Each entry maps to one
perf event and lets userspace pick the PMU and how to sample it: the
raw PMU type/config, addressing flags, and period or frequency.  The
defaults are tuned for Intel PEBS; userspace overrides them for other
PMUs.

Patch 3 wires the sysfs apply path so configured events get attached
to the running monitoring context.

Patch 4 is the core of the series.  It replaces the mutex-protected
report queue with a per-CPU lockless ring fed from NMI by the perf
overflow handler and drained once per sample tick by the kdamond.
Drained reports are matched to monitored regions by binary search
over a per-tick snapshot.  The patch also wires the per-event
lifecycle into kdamond: events arm when the monitor starts, disarm
and drain when it stops, roll back cleanly when per-CPU init fails on
some CPUs, and a second context that asks for the substrate while
it is in use is rejected with -EBUSY.

Patch 5 is the perf-event backend.  Two stateless overflow handlers
(one vaddr-keyed, one paddr-keyed) are picked at event creation time
and submit samples into the per-CPU ring.  Vendor-specific sample
validity is honored at this layer.

Patch 6 adds a tracepoint at every node_eligible_mem_bp quota-goal
evaluation so userspace can watch goal convergence without polling
sysfs.

Userspace setup model

Userspace selects the sampling PMU by pointing the perf event's
`type` / `config` at it, and chooses the scheme topology that suits
the address space the PMU reports on.  No module load or unload step
is involved; `echo on > state` arms the substrate, `echo off > state`
disarms it.

Two configurations were used for validation.

Configuration A: AMD IBS Op, paddr ops, system-wide PULL+PUSH tiering

  IBS Op stamps samples with physical addresses, so DAMON reasons over
  every backing page in the system regardless of which task or guest
  touched it -- the substrate becomes a system-wide tiering controller.

  Setup (abridged; `D=/sys/kernel/mm/damon/admin/kdamonds/0`):

    echo 1     > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
    echo 1     > $D/contexts/nr_contexts
    echo paddr > $D/contexts/0/operations

    # Two regions, one per NUMA node (DRAM + CXL).  PA ranges
    # are derived per host from /proc/iomem; omitted here.
    echo 1 > $D/contexts/0/targets/nr_targets
    echo 2 > $D/contexts/0/targets/0/regions/nr_regions
    echo <DRAM_LO> > $D/contexts/0/targets/0/regions/0/start
    echo <DRAM_HI> > $D/contexts/0/targets/0/regions/0/end
    echo <CXL_LO>  > $D/contexts/0/targets/0/regions/1/start
    echo <CXL_HI>  > $D/contexts/0/targets/0/regions/1/end

    # IBS Op event, period-based, paddr-stamped:
    PE=$D/contexts/0/monitoring_attrs/sample/perf_events
    echo 1 > $PE/nr_perf_events
    echo $(cat /sys/bus/event_source/devices/ibs_op/type) > $PE/0/type
    echo 0      > $PE/0/config
    echo 1      > $PE/0/sample_phys_addr
    echo 0      > $PE/0/freq
    echo 262144 > $PE/0/sample_period
    echo 0      > $PE/0/exclude_kernel
    echo 0      > $PE/0/exclude_hv

    # PULL scheme: migrate_hot toward DRAM, gated on
    # node_eligible_mem_bp(nid=DRAM) goal target_value=TARGET_BP.
    # addr filter restricts source to the CXL range.
    # PUSH scheme: migrate_hot toward CXL, gated on
    # node_eligible_mem_bp(nid=CXL) target_value=10000-TARGET_BP.
    # addr filter restricts source to the DRAM range.
    # Both schemes are migrate_hot; they converge from opposite
    # directions on the same hot working set.

    echo on > $D/state

  Userspace tunes the steady-state DRAM:CXL split by writing the goal
  `target_value`s; DAMON's quota autotuner drives migration intensity
  to match.

  Workload: a QEMU/KVM guest pinned to one NUMA node, running 32
  multichase multiload threads each touching a 4 GiB working set
  (~128 GiB aggregate) with the memcpy-libc kernel.  The guest sees
  a flat single-NUMA layout and has no direct view of the host's
  tiering topology, yet its hot pages are migrated to DRAM and cold
  pages pushed to CXL by host-side DAMON acting on IBS-stamped
  physical addresses -- the application inside the guest benefits
  from tiering it never had to be aware of.  Validated on AMD Turin
  (132-CPU EPYC).  The configuration converged to its target ratio
  in seconds and remained stable for 7+ hours continuously, with no
  perf core auto-throttle and no measurable drift in the achieved
  interleave ratio.

Configuration B: Intel PEBS L3-miss, vaddr ops, per-PID weighted-dest

  PEBS reports vaddr samples in the context of the running task.
  DAMON's vaddr ops monitors a specific PID.

  Setup (abridged):

    echo 1     > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
    echo 1     > $D/contexts/nr_contexts
    echo vaddr > $D/contexts/0/operations

    echo 1     > $D/contexts/0/targets/nr_targets
    echo $PID  > $D/contexts/0/targets/0/pid_target
    echo 0     > $D/contexts/0/targets/0/regions/nr_regions

    # PEBS MEM_LOAD_RETIRED.L3_MISS, frequency-based, vaddr-stamped:
    echo 1      > $PE/nr_perf_events
    echo 4      > $PE/0/type           # PERF_TYPE_RAW
    echo 0x20d1 > $PE/0/config         # umask=0x20 event=0xd1
    echo 0      > $PE/0/sample_phys_addr
    echo 1      > $PE/0/freq
    echo 5003   > $PE/0/sample_freq
    echo 2      > $PE/0/precise_ip
    echo 1      > $PE/0/wakeup_events

    # Single migrate_hot scheme with two weighted destinations
    # (DRAM + CXL).  Userspace tunes the steady-state interleave by
    # writing dests/{0,1}/weight.

    echo on > $D/state

  Workload: 32 multichase multiload threads with a 4 GiB working set
  each (~128 GiB aggregate) running directly on the host, monitored
  by DAMON via the multiload PID.  Validated on Intel Granite Rapids
  (144-CPU).  Convergence is fast and the system is stable.

[1] https://lore.kernel.org/linux-mm/20260516223439.4033-1-ravis.opensrc@gmail.com/
[2] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

Ravi Jonnalagadda (6):
  mm/damon: add struct damon_perf_event{,_attr} and per-ctx perf_events
    list
  mm/damon/sysfs-sample: expose perf_events configuration via sysfs
  mm/damon/sysfs: install perf_events on apply
  mm/damon/core: per-CPU SPSC ring drain and damon_perf_event lifecycle
  mm/damon/vaddr: implement perf-event access check
  mm/damon: add damos_node_eligible_mem_bp tracepoint

 include/linux/damon.h        |  80 +++++
 include/trace/events/damon.h |  49 +++
 mm/damon/core.c              | 403 ++++++++++++++++++++----
 mm/damon/ops-common.h        |  39 +++
 mm/damon/sysfs-common.h      |   6 +
 mm/damon/sysfs-sample.c      | 579 +++++++++++++++++++++++++++++++++++
 mm/damon/sysfs.c             |   3 +
 mm/damon/vaddr.c             | 267 ++++++++++++++++
 8 files changed, 1370 insertions(+), 56 deletions(-)


base-commit: 4c8ad15abf15eb480d3ad85f902001e35465ef18
-- 
2.43.0

Re: [RFC PATCH 0/6] mm/damon: hardware-sampled access reports

Posted by SeongJae Park 1 week, 2 days ago

On Fri, 29 May 2026 09:56:34 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:

> This series introduces a vendor and PMU-agnostic substrate inside DAMON
> that consumes hardware-sampled access reports through the standard
> perf-event interface.  Userspace selects the PMU through sysfs (raw
> type/config knobs), driving either Intel PEBS L3-miss sampling or AMD
> IBS Op sampling.
> 
> Why a unified perf-event substrate
> 
> Earlier hardware-sampled access-monitoring proposal [1] took an AMD IBS
> specific module path backend, owning its own probe configuration,
> sysfs knobs, and lifecycle.
> 
> SeongJae Park has previously highlighted the advantage of Akinobu
> Mita's perf-event proposal [2]: let DAMON register kernel-counter perf
> events and consume samples from any sampling PMU that perf core knows
> about.  This series builds on that direction

Ah great, so we have no unclear challenge (additional loadable module support
and conflicts with other IBS modules) on our road for now!  That is, we can
reuse the stable perf event interface and achieve all our goals!  As I
previously shared [1], it would take time, but I'm very optimistic about the
success of this project.  I don't like promising too much, but this project
looks like something that we can "consider it done".

We can also say that the current candidate of the first
damon_report_access()-based data attributes monitoring (milestone 2 [1] final
deliverable) is the perf event based monitoring.

> with the changes we
> needed to run it cross-vendor:
> 
>   - a per-CPU lockless ring between the NMI sample handler and the
>     kdamond drain,
>   - per-CPU events that follow CPU hotplug cleanly,
>   - events fire only while the monitor is running -- created disabled,
>     armed when kdamond starts, disarmed and drained when it stops,
>   - all-or-nothing init across CPUs: a partial-CPU create failure rolls
>     the whole event back rather than leaving silent gaps,
>   - safe handling of vendor sample-validity flags so a stale or
>     unpopulated address is never mistaken for a valid sample.
> 
> What the series adds
> 
> Patch 1 introduces the substrate's data types: a per-event
> configuration struct and a per-context list to hang them on.  A
> CONFIG_PERF_EVENTS=n build folds to no-op stubs.
> 
> Patch 2 exposes those types through sysfs.  Each entry maps to one
> perf event and lets userspace pick the PMU and how to sample it: the
> raw PMU type/config, addressing flags, and period or frequency.  The
> defaults are tuned for Intel PEBS; userspace overrides them for other
> PMUs.
> 
> Patch 3 wires the sysfs apply path so configured events get attached
> to the running monitoring context.
> 
> Patch 4 is the core of the series.  It replaces the mutex-protected
> report queue with a per-CPU lockless ring fed from NMI by the perf
> overflow handler and drained once per sample tick by the kdamond.
> Drained reports are matched to monitored regions by binary search
> over a per-tick snapshot.  The patch also wires the per-event
> lifecycle into kdamond: events arm when the monitor starts, disarm
> and drain when it stops, roll back cleanly when per-CPU init fails on
> some CPUs, and a second context that asks for the substrate while
> it is in use is rejected with -EBUSY.
> 
> Patch 5 is the perf-event backend.  Two stateless overflow handlers
> (one vaddr-keyed, one paddr-keyed) are picked at event creation time
> and submit samples into the per-CPU ring.  Vendor-specific sample
> validity is honored at this layer.
> 
> Patch 6 adds a tracepoint at every node_eligible_mem_bp quota-goal
> evaluation so userspace can watch goal convergence without polling
> sysfs.
> 
> Userspace setup model
> 
> Userspace selects the sampling PMU by pointing the perf event's
> `type` / `config` at it, and chooses the scheme topology that suits
> the address space the PMU reports on.  No module load or unload step
> is involved; `echo on > state` arms the substrate, `echo off > state`
> disarms it.
> 
> Two configurations were used for validation.
> 
> Configuration A: AMD IBS Op, paddr ops, system-wide PULL+PUSH tiering
> 
>   IBS Op stamps samples with physical addresses, so DAMON reasons over
>   every backing page in the system regardless of which task or guest
>   touched it -- the substrate becomes a system-wide tiering controller.
> 
>   Setup (abridged; `D=/sys/kernel/mm/damon/admin/kdamonds/0`):
> 
>     echo 1     > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
>     echo 1     > $D/contexts/nr_contexts
>     echo paddr > $D/contexts/0/operations
> 
>     # Two regions, one per NUMA node (DRAM + CXL).  PA ranges
>     # are derived per host from /proc/iomem; omitted here.
>     echo 1 > $D/contexts/0/targets/nr_targets
>     echo 2 > $D/contexts/0/targets/0/regions/nr_regions
>     echo <DRAM_LO> > $D/contexts/0/targets/0/regions/0/start
>     echo <DRAM_HI> > $D/contexts/0/targets/0/regions/0/end
>     echo <CXL_LO>  > $D/contexts/0/targets/0/regions/1/start
>     echo <CXL_HI>  > $D/contexts/0/targets/0/regions/1/end
> 
>     # IBS Op event, period-based, paddr-stamped:
>     PE=$D/contexts/0/monitoring_attrs/sample/perf_events
>     echo 1 > $PE/nr_perf_events
>     echo $(cat /sys/bus/event_source/devices/ibs_op/type) > $PE/0/type
>     echo 0      > $PE/0/config
>     echo 1      > $PE/0/sample_phys_addr
>     echo 0      > $PE/0/freq
>     echo 262144 > $PE/0/sample_period
>     echo 0      > $PE/0/exclude_kernel
>     echo 0      > $PE/0/exclude_hv

FYI, and as you may already know, the current plan [1] is to use the attributes
probe interface.  With it, the above IBS Op event setup part would look like,

mon_attr=/sys/kernel/mm/damon/admin/kdamonds/0/contexts/0/monitoring_attrs
echo 1 > $mon_attr/probes/nr_probes
probe=$mon_attr/probes/0
echo 1 > $probe/filters/nr_filters
filter=$probe/filters/0
echo perf_event > $filter/type
echo ibs_op > $filter/perf_event_type
echo Y > $filter/allow

Of course, more details could change later.

> 
>     # PULL scheme: migrate_hot toward DRAM, gated on
>     # node_eligible_mem_bp(nid=DRAM) goal target_value=TARGET_BP.
>     # addr filter restricts source to the CXL range.
>     # PUSH scheme: migrate_hot toward CXL, gated on
>     # node_eligible_mem_bp(nid=CXL) target_value=10000-TARGET_BP.
>     # addr filter restricts source to the DRAM range.
>     # Both schemes are migrate_hot; they converge from opposite
>     # directions on the same hot working set.
> 
>     echo on > $D/state
> 
>   Userspace tunes the steady-state DRAM:CXL split by writing the goal
>   `target_value`s; DAMON's quota autotuner drives migration intensity
>   to match.
> 
>   Workload: a QEMU/KVM guest pinned to one NUMA node, running 32
>   multichase multiload threads each touching a 4 GiB working set
>   (~128 GiB aggregate) with the memcpy-libc kernel.  The guest sees
>   a flat single-NUMA layout and has no direct view of the host's
>   tiering topology, yet its hot pages are migrated to DRAM and cold
>   pages pushed to CXL by host-side DAMON acting on IBS-stamped
>   physical addresses -- the application inside the guest benefits
>   from tiering it never had to be aware of.  Validated on AMD Turin
>   (132-CPU EPYC).  The configuration converged to its target ratio
>   in seconds and remained stable for 7+ hours continuously, with no
>   perf core auto-throttle and no measurable drift in the achieved
>   interleave ratio.
> 
> Configuration B: Intel PEBS L3-miss, vaddr ops, per-PID weighted-dest
> 
>   PEBS reports vaddr samples in the context of the running task.
>   DAMON's vaddr ops monitors a specific PID.
> 
>   Setup (abridged):
> 
>     echo 1     > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
>     echo 1     > $D/contexts/nr_contexts
>     echo vaddr > $D/contexts/0/operations
> 
>     echo 1     > $D/contexts/0/targets/nr_targets
>     echo $PID  > $D/contexts/0/targets/0/pid_target
>     echo 0     > $D/contexts/0/targets/0/regions/nr_regions
> 
>     # PEBS MEM_LOAD_RETIRED.L3_MISS, frequency-based, vaddr-stamped:
>     echo 1      > $PE/nr_perf_events
>     echo 4      > $PE/0/type           # PERF_TYPE_RAW
>     echo 0x20d1 > $PE/0/config         # umask=0x20 event=0xd1
>     echo 0      > $PE/0/sample_phys_addr
>     echo 1      > $PE/0/freq
>     echo 5003   > $PE/0/sample_freq
>     echo 2      > $PE/0/precise_ip
>     echo 1      > $PE/0/wakeup_events
> 
>     # Single migrate_hot scheme with two weighted destinations
>     # (DRAM + CXL).  Userspace tunes the steady-state interleave by
>     # writing dests/{0,1}/weight.
> 
>     echo on > $D/state
> 
>   Workload: 32 multichase multiload threads with a 4 GiB working set
>   each (~128 GiB aggregate) running directly on the host, monitored
>   by DAMON via the multiload PID.  Validated on Intel Granite Rapids
>   (144-CPU).  Convergence is fast and the system is stable.

Thank you so much for sharing the great prototype implementation and test
results!

I will try to make fast progress on milestone 1.  I will hold reviewing details
of this series for now, as there could be more changes.  But in the high level,
this looks promising.

> 
> [1] https://lore.kernel.org/linux-mm/20260516223439.4033-1-ravis.opensrc@gmail.com/
> [2] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

[1] https://lore.kernel.org/20260525225208.1179-1-sj@kernel.org/


Thanks,
SJ

[...]

Re: [RFC PATCH 0/6] mm/damon: hardware-sampled access reports

Posted by Akinobu Mita 1 week, 2 days ago

Hello Ravi and SeongJae,

2026年5月30日(土) 9:05 SeongJae Park <sj@kernel.org>:
>
> On Fri, 29 May 2026 09:56:34 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
>
> > This series introduces a vendor and PMU-agnostic substrate inside DAMON
> > that consumes hardware-sampled access reports through the standard
> > perf-event interface.  Userspace selects the PMU through sysfs (raw
> > type/config knobs), driving either Intel PEBS L3-miss sampling or AMD
> > IBS Op sampling.
> >
> > Why a unified perf-event substrate
> >
> > Earlier hardware-sampled access-monitoring proposal [1] took an AMD IBS
> > specific module path backend, owning its own probe configuration,
> > sysfs knobs, and lifecycle.
> >
> > SeongJae Park has previously highlighted the advantage of Akinobu
> > Mita's perf-event proposal [2]: let DAMON register kernel-counter perf
> > events and consume samples from any sampling PMU that perf core knows
> > about.  This series builds on that direction
>
> Ah great, so we have no unclear challenge (additional loadable module support
> and conflicts with other IBS modules) on our road for now!  That is, we can
> reuse the stable perf event interface and achieve all our goals!  As I
> previously shared [1], it would take time, but I'm very optimistic about the
> success of this project.  I don't like promising too much, but this project
> looks like something that we can "consider it done".
>
> We can also say that the current candidate of the first
> damon_report_access()-based data attributes monitoring (milestone 2 [1] final
> deliverable) is the perf event based monitoring.

That's good!

From a quick look, it seems to have all the features I need, so I'd like to
evaluate it based on Ravi's patch.  If any extensions require changes, I will
let you know as feedback.

Ravi,
You can also add my Co-developed-by and Signed-off-by tags to the appropriate
patch, so please post to the mailing list.

I am currently working on a change to allow selecting perf events from the damo
tool by specifying the event name, similar to the perf record -e option (e.g.,
"cpu/mem-loads,ldlat=30,freq=5000/P" or "cpu/mem-stores,freq=5000/P").

I'll share the progress once it reaches a certain point.  A change to the perf
file, as shown in the attachment, will be necessary, but I believe it can be
handled without changing Ravi's current patch set.

> > with the changes we
> > needed to run it cross-vendor:
> >
> >   - a per-CPU lockless ring between the NMI sample handler and the
> >     kdamond drain,
> >   - per-CPU events that follow CPU hotplug cleanly,
> >   - events fire only while the monitor is running -- created disabled,
> >     armed when kdamond starts, disarmed and drained when it stops,
> >   - all-or-nothing init across CPUs: a partial-CPU create failure rolls
> >     the whole event back rather than leaving silent gaps,
> >   - safe handling of vendor sample-validity flags so a stale or
> >     unpopulated address is never mistaken for a valid sample.
> >
> > What the series adds
> >
> > Patch 1 introduces the substrate's data types: a per-event
> > configuration struct and a per-context list to hang them on.  A
> > CONFIG_PERF_EVENTS=n build folds to no-op stubs.
> >
> > Patch 2 exposes those types through sysfs.  Each entry maps to one
> > perf event and lets userspace pick the PMU and how to sample it: the
> > raw PMU type/config, addressing flags, and period or frequency.  The
> > defaults are tuned for Intel PEBS; userspace overrides them for other
> > PMUs.
> >
> > Patch 3 wires the sysfs apply path so configured events get attached
> > to the running monitoring context.
> >
> > Patch 4 is the core of the series.  It replaces the mutex-protected
> > report queue with a per-CPU lockless ring fed from NMI by the perf
> > overflow handler and drained once per sample tick by the kdamond.
> > Drained reports are matched to monitored regions by binary search
> > over a per-tick snapshot.  The patch also wires the per-event
> > lifecycle into kdamond: events arm when the monitor starts, disarm
> > and drain when it stops, roll back cleanly when per-CPU init fails on
> > some CPUs, and a second context that asks for the substrate while
> > it is in use is rejected with -EBUSY.
> >
> > Patch 5 is the perf-event backend.  Two stateless overflow handlers
> > (one vaddr-keyed, one paddr-keyed) are picked at event creation time
> > and submit samples into the per-CPU ring.  Vendor-specific sample
> > validity is honored at this layer.
> >
> > Patch 6 adds a tracepoint at every node_eligible_mem_bp quota-goal
> > evaluation so userspace can watch goal convergence without polling
> > sysfs.
> >
> > Userspace setup model
> >
> > Userspace selects the sampling PMU by pointing the perf event's
> > `type` / `config` at it, and chooses the scheme topology that suits
> > the address space the PMU reports on.  No module load or unload step
> > is involved; `echo on > state` arms the substrate, `echo off > state`
> > disarms it.
> >
> > Two configurations were used for validation.
> >
> > Configuration A: AMD IBS Op, paddr ops, system-wide PULL+PUSH tiering
> >
> >   IBS Op stamps samples with physical addresses, so DAMON reasons over
> >   every backing page in the system regardless of which task or guest
> >   touched it -- the substrate becomes a system-wide tiering controller.
> >
> >   Setup (abridged; `D=/sys/kernel/mm/damon/admin/kdamonds/0`):
> >
> >     echo 1     > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
> >     echo 1     > $D/contexts/nr_contexts
> >     echo paddr > $D/contexts/0/operations
> >
> >     # Two regions, one per NUMA node (DRAM + CXL).  PA ranges
> >     # are derived per host from /proc/iomem; omitted here.
> >     echo 1 > $D/contexts/0/targets/nr_targets
> >     echo 2 > $D/contexts/0/targets/0/regions/nr_regions
> >     echo <DRAM_LO> > $D/contexts/0/targets/0/regions/0/start
> >     echo <DRAM_HI> > $D/contexts/0/targets/0/regions/0/end
> >     echo <CXL_LO>  > $D/contexts/0/targets/0/regions/1/start
> >     echo <CXL_HI>  > $D/contexts/0/targets/0/regions/1/end
> >
> >     # IBS Op event, period-based, paddr-stamped:
> >     PE=$D/contexts/0/monitoring_attrs/sample/perf_events
> >     echo 1 > $PE/nr_perf_events
> >     echo $(cat /sys/bus/event_source/devices/ibs_op/type) > $PE/0/type
> >     echo 0      > $PE/0/config
> >     echo 1      > $PE/0/sample_phys_addr
> >     echo 0      > $PE/0/freq
> >     echo 262144 > $PE/0/sample_period
> >     echo 0      > $PE/0/exclude_kernel
> >     echo 0      > $PE/0/exclude_hv
>
> FYI, and as you may already know, the current plan [1] is to use the attributes
> probe interface.  With it, the above IBS Op event setup part would look like,
>
> mon_attr=/sys/kernel/mm/damon/admin/kdamonds/0/contexts/0/monitoring_attrs
> echo 1 > $mon_attr/probes/nr_probes
> probe=$mon_attr/probes/0
> echo 1 > $probe/filters/nr_filters
> filter=$probe/filters/0
> echo perf_event > $filter/type
> echo ibs_op > $filter/perf_event_type
> echo Y > $filter/allow
>
> Of course, more details could change later.
>
> >
> >     # PULL scheme: migrate_hot toward DRAM, gated on
> >     # node_eligible_mem_bp(nid=DRAM) goal target_value=TARGET_BP.
> >     # addr filter restricts source to the CXL range.
> >     # PUSH scheme: migrate_hot toward CXL, gated on
> >     # node_eligible_mem_bp(nid=CXL) target_value=10000-TARGET_BP.
> >     # addr filter restricts source to the DRAM range.
> >     # Both schemes are migrate_hot; they converge from opposite
> >     # directions on the same hot working set.
> >
> >     echo on > $D/state
> >
> >   Userspace tunes the steady-state DRAM:CXL split by writing the goal
> >   `target_value`s; DAMON's quota autotuner drives migration intensity
> >   to match.
> >
> >   Workload: a QEMU/KVM guest pinned to one NUMA node, running 32
> >   multichase multiload threads each touching a 4 GiB working set
> >   (~128 GiB aggregate) with the memcpy-libc kernel.  The guest sees
> >   a flat single-NUMA layout and has no direct view of the host's
> >   tiering topology, yet its hot pages are migrated to DRAM and cold
> >   pages pushed to CXL by host-side DAMON acting on IBS-stamped
> >   physical addresses -- the application inside the guest benefits
> >   from tiering it never had to be aware of.  Validated on AMD Turin
> >   (132-CPU EPYC).  The configuration converged to its target ratio
> >   in seconds and remained stable for 7+ hours continuously, with no
> >   perf core auto-throttle and no measurable drift in the achieved
> >   interleave ratio.
> >
> > Configuration B: Intel PEBS L3-miss, vaddr ops, per-PID weighted-dest
> >
> >   PEBS reports vaddr samples in the context of the running task.
> >   DAMON's vaddr ops monitors a specific PID.
> >
> >   Setup (abridged):
> >
> >     echo 1     > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
> >     echo 1     > $D/contexts/nr_contexts
> >     echo vaddr > $D/contexts/0/operations
> >
> >     echo 1     > $D/contexts/0/targets/nr_targets
> >     echo $PID  > $D/contexts/0/targets/0/pid_target
> >     echo 0     > $D/contexts/0/targets/0/regions/nr_regions
> >
> >     # PEBS MEM_LOAD_RETIRED.L3_MISS, frequency-based, vaddr-stamped:
> >     echo 1      > $PE/nr_perf_events
> >     echo 4      > $PE/0/type           # PERF_TYPE_RAW
> >     echo 0x20d1 > $PE/0/config         # umask=0x20 event=0xd1
> >     echo 0      > $PE/0/sample_phys_addr
> >     echo 1      > $PE/0/freq
> >     echo 5003   > $PE/0/sample_freq
> >     echo 2      > $PE/0/precise_ip
> >     echo 1      > $PE/0/wakeup_events
> >
> >     # Single migrate_hot scheme with two weighted destinations
> >     # (DRAM + CXL).  Userspace tunes the steady-state interleave by
> >     # writing dests/{0,1}/weight.
> >
> >     echo on > $D/state
> >
> >   Workload: 32 multichase multiload threads with a 4 GiB working set
> >   each (~128 GiB aggregate) running directly on the host, monitored
> >   by DAMON via the multiload PID.  Validated on Intel Granite Rapids
> >   (144-CPU).  Convergence is fast and the system is stable.
>
> Thank you so much for sharing the great prototype implementation and test
> results!
>
> I will try to make fast progress on milestone 1.  I will hold reviewing details
> of this series for now, as there could be more changes.  But in the high level,
> this looks promising.
>
> >
> > [1] https://lore.kernel.org/linux-mm/20260516223439.4033-1-ravis.opensrc@gmail.com/
> > [2] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com
>
> [1] https://lore.kernel.org/20260525225208.1179-1-sj@kernel.org/
>
>
> Thanks,
> SJ
>
> [...]

Re: [RFC PATCH 0/6] mm/damon: hardware-sampled access reports

Posted by Ravi Jonnalagadda 1 week, 2 days ago

Hi SeongJae and Akinobu,

  Thank you both for the warm reception and for the clear direction.

On Fri, May 29, 2026 at 8:02 PM Akinobu Mita <akinobu.mita@gmail.com> wrote:
>
> Hello Ravi and SeongJae,
>
> 2026年5月30日(土) 9:05 SeongJae Park <sj@kernel.org>:
> >
> > On Fri, 29 May 2026 09:56:34 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
> >
> > > This series introduces a vendor and PMU-agnostic substrate inside DAMON
> > > that consumes hardware-sampled access reports through the standard
> > > perf-event interface.  Userspace selects the PMU through sysfs (raw
> > > type/config knobs), driving either Intel PEBS L3-miss sampling or AMD
> > > IBS Op sampling.
> > >
> > > Why a unified perf-event substrate
> > >
> > > Earlier hardware-sampled access-monitoring proposal [1] took an AMD IBS
> > > specific module path backend, owning its own probe configuration,
> > > sysfs knobs, and lifecycle.
> > >
> > > SeongJae Park has previously highlighted the advantage of Akinobu
> > > Mita's perf-event proposal [2]: let DAMON register kernel-counter perf
> > > events and consume samples from any sampling PMU that perf core knows
> > > about.  This series builds on that direction
> >
> > Ah great, so we have no unclear challenge (additional loadable module support
> > and conflicts with other IBS modules) on our road for now!  That is, we can
> > reuse the stable perf event interface and achieve all our goals!  As I
> > previously shared [1], it would take time, but I'm very optimistic about the
> > success of this project.  I don't like promising too much, but this project
> > looks like something that we can "consider it done".
> >
> > We can also say that the current candidate of the first
> > damon_report_access()-based data attributes monitoring (milestone 2 [1] final
> > deliverable) is the perf event based monitoring.

Glad this aligns with the milestone roadmap.

>
> That's good!
>
> From a quick look, it seems to have all the features I need, so I'd like to
> evaluate it based on Ravi's patch.  If any extensions require changes, I will
> let you know as feedback.

Great, please do.  Happy to fold any feedback into v2.

>
> Ravi,
> You can also add my Co-developed-by and Signed-off-by tags to the appropriate
> patch, so please post to the mailing list.
>

Will do.  In v2 will add Co-developed-by and Signed-off-by tags
  to patches 1, 4, and 5:

    - Patch 1 (`struct damon_perf_event{,_attr}` + per-ctx list)
    - Patch 4 (per-CPU SPSC ring drain + perf-event lifecycle)
    - Patch 5 (vaddr/paddr perf-event backend)

Patches 2 and 3 are the sysfs surface that will move to the
probes/filters interface; patch 6 is the unrelated
`damos_node_eligible_mem_bp` tracepoint.

> I am currently working on a change to allow selecting perf events from the damo
> tool by specifying the event name, similar to the perf record -e option (e.g.,
> "cpu/mem-loads,ldlat=30,freq=5000/P" or "cpu/mem-stores,freq=5000/P").
>
> I'll share the progress once it reaches a certain point.  A change to the perf
> file, as shown in the attachment, will be necessary, but I believe it can be
> handled without changing Ravi's current patch set.

Nice. When the damo side is ready I will rerun the existing AMD IBS
and Intel PEBS configurations through it.

>
> > > with the changes we
> > > needed to run it cross-vendor:
> > >
> > >   - a per-CPU lockless ring between the NMI sample handler and the
> > >     kdamond drain,
> > >   - per-CPU events that follow CPU hotplug cleanly,
> > >   - events fire only while the monitor is running -- created disabled,
> > >     armed when kdamond starts, disarmed and drained when it stops,
> > >   - all-or-nothing init across CPUs: a partial-CPU create failure rolls
> > >     the whole event back rather than leaving silent gaps,
> > >   - safe handling of vendor sample-validity flags so a stale or
> > >     unpopulated address is never mistaken for a valid sample.
> > >
> > > What the series adds
> > >
> > > Patch 1 introduces the substrate's data types: a per-event
> > > configuration struct and a per-context list to hang them on.  A
> > > CONFIG_PERF_EVENTS=n build folds to no-op stubs.
> > >
> > > Patch 2 exposes those types through sysfs.  Each entry maps to one
> > > perf event and lets userspace pick the PMU and how to sample it: the
> > > raw PMU type/config, addressing flags, and period or frequency.  The
> > > defaults are tuned for Intel PEBS; userspace overrides them for other
> > > PMUs.
> > >
> > > Patch 3 wires the sysfs apply path so configured events get attached
> > > to the running monitoring context.
> > >
> > > Patch 4 is the core of the series.  It replaces the mutex-protected
> > > report queue with a per-CPU lockless ring fed from NMI by the perf
> > > overflow handler and drained once per sample tick by the kdamond.
> > > Drained reports are matched to monitored regions by binary search
> > > over a per-tick snapshot.  The patch also wires the per-event
> > > lifecycle into kdamond: events arm when the monitor starts, disarm
> > > and drain when it stops, roll back cleanly when per-CPU init fails on
> > > some CPUs, and a second context that asks for the substrate while
> > > it is in use is rejected with -EBUSY.
> > >
> > > Patch 5 is the perf-event backend.  Two stateless overflow handlers
> > > (one vaddr-keyed, one paddr-keyed) are picked at event creation time
> > > and submit samples into the per-CPU ring.  Vendor-specific sample
> > > validity is honored at this layer.
> > >
> > > Patch 6 adds a tracepoint at every node_eligible_mem_bp quota-goal
> > > evaluation so userspace can watch goal convergence without polling
> > > sysfs.
> > >
> > > Userspace setup model
> > >
> > > Userspace selects the sampling PMU by pointing the perf event's
> > > `type` / `config` at it, and chooses the scheme topology that suits
> > > the address space the PMU reports on.  No module load or unload step
> > > is involved; `echo on > state` arms the substrate, `echo off > state`
> > > disarms it.
> > >
> > > Two configurations were used for validation.
> > >
> > > Configuration A: AMD IBS Op, paddr ops, system-wide PULL+PUSH tiering
> > >
> > >   IBS Op stamps samples with physical addresses, so DAMON reasons over
> > >   every backing page in the system regardless of which task or guest
> > >   touched it -- the substrate becomes a system-wide tiering controller.
> > >
> > >   Setup (abridged; `D=/sys/kernel/mm/damon/admin/kdamonds/0`):
> > >
> > >     echo 1     > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
> > >     echo 1     > $D/contexts/nr_contexts
> > >     echo paddr > $D/contexts/0/operations
> > >
> > >     # Two regions, one per NUMA node (DRAM + CXL).  PA ranges
> > >     # are derived per host from /proc/iomem; omitted here.
> > >     echo 1 > $D/contexts/0/targets/nr_targets
> > >     echo 2 > $D/contexts/0/targets/0/regions/nr_regions
> > >     echo <DRAM_LO> > $D/contexts/0/targets/0/regions/0/start
> > >     echo <DRAM_HI> > $D/contexts/0/targets/0/regions/0/end
> > >     echo <CXL_LO>  > $D/contexts/0/targets/0/regions/1/start
> > >     echo <CXL_HI>  > $D/contexts/0/targets/0/regions/1/end
> > >
> > >     # IBS Op event, period-based, paddr-stamped:
> > >     PE=$D/contexts/0/monitoring_attrs/sample/perf_events
> > >     echo 1 > $PE/nr_perf_events
> > >     echo $(cat /sys/bus/event_source/devices/ibs_op/type) > $PE/0/type
> > >     echo 0      > $PE/0/config
> > >     echo 1      > $PE/0/sample_phys_addr
> > >     echo 0      > $PE/0/freq
> > >     echo 262144 > $PE/0/sample_period
> > >     echo 0      > $PE/0/exclude_kernel
> > >     echo 0      > $PE/0/exclude_hv
> >
> > FYI, and as you may already know, the current plan [1] is to use the attributes
> > probe interface.  With it, the above IBS Op event setup part would look like,
> >
> > mon_attr=/sys/kernel/mm/damon/admin/kdamonds/0/contexts/0/monitoring_attrs
> > echo 1 > $mon_attr/probes/nr_probes
> > probe=$mon_attr/probes/0
> > echo 1 > $probe/filters/nr_filters
> > filter=$probe/filters/0
> > echo perf_event > $filter/type
> > echo ibs_op > $filter/perf_event_type
> > echo Y > $filter/allow
> >
> > Of course, more details could change later.

Understood. will hold for milestone 1.

> >
> > >
> > >     # PULL scheme: migrate_hot toward DRAM, gated on
> > >     # node_eligible_mem_bp(nid=DRAM) goal target_value=TARGET_BP.
> > >     # addr filter restricts source to the CXL range.
> > >     # PUSH scheme: migrate_hot toward CXL, gated on
> > >     # node_eligible_mem_bp(nid=CXL) target_value=10000-TARGET_BP.
> > >     # addr filter restricts source to the DRAM range.
> > >     # Both schemes are migrate_hot; they converge from opposite
> > >     # directions on the same hot working set.
> > >
> > >     echo on > $D/state
> > >
> > >   Userspace tunes the steady-state DRAM:CXL split by writing the goal
> > >   `target_value`s; DAMON's quota autotuner drives migration intensity
> > >   to match.
> > >
> > >   Workload: a QEMU/KVM guest pinned to one NUMA node, running 32
> > >   multichase multiload threads each touching a 4 GiB working set
> > >   (~128 GiB aggregate) with the memcpy-libc kernel.  The guest sees
> > >   a flat single-NUMA layout and has no direct view of the host's
> > >   tiering topology, yet its hot pages are migrated to DRAM and cold
> > >   pages pushed to CXL by host-side DAMON acting on IBS-stamped
> > >   physical addresses -- the application inside the guest benefits
> > >   from tiering it never had to be aware of.  Validated on AMD Turin
> > >   (132-CPU EPYC).  The configuration converged to its target ratio
> > >   in seconds and remained stable for 7+ hours continuously, with no
> > >   perf core auto-throttle and no measurable drift in the achieved
> > >   interleave ratio.
> > >
> > > Configuration B: Intel PEBS L3-miss, vaddr ops, per-PID weighted-dest
> > >
> > >   PEBS reports vaddr samples in the context of the running task.
> > >   DAMON's vaddr ops monitors a specific PID.
> > >
> > >   Setup (abridged):
> > >
> > >     echo 1     > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
> > >     echo 1     > $D/contexts/nr_contexts
> > >     echo vaddr > $D/contexts/0/operations
> > >
> > >     echo 1     > $D/contexts/0/targets/nr_targets
> > >     echo $PID  > $D/contexts/0/targets/0/pid_target
> > >     echo 0     > $D/contexts/0/targets/0/regions/nr_regions
> > >
> > >     # PEBS MEM_LOAD_RETIRED.L3_MISS, frequency-based, vaddr-stamped:
> > >     echo 1      > $PE/nr_perf_events
> > >     echo 4      > $PE/0/type           # PERF_TYPE_RAW
> > >     echo 0x20d1 > $PE/0/config         # umask=0x20 event=0xd1
> > >     echo 0      > $PE/0/sample_phys_addr
> > >     echo 1      > $PE/0/freq
> > >     echo 5003   > $PE/0/sample_freq
> > >     echo 2      > $PE/0/precise_ip
> > >     echo 1      > $PE/0/wakeup_events
> > >
> > >     # Single migrate_hot scheme with two weighted destinations
> > >     # (DRAM + CXL).  Userspace tunes the steady-state interleave by
> > >     # writing dests/{0,1}/weight.
> > >
> > >     echo on > $D/state
> > >
> > >   Workload: 32 multichase multiload threads with a 4 GiB working set
> > >   each (~128 GiB aggregate) running directly on the host, monitored
> > >   by DAMON via the multiload PID.  Validated on Intel Granite Rapids
> > >   (144-CPU).  Convergence is fast and the system is stable.
> >
> > Thank you so much for sharing the great prototype implementation and test
> > results!
> >
> > I will try to make fast progress on milestone 1.  I will hold reviewing details
> > of this series for now, as there could be more changes.  But in the high level,
> > this looks promising.
> >
> > >
> > > [1] https://lore.kernel.org/linux-mm/20260516223439.4033-1-ravis.opensrc@gmail.com/
> > > [2] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com
> >
> > [1] https://lore.kernel.org/20260525225208.1179-1-sj@kernel.org/
> >
> >
> > Thanks,
> > SJ
> >
> > [...]

Thanks,
Ravi.