[RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver

Jonathan Cameron posted 4 patches 1 year, 2 months ago
Documentation/trace/cxl-hmu.rst     | 197 +++++++
Documentation/trace/index.rst       |   1 +
drivers/cxl/Kconfig                 |   6 +
drivers/cxl/Makefile                |   3 +
drivers/cxl/core/Makefile           |   1 +
drivers/cxl/core/core.h             |   1 +
drivers/cxl/core/hmu.c              |  64 ++
drivers/cxl/core/port.c             |   2 +
drivers/cxl/core/regs.c             |  14 +
drivers/cxl/cxl.h                   |   5 +
drivers/cxl/cxlpci.h                |   1 +
drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
drivers/cxl/hmu.h                   |  23 +
drivers/cxl/pci.c                   |  26 +-
tools/perf/arch/arm/util/auxtrace.c |  58 ++
tools/perf/arch/x86/util/auxtrace.c |  76 +++
tools/perf/util/Build               |   1 +
tools/perf/util/auxtrace.c          |   4 +
tools/perf/util/auxtrace.h          |   1 +
tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
tools/perf/util/cxl-hmu.h           |  18 +
21 files changed, 1748 insertions(+), 1 deletion(-)
create mode 100644 Documentation/trace/cxl-hmu.rst
create mode 100644 drivers/cxl/core/hmu.c
create mode 100644 drivers/cxl/hmu.c
create mode 100644 drivers/cxl/hmu.h
create mode 100644 tools/perf/util/cxl-hmu.c
create mode 100644 tools/perf/util/cxl-hmu.h
[RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Jonathan Cameron 1 year, 2 months ago
The CXL specification release 3.2 is now available under a click through at
https://computeexpresslink.org/cxl-specification/ and it brings new
shiny toys.

RFC reason
- Whilst trace capture with a particular configuration is potentially useful
  the intent is that CXL HMU units will be used to drive various forms of
  hotpage migration for memory tiering setups. This driver doesn't do this
  (yet), but rather provides data capture etc for experimentation and
  for working out how to mostly put the allocations in the right place to
  start with by tuning applications.

CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
of this is to provide a way to establish which units of memory (typically
pages or larger) in CXL attached memory are hot. The implementation details
and algorithm are all implementation defined. The specification simply
describes the 'interface' which takes the form of ring buffer of hotness
records in a PCI BAR and defined capability, configuration and status
registers.

The hardware may have constraints on what it can track, granularity etc
and on how accurately it tracks (e.g. counter exhaustion, inaccurate
trackers). Some of these constraints are discoverable from the hardware
registers, others such as loss of accuracy have no universally accepted
measures as they are typically access pattern dependent. Sadly it is
very unlikely any hardware will implement a truly precise tracker given
the large resource requirements for tracking at a useful granularity.

There are two fundamental operation modes:

* Epoch based. Counters are checked after a period of time (Epoch) and
  if over a threshold added to the hotlist.
* Always on. Counters run until a threshold is reached, after that the
  hot unit is added to the hotlist and the counter released.

Counting can be filtered on:

* Region of CXL DPA space (256MiB per bit in a bitmap).
* Type of access - Trusted and non trusted or non trusted only, R/W/RW

Sampling can be modified by:

* Downsampling including potentially randomized downsampling.

The driver presented here is intended to be useful in its own right but
also to act as the first step of a possible path towards hotness monitoring
based hot page migration. Those steps might look like.

1. Gather data - drivers provide telemetry like solutions to get that
   data. May be enhanced, for example in this driver by providing the
   HPA address rather than DPA Unit Address. Userspace can access enough
   information to do this so maybe not.
2. Userspace algorithm development, possibly combined with userspace
   triggered migration by PA. Working out how to use different levels
   of constrained hardware resources will be challenging.
3. Move those algorithms in kernel. Will require generalization across
   different hotpage trackers etc.

So far this driver just gives access to the raw data. I will probably kick
of a longer discussion on how to do adaptive sampling needed to actually
use these units for tiering etc, sometime soon (if no one one else beats
me too it).  There is a follow up topic of how to virtualize this stuff
for memory stranding cases (VM gets a fixed mixture of fast and slow
memory and should do it's own tiering).

More details in the Documentation patch but typical commands are:

$perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
 hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
 range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
 hotness_granual=12

$perf report --dump-raw-traces

Example output.  With a counter_width of 16 (0x10) the least significant
4 bytes are the counter value and the unit index is bits 16-63.
Here all units are over the threshold and the indexes are 0,1,2 etc.

. ... CXL_HMU data: size 33512 bytes
Header 0: units: 29c counter_width 10
Header 1 : deadbeef
0000000000000283
0000000000010364
0000000000020366
000000000003033c
0000000000040343
00000000000502ff
000000000006030d
000000000007031a

Which will produce a list of hotness entries.
Bits[N-1:0] counter value
Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
  config to get to a Host Physical Address)

Specific RFC questions.
- What should be in the header added to the aux buffer.
  Currently just the minimum is provided. Number of records
  and the counter width needed to decode them.
- Should we reset the counters when doing sampling "-F X"
  If the frequency is higher than the epoch we never see any hot units.
  If so, when should we reset them?

Note testing has been light and on emulation only + as perf tool is
a pain to build on a striped back VM,  build testing has all be on
arm64 so far.  The driver loads though on both arm64 and x86 so
any problems are likely in the perf tool arch specific code
which is build tested (on wrong machine)

The QEMU emulation needs some cleanup, but I should be able to post
that shortly to let people actually play with this.  There are lots
of open questions there on how 'right' we want the emulation to be
and what counting uarch to emulate.

Jonathan Cameron (4):
  cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
  cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
  perf: Add support for CXL Hotness Monitoring Units (CHMU)
  hwtrace: Document CXL Hotness Monitoring Unit driver

 Documentation/trace/cxl-hmu.rst     | 197 +++++++
 Documentation/trace/index.rst       |   1 +
 drivers/cxl/Kconfig                 |   6 +
 drivers/cxl/Makefile                |   3 +
 drivers/cxl/core/Makefile           |   1 +
 drivers/cxl/core/core.h             |   1 +
 drivers/cxl/core/hmu.c              |  64 ++
 drivers/cxl/core/port.c             |   2 +
 drivers/cxl/core/regs.c             |  14 +
 drivers/cxl/cxl.h                   |   5 +
 drivers/cxl/cxlpci.h                |   1 +
 drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
 drivers/cxl/hmu.h                   |  23 +
 drivers/cxl/pci.c                   |  26 +-
 tools/perf/arch/arm/util/auxtrace.c |  58 ++
 tools/perf/arch/x86/util/auxtrace.c |  76 +++
 tools/perf/util/Build               |   1 +
 tools/perf/util/auxtrace.c          |   4 +
 tools/perf/util/auxtrace.h          |   1 +
 tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
 tools/perf/util/cxl-hmu.h           |  18 +
 21 files changed, 1748 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/trace/cxl-hmu.rst
 create mode 100644 drivers/cxl/core/hmu.c
 create mode 100644 drivers/cxl/hmu.c
 create mode 100644 drivers/cxl/hmu.h
 create mode 100644 tools/perf/util/cxl-hmu.c
 create mode 100644 tools/perf/util/cxl-hmu.h

-- 
2.43.0
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Jonathan Cameron 1 year, 2 months ago
On Thu, 21 Nov 2024 10:18:41 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> The CXL specification release 3.2 is now available under a click through at
> https://computeexpresslink.org/cxl-specification/ and it brings new
> shiny toys.

If anyone wants to play, basic emulation on my CXL QEMU staging tree
https://gitlab.com/jic23/qemu/-/commit/e89b35d264c1bcc04807e7afab1254f35ffc8cb9

Branch with a few other things on top is:
https://gitlab.com/jic23/qemu/-/commits/cxl-2024-11-27

Note that this currently doesn't produce real data.  I have a plan
/ initial PoC / hack to hook that up via an addition to the QEMU cache
plugin and an external tool to emulate the hotness tracker counting
hardware. Will be a little while before I get that finished, so in
a meantime the above exercises the driver.

Jonathan
 
> 
> RFC reason
> - Whilst trace capture with a particular configuration is potentially useful
>   the intent is that CXL HMU units will be used to drive various forms of
>   hotpage migration for memory tiering setups. This driver doesn't do this
>   (yet), but rather provides data capture etc for experimentation and
>   for working out how to mostly put the allocations in the right place to
>   start with by tuning applications.
> 
> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> of this is to provide a way to establish which units of memory (typically
> pages or larger) in CXL attached memory are hot. The implementation details
> and algorithm are all implementation defined. The specification simply
> describes the 'interface' which takes the form of ring buffer of hotness
> records in a PCI BAR and defined capability, configuration and status
> registers.
> 
> The hardware may have constraints on what it can track, granularity etc
> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> trackers). Some of these constraints are discoverable from the hardware
> registers, others such as loss of accuracy have no universally accepted
> measures as they are typically access pattern dependent. Sadly it is
> very unlikely any hardware will implement a truly precise tracker given
> the large resource requirements for tracking at a useful granularity.
> 
> There are two fundamental operation modes:
> 
> * Epoch based. Counters are checked after a period of time (Epoch) and
>   if over a threshold added to the hotlist.
> * Always on. Counters run until a threshold is reached, after that the
>   hot unit is added to the hotlist and the counter released.
> 
> Counting can be filtered on:
> 
> * Region of CXL DPA space (256MiB per bit in a bitmap).
> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> 
> Sampling can be modified by:
> 
> * Downsampling including potentially randomized downsampling.
> 
> The driver presented here is intended to be useful in its own right but
> also to act as the first step of a possible path towards hotness monitoring
> based hot page migration. Those steps might look like.
> 
> 1. Gather data - drivers provide telemetry like solutions to get that
>    data. May be enhanced, for example in this driver by providing the
>    HPA address rather than DPA Unit Address. Userspace can access enough
>    information to do this so maybe not.
> 2. Userspace algorithm development, possibly combined with userspace
>    triggered migration by PA. Working out how to use different levels
>    of constrained hardware resources will be challenging.
> 3. Move those algorithms in kernel. Will require generalization across
>    different hotpage trackers etc.
> 
> So far this driver just gives access to the raw data. I will probably kick
> of a longer discussion on how to do adaptive sampling needed to actually
> use these units for tiering etc, sometime soon (if no one one else beats
> me too it).  There is a follow up topic of how to virtualize this stuff
> for memory stranding cases (VM gets a fixed mixture of fast and slow
> memory and should do it's own tiering).
> 
> More details in the Documentation patch but typical commands are:
> 
> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
>  hotness_granual=12
> 
> $perf report --dump-raw-traces
> 
> Example output.  With a counter_width of 16 (0x10) the least significant
> 4 bytes are the counter value and the unit index is bits 16-63.
> Here all units are over the threshold and the indexes are 0,1,2 etc.
> 
> . ... CXL_HMU data: size 33512 bytes
> Header 0: units: 29c counter_width 10
> Header 1 : deadbeef
> 0000000000000283
> 0000000000010364
> 0000000000020366
> 000000000003033c
> 0000000000040343
> 00000000000502ff
> 000000000006030d
> 000000000007031a
> 
> Which will produce a list of hotness entries.
> Bits[N-1:0] counter value
> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
>   config to get to a Host Physical Address)
> 
> Specific RFC questions.
> - What should be in the header added to the aux buffer.
>   Currently just the minimum is provided. Number of records
>   and the counter width needed to decode them.
> - Should we reset the counters when doing sampling "-F X"
>   If the frequency is higher than the epoch we never see any hot units.
>   If so, when should we reset them?
> 
> Note testing has been light and on emulation only + as perf tool is
> a pain to build on a striped back VM,  build testing has all be on
> arm64 so far.  The driver loads though on both arm64 and x86 so
> any problems are likely in the perf tool arch specific code
> which is build tested (on wrong machine)
> 
> The QEMU emulation needs some cleanup, but I should be able to post
> that shortly to let people actually play with this.  There are lots
> of open questions there on how 'right' we want the emulation to be
> and what counting uarch to emulate.
> 
> Jonathan Cameron (4):
>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
>   hwtrace: Document CXL Hotness Monitoring Unit driver
> 
>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
>  Documentation/trace/index.rst       |   1 +
>  drivers/cxl/Kconfig                 |   6 +
>  drivers/cxl/Makefile                |   3 +
>  drivers/cxl/core/Makefile           |   1 +
>  drivers/cxl/core/core.h             |   1 +
>  drivers/cxl/core/hmu.c              |  64 ++
>  drivers/cxl/core/port.c             |   2 +
>  drivers/cxl/core/regs.c             |  14 +
>  drivers/cxl/cxl.h                   |   5 +
>  drivers/cxl/cxlpci.h                |   1 +
>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
>  drivers/cxl/hmu.h                   |  23 +
>  drivers/cxl/pci.c                   |  26 +-
>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
>  tools/perf/util/Build               |   1 +
>  tools/perf/util/auxtrace.c          |   4 +
>  tools/perf/util/auxtrace.h          |   1 +
>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
>  tools/perf/util/cxl-hmu.h           |  18 +
>  21 files changed, 1748 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/trace/cxl-hmu.rst
>  create mode 100644 drivers/cxl/core/hmu.c
>  create mode 100644 drivers/cxl/hmu.c
>  create mode 100644 drivers/cxl/hmu.h
>  create mode 100644 tools/perf/util/cxl-hmu.c
>  create mode 100644 tools/perf/util/cxl-hmu.h
>
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Neeraj Kumar 1 year, 1 month ago
On 27/11/24 04:34PM, Jonathan Cameron wrote:
>On Thu, 21 Nov 2024 10:18:41 +0000
>Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>
>> The CXL specification release 3.2 is now available under a click through at
>> https://computeexpresslink.org/cxl-specification/ and it brings new
>> shiny toys.
>
>If anyone wants to play, basic emulation on my CXL QEMU staging tree
>https://gitlab.com/jic23/qemu/-/commit/e89b35d264c1bcc04807e7afab1254f35ffc8cb9
>
>Branch with a few other things on top is:
>https://gitlab.com/jic23/qemu/-/commits/cxl-2024-11-27
>
>Note that this currently doesn't produce real data.  I have a plan
>/ initial PoC / hack to hook that up via an addition to the QEMU cache
>plugin and an external tool to emulate the hotness tracker counting
>hardware. Will be a little while before I get that finished, so in
>a meantime the above exercises the driver.
>
>Jonathan
>
>>
>> RFC reason
>> - Whilst trace capture with a particular configuration is potentially useful
>>   the intent is that CXL HMU units will be used to drive various forms of
>>   hotpage migration for memory tiering setups. This driver doesn't do this
>>   (yet), but rather provides data capture etc for experimentation and
>>   for working out how to mostly put the allocations in the right place to
>>   start with by tuning applications.
>>
>> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
>> of this is to provide a way to establish which units of memory (typically
>> pages or larger) in CXL attached memory are hot. The implementation details
>> and algorithm are all implementation defined. The specification simply
>> describes the 'interface' which takes the form of ring buffer of hotness
>> records in a PCI BAR and defined capability, configuration and status
>> registers.
>>
>> The hardware may have constraints on what it can track, granularity etc
>> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
>> trackers). Some of these constraints are discoverable from the hardware
>> registers, others such as loss of accuracy have no universally accepted
>> measures as they are typically access pattern dependent. Sadly it is
>> very unlikely any hardware will implement a truly precise tracker given
>> the large resource requirements for tracking at a useful granularity.
>>
>> There are two fundamental operation modes:
>>
>> * Epoch based. Counters are checked after a period of time (Epoch) and
>>   if over a threshold added to the hotlist.
>> * Always on. Counters run until a threshold is reached, after that the
>>   hot unit is added to the hotlist and the counter released.
>>
>> Counting can be filtered on:
>>
>> * Region of CXL DPA space (256MiB per bit in a bitmap).
>> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
>>
>> Sampling can be modified by:
>>
>> * Downsampling including potentially randomized downsampling.
>>
>> The driver presented here is intended to be useful in its own right but
>> also to act as the first step of a possible path towards hotness monitoring
>> based hot page migration. Those steps might look like.
>>
>> 1. Gather data - drivers provide telemetry like solutions to get that
>>    data. May be enhanced, for example in this driver by providing the
>>    HPA address rather than DPA Unit Address. Userspace can access enough
>>    information to do this so maybe not.
>> 2. Userspace algorithm development, possibly combined with userspace
>>    triggered migration by PA. Working out how to use different levels
>>    of constrained hardware resources will be challenging.
>> 3. Move those algorithms in kernel. Will require generalization across
>>    different hotpage trackers etc.
>>
>> So far this driver just gives access to the raw data. I will probably kick
>> of a longer discussion on how to do adaptive sampling needed to actually
>> use these units for tiering etc, sometime soon (if no one one else beats
>> me too it).  There is a follow up topic of how to virtualize this stuff
>> for memory stranding cases (VM gets a fixed mixture of fast and slow
>> memory and should do it's own tiering).
>>
>> More details in the Documentation patch but typical commands are:
>>
>> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
>>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
>>  hotness_granual=12

Facing issue while executing perf record on x86 emulation environment using following steps

1. Tried applying CHMU Patch on branch cxl-for-6.13 using b4 utility. As
base commit is not specified, with minor change able to apply patch.
Compiled kernel with CONFIG_CXL_HMU

2. Compiled jic23/cxl-2024-11-27 for x86_64-softmmu

3. Launched Qemu with following CXL topology along with compiled kernel
VM="-object memory-backend-ram,id=vmem1,share=on,size=512M \
     -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
     -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
     -device cxl-type3,bus=root_port13,volatile-memdev=vmem1,id=cxl-vmem1 \
     -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k"

4. Created region and onlined this memory. Also run top utility on the newly created 
numa node using numactl -m<node> top

5. Compiled and installed perf utility in qemu environment, and able to
see cxl_hmu_mem* entries in perf list

root@QEMUCXL2030mm:~# perf list
<snip>
  cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_hmu_mem0.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_hmu_mem0.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_hmu_mem1.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_hmu_mem1.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_hmu_mem1.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_pmu_mem0.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
  cxl_pmu_mem0.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
  cxl_pmu_mem1.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
  cxl_pmu_mem1.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
<snip>

6. Tried running perf command mentioned in Documentation/trace/cxl-hmu.rst

root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf -v
perf version 6.12.rc5.gc198a4f4a356
root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12
event syntax error: '..ess_granual=12'
                                  \___ Unrecognized input



Are there any steps i am missing?

Regards,
Neeraj	 

>>
>> $perf report --dump-raw-traces
>>
>> Example output.  With a counter_width of 16 (0x10) the least significant
>> 4 bytes are the counter value and the unit index is bits 16-63.
>> Here all units are over the threshold and the indexes are 0,1,2 etc.
>>
>> . ... CXL_HMU data: size 33512 bytes
>> Header 0: units: 29c counter_width 10
>> Header 1 : deadbeef
>> 0000000000000283
>> 0000000000010364
>> 0000000000020366
>> 000000000003033c
>> 0000000000040343
>> 00000000000502ff
>> 000000000006030d
>> 000000000007031a
>>
>> Which will produce a list of hotness entries.
>> Bits[N-1:0] counter value
>> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
>>   config to get to a Host Physical Address)
>>
>> Specific RFC questions.
>> - What should be in the header added to the aux buffer.
>>   Currently just the minimum is provided. Number of records
>>   and the counter width needed to decode them.
>> - Should we reset the counters when doing sampling "-F X"
>>   If the frequency is higher than the epoch we never see any hot units.
>>   If so, when should we reset them?
>>
>> Note testing has been light and on emulation only + as perf tool is
>> a pain to build on a striped back VM,  build testing has all be on
>> arm64 so far.  The driver loads though on both arm64 and x86 so
>> any problems are likely in the perf tool arch specific code
>> which is build tested (on wrong machine)
>>
>> The QEMU emulation needs some cleanup, but I should be able to post
>> that shortly to let people actually play with this.  There are lots
>> of open questions there on how 'right' we want the emulation to be
>> and what counting uarch to emulate.
>>
>> Jonathan Cameron (4):
>>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
>>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
>>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
>>   hwtrace: Document CXL Hotness Monitoring Unit driver
>>
>>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
>>  Documentation/trace/index.rst       |   1 +
>>  drivers/cxl/Kconfig                 |   6 +
>>  drivers/cxl/Makefile                |   3 +
>>  drivers/cxl/core/Makefile           |   1 +
>>  drivers/cxl/core/core.h             |   1 +
>>  drivers/cxl/core/hmu.c              |  64 ++
>>  drivers/cxl/core/port.c             |   2 +
>>  drivers/cxl/core/regs.c             |  14 +
>>  drivers/cxl/cxl.h                   |   5 +
>>  drivers/cxl/cxlpci.h                |   1 +
>>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
>>  drivers/cxl/hmu.h                   |  23 +
>>  drivers/cxl/pci.c                   |  26 +-
>>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
>>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
>>  tools/perf/util/Build               |   1 +
>>  tools/perf/util/auxtrace.c          |   4 +
>>  tools/perf/util/auxtrace.h          |   1 +
>>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
>>  tools/perf/util/cxl-hmu.h           |  18 +
>>  21 files changed, 1748 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/trace/cxl-hmu.rst
>>  create mode 100644 drivers/cxl/core/hmu.c
>>  create mode 100644 drivers/cxl/hmu.c
>>  create mode 100644 drivers/cxl/hmu.h
>>  create mode 100644 tools/perf/util/cxl-hmu.c
>>  create mode 100644 tools/perf/util/cxl-hmu.h
>>
>
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Jonathan Cameron 1 year ago
On Fri, 3 Jan 2025 10:57:22 +0530
Neeraj Kumar <s.neeraj@samsung.com> wrote:

> On 27/11/24 04:34PM, Jonathan Cameron wrote:
> >On Thu, 21 Nov 2024 10:18:41 +0000
> >Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >  
> >> The CXL specification release 3.2 is now available under a click through at
> >> https://computeexpresslink.org/cxl-specification/ and it brings new
> >> shiny toys.  
> >
> >If anyone wants to play, basic emulation on my CXL QEMU staging tree
> >https://gitlab.com/jic23/qemu/-/commit/e89b35d264c1bcc04807e7afab1254f35ffc8cb9
> >
> >Branch with a few other things on top is:
> >https://gitlab.com/jic23/qemu/-/commits/cxl-2024-11-27
> >
> >Note that this currently doesn't produce real data.  I have a plan
> >/ initial PoC / hack to hook that up via an addition to the QEMU cache
> >plugin and an external tool to emulate the hotness tracker counting
> >hardware. Will be a little while before I get that finished, so in
> >a meantime the above exercises the driver.
> >
> >Jonathan
> >  
> >>
> >> RFC reason
> >> - Whilst trace capture with a particular configuration is potentially useful
> >>   the intent is that CXL HMU units will be used to drive various forms of
> >>   hotpage migration for memory tiering setups. This driver doesn't do this
> >>   (yet), but rather provides data capture etc for experimentation and
> >>   for working out how to mostly put the allocations in the right place to
> >>   start with by tuning applications.
> >>
> >> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> >> of this is to provide a way to establish which units of memory (typically
> >> pages or larger) in CXL attached memory are hot. The implementation details
> >> and algorithm are all implementation defined. The specification simply
> >> describes the 'interface' which takes the form of ring buffer of hotness
> >> records in a PCI BAR and defined capability, configuration and status
> >> registers.
> >>
> >> The hardware may have constraints on what it can track, granularity etc
> >> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> >> trackers). Some of these constraints are discoverable from the hardware
> >> registers, others such as loss of accuracy have no universally accepted
> >> measures as they are typically access pattern dependent. Sadly it is
> >> very unlikely any hardware will implement a truly precise tracker given
> >> the large resource requirements for tracking at a useful granularity.
> >>
> >> There are two fundamental operation modes:
> >>
> >> * Epoch based. Counters are checked after a period of time (Epoch) and
> >>   if over a threshold added to the hotlist.
> >> * Always on. Counters run until a threshold is reached, after that the
> >>   hot unit is added to the hotlist and the counter released.
> >>
> >> Counting can be filtered on:
> >>
> >> * Region of CXL DPA space (256MiB per bit in a bitmap).
> >> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> >>
> >> Sampling can be modified by:
> >>
> >> * Downsampling including potentially randomized downsampling.
> >>
> >> The driver presented here is intended to be useful in its own right but
> >> also to act as the first step of a possible path towards hotness monitoring
> >> based hot page migration. Those steps might look like.
> >>
> >> 1. Gather data - drivers provide telemetry like solutions to get that
> >>    data. May be enhanced, for example in this driver by providing the
> >>    HPA address rather than DPA Unit Address. Userspace can access enough
> >>    information to do this so maybe not.
> >> 2. Userspace algorithm development, possibly combined with userspace
> >>    triggered migration by PA. Working out how to use different levels
> >>    of constrained hardware resources will be challenging.
> >> 3. Move those algorithms in kernel. Will require generalization across
> >>    different hotpage trackers etc.
> >>
> >> So far this driver just gives access to the raw data. I will probably kick
> >> of a longer discussion on how to do adaptive sampling needed to actually
> >> use these units for tiering etc, sometime soon (if no one one else beats
> >> me too it).  There is a follow up topic of how to virtualize this stuff
> >> for memory stranding cases (VM gets a fixed mixture of fast and slow
> >> memory and should do it's own tiering).
> >>
> >> More details in the Documentation patch but typical commands are:
> >>
> >> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> >>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
> >>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
> >>  hotness_granual=12  
> 
> Facing issue while executing perf record on x86 emulation environment using following steps
> 
> 1. Tried applying CHMU Patch on branch cxl-for-6.13 using b4 utility. As
> base commit is not specified, with minor change able to apply patch.
> Compiled kernel with CONFIG_CXL_HMU
> 
> 2. Compiled jic23/cxl-2024-11-27 for x86_64-softmmu
> 
> 3. Launched Qemu with following CXL topology along with compiled kernel
> VM="-object memory-backend-ram,id=vmem1,share=on,size=512M \
>      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>      -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
>      -device cxl-type3,bus=root_port13,volatile-memdev=vmem1,id=cxl-vmem1 \
>      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k"
> 
> 4. Created region and onlined this memory. Also run top utility on the newly created 
> numa node using numactl -m<node> top
> 
> 5. Compiled and installed perf utility in qemu environment, and able to
> see cxl_hmu_mem* entries in perf list
> 
> root@QEMUCXL2030mm:~# perf list
> <snip>
>   cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_hmu_mem0.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_hmu_mem0.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_hmu_mem1.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_hmu_mem1.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_hmu_mem1.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_pmu_mem0.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
>   cxl_pmu_mem0.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
>   cxl_pmu_mem1.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
>   cxl_pmu_mem1.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> <snip>
> 
> 6. Tried running perf command mentioned in Documentation/trace/cxl-hmu.rst
> 
> root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf -v
> perf version 6.12.rc5.gc198a4f4a356
> root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12
> event syntax error: '..ess_granual=12'
>                                   \___ Unrecognized input

This is probably my mistake when cutting and pasting the example from a terminal.
Add a trailing / and something to run.

 perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10

Jonathan

> 
> 
> 
> Are there any steps i am missing?
> 
> Regards,
> Neeraj	 
> 
> >>
> >> $perf report --dump-raw-traces
> >>
> >> Example output.  With a counter_width of 16 (0x10) the least significant
> >> 4 bytes are the counter value and the unit index is bits 16-63.
> >> Here all units are over the threshold and the indexes are 0,1,2 etc.
> >>
> >> . ... CXL_HMU data: size 33512 bytes
> >> Header 0: units: 29c counter_width 10
> >> Header 1 : deadbeef
> >> 0000000000000283
> >> 0000000000010364
> >> 0000000000020366
> >> 000000000003033c
> >> 0000000000040343
> >> 00000000000502ff
> >> 000000000006030d
> >> 000000000007031a
> >>
> >> Which will produce a list of hotness entries.
> >> Bits[N-1:0] counter value
> >> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
> >>   config to get to a Host Physical Address)
> >>
> >> Specific RFC questions.
> >> - What should be in the header added to the aux buffer.
> >>   Currently just the minimum is provided. Number of records
> >>   and the counter width needed to decode them.
> >> - Should we reset the counters when doing sampling "-F X"
> >>   If the frequency is higher than the epoch we never see any hot units.
> >>   If so, when should we reset them?
> >>
> >> Note testing has been light and on emulation only + as perf tool is
> >> a pain to build on a striped back VM,  build testing has all be on
> >> arm64 so far.  The driver loads though on both arm64 and x86 so
> >> any problems are likely in the perf tool arch specific code
> >> which is build tested (on wrong machine)
> >>
> >> The QEMU emulation needs some cleanup, but I should be able to post
> >> that shortly to let people actually play with this.  There are lots
> >> of open questions there on how 'right' we want the emulation to be
> >> and what counting uarch to emulate.
> >>
> >> Jonathan Cameron (4):
> >>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
> >>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
> >>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
> >>   hwtrace: Document CXL Hotness Monitoring Unit driver
> >>
> >>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
> >>  Documentation/trace/index.rst       |   1 +
> >>  drivers/cxl/Kconfig                 |   6 +
> >>  drivers/cxl/Makefile                |   3 +
> >>  drivers/cxl/core/Makefile           |   1 +
> >>  drivers/cxl/core/core.h             |   1 +
> >>  drivers/cxl/core/hmu.c              |  64 ++
> >>  drivers/cxl/core/port.c             |   2 +
> >>  drivers/cxl/core/regs.c             |  14 +
> >>  drivers/cxl/cxl.h                   |   5 +
> >>  drivers/cxl/cxlpci.h                |   1 +
> >>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
> >>  drivers/cxl/hmu.h                   |  23 +
> >>  drivers/cxl/pci.c                   |  26 +-
> >>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
> >>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
> >>  tools/perf/util/Build               |   1 +
> >>  tools/perf/util/auxtrace.c          |   4 +
> >>  tools/perf/util/auxtrace.h          |   1 +
> >>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
> >>  tools/perf/util/cxl-hmu.h           |  18 +
> >>  21 files changed, 1748 insertions(+), 1 deletion(-)
> >>  create mode 100644 Documentation/trace/cxl-hmu.rst
> >>  create mode 100644 drivers/cxl/core/hmu.c
> >>  create mode 100644 drivers/cxl/hmu.c
> >>  create mode 100644 drivers/cxl/hmu.h
> >>  create mode 100644 tools/perf/util/cxl-hmu.c
> >>  create mode 100644 tools/perf/util/cxl-hmu.h
> >>  
> >  
>
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Yuquan Wang 7 months, 3 weeks ago
On Wed, Jan 15, 2025 at 01:42:07PM +0000, Jonathan Cameron wrote:
> On Fri, 3 Jan 2025 10:57:22 +0530
> Neeraj Kumar <s.neeraj@samsung.com> wrote:
> 
> > On 27/11/24 04:34PM, Jonathan Cameron wrote:
> > >On Thu, 21 Nov 2024 10:18:41 +0000
> > >Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > >  
> > >> The CXL specification release 3.2 is now available under a click through at
> > >> https://computeexpresslink.org/cxl-specification/ and it brings new
> > >> shiny toys.  
> > >
> > >If anyone wants to play, basic emulation on my CXL QEMU staging tree
> > >https://gitlab.com/jic23/qemu/-/commit/e89b35d264c1bcc04807e7afab1254f35ffc8cb9
> > >
> > >Branch with a few other things on top is:
> > >https://gitlab.com/jic23/qemu/-/commits/cxl-2024-11-27
> > >
> > >Note that this currently doesn't produce real data.  I have a plan
> > >/ initial PoC / hack to hook that up via an addition to the QEMU cache
> > >plugin and an external tool to emulate the hotness tracker counting
> > >hardware. Will be a little while before I get that finished, so in
> > >a meantime the above exercises the driver.
> > >
> > >Jonathan
> > >  
> > >>
> > >> RFC reason
> > >> - Whilst trace capture with a particular configuration is potentially useful
> > >>   the intent is that CXL HMU units will be used to drive various forms of
> > >>   hotpage migration for memory tiering setups. This driver doesn't do this
> > >>   (yet), but rather provides data capture etc for experimentation and
> > >>   for working out how to mostly put the allocations in the right place to
> > >>   start with by tuning applications.
> > >>
> > >> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> > >> of this is to provide a way to establish which units of memory (typically
> > >> pages or larger) in CXL attached memory are hot. The implementation details
> > >> and algorithm are all implementation defined. The specification simply
> > >> describes the 'interface' which takes the form of ring buffer of hotness
> > >> records in a PCI BAR and defined capability, configuration and status
> > >> registers.
> > >>
> > >> The hardware may have constraints on what it can track, granularity etc
> > >> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> > >> trackers). Some of these constraints are discoverable from the hardware
> > >> registers, others such as loss of accuracy have no universally accepted
> > >> measures as they are typically access pattern dependent. Sadly it is
> > >> very unlikely any hardware will implement a truly precise tracker given
> > >> the large resource requirements for tracking at a useful granularity.
> > >>
> > >> There are two fundamental operation modes:
> > >>
> > >> * Epoch based. Counters are checked after a period of time (Epoch) and
> > >>   if over a threshold added to the hotlist.
> > >> * Always on. Counters run until a threshold is reached, after that the
> > >>   hot unit is added to the hotlist and the counter released.
> > >>
> > >> Counting can be filtered on:
> > >>
> > >> * Region of CXL DPA space (256MiB per bit in a bitmap).
> > >> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> > >>
> > >> Sampling can be modified by:
> > >>
> > >> * Downsampling including potentially randomized downsampling.
> > >>
> > >> The driver presented here is intended to be useful in its own right but
> > >> also to act as the first step of a possible path towards hotness monitoring
> > >> based hot page migration. Those steps might look like.
> > >>
> > >> 1. Gather data - drivers provide telemetry like solutions to get that
> > >>    data. May be enhanced, for example in this driver by providing the
> > >>    HPA address rather than DPA Unit Address. Userspace can access enough
> > >>    information to do this so maybe not.
> > >> 2. Userspace algorithm development, possibly combined with userspace
> > >>    triggered migration by PA. Working out how to use different levels
> > >>    of constrained hardware resources will be challenging.
> > >> 3. Move those algorithms in kernel. Will require generalization across
> > >>    different hotpage trackers etc.
> > >>
> > >> So far this driver just gives access to the raw data. I will probably kick
> > >> of a longer discussion on how to do adaptive sampling needed to actually
> > >> use these units for tiering etc, sometime soon (if no one one else beats
> > >> me too it).  There is a follow up topic of how to virtualize this stuff
> > >> for memory stranding cases (VM gets a fixed mixture of fast and slow
> > >> memory and should do it's own tiering).
> > >>
> > >> More details in the Documentation patch but typical commands are:
> > >>
> > >> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> > >>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
> > >>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
> > >>  hotness_granual=12  
> > 
> > Facing issue while executing perf record on x86 emulation environment using following steps
> > 
> > 1. Tried applying CHMU Patch on branch cxl-for-6.13 using b4 utility. As
> > base commit is not specified, with minor change able to apply patch.
> > Compiled kernel with CONFIG_CXL_HMU
> > 
> > 2. Compiled jic23/cxl-2024-11-27 for x86_64-softmmu
> > 
> > 3. Launched Qemu with following CXL topology along with compiled kernel
> > VM="-object memory-backend-ram,id=vmem1,share=on,size=512M \
> >      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >      -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> >      -device cxl-type3,bus=root_port13,volatile-memdev=vmem1,id=cxl-vmem1 \
> >      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k"
> > 
> > 4. Created region and onlined this memory. Also run top utility on the newly created 
> > numa node using numactl -m<node> top
> > 
> > 5. Compiled and installed perf utility in qemu environment, and able to
> > see cxl_hmu_mem* entries in perf list
> > 
> > root@QEMUCXL2030mm:~# perf list
> > <snip>
> >   cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_hmu_mem0.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_hmu_mem0.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_hmu_mem1.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_hmu_mem1.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_hmu_mem1.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_pmu_mem0.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> >   cxl_pmu_mem0.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> >   cxl_pmu_mem1.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> >   cxl_pmu_mem1.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> > <snip>
> > 
> > 6. Tried running perf command mentioned in Documentation/trace/cxl-hmu.rst
> > 
> > root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf -v
> > perf version 6.12.rc5.gc198a4f4a356
> > root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12
> > event syntax error: '..ess_granual=12'
> >                                   \___ Unrecognized input
> 
> This is probably my mistake when cutting and pasting the example from a terminal.
> Add a trailing / and something to run.
> 
>  perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10

Hi Jonathan,

I tried to use this new command but perf shows error. 

Based on the change of hmu iomap_block size[1], my steps are like below:

step1: Create cxl region and online the numa node

root@ubuntu-jammy-arm64:~/tools# numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 1972 MB
node 0 free: 1694 MB
node 1 cpus: 2 3
node 1 size: 1942 MB
node 1 free: 1690 MB
node 2 cpus:
node 2 size: 256 MB
node 2 free: 256 MB
node distances:
node   0   1   2
  0:  10  20  20
  1:  20  10  20
  2:  20  20  10

step2: Bind this numa node to run 'ls'

root@ubuntu-jammy-arm64:~/tools# numactl -m 2 ls
build   perf.data  ndctl  perf     

root@ubuntu-jammy-arm64:~/tools# numastat 
                           node0           node1           node2
numa_hit                  109323          141170              77
numa_miss                      0               0               0
numa_foreign                   0               0               0
interleave_hit               519             591               0
local_node                108810          139591               0
other_node                   513            1579              77

step3: Use perf tool

root@ubuntu-jammy-arm64:~/tools# perf -v
perf version 6.15.rc5.g2c3e6f60f5cf

root@ubuntu-jammy-arm64:~/tools# perf list | grep -i hmu
  cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw event descriptor]

root@ubuntu-jammy-arm64:~/tools# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoperf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10

Error:
cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/H: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'

[1]:https://lore.kernel.org/linux-cxl/aFNsFI5OKrD0CWR3@phytium.com.cn/T/#u

Is something wrong on the CHMU interrupts?

> 
> Jonathan
> 
> > 
> > 
> > 
> > Are there any steps i am missing?
> > 
> > Regards,
> > Neeraj	 
> > 
> > >>
> > >> $perf report --dump-raw-traces
> > >>
> > >> Example output.  With a counter_width of 16 (0x10) the least significant
> > >> 4 bytes are the counter value and the unit index is bits 16-63.
> > >> Here all units are over the threshold and the indexes are 0,1,2 etc.
> > >>
> > >> . ... CXL_HMU data: size 33512 bytes
> > >> Header 0: units: 29c counter_width 10
> > >> Header 1 : deadbeef
> > >> 0000000000000283
> > >> 0000000000010364
> > >> 0000000000020366
> > >> 000000000003033c
> > >> 0000000000040343
> > >> 00000000000502ff
> > >> 000000000006030d
> > >> 000000000007031a
> > >>
> > >> Which will produce a list of hotness entries.
> > >> Bits[N-1:0] counter value
> > >> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
> > >>   config to get to a Host Physical Address)
> > >>
> > >> Specific RFC questions.
> > >> - What should be in the header added to the aux buffer.
> > >>   Currently just the minimum is provided. Number of records
> > >>   and the counter width needed to decode them.
> > >> - Should we reset the counters when doing sampling "-F X"
> > >>   If the frequency is higher than the epoch we never see any hot units.
> > >>   If so, when should we reset them?
> > >>
> > >> Note testing has been light and on emulation only + as perf tool is
> > >> a pain to build on a striped back VM,  build testing has all be on
> > >> arm64 so far.  The driver loads though on both arm64 and x86 so
> > >> any problems are likely in the perf tool arch specific code
> > >> which is build tested (on wrong machine)
> > >>
> > >> The QEMU emulation needs some cleanup, but I should be able to post
> > >> that shortly to let people actually play with this.  There are lots
> > >> of open questions there on how 'right' we want the emulation to be
> > >> and what counting uarch to emulate.
> > >>
> > >> Jonathan Cameron (4):
> > >>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
> > >>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
> > >>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
> > >>   hwtrace: Document CXL Hotness Monitoring Unit driver
> > >>
> > >>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
> > >>  Documentation/trace/index.rst       |   1 +
> > >>  drivers/cxl/Kconfig                 |   6 +
> > >>  drivers/cxl/Makefile                |   3 +
> > >>  drivers/cxl/core/Makefile           |   1 +
> > >>  drivers/cxl/core/core.h             |   1 +
> > >>  drivers/cxl/core/hmu.c              |  64 ++
> > >>  drivers/cxl/core/port.c             |   2 +
> > >>  drivers/cxl/core/regs.c             |  14 +
> > >>  drivers/cxl/cxl.h                   |   5 +
> > >>  drivers/cxl/cxlpci.h                |   1 +
> > >>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
> > >>  drivers/cxl/hmu.h                   |  23 +
> > >>  drivers/cxl/pci.c                   |  26 +-
> > >>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
> > >>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
> > >>  tools/perf/util/Build               |   1 +
> > >>  tools/perf/util/auxtrace.c          |   4 +
> > >>  tools/perf/util/auxtrace.h          |   1 +
> > >>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
> > >>  tools/perf/util/cxl-hmu.h           |  18 +
> > >>  21 files changed, 1748 insertions(+), 1 deletion(-)
> > >>  create mode 100644 Documentation/trace/cxl-hmu.rst
> > >>  create mode 100644 drivers/cxl/core/hmu.c
> > >>  create mode 100644 drivers/cxl/hmu.c
> > >>  create mode 100644 drivers/cxl/hmu.h
> > >>  create mode 100644 tools/perf/util/cxl-hmu.c
> > >>  create mode 100644 tools/perf/util/cxl-hmu.h
> > >>  
> > >  
> > 
>
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Jonathan Cameron 7 months, 3 weeks ago
On Thu, 19 Jun 2025 11:59:28 +0800
Yuquan Wang <wangyuquan1236@phytium.com.cn> wrote:

> On Wed, Jan 15, 2025 at 01:42:07PM +0000, Jonathan Cameron wrote:
> > On Fri, 3 Jan 2025 10:57:22 +0530
> > Neeraj Kumar <s.neeraj@samsung.com> wrote:
> >   
> > > On 27/11/24 04:34PM, Jonathan Cameron wrote:  
> > > >On Thu, 21 Nov 2024 10:18:41 +0000
> > > >Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > >    
> > > >> The CXL specification release 3.2 is now available under a click through at
> > > >> https://computeexpresslink.org/cxl-specification/ and it brings new
> > > >> shiny toys.    
> > > >
> > > >If anyone wants to play, basic emulation on my CXL QEMU staging tree
> > > >https://gitlab.com/jic23/qemu/-/commit/e89b35d264c1bcc04807e7afab1254f35ffc8cb9
> > > >
> > > >Branch with a few other things on top is:
> > > >https://gitlab.com/jic23/qemu/-/commits/cxl-2024-11-27
> > > >
> > > >Note that this currently doesn't produce real data.  I have a plan
> > > >/ initial PoC / hack to hook that up via an addition to the QEMU cache
> > > >plugin and an external tool to emulate the hotness tracker counting
> > > >hardware. Will be a little while before I get that finished, so in
> > > >a meantime the above exercises the driver.
> > > >
> > > >Jonathan
> > > >    
> > > >>
> > > >> RFC reason
> > > >> - Whilst trace capture with a particular configuration is potentially useful
> > > >>   the intent is that CXL HMU units will be used to drive various forms of
> > > >>   hotpage migration for memory tiering setups. This driver doesn't do this
> > > >>   (yet), but rather provides data capture etc for experimentation and
> > > >>   for working out how to mostly put the allocations in the right place to
> > > >>   start with by tuning applications.
> > > >>
> > > >> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> > > >> of this is to provide a way to establish which units of memory (typically
> > > >> pages or larger) in CXL attached memory are hot. The implementation details
> > > >> and algorithm are all implementation defined. The specification simply
> > > >> describes the 'interface' which takes the form of ring buffer of hotness
> > > >> records in a PCI BAR and defined capability, configuration and status
> > > >> registers.
> > > >>
> > > >> The hardware may have constraints on what it can track, granularity etc
> > > >> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> > > >> trackers). Some of these constraints are discoverable from the hardware
> > > >> registers, others such as loss of accuracy have no universally accepted
> > > >> measures as they are typically access pattern dependent. Sadly it is
> > > >> very unlikely any hardware will implement a truly precise tracker given
> > > >> the large resource requirements for tracking at a useful granularity.
> > > >>
> > > >> There are two fundamental operation modes:
> > > >>
> > > >> * Epoch based. Counters are checked after a period of time (Epoch) and
> > > >>   if over a threshold added to the hotlist.
> > > >> * Always on. Counters run until a threshold is reached, after that the
> > > >>   hot unit is added to the hotlist and the counter released.
> > > >>
> > > >> Counting can be filtered on:
> > > >>
> > > >> * Region of CXL DPA space (256MiB per bit in a bitmap).
> > > >> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> > > >>
> > > >> Sampling can be modified by:
> > > >>
> > > >> * Downsampling including potentially randomized downsampling.
> > > >>
> > > >> The driver presented here is intended to be useful in its own right but
> > > >> also to act as the first step of a possible path towards hotness monitoring
> > > >> based hot page migration. Those steps might look like.
> > > >>
> > > >> 1. Gather data - drivers provide telemetry like solutions to get that
> > > >>    data. May be enhanced, for example in this driver by providing the
> > > >>    HPA address rather than DPA Unit Address. Userspace can access enough
> > > >>    information to do this so maybe not.
> > > >> 2. Userspace algorithm development, possibly combined with userspace
> > > >>    triggered migration by PA. Working out how to use different levels
> > > >>    of constrained hardware resources will be challenging.
> > > >> 3. Move those algorithms in kernel. Will require generalization across
> > > >>    different hotpage trackers etc.
> > > >>
> > > >> So far this driver just gives access to the raw data. I will probably kick
> > > >> of a longer discussion on how to do adaptive sampling needed to actually
> > > >> use these units for tiering etc, sometime soon (if no one one else beats
> > > >> me too it).  There is a follow up topic of how to virtualize this stuff
> > > >> for memory stranding cases (VM gets a fixed mixture of fast and slow
> > > >> memory and should do it's own tiering).
> > > >>
> > > >> More details in the Documentation patch but typical commands are:
> > > >>
> > > >> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> > > >>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
> > > >>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
> > > >>  hotness_granual=12    
> > > 
> > > Facing issue while executing perf record on x86 emulation environment using following steps
> > > 
> > > 1. Tried applying CHMU Patch on branch cxl-for-6.13 using b4 utility. As
> > > base commit is not specified, with minor change able to apply patch.
> > > Compiled kernel with CONFIG_CXL_HMU
> > > 
> > > 2. Compiled jic23/cxl-2024-11-27 for x86_64-softmmu
> > > 
> > > 3. Launched Qemu with following CXL topology along with compiled kernel
> > > VM="-object memory-backend-ram,id=vmem1,share=on,size=512M \
> > >      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > >      -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > >      -device cxl-type3,bus=root_port13,volatile-memdev=vmem1,id=cxl-vmem1 \
> > >      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k"
> > > 
> > > 4. Created region and onlined this memory. Also run top utility on the newly created 
> > > numa node using numactl -m<node> top
> > > 
> > > 5. Compiled and installed perf utility in qemu environment, and able to
> > > see cxl_hmu_mem* entries in perf list
> > > 
> > > root@QEMUCXL2030mm:~# perf list
> > > <snip>
> > >   cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_hmu_mem0.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_hmu_mem0.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_hmu_mem1.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_hmu_mem1.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_hmu_mem1.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_pmu_mem0.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> > >   cxl_pmu_mem0.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> > >   cxl_pmu_mem1.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> > >   cxl_pmu_mem1.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> > > <snip>
> > > 
> > > 6. Tried running perf command mentioned in Documentation/trace/cxl-hmu.rst
> > > 
> > > root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf -v
> > > perf version 6.12.rc5.gc198a4f4a356
> > > root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12
> > > event syntax error: '..ess_granual=12'
> > >                                   \___ Unrecognized input  
> > 
> > This is probably my mistake when cutting and pasting the example from a terminal.
> > Add a trailing / and something to run.
> > 
> >  perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10  
> 
> Hi Jonathan,
> 
> I tried to use this new command but perf shows error. 
> 
> Based on the change of hmu iomap_block size[1], my steps are like below:
> 
> step1: Create cxl region and online the numa node
> 
> root@ubuntu-jammy-arm64:~/tools# numactl -H
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: 1972 MB
> node 0 free: 1694 MB
> node 1 cpus: 2 3
> node 1 size: 1942 MB
> node 1 free: 1690 MB
> node 2 cpus:
> node 2 size: 256 MB
> node 2 free: 256 MB
> node distances:
> node   0   1   2
>   0:  10  20  20
>   1:  20  10  20
>   2:  20  20  10
> 
> step2: Bind this numa node to run 'ls'
> 
> root@ubuntu-jammy-arm64:~/tools# numactl -m 2 ls
> build   perf.data  ndctl  perf     
> 
> root@ubuntu-jammy-arm64:~/tools# numastat 
>                            node0           node1           node2
> numa_hit                  109323          141170              77
> numa_miss                      0               0               0
> numa_foreign                   0               0               0
> interleave_hit               519             591               0
> local_node                108810          139591               0
> other_node                   513            1579              77
> 
> step3: Use perf tool
> 
> root@ubuntu-jammy-arm64:~/tools# perf -v
> perf version 6.15.rc5.g2c3e6f60f5cf
> 
> root@ubuntu-jammy-arm64:~/tools# perf list | grep -i hmu
>   cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw event descriptor]
> 
> root@ubuntu-jammy-arm64:~/tools# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoperf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10
> 
This cut and paste seems to have half of two commands.  

I'll just grab the second one as what I'm guessing you were running. 

perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10

> Error:
> cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/H: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'
> 
> [1]:https://lore.kernel.org/linux-cxl/aFNsFI5OKrD0CWR3@phytium.com.cn/T/#u
> 
> Is something wrong on the CHMU interrupts?

That error message is annoyingly meaningless and misleading. I think
the perf core is assuming that any not supported error means that is the problem
rather than something else in the command line might not be supported.

It's failing because we don't support that downsampling_factor.  The format does (hence
I think the stuff above) but the emulation doesn't. 

So for now just don't specify downsampling_factor in the command line. There is
a bug in the emulation around setting it to 1 it seems.  

val = FIELD_DP64(val, CXL_CHMU0_CAP1, DOWN_SAMPLING_FACTORS, BIT(1))
which should have been bit 0 to indicate no downsampling (comment above that is wrong too).


Looking at the spec is rather confusing on how this is supposed to work.
We have a bitmap that has 16 bits, each of which represents a power of 2, but
then we have a CHMU configuration register Down-sampling factor that says
bits 99:96 are the 'one of the 16 possible values'  I assume that means
bit offset.

I'll reply to the qemu thread to point out this bug in the capability.

Jonathan



> 
> > 
> > Jonathan
> >   
> > > 
> > > 
> > > 
> > > Are there any steps i am missing?
> > > 
> > > Regards,
> > > Neeraj	 
> > >   
> > > >>
> > > >> $perf report --dump-raw-traces
> > > >>
> > > >> Example output.  With a counter_width of 16 (0x10) the least significant
> > > >> 4 bytes are the counter value and the unit index is bits 16-63.
> > > >> Here all units are over the threshold and the indexes are 0,1,2 etc.
> > > >>
> > > >> . ... CXL_HMU data: size 33512 bytes
> > > >> Header 0: units: 29c counter_width 10
> > > >> Header 1 : deadbeef
> > > >> 0000000000000283
> > > >> 0000000000010364
> > > >> 0000000000020366
> > > >> 000000000003033c
> > > >> 0000000000040343
> > > >> 00000000000502ff
> > > >> 000000000006030d
> > > >> 000000000007031a
> > > >>
> > > >> Which will produce a list of hotness entries.
> > > >> Bits[N-1:0] counter value
> > > >> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
> > > >>   config to get to a Host Physical Address)
> > > >>
> > > >> Specific RFC questions.
> > > >> - What should be in the header added to the aux buffer.
> > > >>   Currently just the minimum is provided. Number of records
> > > >>   and the counter width needed to decode them.
> > > >> - Should we reset the counters when doing sampling "-F X"
> > > >>   If the frequency is higher than the epoch we never see any hot units.
> > > >>   If so, when should we reset them?
> > > >>
> > > >> Note testing has been light and on emulation only + as perf tool is
> > > >> a pain to build on a striped back VM,  build testing has all be on
> > > >> arm64 so far.  The driver loads though on both arm64 and x86 so
> > > >> any problems are likely in the perf tool arch specific code
> > > >> which is build tested (on wrong machine)
> > > >>
> > > >> The QEMU emulation needs some cleanup, but I should be able to post
> > > >> that shortly to let people actually play with this.  There are lots
> > > >> of open questions there on how 'right' we want the emulation to be
> > > >> and what counting uarch to emulate.
> > > >>
> > > >> Jonathan Cameron (4):
> > > >>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
> > > >>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
> > > >>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
> > > >>   hwtrace: Document CXL Hotness Monitoring Unit driver
> > > >>
> > > >>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
> > > >>  Documentation/trace/index.rst       |   1 +
> > > >>  drivers/cxl/Kconfig                 |   6 +
> > > >>  drivers/cxl/Makefile                |   3 +
> > > >>  drivers/cxl/core/Makefile           |   1 +
> > > >>  drivers/cxl/core/core.h             |   1 +
> > > >>  drivers/cxl/core/hmu.c              |  64 ++
> > > >>  drivers/cxl/core/port.c             |   2 +
> > > >>  drivers/cxl/core/regs.c             |  14 +
> > > >>  drivers/cxl/cxl.h                   |   5 +
> > > >>  drivers/cxl/cxlpci.h                |   1 +
> > > >>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
> > > >>  drivers/cxl/hmu.h                   |  23 +
> > > >>  drivers/cxl/pci.c                   |  26 +-
> > > >>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
> > > >>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
> > > >>  tools/perf/util/Build               |   1 +
> > > >>  tools/perf/util/auxtrace.c          |   4 +
> > > >>  tools/perf/util/auxtrace.h          |   1 +
> > > >>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
> > > >>  tools/perf/util/cxl-hmu.h           |  18 +
> > > >>  21 files changed, 1748 insertions(+), 1 deletion(-)
> > > >>  create mode 100644 Documentation/trace/cxl-hmu.rst
> > > >>  create mode 100644 drivers/cxl/core/hmu.c
> > > >>  create mode 100644 drivers/cxl/hmu.c
> > > >>  create mode 100644 drivers/cxl/hmu.h
> > > >>  create mode 100644 tools/perf/util/cxl-hmu.c
> > > >>  create mode 100644 tools/perf/util/cxl-hmu.h
> > > >>    
> > > >    
> > >   
> >   
> 
>
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Gregory Price 1 year, 2 months ago
On Thu, Nov 21, 2024 at 10:18:41AM +0000, Jonathan Cameron wrote:
> The CXL specification release 3.2 is now available under a click through at
> https://computeexpresslink.org/cxl-specification/ and it brings new
> shiny toys.
> 
> RFC reason
> - Whilst trace capture with a particular configuration is potentially useful
>   the intent is that CXL HMU units will be used to drive various forms of
>   hotpage migration for memory tiering setups. This driver doesn't do this
>   (yet), but rather provides data capture etc for experimentation and
>   for working out how to mostly put the allocations in the right place to
>   start with by tuning applications.
> 
> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> of this is to provide a way to establish which units of memory (typically
> pages or larger) in CXL attached memory are hot. The implementation details
> and algorithm are all implementation defined. The specification simply
> describes the 'interface' which takes the form of ring buffer of hotness
> records in a PCI BAR and defined capability, configuration and status
> registers.
> 
> The hardware may have constraints on what it can track, granularity etc
> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> trackers). Some of these constraints are discoverable from the hardware
> registers, others such as loss of accuracy have no universally accepted
> measures as they are typically access pattern dependent. Sadly it is
> very unlikely any hardware will implement a truly precise tracker given
> the large resource requirements for tracking at a useful granularity.
> 
> There are two fundamental operation modes:
> 
> * Epoch based. Counters are checked after a period of time (Epoch) and
>   if over a threshold added to the hotlist.
> * Always on. Counters run until a threshold is reached, after that the
>   hot unit is added to the hotlist and the counter released.
> 
> Counting can be filtered on:
> 
> * Region of CXL DPA space (256MiB per bit in a bitmap).
> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> 
> Sampling can be modified by:
> 
> * Downsampling including potentially randomized downsampling.
> 
> The driver presented here is intended to be useful in its own right but
> also to act as the first step of a possible path towards hotness monitoring
> based hot page migration. Those steps might look like.
> 
> 1. Gather data - drivers provide telemetry like solutions to get that
>    data. May be enhanced, for example in this driver by providing the
>    HPA address rather than DPA Unit Address. Userspace can access enough
>    information to do this so maybe not.
> 2. Userspace algorithm development, possibly combined with userspace
>    triggered migration by PA. Working out how to use different levels
>    of constrained hardware resources will be challenging.

FWIW this is what i was thinking about for this extension:

https://lore.kernel.org/all/20240319172609.332900-1-gregory.price@memverge.com/

At least for testing CHMU stuff. So if anyone is poking at testing such
things, they can feel free to use that for prototyping. However, I think
there is general discomfort around userspace handling HPA/DPA.

So it might look more like

echo nr_pages > /sys/.../tiering/nodeN/promote_pages

rather than handling the raw data from the CHMU to make decisions.


> 3. Move those algorithms in kernel. Will require generalization across
>    different hotpage trackers etc.
> 

In a longer discussion with Dan, we considered something a little more
abstract - like a system that monitors bandwidth and memory access stalls
and decide to promote X pages from Y device.  This carries a pretty tall
generalization cost, but it's pretty exciting to say the least.

Definitely worth a discussion for later.

>
> So far this driver just gives access to the raw data. I will probably kick
> of a longer discussion on how to do adaptive sampling needed to actually
> use these units for tiering etc, sometime soon (if no one one else beats
> me too it).  There is a follow up topic of how to virtualize this stuff
> for memory stranding cases (VM gets a fixed mixture of fast and slow
> memory and should do it's own tiering).
>

Without having looked at the patches yet, I would presume this interface
is at least gated to admin/root? (raw data is physical address info)

~Gregory
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Jonathan Cameron 1 year, 2 months ago
On Thu, 21 Nov 2024 09:24:43 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Thu, Nov 21, 2024 at 10:18:41AM +0000, Jonathan Cameron wrote:
> > The CXL specification release 3.2 is now available under a click through at
> > https://computeexpresslink.org/cxl-specification/ and it brings new
> > shiny toys.
> > 
> > RFC reason
> > - Whilst trace capture with a particular configuration is potentially useful
> >   the intent is that CXL HMU units will be used to drive various forms of
> >   hotpage migration for memory tiering setups. This driver doesn't do this
> >   (yet), but rather provides data capture etc for experimentation and
> >   for working out how to mostly put the allocations in the right place to
> >   start with by tuning applications.
> > 
> > CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> > of this is to provide a way to establish which units of memory (typically
> > pages or larger) in CXL attached memory are hot. The implementation details
> > and algorithm are all implementation defined. The specification simply
> > describes the 'interface' which takes the form of ring buffer of hotness
> > records in a PCI BAR and defined capability, configuration and status
> > registers.
> > 
> > The hardware may have constraints on what it can track, granularity etc
> > and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> > trackers). Some of these constraints are discoverable from the hardware
> > registers, others such as loss of accuracy have no universally accepted
> > measures as they are typically access pattern dependent. Sadly it is
> > very unlikely any hardware will implement a truly precise tracker given
> > the large resource requirements for tracking at a useful granularity.
> > 
> > There are two fundamental operation modes:
> > 
> > * Epoch based. Counters are checked after a period of time (Epoch) and
> >   if over a threshold added to the hotlist.
> > * Always on. Counters run until a threshold is reached, after that the
> >   hot unit is added to the hotlist and the counter released.
> > 
> > Counting can be filtered on:
> > 
> > * Region of CXL DPA space (256MiB per bit in a bitmap).
> > * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> > 
> > Sampling can be modified by:
> > 
> > * Downsampling including potentially randomized downsampling.
> > 
> > The driver presented here is intended to be useful in its own right but
> > also to act as the first step of a possible path towards hotness monitoring
> > based hot page migration. Those steps might look like.
> > 
> > 1. Gather data - drivers provide telemetry like solutions to get that
> >    data. May be enhanced, for example in this driver by providing the
> >    HPA address rather than DPA Unit Address. Userspace can access enough
> >    information to do this so maybe not.
> > 2. Userspace algorithm development, possibly combined with userspace
> >    triggered migration by PA. Working out how to use different levels
> >    of constrained hardware resources will be challenging.  
> 
> FWIW this is what i was thinking about for this extension:
> 
> https://lore.kernel.org/all/20240319172609.332900-1-gregory.price@memverge.com/

Yup. I had that in mind. Forgot to actually add a link.

> 
> At least for testing CHMU stuff. So if anyone is poking at testing such
> things, they can feel free to use that for prototyping. However, I think
> there is general discomfort around userspace handling HPA/DPA.
> 
> So it might look more like
> 
> echo nr_pages > /sys/.../tiering/nodeN/promote_pages
> 
> rather than handling the raw data from the CHMU to make decisions.

Agreed, but I think we are far away from a point where we can implement that.

Just working out how to tune the hardware to grab useful data is going
to take a while to figure out, let alone doing anything much with it.

Without care you won't get a meaningful signal for what is actually
hot out of the box. Lots of reasons why including:
a) Exhaustion of tracking resources, due to looking at too large a window
   or for too long.  Will probably need some form of auto updating of
   what is being scanning (coarse to fine might work though I'm doubtful,
   scanning across small regions maybe).
b) Threshold too high, no detections.
c) Threshold too low, everything hot.
d) Wrong timescales. Hot is not a well defined thing.
e) Hardware that won't do tracking at fine enough granularity.

> 
> 
> > 3. Move those algorithms in kernel. Will require generalization across
> >    different hotpage trackers etc.
> >   
> 
> In a longer discussion with Dan, we considered something a little more
> abstract - like a system that monitors bandwidth and memory access stalls
> and decide to promote X pages from Y device.  This carries a pretty tall
> generalization cost, but it's pretty exciting to say the least.

Agreed that ultimately we'll end up somewhere like that.
These units are just a small part of what is needed in total.

> 
> Definitely worth a discussion for later.
> 
> >
> > So far this driver just gives access to the raw data. I will probably kick
> > of a longer discussion on how to do adaptive sampling needed to actually
> > use these units for tiering etc, sometime soon (if no one one else beats
> > me too it).  There is a follow up topic of how to virtualize this stuff
> > for memory stranding cases (VM gets a fixed mixture of fast and slow
> > memory and should do it's own tiering).
> >  
> 
> Without having looked at the patches yet, I would presume this interface
> is at least gated to admin/root? (raw data is physical address info)

That's certainly the intent. It's not going upstream in this form so
I haven't actually checked yet :)  Uses similar infrastructure to ARM
SPE which can also give physical address info + a lot more than that.

Jonathan



> 
> ~Gregory
>
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by SeongJae Park 1 year, 2 months ago
On Thu, 21 Nov 2024 14:58:52 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Thu, 21 Nov 2024 09:24:43 -0500
> Gregory Price <gourry@gourry.net> wrote:
> 
> > On Thu, Nov 21, 2024 at 10:18:41AM +0000, Jonathan Cameron wrote:
[...]
> Just working out how to tune the hardware to grab useful data is going
> to take a while to figure out, let alone doing anything much with it.
> 
> Without care you won't get a meaningful signal for what is actually
> hot out of the box. Lots of reasons why including:
> a) Exhaustion of tracking resources, due to looking at too large a window
>    or for too long.  Will probably need some form of auto updating of
>    what is being scanning (coarse to fine might work though I'm doubtful,
>    scanning across small regions maybe).
> b) Threshold too high, no detections.
> c) Threshold too low, everything hot.
> d) Wrong timescales. Hot is not a well defined thing.
> e) Hardware that won't do tracking at fine enough granularity.

Similar questions can be raised to general hotness monitoring including that
for DAMON.  I'm trying to summarize[1] rules of thumbs for DAMON tuning based
on my humble experiences.  Once it is done, I will further try automations of
tunings.

In future, hopefully DAMON can be extended to utilize CXL hotness monitoring
unit as low level primitive for access check.  Then, the guidance and
automation of DAMON tuning could be just applied.

Note that I'm not saying DAMON should be the only way to utilize CXL hotness
monitoring unit.  I'm saying DAMON could be one of the ways :)

[1] https://lore.kernel.org/20241108232536.73843-1-sj@kernel.org


Thanks,
SJ

[...]
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Gregory Price 1 year, 2 months ago
On Thu, Nov 21, 2024 at 02:58:52PM +0000, Jonathan Cameron wrote:
> On Thu, 21 Nov 2024 09:24:43 -0500
> Gregory Price <gourry@gourry.net> wrote:
> 
> > On Thu, Nov 21, 2024 at 10:18:41AM +0000, Jonathan Cameron wrote:
> > > The CXL specification release 3.2 is now available under a click through at
> > > https://computeexpresslink.org/cxl-specification/ and it brings new
> > > shiny toys.
> > > 
> > > RFC reason
> > > - Whilst trace capture with a particular configuration is potentially useful
> > >   the intent is that CXL HMU units will be used to drive various forms of
> > >   hotpage migration for memory tiering setups. This driver doesn't do this
> > >   (yet), but rather provides data capture etc for experimentation and
> > >   for working out how to mostly put the allocations in the right place to
> > >   start with by tuning applications.
> > > 
> > > CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> > > of this is to provide a way to establish which units of memory (typically
> > > pages or larger) in CXL attached memory are hot. The implementation details
> > > and algorithm are all implementation defined. The specification simply
> > > describes the 'interface' which takes the form of ring buffer of hotness
> > > records in a PCI BAR and defined capability, configuration and status
> > > registers.
> > > 
> > > The hardware may have constraints on what it can track, granularity etc
> > > and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> > > trackers). Some of these constraints are discoverable from the hardware
> > > registers, others such as loss of accuracy have no universally accepted
> > > measures as they are typically access pattern dependent. Sadly it is
> > > very unlikely any hardware will implement a truly precise tracker given
> > > the large resource requirements for tracking at a useful granularity.
> > > 
> > > There are two fundamental operation modes:
> > > 
> > > * Epoch based. Counters are checked after a period of time (Epoch) and
> > >   if over a threshold added to the hotlist.
> > > * Always on. Counters run until a threshold is reached, after that the
> > >   hot unit is added to the hotlist and the counter released.
> > > 
> > > Counting can be filtered on:
> > > 
> > > * Region of CXL DPA space (256MiB per bit in a bitmap).
> > > * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> > > 
> > > Sampling can be modified by:
> > > 
> > > * Downsampling including potentially randomized downsampling.
> > > 
> > > The driver presented here is intended to be useful in its own right but
> > > also to act as the first step of a possible path towards hotness monitoring
> > > based hot page migration. Those steps might look like.
> > > 
> > > 1. Gather data - drivers provide telemetry like solutions to get that
> > >    data. May be enhanced, for example in this driver by providing the
> > >    HPA address rather than DPA Unit Address. Userspace can access enough
> > >    information to do this so maybe not.
> > > 2. Userspace algorithm development, possibly combined with userspace
> > >    triggered migration by PA. Working out how to use different levels
> > >    of constrained hardware resources will be challenging.  
> > 
> > FWIW this is what i was thinking about for this extension:
> > 
> > https://lore.kernel.org/all/20240319172609.332900-1-gregory.price@memverge.com/
> 
> Yup. I had that in mind. Forgot to actually add a link.
> 
> > 
> > At least for testing CHMU stuff. So if anyone is poking at testing such
> > things, they can feel free to use that for prototyping. However, I think
> > there is general discomfort around userspace handling HPA/DPA.
> > 
> > So it might look more like
> > 
> > echo nr_pages > /sys/.../tiering/nodeN/promote_pages
> > 
> > rather than handling the raw data from the CHMU to make decisions.
> 
> Agreed, but I think we are far away from a point where we can implement that.
> 
> Just working out how to tune the hardware to grab useful data is going
> to take a while to figure out, let alone doing anything much with it.
> 
> Without care you won't get a meaningful signal for what is actually
> hot out of the box. Lots of reasons why including:
> a) Exhaustion of tracking resources, due to looking at too large a window
>    or for too long.  Will probably need some form of auto updating of
>    what is being scanning (coarse to fine might work though I'm doubtful,
>    scanning across small regions maybe).
> b) Threshold too high, no detections.
> c) Threshold too low, everything hot.
> d) Wrong timescales. Hot is not a well defined thing.
> e) Hardware that won't do tracking at fine enough granularity.
> 

f) How does this even work with interleaving on larger pools :B
   It's pretend-addressing all the way down :D

Lots of conceptually complex and fun questions here.

~Gregory
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Jonathan Cameron 1 year, 2 months ago
On Thu, 21 Nov 2024 10:18:41 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> The CXL specification release 3.2 is now available under a click through at
> https://computeexpresslink.org/cxl-specification/ and it brings new
> shiny toys.
> 
> RFC reason
> - Whilst trace capture with a particular configuration is potentially useful
>   the intent is that CXL HMU units will be used to drive various forms of
>   hotpage migration for memory tiering setups. This driver doesn't do this
>   (yet), but rather provides data capture etc for experimentation and
>   for working out how to mostly put the allocations in the right place to
>   start with by tuning applications.
> 
> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> of this is to provide a way to establish which units of memory (typically
> pages or larger) in CXL attached memory are hot. The implementation details
> and algorithm are all implementation defined. The specification simply
> describes the 'interface' which takes the form of ring buffer of hotness
> records in a PCI BAR and defined capability, configuration and status
> registers.
> 
> The hardware may have constraints on what it can track, granularity etc
> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> trackers). Some of these constraints are discoverable from the hardware
> registers, others such as loss of accuracy have no universally accepted
> measures as they are typically access pattern dependent. Sadly it is
> very unlikely any hardware will implement a truly precise tracker given
> the large resource requirements for tracking at a useful granularity.
> 
> There are two fundamental operation modes:
> 
> * Epoch based. Counters are checked after a period of time (Epoch) and
>   if over a threshold added to the hotlist.
> * Always on. Counters run until a threshold is reached, after that the
>   hot unit is added to the hotlist and the counter released.
> 
> Counting can be filtered on:
> 
> * Region of CXL DPA space (256MiB per bit in a bitmap).
> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> 
> Sampling can be modified by:
> 
> * Downsampling including potentially randomized downsampling.
> 
> The driver presented here is intended to be useful in its own right but
> also to act as the first step of a possible path towards hotness monitoring
> based hot page migration. Those steps might look like.
> 
> 1. Gather data - drivers provide telemetry like solutions to get that
>    data. May be enhanced, for example in this driver by providing the
>    HPA address rather than DPA Unit Address. Userspace can access enough
>    information to do this so maybe not.
> 2. Userspace algorithm development, possibly combined with userspace
>    triggered migration by PA. Working out how to use different levels
>    of constrained hardware resources will be challenging.
> 3. Move those algorithms in kernel. Will require generalization across
>    different hotpage trackers etc.
> 
> So far this driver just gives access to the raw data. I will probably kick
> of a longer discussion on how to do adaptive sampling needed to actually
> use these units for tiering etc, sometime soon (if no one one else beats
> me too it).  There is a follow up topic of how to virtualize this stuff
> for memory stranding cases (VM gets a fixed mixture of fast and slow
> memory and should do it's own tiering).
> 
> More details in the Documentation patch but typical commands are:
> 
> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
>  hotness_granual=12
> 
> $perf report --dump-raw-traces
> 
> Example output.  With a counter_width of 16 (0x10) the least significant
> 4 bytes are the counter value and the unit index is bits 16-63.
> Here all units are over the threshold and the indexes are 0,1,2 etc.
> 
> . ... CXL_HMU data: size 33512 bytes
> Header 0: units: 29c counter_width 10
> Header 1 : deadbeef
> 0000000000000283
> 0000000000010364
> 0000000000020366
> 000000000003033c
> 0000000000040343
> 00000000000502ff
> 000000000006030d
> 000000000007031a
> 
> Which will produce a list of hotness entries.
> Bits[N-1:0] counter value
> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
>   config to get to a Host Physical Address)
> 
> Specific RFC questions.
> - What should be in the header added to the aux buffer.
>   Currently just the minimum is provided. Number of records
>   and the counter width needed to decode them.
> - Should we reset the counters when doing sampling "-F X"
>   If the frequency is higher than the epoch we never see any hot units.
>   If so, when should we reset them?
> 
> Note testing has been light and on emulation only + as perf tool is
> a pain to build on a striped back VM,  build testing has all be on
> arm64 so far.  The driver loads though on both arm64 and x86 so
> any problems are likely in the perf tool arch specific code
> which is build tested (on wrong machine)

FWIW, runs on x86. However, it triggers a lockdep warning in
both start and stop due to the spin lock. Something to tidy up for
RFCv2.

J

> 
> The QEMU emulation needs some cleanup, but I should be able to post
> that shortly to let people actually play with this.  There are lots
> of open questions there on how 'right' we want the emulation to be
> and what counting uarch to emulate.
> 
> Jonathan Cameron (4):
>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
>   hwtrace: Document CXL Hotness Monitoring Unit driver
> 
>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
>  Documentation/trace/index.rst       |   1 +
>  drivers/cxl/Kconfig                 |   6 +
>  drivers/cxl/Makefile                |   3 +
>  drivers/cxl/core/Makefile           |   1 +
>  drivers/cxl/core/core.h             |   1 +
>  drivers/cxl/core/hmu.c              |  64 ++
>  drivers/cxl/core/port.c             |   2 +
>  drivers/cxl/core/regs.c             |  14 +
>  drivers/cxl/cxl.h                   |   5 +
>  drivers/cxl/cxlpci.h                |   1 +
>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
>  drivers/cxl/hmu.h                   |  23 +
>  drivers/cxl/pci.c                   |  26 +-
>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
>  tools/perf/util/Build               |   1 +
>  tools/perf/util/auxtrace.c          |   4 +
>  tools/perf/util/auxtrace.h          |   1 +
>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
>  tools/perf/util/cxl-hmu.h           |  18 +
>  21 files changed, 1748 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/trace/cxl-hmu.rst
>  create mode 100644 drivers/cxl/core/hmu.c
>  create mode 100644 drivers/cxl/hmu.c
>  create mode 100644 drivers/cxl/hmu.h
>  create mode 100644 tools/perf/util/cxl-hmu.c
>  create mode 100644 tools/perf/util/cxl-hmu.h
>
Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
Posted by Jonathan Cameron 1 year ago
On Thu, 21 Nov 2024 10:18:41 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> The CXL specification release 3.2 is now available under a click through at
> https://computeexpresslink.org/cxl-specification/ and it brings new
> shiny toys.

PoC of qemu plugin based approach to getting real data:
https://lore.kernel.org/qemu-devel/20250124172905.84099-1-Jonathan.Cameron@huawei.com/

Also available on gitlab.com/jic23/qemu cxl-2025-01-24

I have a minor revision to this driver to post after a spec clarification on units
of one parameter. Will do that next week but for now that PoC ignores the parameter
anyway :)

Hopefully a cleaned up version of the above will provide us with a useful
platform for algorithm and framework development. I've posted it early as
there is a fundamental question for the QEMU maintainers on how they
would prefer the plugin interaction to be done - once that's resolved it's
just a question of wiring up more parameters and providing additional
options for the actual counting implementation. For now it's an oracle
with 32 bit counters for every page.  Hardware folk tell me that might be
a 'little too expensive'!

It's not even that slow :)

Jonathan
> 
> RFC reason
> - Whilst trace capture with a particular configuration is potentially useful
>   the intent is that CXL HMU units will be used to drive various forms of
>   hotpage migration for memory tiering setups. This driver doesn't do this
>   (yet), but rather provides data capture etc for experimentation and
>   for working out how to mostly put the allocations in the right place to
>   start with by tuning applications.
> 
> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> of this is to provide a way to establish which units of memory (typically
> pages or larger) in CXL attached memory are hot. The implementation details
> and algorithm are all implementation defined. The specification simply
> describes the 'interface' which takes the form of ring buffer of hotness
> records in a PCI BAR and defined capability, configuration and status
> registers.
> 
> The hardware may have constraints on what it can track, granularity etc
> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> trackers). Some of these constraints are discoverable from the hardware
> registers, others such as loss of accuracy have no universally accepted
> measures as they are typically access pattern dependent. Sadly it is
> very unlikely any hardware will implement a truly precise tracker given
> the large resource requirements for tracking at a useful granularity.
> 
> There are two fundamental operation modes:
> 
> * Epoch based. Counters are checked after a period of time (Epoch) and
>   if over a threshold added to the hotlist.
> * Always on. Counters run until a threshold is reached, after that the
>   hot unit is added to the hotlist and the counter released.
> 
> Counting can be filtered on:
> 
> * Region of CXL DPA space (256MiB per bit in a bitmap).
> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> 
> Sampling can be modified by:
> 
> * Downsampling including potentially randomized downsampling.
> 
> The driver presented here is intended to be useful in its own right but
> also to act as the first step of a possible path towards hotness monitoring
> based hot page migration. Those steps might look like.
> 
> 1. Gather data - drivers provide telemetry like solutions to get that
>    data. May be enhanced, for example in this driver by providing the
>    HPA address rather than DPA Unit Address. Userspace can access enough
>    information to do this so maybe not.
> 2. Userspace algorithm development, possibly combined with userspace
>    triggered migration by PA. Working out how to use different levels
>    of constrained hardware resources will be challenging.
> 3. Move those algorithms in kernel. Will require generalization across
>    different hotpage trackers etc.
> 
> So far this driver just gives access to the raw data. I will probably kick
> of a longer discussion on how to do adaptive sampling needed to actually
> use these units for tiering etc, sometime soon (if no one one else beats
> me too it).  There is a follow up topic of how to virtualize this stuff
> for memory stranding cases (VM gets a fixed mixture of fast and slow
> memory and should do it's own tiering).
> 
> More details in the Documentation patch but typical commands are:
> 
> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
>  hotness_granual=12
> 
> $perf report --dump-raw-traces
> 
> Example output.  With a counter_width of 16 (0x10) the least significant
> 4 bytes are the counter value and the unit index is bits 16-63.
> Here all units are over the threshold and the indexes are 0,1,2 etc.
> 
> . ... CXL_HMU data: size 33512 bytes
> Header 0: units: 29c counter_width 10
> Header 1 : deadbeef
> 0000000000000283
> 0000000000010364
> 0000000000020366
> 000000000003033c
> 0000000000040343
> 00000000000502ff
> 000000000006030d
> 000000000007031a
> 
> Which will produce a list of hotness entries.
> Bits[N-1:0] counter value
> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
>   config to get to a Host Physical Address)
> 
> Specific RFC questions.
> - What should be in the header added to the aux buffer.
>   Currently just the minimum is provided. Number of records
>   and the counter width needed to decode them.
> - Should we reset the counters when doing sampling "-F X"
>   If the frequency is higher than the epoch we never see any hot units.
>   If so, when should we reset them?
> 
> Note testing has been light and on emulation only + as perf tool is
> a pain to build on a striped back VM,  build testing has all be on
> arm64 so far.  The driver loads though on both arm64 and x86 so
> any problems are likely in the perf tool arch specific code
> which is build tested (on wrong machine)
> 
> The QEMU emulation needs some cleanup, but I should be able to post
> that shortly to let people actually play with this.  There are lots
> of open questions there on how 'right' we want the emulation to be
> and what counting uarch to emulate.
> 
> Jonathan Cameron (4):
>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
>   hwtrace: Document CXL Hotness Monitoring Unit driver
> 
>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
>  Documentation/trace/index.rst       |   1 +
>  drivers/cxl/Kconfig                 |   6 +
>  drivers/cxl/Makefile                |   3 +
>  drivers/cxl/core/Makefile           |   1 +
>  drivers/cxl/core/core.h             |   1 +
>  drivers/cxl/core/hmu.c              |  64 ++
>  drivers/cxl/core/port.c             |   2 +
>  drivers/cxl/core/regs.c             |  14 +
>  drivers/cxl/cxl.h                   |   5 +
>  drivers/cxl/cxlpci.h                |   1 +
>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
>  drivers/cxl/hmu.h                   |  23 +
>  drivers/cxl/pci.c                   |  26 +-
>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
>  tools/perf/util/Build               |   1 +
>  tools/perf/util/auxtrace.c          |   4 +
>  tools/perf/util/auxtrace.h          |   1 +
>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
>  tools/perf/util/cxl-hmu.h           |  18 +
>  21 files changed, 1748 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/trace/cxl-hmu.rst
>  create mode 100644 drivers/cxl/core/hmu.c
>  create mode 100644 drivers/cxl/hmu.c
>  create mode 100644 drivers/cxl/hmu.h
>  create mode 100644 tools/perf/util/cxl-hmu.c
>  create mode 100644 tools/perf/util/cxl-hmu.h
>