[v13] x86,fs/resctrl telemetry monitoring

[PATCH v13 32/32] x86,fs/resctrl: Update documentation for telemetry events

Posted by Tony Luck 3 months, 1 week ago

Update resctrl filesystem documentation with the details about the
resctrl files that support telemetry events.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 Documentation/filesystems/resctrl.rst | 102 +++++++++++++++++++++++---
 1 file changed, 90 insertions(+), 12 deletions(-)

diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
index b7f35b07876a..ea4995f402c9 100644
--- a/Documentation/filesystems/resctrl.rst
+++ b/Documentation/filesystems/resctrl.rst
@@ -168,13 +168,12 @@ with respect to allocation:
 			bandwidth percentages are directly applied to
 			the threads running on the core
 
-If RDT monitoring is available there will be an "L3_MON" directory
+If L3 monitoring is available there will be an "L3_MON" directory
 with the following files:
 
 "num_rmids":
-		The number of RMIDs available. This is the
-		upper bound for how many "CTRL_MON" + "MON"
-		groups can be created.
+		The number of RMIDs supported by hardware for
+		L3 monitoring events.
 
 "mon_features":
 		Lists the monitoring events if
@@ -400,6 +399,24 @@ with the following files:
 		bytes) at which a previously used LLC_occupancy
 		counter can be considered for re-use.
 
+If telemetry monitoring is available there will be an "PERF_PKG_MON" directory
+with the following files:
+
+"num_rmids":
+		The number of RMIDs for telemetry monitoring events. By default,
+		resctrl will not enable telemetry events of a particular type
+		("perf" or "energy") if the number of RMIDs supported for that
+		type is lower than the number of RMIDs supported by hardware
+		for L3 monitoring events. The user can force-enable each type
+		of telemetry events with the "rdt=" boot command line option,
+		but this may reduce the number of "MON" groups that can be created.
+
+"mon_features":
+		Lists the telemetry monitoring events that are enabled on this system.
+
+The upper bound for how many "CTRL_MON" + "MON" can be created
+is the smaller of the L3_MON and PERF_PKG_MON "num_rmids" values.
+
 Finally, in the top level of the "info" directory there is a file
 named "last_cmd_status". This is reset with every "command" issued
 via the file system (making new directories or writing to any of the
@@ -505,15 +522,40 @@ When control is enabled all CTRL_MON groups will also contain:
 When monitoring is enabled all MON groups will also contain:
 
 "mon_data":
-	This contains a set of files organized by L3 domain and by
-	RDT event. E.g. on a system with two L3 domains there will
-	be subdirectories "mon_L3_00" and "mon_L3_01".	Each of these
-	directories have one file per event (e.g. "llc_occupancy",
-	"mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
-	files provide a read out of the current value of the event for
-	all tasks in the group. In CTRL_MON groups these files provide
-	the sum for all tasks in the CTRL_MON group and all tasks in
+	This contains directories for each monitor domain.
+
+	If L3 monitoring is enabled, there will be a "mon_L3_XX" directory for
+	each instance of an L3 cache. Each directory contains files for the enabled
+	L3 events (e.g. "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes").
+
+	If telemetry monitoring is enabled, there will be a "mon_PERF_PKG_YY"
+	directory for each physical processor package. Each directory contains
+	files for the enabled telemetry events (e.g. "core_energy". "activity",
+	"uops_retired", etc.)
+
+	The info/`*`/mon_features files provide the full list of enabled
+	event/file names.
+
+	"core energy" reports a floating point number for the energy (in Joules)
+	consumed by cores (registers, arithmetic units, TLB and L1/L2 caches)
+	during execution of instructions summed across all logical CPUs on a
+	package for the current monitoring group.
+
+	"activity" also reports a floating point value (in Farads).  This provides
+	an estimate of work done independent of the frequency that the CPUs used
+	for execution.
+
+	Note that "core energy" and "activity" only measure energy/activity in the
+	"core" of the CPU (arithmetic units, TLB, L1 and L2 caches, etc.). They
+	do not include L3 cache, memory, I/O devices etc.
+
+	All other events report decimal integer values.
+
+	In a MON group these files provide a read out of the current value of
+	the event for all tasks in the group. In CTRL_MON groups these files
+	provide the sum for all tasks in the CTRL_MON group and all tasks in
 	MON groups. Please see example section for more details on usage.
+
 	On systems with Sub-NUMA Cluster (SNC) enabled there are extra
 	directories for each node (located within the "mon_L3_XX" directory
 	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
@@ -1506,6 +1548,42 @@ Example with C::
     resctrl_release_lock(fd);
   }
 
+Debugfs
+=======
+In addition to the use of debugfs for tracing of pseudo-locking performance,
+architecture code may create debugfs directories associated with monitoring
+features for a specific resource.
+
+The full pathname for these is in the form:
+
+    /sys/kernel/debug/resctrl/info/{resource_name}_MON/{arch}/
+
+The presence, names, and format of these files may vary between architectures
+even if the same resource is present.
+
+PERF_PKG_MON/x86_64
+-------------------
+Three files are present per telemetry aggregator instance that show status.
+The prefix of each file name describes the type ("energy" or "perf") which
+processor package it belongs to, and the instance number of the aggregator.
+For example: "energy_pkg1_agg2".
+
+The suffix describes which data is reported in the file and
+is one of:
+
+data_loss_count:
+	This counts the number of times that this aggregator
+	failed to accumulate a counter value supplied by a CPU.
+
+data_loss_timestamp:
+	This is a "timestamp" from a free running 25MHz uncore
+	timer indicating when the most recent data loss occurred.
+
+last_update_timestamp:
+	Another 25MHz timestamp indicating when the
+	most recent counter update was successfully applied.
+
+
 Examples for RDT Monitoring along with allocation usage
 =======================================================
 Reading monitored data
-- 
2.51.0

Re: [PATCH v13 32/32] x86,fs/resctrl: Update documentation for telemetry events

Posted by Reinette Chatre 2 months, 3 weeks ago

Hi Tony,

On 10/29/25 9:21 AM, Tony Luck wrote:
> @@ -400,6 +399,24 @@ with the following files:
>  		bytes) at which a previously used LLC_occupancy
>  		counter can be considered for re-use.
>  
> +If telemetry monitoring is available there will be an "PERF_PKG_MON" directory

"an" -> "a"?

> +with the following files:
> +
> +"num_rmids":
> +		The number of RMIDs for telemetry monitoring events. By default,
> +		resctrl will not enable telemetry events of a particular type
> +		("perf" or "energy") if the number of RMIDs supported for that
> +		type is lower than the number of RMIDs supported by hardware
> +		for L3 monitoring events. The user can force-enable each type

It is not clear to me how the number of L3 monitoring events is relevant here. This
is addressed later with the "The upper bound for how many "CTRL_MON" + "MON" can be
created ...", no?
How about something like: "if the number of RMIDs that can be tracked concurrently
for that type is lower than the total number of RMIDs supported by that type."?
(I am sure it can be improved)


> +		of telemetry events with the "rdt=" boot command line option,
> +		but this may reduce the number of "MON" groups that can be created.

Since this includes "CTRL_MON" and "MON" groups it may be simpler to just say "monitoring
groups".
	> +
> +"mon_features":
> +		Lists the telemetry monitoring events that are enabled on this system.
> +
> +The upper bound for how many "CTRL_MON" + "MON" can be created
> +is the smaller of the L3_MON and PERF_PKG_MON "num_rmids" values.
> +
>  Finally, in the top level of the "info" directory there is a file
>  named "last_cmd_status". This is reset with every "command" issued
>  via the file system (making new directories or writing to any of the

...

> +Debugfs
> +=======
> +In addition to the use of debugfs for tracing of pseudo-locking performance,
> +architecture code may create debugfs directories associated with monitoring
> +features for a specific resource.
> +
> +The full pathname for these is in the form:
> +
> +    /sys/kernel/debug/resctrl/info/{resource_name}_MON/{arch}/
> +
> +The presence, names, and format of these files may vary between architectures
> +even if the same resource is present.
> +
> +PERF_PKG_MON/x86_64
> +-------------------
> +Three files are present per telemetry aggregator instance that show status.
> +The prefix of each file name describes the type ("energy" or "perf") which
> +processor package it belongs to, and the instance number of the aggregator.
> +For example: "energy_pkg1_agg2".
> +
> +The suffix describes which data is reported in the file and
> +is one of:

(nit: unnecessary line break)

> +
> +data_loss_count:
> +	This counts the number of times that this aggregator
> +	failed to accumulate a counter value supplied by a CPU.
> +
> +data_loss_timestamp:
> +	This is a "timestamp" from a free running 25MHz uncore
> +	timer indicating when the most recent data loss occurred.
> +
> +last_update_timestamp:
> +	Another 25MHz timestamp indicating when the
> +	most recent counter update was successfully applied.
> +
> +
>  Examples for RDT Monitoring along with allocation usage
>  =======================================================
>  Reading monitored data

Reinette