[v5] perf stat affinity changes

[PATCH v5 3/3] perf stat: Add no-affinity flag

Posted by Ian Rogers 2 months, 3 weeks ago

Add flag that disables affinity behavior. Using sched_setaffinity to
place a perf thread on a CPU can avoid certain interprocessor
interrupts but may introduce a delay due to the scheduling,
particularly on loaded machines. Add a command line option to disable
the behavior. This behavior is less present in other tools like `perf
record`, as it uses a ring buffer and doesn't make repeated system
calls.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/Documentation/perf-stat.txt | 4 ++++
 tools/perf/builtin-stat.c              | 6 ++++++
 tools/perf/util/evlist.c               | 6 +-----
 tools/perf/util/evlist.h               | 1 +
 4 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 1a766d4a2233..1ffb510606af 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -382,6 +382,10 @@ color the metric's computed value.
 Don't print output, warnings or messages. This is useful with perf stat
 record below to only write data to the perf.data file.
 
+--no-affinity::
+Don't change scheduler affinities when iterating over CPUs. Disables
+an optimization aimed at minimizing interprocessor interrupts.
+
 STAT RECORD
 -----------
 Stores stat data into perf data file.
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index aec93b91fd11..709e4bcea398 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -2415,6 +2415,7 @@ static int parse_tpebs_mode(const struct option *opt, const char *str,
 int cmd_stat(int argc, const char **argv)
 {
 	struct opt_aggr_mode opt_mode = {};
+	bool affinity = true, affinity_set = false;
 	struct option stat_options[] = {
 		OPT_BOOLEAN('T', "transaction", &transaction_run,
 			"hardware transaction statistics"),
@@ -2543,6 +2544,8 @@ int cmd_stat(int argc, const char **argv)
 			"don't print 'summary' for CSV summary output"),
 		OPT_BOOLEAN(0, "quiet", &quiet,
 			"don't print any output, messages or warnings (useful with record)"),
+		OPT_BOOLEAN_SET(0, "affinity", &affinity, &affinity_set,
+			"don't allow affinity optimizations aimed at reducing IPIs"),
 		OPT_CALLBACK(0, "cputype", &evsel_list, "hybrid cpu type",
 			"Only enable events on applying cpu with this type "
 			"for hybrid platform (e.g. core or atom)",
@@ -2600,6 +2603,9 @@ int cmd_stat(int argc, const char **argv)
 	} else
 		stat_config.csv_sep = DEFAULT_SEPARATOR;
 
+	if (affinity_set)
+		evsel_list->no_affinity = !affinity;
+
 	if (argc && strlen(argv[0]) > 2 && strstarts("record", argv[0])) {
 		argc = __cmd_record(stat_options, &opt_mode, argc, argv);
 		if (argc < 0)
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index b6df81b8a236..53c8e974de8b 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -368,11 +368,7 @@ static bool evlist__use_affinity(struct evlist *evlist)
 	struct perf_cpu_map *used_cpus = NULL;
 	bool ret = false;
 
-	/*
-	 * With perf record core.user_requested_cpus is usually NULL.
-	 * Use the old method to handle this for now.
-	 */
-	if (!evlist->core.user_requested_cpus ||
+	if (evlist->no_affinity || !evlist->core.user_requested_cpus ||
 	    cpu_map__is_dummy(evlist->core.user_requested_cpus))
 		return false;
 
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index b4604c3f03d6..c7ba0e0b2219 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -59,6 +59,7 @@ struct event_enable_timer;
 struct evlist {
 	struct perf_evlist core;
 	bool		 enabled;
+	bool		 no_affinity;
 	int		 id_pos;
 	int		 is_pos;
 	int		 nr_br_cntr;
-- 
2.52.0.rc1.455.g30608eb744-goog

Re: [PATCH v5 3/3] perf stat: Add no-affinity flag

Posted by Andi Kleen 2 months, 3 weeks ago

On Tue, Nov 18, 2025 at 01:13:26PM -0800, Ian Rogers wrote:
> Add flag that disables affinity behavior. Using sched_setaffinity to
> place a perf thread on a CPU can avoid certain interprocessor
> interrupts but may introduce a delay due to the scheduling,
> particularly on loaded machines. Add a command line option to disable
> the behavior. This behavior is less present in other tools like `perf
> record`, as it uses a ring buffer and doesn't make repeated system
> calls.

Like i wrote earlier a much better fix for starvation is to use 
real-time priority instead of the old IPI storms this flag 
is bringing back.

-Andi

Re: [PATCH v5 3/3] perf stat: Add no-affinity flag

Posted by Ian Rogers 2 months, 3 weeks ago

On Tue, Nov 18, 2025 at 3:19 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> On Tue, Nov 18, 2025 at 01:13:26PM -0800, Ian Rogers wrote:
> > Add flag that disables affinity behavior. Using sched_setaffinity to
> > place a perf thread on a CPU can avoid certain interprocessor
> > interrupts but may introduce a delay due to the scheduling,
> > particularly on loaded machines. Add a command line option to disable
> > the behavior. This behavior is less present in other tools like `perf
> > record`, as it uses a ring buffer and doesn't make repeated system
> > calls.
>
> Like i wrote earlier a much better fix for starvation is to use
> real-time priority instead of the old IPI storms this flag
> is bringing back.

Ack. This is only adding the flag to perf stat, are the storms as much
of an issue there? Patch 2 of 3 changes it so that for a single event
we still use affinities, where a dummy and an event count as >1 event.
We have specific examples of loaded machines where the scheduling
latency causes broken metrics - the flag at least allows investigation
of issues like this. I don't mind reviewing a patch adding real time
priorities as an option.

Thanks,
Ian

Re: [PATCH v5 3/3] perf stat: Add no-affinity flag

Posted by Andi Kleen 2 months, 3 weeks ago

> Ack. This is only adding the flag to perf stat, are the storms as much
> of an issue there? Patch 2 of 3 changes it so that for a single event
> we still use affinities, where a dummy and an event count as >1 event.

Not sure I follow here. I thought you disabled it completely?

> We have specific examples of loaded machines where the scheduling
> latency causes broken metrics - the flag at least allows investigation
> of issues like this. I don't mind reviewing a patch adding real time
> priorities as an option.

You don't need a new flag. Just run perf with real time priority with
any standard wrapper tool, like chrt. The main obstacle is that you may
need the capability to do that though.

-Andi

Re: [PATCH v5 3/3] perf stat: Add no-affinity flag

Posted by Ian Rogers 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 7:37 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> > Ack. This is only adding the flag to perf stat, are the storms as much
> > of an issue there? Patch 2 of 3 changes it so that for a single event
> > we still use affinities, where a dummy and an event count as >1 event.
>
> Not sure I follow here. I thought you disabled it completely?

No, when we have `perf stat` we may have a single event
open/enable/disable/read/close that needs running on a particular CPU.
If we have a group of events, for a metric, the group may turn into a
single syscall that reads the group. What I've done is made it so that
in the case of a single syscall, rather than try to change the
affinity through all the CPUs we just take the hit of the IPI. If
there are 2 syscalls needed (for open/enable/...) then we use the
affinity mechanism. The key function is evlist__use_affinity and that
tries to make this >1 IPI calculation, where >1 IPI means use
affinities. I suspect the true threshold number for when we should use
IPIs probably isn't >1, but I'm hoping that this saving is obviously
true and we can change the number later. Maybe we can do some io_uring
thing in the longer term to batch up all these changes and let the
kernel worry about optimizing the changes.

The previous affinity code wasn't used for events in per-thread mode,
but when trying to use that more widely I found bugs in its iteration.
So I did a bigger re-engineering that is in patch 2 now. The code
tries to spot the grouping case, and to ignore certain kinds of events
like retirement latency and tool events that don't benefit from the
affinity mechanism regardless of what their CPU mask is saying.

> > We have specific examples of loaded machines where the scheduling
> > latency causes broken metrics - the flag at least allows investigation
> > of issues like this. I don't mind reviewing a patch adding real time
> > priorities as an option.
>
> You don't need a new flag. Just run perf with real time priority with
> any standard wrapper tool, like chrt. The main obstacle is that you may
> need the capability to do that though.

Ack. Thanks!

Ian