[PATCH v7 0/6] perf stat affinity changes

Ian Rogers posted 6 patches 3 days, 20 hours ago
There is a newer version of this series
tools/lib/perf/evlist.c                 |  36 +++++-
tools/lib/perf/include/internal/evsel.h |   2 +
tools/perf/Documentation/perf-stat.txt  |   4 +
tools/perf/builtin-stat.c               | 114 ++++++++---------
tools/perf/util/evlist.c                | 156 +++++++++++++++---------
tools/perf/util/evlist.h                |  27 ++--
tools/perf/util/parse-events.c          |  10 +-
tools/perf/util/pmu.c                   |  23 ++++
tools/perf/util/pmu.h                   |   3 +
tools/perf/util/stat-shadow.c           |   7 +-
tools/perf/util/tool_pmu.c              |  19 ---
tools/perf/util/tool_pmu.h              |   1 -
12 files changed, 237 insertions(+), 165 deletions(-)
[PATCH v7 0/6] perf stat affinity changes
Posted by Ian Rogers 3 days, 20 hours ago
Change how affinities work with evlist__for_each_cpu. Move the
affinity code into the iterator to simplify setting it up. Detect when
affinities will and won't be profitable, for example a tool event and
a regular perf event (or read group) may face less delay from a single
IPI for the event read than from a call to sched_setaffinity. Add a
 --no-affinity flag to perf stat to allow affinities to be disabled.

v7: Revert "perf tool_pmu: More accurately set the cpus for tool
    events" that caused issues with user specified CPUs (Andres Freund
    <andres@anarazel.de>). Fix a null test is prepare_metric so that
    missing events can't trigger segfaults (Andres Freund). Make the
    CPU map propagation improve the CPU maps for tool events that only
    read on index 0, this allows later setting when
    evlist__create_maps is called with the correct user CPUs. Rebase
    previous non-merged affinity changes that hadn't been picked up
    yet.

v6: Drop merged tool event change. Move TPEBS fix into its own patch
    1st.
https://lore.kernel.org/lkml/20260108212652.768875-1-irogers@google.com/

v5: Drop merged changes. Move tool event reading to first
    patch. Change --no-affinity flag to --affinity/--no-affinity flag.
https://lore.kernel.org/lkml/20251118211326.1840989-1-irogers@google.com/
    On v5 there was discussion with Andi Kleen who points out that
    affinities will work better with real time priorities but using
    this requires privileges.

v4: Rebase. Add patch to reduce scope of walltime_nsec_stats now that
    the legacy metric code is no more. Minor tweak to the ru_stats
    clean up.
https://lore.kernel.org/lkml/20251113180517.44096-1-irogers@google.com/

v3: Add affinity clean ups and read tool events last.
https://lore.kernel.org/lkml/20251106071241.141234-1-irogers@google.com/

v2: Fixed an aggregation index issue:
https://lore.kernel.org/lkml/20251104234148.3103176-2-irogers@google.com/

v1:
https://lore.kernel.org/lkml/20251104053449.1208800-1-irogers@google.com/

Ian Rogers (6):
  Revert "perf tool_pmu: More accurately set the cpus for tool events"
  perf stat-shadow: In prepare_metric fix guard on reading NULL
    perf_stat_evsel
  perf evlist: Special map propagation for tool events that read on 1
    CPU
  perf evlist: Missing TPEBS close in evlist__close
  perf evlist: Reduce affinity use and move into iterator, fix no
    affinity
  perf stat: Add no-affinity flag

 tools/lib/perf/evlist.c                 |  36 +++++-
 tools/lib/perf/include/internal/evsel.h |   2 +
 tools/perf/Documentation/perf-stat.txt  |   4 +
 tools/perf/builtin-stat.c               | 114 ++++++++---------
 tools/perf/util/evlist.c                | 156 +++++++++++++++---------
 tools/perf/util/evlist.h                |  27 ++--
 tools/perf/util/parse-events.c          |  10 +-
 tools/perf/util/pmu.c                   |  23 ++++
 tools/perf/util/pmu.h                   |   3 +
 tools/perf/util/stat-shadow.c           |   7 +-
 tools/perf/util/tool_pmu.c              |  19 ---
 tools/perf/util/tool_pmu.h              |   1 -
 12 files changed, 237 insertions(+), 165 deletions(-)

-- 
2.53.0.rc2.204.g2597b5adb4-goog
Re: [PATCH v7 0/6] perf stat affinity changes
Posted by Arnaldo Carvalho de Melo 21 hours ago
On Tue, Feb 03, 2026 at 02:51:23PM -0800, Ian Rogers wrote:
> Change how affinities work with evlist__for_each_cpu. Move the
> affinity code into the iterator to simplify setting it up. Detect when
> affinities will and won't be profitable, for example a tool event and
> a regular perf event (or read group) may face less delay from a single
> IPI for the event read than from a call to sched_setaffinity. Add a
>  --no-affinity flag to perf stat to allow affinities to be disabled.
> 
> v7: Revert "perf tool_pmu: More accurately set the cpus for tool
>     events" that caused issues with user specified CPUs (Andres Freund
>     <andres@anarazel.de>). Fix a null test is prepare_metric so that
>     missing events can't trigger segfaults (Andres Freund). Make the
>     CPU map propagation improve the CPU maps for tool events that only
>     read on index 0, this allows later setting when
>     evlist__create_maps is called with the correct user CPUs. Rebase
>     previous non-merged affinity changes that hadn't been picked up
>     yet.

Not applying to tmp.perf-tools-next, can you please check?

- Arnaldo
 
> v6: Drop merged tool event change. Move TPEBS fix into its own patch
>     1st.
> https://lore.kernel.org/lkml/20260108212652.768875-1-irogers@google.com/
> 
> v5: Drop merged changes. Move tool event reading to first
>     patch. Change --no-affinity flag to --affinity/--no-affinity flag.
> https://lore.kernel.org/lkml/20251118211326.1840989-1-irogers@google.com/
>     On v5 there was discussion with Andi Kleen who points out that
>     affinities will work better with real time priorities but using
>     this requires privileges.
> 
> v4: Rebase. Add patch to reduce scope of walltime_nsec_stats now that
>     the legacy metric code is no more. Minor tweak to the ru_stats
>     clean up.
> https://lore.kernel.org/lkml/20251113180517.44096-1-irogers@google.com/
> 
> v3: Add affinity clean ups and read tool events last.
> https://lore.kernel.org/lkml/20251106071241.141234-1-irogers@google.com/
> 
> v2: Fixed an aggregation index issue:
> https://lore.kernel.org/lkml/20251104234148.3103176-2-irogers@google.com/
> 
> v1:
> https://lore.kernel.org/lkml/20251104053449.1208800-1-irogers@google.com/
> 
> Ian Rogers (6):
>   Revert "perf tool_pmu: More accurately set the cpus for tool events"
>   perf stat-shadow: In prepare_metric fix guard on reading NULL
>     perf_stat_evsel
>   perf evlist: Special map propagation for tool events that read on 1
>     CPU
>   perf evlist: Missing TPEBS close in evlist__close
>   perf evlist: Reduce affinity use and move into iterator, fix no
>     affinity
>   perf stat: Add no-affinity flag
> 
>  tools/lib/perf/evlist.c                 |  36 +++++-
>  tools/lib/perf/include/internal/evsel.h |   2 +
>  tools/perf/Documentation/perf-stat.txt  |   4 +
>  tools/perf/builtin-stat.c               | 114 ++++++++---------
>  tools/perf/util/evlist.c                | 156 +++++++++++++++---------
>  tools/perf/util/evlist.h                |  27 ++--
>  tools/perf/util/parse-events.c          |  10 +-
>  tools/perf/util/pmu.c                   |  23 ++++
>  tools/perf/util/pmu.h                   |   3 +
>  tools/perf/util/stat-shadow.c           |   7 +-
>  tools/perf/util/tool_pmu.c              |  19 ---
>  tools/perf/util/tool_pmu.h              |   1 -
>  12 files changed, 237 insertions(+), 165 deletions(-)
> 
> -- 
> 2.53.0.rc2.204.g2597b5adb4-goog
Re: [PATCH v7 0/6] perf stat affinity changes
Posted by Ian Rogers 20 hours ago
On Fri, Feb 6, 2026 at 1:35 PM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Tue, Feb 03, 2026 at 02:51:23PM -0800, Ian Rogers wrote:
> > Change how affinities work with evlist__for_each_cpu. Move the
> > affinity code into the iterator to simplify setting it up. Detect when
> > affinities will and won't be profitable, for example a tool event and
> > a regular perf event (or read group) may face less delay from a single
> > IPI for the event read than from a call to sched_setaffinity. Add a
> >  --no-affinity flag to perf stat to allow affinities to be disabled.
> >
> > v7: Revert "perf tool_pmu: More accurately set the cpus for tool
> >     events" that caused issues with user specified CPUs (Andres Freund
> >     <andres@anarazel.de>). Fix a null test is prepare_metric so that
> >     missing events can't trigger segfaults (Andres Freund). Make the
> >     CPU map propagation improve the CPU maps for tool events that only
> >     read on index 0, this allows later setting when
> >     evlist__create_maps is called with the correct user CPUs. Rebase
> >     previous non-merged affinity changes that hadn't been picked up
> >     yet.
>
> Not applying to tmp.perf-tools-next, can you please check?

Will send a rebase on tmp.perf-tools-next shortly. Could I interest
you in some second servings of build improvements while you wait? :-)
https://lore.kernel.org/lkml/20260203164323.3447826-1-irogers@google.com/

Thanks,
Ian

> - Arnaldo
>
> > v6: Drop merged tool event change. Move TPEBS fix into its own patch
> >     1st.
> > https://lore.kernel.org/lkml/20260108212652.768875-1-irogers@google.com/
> >
> > v5: Drop merged changes. Move tool event reading to first
> >     patch. Change --no-affinity flag to --affinity/--no-affinity flag.
> > https://lore.kernel.org/lkml/20251118211326.1840989-1-irogers@google.com/
> >     On v5 there was discussion with Andi Kleen who points out that
> >     affinities will work better with real time priorities but using
> >     this requires privileges.
> >
> > v4: Rebase. Add patch to reduce scope of walltime_nsec_stats now that
> >     the legacy metric code is no more. Minor tweak to the ru_stats
> >     clean up.
> > https://lore.kernel.org/lkml/20251113180517.44096-1-irogers@google.com/
> >
> > v3: Add affinity clean ups and read tool events last.
> > https://lore.kernel.org/lkml/20251106071241.141234-1-irogers@google.com/
> >
> > v2: Fixed an aggregation index issue:
> > https://lore.kernel.org/lkml/20251104234148.3103176-2-irogers@google.com/
> >
> > v1:
> > https://lore.kernel.org/lkml/20251104053449.1208800-1-irogers@google.com/
> >
> > Ian Rogers (6):
> >   Revert "perf tool_pmu: More accurately set the cpus for tool events"
> >   perf stat-shadow: In prepare_metric fix guard on reading NULL
> >     perf_stat_evsel
> >   perf evlist: Special map propagation for tool events that read on 1
> >     CPU
> >   perf evlist: Missing TPEBS close in evlist__close
> >   perf evlist: Reduce affinity use and move into iterator, fix no
> >     affinity
> >   perf stat: Add no-affinity flag
> >
> >  tools/lib/perf/evlist.c                 |  36 +++++-
> >  tools/lib/perf/include/internal/evsel.h |   2 +
> >  tools/perf/Documentation/perf-stat.txt  |   4 +
> >  tools/perf/builtin-stat.c               | 114 ++++++++---------
> >  tools/perf/util/evlist.c                | 156 +++++++++++++++---------
> >  tools/perf/util/evlist.h                |  27 ++--
> >  tools/perf/util/parse-events.c          |  10 +-
> >  tools/perf/util/pmu.c                   |  23 ++++
> >  tools/perf/util/pmu.h                   |   3 +
> >  tools/perf/util/stat-shadow.c           |   7 +-
> >  tools/perf/util/tool_pmu.c              |  19 ---
> >  tools/perf/util/tool_pmu.h              |   1 -
> >  12 files changed, 237 insertions(+), 165 deletions(-)
> >
> > --
> > 2.53.0.rc2.204.g2597b5adb4-goog
Re: [PATCH v7 0/6] perf stat affinity changes
Posted by Arnaldo Carvalho de Melo 20 hours ago
On Fri, Feb 06, 2026 at 02:01:37PM -0800, Ian Rogers wrote:
> On Fri, Feb 6, 2026 at 1:35 PM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> >
> > On Tue, Feb 03, 2026 at 02:51:23PM -0800, Ian Rogers wrote:
> > > Change how affinities work with evlist__for_each_cpu. Move the
> > > affinity code into the iterator to simplify setting it up. Detect when
> > > affinities will and won't be profitable, for example a tool event and
> > > a regular perf event (or read group) may face less delay from a single
> > > IPI for the event read than from a call to sched_setaffinity. Add a
> > >  --no-affinity flag to perf stat to allow affinities to be disabled.
> > >
> > > v7: Revert "perf tool_pmu: More accurately set the cpus for tool
> > >     events" that caused issues with user specified CPUs (Andres Freund
> > >     <andres@anarazel.de>). Fix a null test is prepare_metric so that
> > >     missing events can't trigger segfaults (Andres Freund). Make the
> > >     CPU map propagation improve the CPU maps for tool events that only
> > >     read on index 0, this allows later setting when
> > >     evlist__create_maps is called with the correct user CPUs. Rebase
> > >     previous non-merged affinity changes that hadn't been picked up
> > >     yet.
> >
> > Not applying to tmp.perf-tools-next, can you please check?
> 
> Will send a rebase on tmp.perf-tools-next shortly.

Thanks!

> Could I interest you in some second servings of build improvements
> while you wait? :-)
> https://lore.kernel.org/lkml/20260203164323.3447826-1-irogers@google.com/

Applied it locally for testing,

- Arnaldo
 
> Thanks,
> Ian
> 
> > - Arnaldo
> >
> > > v6: Drop merged tool event change. Move TPEBS fix into its own patch
> > >     1st.
> > > https://lore.kernel.org/lkml/20260108212652.768875-1-irogers@google.com/
> > >
> > > v5: Drop merged changes. Move tool event reading to first
> > >     patch. Change --no-affinity flag to --affinity/--no-affinity flag.
> > > https://lore.kernel.org/lkml/20251118211326.1840989-1-irogers@google.com/
> > >     On v5 there was discussion with Andi Kleen who points out that
> > >     affinities will work better with real time priorities but using
> > >     this requires privileges.
> > >
> > > v4: Rebase. Add patch to reduce scope of walltime_nsec_stats now that
> > >     the legacy metric code is no more. Minor tweak to the ru_stats
> > >     clean up.
> > > https://lore.kernel.org/lkml/20251113180517.44096-1-irogers@google.com/
> > >
> > > v3: Add affinity clean ups and read tool events last.
> > > https://lore.kernel.org/lkml/20251106071241.141234-1-irogers@google.com/
> > >
> > > v2: Fixed an aggregation index issue:
> > > https://lore.kernel.org/lkml/20251104234148.3103176-2-irogers@google.com/
> > >
> > > v1:
> > > https://lore.kernel.org/lkml/20251104053449.1208800-1-irogers@google.com/
> > >
> > > Ian Rogers (6):
> > >   Revert "perf tool_pmu: More accurately set the cpus for tool events"
> > >   perf stat-shadow: In prepare_metric fix guard on reading NULL
> > >     perf_stat_evsel
> > >   perf evlist: Special map propagation for tool events that read on 1
> > >     CPU
> > >   perf evlist: Missing TPEBS close in evlist__close
> > >   perf evlist: Reduce affinity use and move into iterator, fix no
> > >     affinity
> > >   perf stat: Add no-affinity flag
> > >
> > >  tools/lib/perf/evlist.c                 |  36 +++++-
> > >  tools/lib/perf/include/internal/evsel.h |   2 +
> > >  tools/perf/Documentation/perf-stat.txt  |   4 +
> > >  tools/perf/builtin-stat.c               | 114 ++++++++---------
> > >  tools/perf/util/evlist.c                | 156 +++++++++++++++---------
> > >  tools/perf/util/evlist.h                |  27 ++--
> > >  tools/perf/util/parse-events.c          |  10 +-
> > >  tools/perf/util/pmu.c                   |  23 ++++
> > >  tools/perf/util/pmu.h                   |   3 +
> > >  tools/perf/util/stat-shadow.c           |   7 +-
> > >  tools/perf/util/tool_pmu.c              |  19 ---
> > >  tools/perf/util/tool_pmu.h              |   1 -
> > >  12 files changed, 237 insertions(+), 165 deletions(-)
> > >
> > > --
> > > 2.53.0.rc2.204.g2597b5adb4-goog
[PATCH v8 0/6] perf stat affinity changes
Posted by Ian Rogers 20 hours ago
Change how affinities work with evlist__for_each_cpu. Move the
affinity code into the iterator to simplify setting it up. Detect when
affinities will and won't be profitable, for example a tool event and
a regular perf event (or read group) may face less delay from a single
IPI for the event read than from a call to sched_setaffinity. Add a
 --no-affinity flag to perf stat to allow affinities to be disabled.

v8: Rebase, due to minor conflict with:
https://lore.kernel.org/lkml/20260203230733.1474840-1-ctshao@google.com/

v7: Revert "perf tool_pmu: More accurately set the cpus for tool
    events" that caused issues with user specified CPUs (Andres Freund
    <andres@anarazel.de>). Fix a null test is prepare_metric so that
    missing events can't trigger segfaults (Andres Freund). Make the
    CPU map propagation improve the CPU maps for tool events that only
    read on index 0, this allows later setting when
    evlist__create_maps is called with the correct user CPUs. Rebase
    previous non-merged affinity changes that hadn't been picked up
    yet.
https://lore.kernel.org/lkml/20260203225129.4077140-1-irogers@google.com/

v6: Drop merged tool event change. Move TPEBS fix into its own patch
    1st.
https://lore.kernel.org/lkml/20260108212652.768875-1-irogers@google.com/

v5: Drop merged changes. Move tool event reading to first
    patch. Change --no-affinity flag to --affinity/--no-affinity flag.
https://lore.kernel.org/lkml/20251118211326.1840989-1-irogers@google.com/
    On v5 there was discussion with Andi Kleen who points out that
    affinities will work better with real time priorities but using
    this requires privileges.

v4: Rebase. Add patch to reduce scope of walltime_nsec_stats now that
    the legacy metric code is no more. Minor tweak to the ru_stats
    clean up.
https://lore.kernel.org/lkml/20251113180517.44096-1-irogers@google.com/

v3: Add affinity clean ups and read tool events last.
https://lore.kernel.org/lkml/20251106071241.141234-1-irogers@google.com/

v2: Fixed an aggregation index issue:
https://lore.kernel.org/lkml/20251104234148.3103176-2-irogers@google.com/

v1:
https://lore.kernel.org/lkml/20251104053449.1208800-1-irogers@google.com/

Ian Rogers (6):
  Revert "perf tool_pmu: More accurately set the cpus for tool events"
  perf stat-shadow: In prepare_metric fix guard on reading NULL
    perf_stat_evsel
  perf evlist: Special map propagation for tool events that read on 1
    CPU
  perf evlist: Missing TPEBS close in evlist__close
  perf evlist: Reduce affinity use and move into iterator, fix no
    affinity
  perf stat: Add no-affinity flag

 tools/lib/perf/evlist.c                 |  36 +++++-
 tools/lib/perf/include/internal/evsel.h |   2 +
 tools/perf/Documentation/perf-stat.txt  |   4 +
 tools/perf/builtin-stat.c               | 114 ++++++++---------
 tools/perf/util/evlist.c                | 156 +++++++++++++++---------
 tools/perf/util/evlist.h                |  27 ++--
 tools/perf/util/parse-events.c          |  10 +-
 tools/perf/util/pmu.c                   |  23 ++++
 tools/perf/util/pmu.h                   |   3 +
 tools/perf/util/stat-shadow.c           |  24 ++--
 tools/perf/util/tool_pmu.c              |  19 ---
 tools/perf/util/tool_pmu.h              |   1 -
 12 files changed, 249 insertions(+), 170 deletions(-)

-- 
2.53.0.rc2.204.g2597b5adb4-goog
[PATCH v8 1/6] Revert "perf tool_pmu: More accurately set the cpus for tool events"
Posted by Ian Rogers 20 hours ago
This reverts commit d8d8a0b3603a9a8fa207cf9e4f292e81dc5d1008.

The setting of a user CPU map can cause an empty intersection when
combined with CPU 0 and the event removed. This later triggers a segv
in the stat-shadow logic. Let's put back a full online CPU map for now
by reverting this patch.

Reported-by: Andres Freund <andres@anarazel.de>
Closes: https://lore.kernel.org/linux-perf-users/cgja46br2smmznxs7kbeabs6zgv3b4olfqgh2fdp5mxk2yom4v@w6jjgov6hdi6/
Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/parse-events.c |  9 ++-------
 tools/perf/util/tool_pmu.c     | 19 -------------------
 tools/perf/util/tool_pmu.h     |  1 -
 3 files changed, 2 insertions(+), 27 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index d4647ded340f..f631bf7a919f 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -30,7 +30,6 @@
 #include "util/event.h"
 #include "util/bpf-filter.h"
 #include "util/stat.h"
-#include "util/tool_pmu.h"
 #include "util/util.h"
 #include "tracepoint.h"
 #include <api/fs/tracing_path.h>
@@ -230,12 +229,8 @@ __add_event(struct list_head *list, int *idx,
 	if (pmu) {
 		is_pmu_core = pmu->is_core;
 		pmu_cpus = perf_cpu_map__get(pmu->cpus);
-		if (perf_cpu_map__is_empty(pmu_cpus)) {
-			if (perf_pmu__is_tool(pmu))
-				pmu_cpus = tool_pmu__cpus(attr);
-			else
-				pmu_cpus = cpu_map__online();
-		}
+		if (perf_cpu_map__is_empty(pmu_cpus))
+			pmu_cpus = cpu_map__online();
 	} else {
 		is_pmu_core = (attr->type == PERF_TYPE_HARDWARE ||
 			       attr->type == PERF_TYPE_HW_CACHE);
diff --git a/tools/perf/util/tool_pmu.c b/tools/perf/util/tool_pmu.c
index 37c4eae0bef1..6a9df3dc0e07 100644
--- a/tools/perf/util/tool_pmu.c
+++ b/tools/perf/util/tool_pmu.c
@@ -2,7 +2,6 @@
 #include "cgroup.h"
 #include "counts.h"
 #include "cputopo.h"
-#include "debug.h"
 #include "evsel.h"
 #include "pmu.h"
 #include "print-events.h"
@@ -14,7 +13,6 @@
 #include <api/fs/fs.h>
 #include <api/io.h>
 #include <internal/threadmap.h>
-#include <perf/cpumap.h>
 #include <perf/threadmap.h>
 #include <fcntl.h>
 #include <strings.h>
@@ -111,23 +109,6 @@ const char *evsel__tool_pmu_event_name(const struct evsel *evsel)
 	return tool_pmu__event_to_str(evsel->core.attr.config);
 }
 
-struct perf_cpu_map *tool_pmu__cpus(struct perf_event_attr *attr)
-{
-	static struct perf_cpu_map *cpu0_map;
-	enum tool_pmu_event event = (enum tool_pmu_event)attr->config;
-
-	if (event <= TOOL_PMU__EVENT_NONE || event >= TOOL_PMU__EVENT_MAX) {
-		pr_err("Invalid tool PMU event config %llx\n", attr->config);
-		return NULL;
-	}
-	if (event == TOOL_PMU__EVENT_USER_TIME || event == TOOL_PMU__EVENT_SYSTEM_TIME)
-		return cpu_map__online();
-
-	if (!cpu0_map)
-		cpu0_map = perf_cpu_map__new_int(0);
-	return perf_cpu_map__get(cpu0_map);
-}
-
 static bool read_until_char(struct io *io, char e)
 {
 	int c;
diff --git a/tools/perf/util/tool_pmu.h b/tools/perf/util/tool_pmu.h
index ea343d1983d3..f1714001bc1d 100644
--- a/tools/perf/util/tool_pmu.h
+++ b/tools/perf/util/tool_pmu.h
@@ -46,7 +46,6 @@ bool tool_pmu__read_event(enum tool_pmu_event ev,
 u64 tool_pmu__cpu_slots_per_cycle(void);
 
 bool perf_pmu__is_tool(const struct perf_pmu *pmu);
-struct perf_cpu_map *tool_pmu__cpus(struct perf_event_attr *attr);
 
 bool evsel__is_tool(const struct evsel *evsel);
 enum tool_pmu_event evsel__tool_event(const struct evsel *evsel);
-- 
2.53.0.rc2.204.g2597b5adb4-goog
[PATCH v8 2/6] perf stat-shadow: In prepare_metric fix guard on reading NULL perf_stat_evsel
Posted by Ian Rogers 20 hours ago
The aggr value is setup to always be non-null creating a redundant
guard for reading from it. Switch to using the perf_stat_evsel (ps)
and narrow the scope of aggr so that it is known valid when used.

Reported-by: Andres Freund <andres@anarazel.de>
Closes: https://lore.kernel.org/linux-perf-users/cgja46br2smmznxs7kbeabs6zgv3b4olfqgh2fdp5mxk2yom4v@w6jjgov6hdi6/
Fixes: 3d65f6445fd9 ("perf stat-shadow: Read tool events directly")
Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/stat-shadow.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 5d8d09e0e6ae..59d2cd4f2188 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -57,7 +57,6 @@ static int prepare_metric(struct perf_stat_config *config,
 		bool is_tool_time =
 			tool_pmu__is_time_event(config, metric_events[i], &tool_aggr_idx);
 		struct perf_stat_evsel *ps = metric_events[i]->stats;
-		struct perf_stat_aggr *aggr;
 		char *n;
 		double val;
 
@@ -82,8 +81,7 @@ static int prepare_metric(struct perf_stat_config *config,
 			}
 		}
 		/* Time events are always on CPU0, the first aggregation index. */
-		aggr = &ps->aggr[is_tool_time ? tool_aggr_idx : aggr_idx];
-		if (!aggr || !metric_events[i]->supported || aggr->counts.run == 0) {
+		if (!ps || !metric_events[i]->supported) {
 			/*
 			 * Not supported events will have a count of 0, which
 			 * can be confusing in a metric. Explicitly set the
@@ -93,11 +91,21 @@ static int prepare_metric(struct perf_stat_config *config,
 			val = NAN;
 			source_count = 0;
 		} else {
-			val = aggr->counts.val;
-			if (is_tool_time)
-				val *= 1e-9; /* Convert time event nanoseconds to seconds. */
-			if (!source_count)
-				source_count = evsel__source_count(metric_events[i]);
+			struct perf_stat_aggr *aggr =
+				&ps->aggr[is_tool_time ? tool_aggr_idx : aggr_idx];
+
+			if (aggr->counts.run == 0) {
+				val = NAN;
+				source_count = 0;
+			} else {
+				val = aggr->counts.val;
+				if (is_tool_time) {
+					/* Convert time event nanoseconds to seconds. */
+					val *= 1e-9;
+				}
+				if (!source_count)
+					source_count = evsel__source_count(metric_events[i]);
+			}
 		}
 		n = strdup(evsel__metric_id(metric_events[i]));
 		if (!n)
-- 
2.53.0.rc2.204.g2597b5adb4-goog
[PATCH v8 3/6] perf evlist: Special map propagation for tool events that read on 1 CPU
Posted by Ian Rogers 20 hours ago
Tool events like duration_time don't need a perf_cpu_map that contains
all online CPUs. Having such a perf_cpu_map causes overheads when
iterating between events for CPU affinity. During parsing mark events
that just read on a single CPU map index as such, then during map
propagation set up the evsel's CPUs and thereby the evlists
accordingly. The setting cannot be done early in parsing as user CPUs
are only fully known when evlist__create_maps is called.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/lib/perf/evlist.c                 | 36 ++++++++++++++++++++++---
 tools/lib/perf/include/internal/evsel.h |  2 ++
 tools/perf/util/parse-events.c          |  1 +
 tools/perf/util/pmu.c                   | 11 ++++++++
 tools/perf/util/pmu.h                   |  2 ++
 5 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
index 3ed023f4b190..1f210dadd666 100644
--- a/tools/lib/perf/evlist.c
+++ b/tools/lib/perf/evlist.c
@@ -101,6 +101,28 @@ static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,
 		evsel->cpus = perf_cpu_map__get(evlist->user_requested_cpus);
 	}
 
+	/*
+	 * Tool events may only read on the first CPU index to avoid double
+	 * counting things like duration_time. Make the evsel->cpus contain just
+	 * that single entry otherwise we may spend time changing affinity to
+	 * CPUs that just have tool events, etc.
+	 */
+	if (evsel->reads_only_on_cpu_idx0 && perf_cpu_map__nr(evsel->cpus) > 0) {
+		struct perf_cpu_map *srcs[3] = {
+			evlist->all_cpus,
+			evlist->user_requested_cpus,
+			evsel->pmu_cpus,
+		};
+		for (size_t i = 0; i < ARRAY_SIZE(srcs); i++) {
+			if (!srcs[i])
+				continue;
+
+			perf_cpu_map__put(evsel->cpus);
+			evsel->cpus = perf_cpu_map__new_int(perf_cpu_map__cpu(srcs[i], 0).cpu);
+			break;
+		}
+	}
+
 	/* Sanity check assert before the evsel is potentially removed. */
 	assert(!evsel->requires_cpu || !perf_cpu_map__has_any_cpu(evsel->cpus));
 
@@ -133,16 +155,22 @@ static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,
 
 static void perf_evlist__propagate_maps(struct perf_evlist *evlist)
 {
-	struct perf_evsel *evsel, *n;
-
 	evlist->needs_map_propagation = true;
 
 	/* Clear the all_cpus set which will be merged into during propagation. */
 	perf_cpu_map__put(evlist->all_cpus);
 	evlist->all_cpus = NULL;
 
-	list_for_each_entry_safe(evsel, n, &evlist->entries, node)
-		__perf_evlist__propagate_maps(evlist, evsel);
+	/* 2 rounds so that reads_only_on_cpu_idx0 benefit from knowing the other CPU maps. */
+	for (int round = 0; round < 2; round++) {
+		struct perf_evsel *evsel, *n;
+
+		list_for_each_entry_safe(evsel, n, &evlist->entries, node) {
+			if ((!evsel->reads_only_on_cpu_idx0 && round == 0) ||
+			    (evsel->reads_only_on_cpu_idx0 && round == 1))
+				__perf_evlist__propagate_maps(evlist, evsel);
+		}
+	}
 }
 
 void perf_evlist__add(struct perf_evlist *evlist,
diff --git a/tools/lib/perf/include/internal/evsel.h b/tools/lib/perf/include/internal/evsel.h
index fefe64ba5e26..b988034f1371 100644
--- a/tools/lib/perf/include/internal/evsel.h
+++ b/tools/lib/perf/include/internal/evsel.h
@@ -128,6 +128,8 @@ struct perf_evsel {
 	bool			 requires_cpu;
 	/** Is the PMU for the event a core one? Effects the handling of own_cpus. */
 	bool			 is_pmu_core;
+	/** Does the evsel on read on the first CPU index such as tool time events? */
+	bool			 reads_only_on_cpu_idx0;
 	int			 idx;
 };
 
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index f631bf7a919f..b9efb296bba5 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -269,6 +269,7 @@ __add_event(struct list_head *list, int *idx,
 	evsel->core.pmu_cpus = pmu_cpus;
 	evsel->core.requires_cpu = pmu ? pmu->is_uncore : false;
 	evsel->core.is_pmu_core = is_pmu_core;
+	evsel->core.reads_only_on_cpu_idx0 = perf_pmu__reads_only_on_cpu_idx0(attr);
 	evsel->pmu = pmu;
 	evsel->alternate_hw_config = alternate_hw_config;
 	evsel->first_wildcard_match = first_wildcard_match;
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index bb399a47d2b4..81ab74681c9b 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -2718,3 +2718,14 @@ const char *perf_pmu__name_from_config(struct perf_pmu *pmu, u64 config)
 	}
 	return NULL;
 }
+
+bool perf_pmu__reads_only_on_cpu_idx0(const struct perf_event_attr *attr)
+{
+	enum tool_pmu_event event;
+
+	if (attr->type != PERF_PMU_TYPE_TOOL)
+		return false;
+
+	event = (enum tool_pmu_event)attr->config;
+	return event != TOOL_PMU__EVENT_USER_TIME && event != TOOL_PMU__EVENT_SYSTEM_TIME;
+}
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 7ef90b54a149..41c21389f393 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -350,6 +350,8 @@ void perf_pmu__delete(struct perf_pmu *pmu);
 const char *perf_pmu__name_from_config(struct perf_pmu *pmu, u64 config);
 bool perf_pmu__is_fake(const struct perf_pmu *pmu);
 
+bool perf_pmu__reads_only_on_cpu_idx0(const struct perf_event_attr *attr);
+
 static inline enum pmu_kind perf_pmu__kind(const struct perf_pmu *pmu)
 {
 	__u32 type;
-- 
2.53.0.rc2.204.g2597b5adb4-goog
[PATCH v8 4/6] perf evlist: Missing TPEBS close in evlist__close
Posted by Ian Rogers 20 hours ago
The libperf evsel close won't close TPEBS events properly. Add a test
to do this. The libperf close routine is used in evlist__close for
affinity reasons.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/evlist.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 3b0d837e3046..3abc2215e790 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1356,6 +1356,8 @@ void evlist__close(struct evlist *evlist)
 		return;
 
 	evlist__for_each_cpu(evlist_cpu_itr, evlist, &affinity) {
+		if (evlist_cpu_itr.cpu_map_idx == 0 && evsel__is_retire_lat(evlist_cpu_itr.evsel))
+			evsel__tpebs_close(evlist_cpu_itr.evsel);
 		perf_evsel__close_cpu(&evlist_cpu_itr.evsel->core,
 				      evlist_cpu_itr.cpu_map_idx);
 	}
-- 
2.53.0.rc2.204.g2597b5adb4-goog
[PATCH v8 5/6] perf evlist: Reduce affinity use and move into iterator, fix no affinity
Posted by Ian Rogers 20 hours ago
The evlist__for_each_cpu iterator will call sched_setaffitinity when
moving between CPUs to avoid IPIs. If only 1 IPI is saved then this
may be unprofitable as the delay to get scheduled may be
considerable. This may be particularly true if reading an event group
in `perf stat` in interval mode.

Move the affinity handling completely into the iterator so that a
single evlist__use_affinity can determine whether CPU affinities will
be used. For `perf record` the change is minimal as the dummy event
and the real event will always make the use of affinities the thing to
do. In `perf stat`, tool events are ignored and affinities only used
if >1 event on the same CPU occur. Determining if affinities are
useful is done by evlist__use_affinity which tests per-event whether
the event's PMU benefits from affinity use - it is assumed only perf
event using PMUs do.

Fix a bug where when there are no affinities that the CPU map iterator
may reference a CPU not present in the initial evsel. Fix by making
the iterator and non-iterator code common.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/builtin-stat.c | 108 +++++++++++---------------
 tools/perf/util/evlist.c  | 158 +++++++++++++++++++++++---------------
 tools/perf/util/evlist.h  |  26 +++++--
 tools/perf/util/pmu.c     |  12 +++
 tools/perf/util/pmu.h     |   1 +
 5 files changed, 174 insertions(+), 131 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2895b809607f..c1bb40b99176 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -369,19 +369,11 @@ static int read_counter_cpu(struct evsel *counter, int cpu_map_idx)
 static int read_counters_with_affinity(void)
 {
 	struct evlist_cpu_iterator evlist_cpu_itr;
-	struct affinity saved_affinity, *affinity;
 
 	if (all_counters_use_bpf)
 		return 0;
 
-	if (!target__has_cpu(&target) || target__has_per_thread(&target))
-		affinity = NULL;
-	else if (affinity__setup(&saved_affinity) < 0)
-		return -1;
-	else
-		affinity = &saved_affinity;
-
-	evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
+	evlist__for_each_cpu(evlist_cpu_itr, evsel_list) {
 		struct evsel *counter = evlist_cpu_itr.evsel;
 
 		if (evsel__is_bpf(counter))
@@ -393,8 +385,6 @@ static int read_counters_with_affinity(void)
 		if (!counter->err)
 			counter->err = read_counter_cpu(counter, evlist_cpu_itr.cpu_map_idx);
 	}
-	if (affinity)
-		affinity__cleanup(&saved_affinity);
 
 	return 0;
 }
@@ -793,7 +783,6 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 	const bool forks = (argc > 0);
 	bool is_pipe = STAT_RECORD ? perf_stat.data.is_pipe : false;
 	struct evlist_cpu_iterator evlist_cpu_itr;
-	struct affinity saved_affinity, *affinity = NULL;
 	int err, open_err = 0;
 	bool second_pass = false, has_supported_counters;
 
@@ -805,14 +794,6 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 		child_pid = evsel_list->workload.pid;
 	}
 
-	if (!cpu_map__is_dummy(evsel_list->core.user_requested_cpus)) {
-		if (affinity__setup(&saved_affinity) < 0) {
-			err = -1;
-			goto err_out;
-		}
-		affinity = &saved_affinity;
-	}
-
 	evlist__for_each_entry(evsel_list, counter) {
 		counter->reset_group = false;
 		if (bpf_counter__load(counter, &target)) {
@@ -825,49 +806,48 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 
 	evlist__reset_aggr_stats(evsel_list);
 
-	evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
-		counter = evlist_cpu_itr.evsel;
+	/*
+	 * bperf calls evsel__open_per_cpu() in bperf__load(), so
+	 * no need to call it again here.
+	 */
+	if (!target.use_bpf) {
+		evlist__for_each_cpu(evlist_cpu_itr, evsel_list) {
+			counter = evlist_cpu_itr.evsel;
 
-		/*
-		 * bperf calls evsel__open_per_cpu() in bperf__load(), so
-		 * no need to call it again here.
-		 */
-		if (target.use_bpf)
-			break;
+			if (counter->reset_group || !counter->supported)
+				continue;
+			if (evsel__is_bperf(counter))
+				continue;
 
-		if (counter->reset_group || !counter->supported)
-			continue;
-		if (evsel__is_bperf(counter))
-			continue;
+			while (true) {
+				if (create_perf_stat_counter(counter, &stat_config,
+							      evlist_cpu_itr.cpu_map_idx) == 0)
+					break;
 
-		while (true) {
-			if (create_perf_stat_counter(counter, &stat_config,
-						     evlist_cpu_itr.cpu_map_idx) == 0)
-				break;
+				open_err = errno;
+				/*
+				 * Weak group failed. We cannot just undo this
+				 * here because earlier CPUs might be in group
+				 * mode, and the kernel doesn't support mixing
+				 * group and non group reads. Defer it to later.
+				 * Don't close here because we're in the wrong
+				 * affinity.
+				 */
+				if ((open_err == EINVAL || open_err == EBADF) &&
+					evsel__leader(counter) != counter &&
+					counter->weak_group) {
+					evlist__reset_weak_group(evsel_list, counter, false);
+					assert(counter->reset_group);
+					counter->supported = true;
+					second_pass = true;
+					break;
+				}
 
-			open_err = errno;
-			/*
-			 * Weak group failed. We cannot just undo this here
-			 * because earlier CPUs might be in group mode, and the kernel
-			 * doesn't support mixing group and non group reads. Defer
-			 * it to later.
-			 * Don't close here because we're in the wrong affinity.
-			 */
-			if ((open_err == EINVAL || open_err == EBADF) &&
-				evsel__leader(counter) != counter &&
-				counter->weak_group) {
-				evlist__reset_weak_group(evsel_list, counter, false);
-				assert(counter->reset_group);
-				counter->supported = true;
-				second_pass = true;
-				break;
+				if (stat_handle_error(counter, open_err) != COUNTER_RETRY)
+					break;
 			}
-
-			if (stat_handle_error(counter, open_err) != COUNTER_RETRY)
-				break;
 		}
 	}
-
 	if (second_pass) {
 		/*
 		 * Now redo all the weak group after closing them,
@@ -875,7 +855,7 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 		 */
 
 		/* First close errored or weak retry */
-		evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
+		evlist__for_each_cpu(evlist_cpu_itr, evsel_list) {
 			counter = evlist_cpu_itr.evsel;
 
 			if (!counter->reset_group && counter->supported)
@@ -884,7 +864,7 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 			perf_evsel__close_cpu(&counter->core, evlist_cpu_itr.cpu_map_idx);
 		}
 		/* Now reopen weak */
-		evlist__for_each_cpu(evlist_cpu_itr, evsel_list, affinity) {
+		evlist__for_each_cpu(evlist_cpu_itr, evsel_list) {
 			counter = evlist_cpu_itr.evsel;
 
 			if (!counter->reset_group)
@@ -893,17 +873,18 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 			while (true) {
 				pr_debug2("reopening weak %s\n", evsel__name(counter));
 				if (create_perf_stat_counter(counter, &stat_config,
-							     evlist_cpu_itr.cpu_map_idx) == 0)
+							     evlist_cpu_itr.cpu_map_idx) == 0) {
+					evlist_cpu_iterator__exit(&evlist_cpu_itr);
 					break;
-
+				}
 				open_err = errno;
-				if (stat_handle_error(counter, open_err) != COUNTER_RETRY)
+				if (stat_handle_error(counter, open_err) != COUNTER_RETRY) {
+					evlist_cpu_iterator__exit(&evlist_cpu_itr);
 					break;
+				}
 			}
 		}
 	}
-	affinity__cleanup(affinity);
-	affinity = NULL;
 
 	has_supported_counters = false;
 	evlist__for_each_entry(evsel_list, counter) {
@@ -1065,7 +1046,6 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
 	if (forks)
 		evlist__cancel_workload(evsel_list);
 
-	affinity__cleanup(affinity);
 	return err;
 }
 
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 3abc2215e790..45833244daf3 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -359,36 +359,111 @@ int evlist__add_newtp(struct evlist *evlist, const char *sys, const char *name,
 }
 #endif
 
-struct evlist_cpu_iterator evlist__cpu_begin(struct evlist *evlist, struct affinity *affinity)
+/*
+ * Should sched_setaffinity be used with evlist__for_each_cpu? Determine if
+ * migrating the thread will avoid possibly numerous IPIs.
+ */
+static bool evlist__use_affinity(struct evlist *evlist)
+{
+	struct evsel *pos;
+	struct perf_cpu_map *used_cpus = NULL;
+	bool ret = false;
+
+	/*
+	 * With perf record core.user_requested_cpus is usually NULL.
+	 * Use the old method to handle this for now.
+	 */
+	if (!evlist->core.user_requested_cpus ||
+	    cpu_map__is_dummy(evlist->core.user_requested_cpus))
+		return false;
+
+	evlist__for_each_entry(evlist, pos) {
+		struct perf_cpu_map *intersect;
+
+		if (!perf_pmu__benefits_from_affinity(pos->pmu))
+			continue;
+
+		if (evsel__is_dummy_event(pos)) {
+			/*
+			 * The dummy event is opened on all CPUs so assume >1
+			 * event with shared CPUs.
+			 */
+			ret = true;
+			break;
+		}
+		if (evsel__is_retire_lat(pos)) {
+			/*
+			 * Retirement latency events are similar to tool ones in
+			 * their implementation, and so don't require affinity.
+			 */
+			continue;
+		}
+		if (perf_cpu_map__is_empty(used_cpus)) {
+			/* First benefitting event, we want >1 on a common CPU. */
+			used_cpus = perf_cpu_map__get(pos->core.cpus);
+			continue;
+		}
+		if ((pos->core.attr.read_format & PERF_FORMAT_GROUP) &&
+		    evsel__leader(pos) != pos) {
+			/* Skip members of the same sample group. */
+			continue;
+		}
+		intersect = perf_cpu_map__intersect(used_cpus, pos->core.cpus);
+		if (!perf_cpu_map__is_empty(intersect)) {
+			/* >1 event with shared CPUs. */
+			perf_cpu_map__put(intersect);
+			ret = true;
+			break;
+		}
+		perf_cpu_map__put(intersect);
+		perf_cpu_map__merge(&used_cpus, pos->core.cpus);
+	}
+	perf_cpu_map__put(used_cpus);
+	return ret;
+}
+
+void evlist_cpu_iterator__init(struct evlist_cpu_iterator *itr, struct evlist *evlist)
 {
-	struct evlist_cpu_iterator itr = {
+	*itr = (struct evlist_cpu_iterator){
 		.container = evlist,
 		.evsel = NULL,
 		.cpu_map_idx = 0,
 		.evlist_cpu_map_idx = 0,
 		.evlist_cpu_map_nr = perf_cpu_map__nr(evlist->core.all_cpus),
 		.cpu = (struct perf_cpu){ .cpu = -1},
-		.affinity = affinity,
+		.affinity = NULL,
 	};
 
 	if (evlist__empty(evlist)) {
 		/* Ensure the empty list doesn't iterate. */
-		itr.evlist_cpu_map_idx = itr.evlist_cpu_map_nr;
-	} else {
-		itr.evsel = evlist__first(evlist);
-		if (itr.affinity) {
-			itr.cpu = perf_cpu_map__cpu(evlist->core.all_cpus, 0);
-			affinity__set(itr.affinity, itr.cpu.cpu);
-			itr.cpu_map_idx = perf_cpu_map__idx(itr.evsel->core.cpus, itr.cpu);
-			/*
-			 * If this CPU isn't in the evsel's cpu map then advance
-			 * through the list.
-			 */
-			if (itr.cpu_map_idx == -1)
-				evlist_cpu_iterator__next(&itr);
-		}
+		itr->evlist_cpu_map_idx = itr->evlist_cpu_map_nr;
+		return;
 	}
-	return itr;
+
+	if (evlist__use_affinity(evlist)) {
+		if (affinity__setup(&itr->saved_affinity) == 0)
+			itr->affinity = &itr->saved_affinity;
+	}
+	itr->evsel = evlist__first(evlist);
+	itr->cpu = perf_cpu_map__cpu(evlist->core.all_cpus, 0);
+	if (itr->affinity)
+		affinity__set(itr->affinity, itr->cpu.cpu);
+	itr->cpu_map_idx = perf_cpu_map__idx(itr->evsel->core.cpus, itr->cpu);
+	/*
+	 * If this CPU isn't in the evsel's cpu map then advance
+	 * through the list.
+	 */
+	if (itr->cpu_map_idx == -1)
+		evlist_cpu_iterator__next(itr);
+}
+
+void evlist_cpu_iterator__exit(struct evlist_cpu_iterator *itr)
+{
+	if (!itr->affinity)
+		return;
+
+	affinity__cleanup(itr->affinity);
+	itr->affinity = NULL;
 }
 
 void evlist_cpu_iterator__next(struct evlist_cpu_iterator *evlist_cpu_itr)
@@ -418,14 +493,11 @@ void evlist_cpu_iterator__next(struct evlist_cpu_iterator *evlist_cpu_itr)
 		 */
 		if (evlist_cpu_itr->cpu_map_idx == -1)
 			evlist_cpu_iterator__next(evlist_cpu_itr);
+	} else {
+		evlist_cpu_iterator__exit(evlist_cpu_itr);
 	}
 }
 
-bool evlist_cpu_iterator__end(const struct evlist_cpu_iterator *evlist_cpu_itr)
-{
-	return evlist_cpu_itr->evlist_cpu_map_idx >= evlist_cpu_itr->evlist_cpu_map_nr;
-}
-
 static int evsel__strcmp(struct evsel *pos, char *evsel_name)
 {
 	if (!evsel_name)
@@ -453,19 +525,11 @@ static void __evlist__disable(struct evlist *evlist, char *evsel_name, bool excl
 {
 	struct evsel *pos;
 	struct evlist_cpu_iterator evlist_cpu_itr;
-	struct affinity saved_affinity, *affinity = NULL;
 	bool has_imm = false;
 
-	// See explanation in evlist__close()
-	if (!cpu_map__is_dummy(evlist->core.user_requested_cpus)) {
-		if (affinity__setup(&saved_affinity) < 0)
-			return;
-		affinity = &saved_affinity;
-	}
-
 	/* Disable 'immediate' events last */
 	for (int imm = 0; imm <= 1; imm++) {
-		evlist__for_each_cpu(evlist_cpu_itr, evlist, affinity) {
+		evlist__for_each_cpu(evlist_cpu_itr, evlist) {
 			pos = evlist_cpu_itr.evsel;
 			if (evsel__strcmp(pos, evsel_name))
 				continue;
@@ -483,7 +547,6 @@ static void __evlist__disable(struct evlist *evlist, char *evsel_name, bool excl
 			break;
 	}
 
-	affinity__cleanup(affinity);
 	evlist__for_each_entry(evlist, pos) {
 		if (evsel__strcmp(pos, evsel_name))
 			continue;
@@ -523,16 +586,8 @@ static void __evlist__enable(struct evlist *evlist, char *evsel_name, bool excl_
 {
 	struct evsel *pos;
 	struct evlist_cpu_iterator evlist_cpu_itr;
-	struct affinity saved_affinity, *affinity = NULL;
 
-	// See explanation in evlist__close()
-	if (!cpu_map__is_dummy(evlist->core.user_requested_cpus)) {
-		if (affinity__setup(&saved_affinity) < 0)
-			return;
-		affinity = &saved_affinity;
-	}
-
-	evlist__for_each_cpu(evlist_cpu_itr, evlist, affinity) {
+	evlist__for_each_cpu(evlist_cpu_itr, evlist) {
 		pos = evlist_cpu_itr.evsel;
 		if (evsel__strcmp(pos, evsel_name))
 			continue;
@@ -542,7 +597,6 @@ static void __evlist__enable(struct evlist *evlist, char *evsel_name, bool excl_
 			continue;
 		evsel__enable_cpu(pos, evlist_cpu_itr.cpu_map_idx);
 	}
-	affinity__cleanup(affinity);
 	evlist__for_each_entry(evlist, pos) {
 		if (evsel__strcmp(pos, evsel_name))
 			continue;
@@ -1339,30 +1393,14 @@ void evlist__close(struct evlist *evlist)
 {
 	struct evsel *evsel;
 	struct evlist_cpu_iterator evlist_cpu_itr;
-	struct affinity affinity;
-
-	/*
-	 * With perf record core.user_requested_cpus is usually NULL.
-	 * Use the old method to handle this for now.
-	 */
-	if (!evlist->core.user_requested_cpus ||
-	    cpu_map__is_dummy(evlist->core.user_requested_cpus)) {
-		evlist__for_each_entry_reverse(evlist, evsel)
-			evsel__close(evsel);
-		return;
-	}
-
-	if (affinity__setup(&affinity) < 0)
-		return;
 
-	evlist__for_each_cpu(evlist_cpu_itr, evlist, &affinity) {
+	evlist__for_each_cpu(evlist_cpu_itr, evlist) {
 		if (evlist_cpu_itr.cpu_map_idx == 0 && evsel__is_retire_lat(evlist_cpu_itr.evsel))
 			evsel__tpebs_close(evlist_cpu_itr.evsel);
 		perf_evsel__close_cpu(&evlist_cpu_itr.evsel->core,
 				      evlist_cpu_itr.cpu_map_idx);
 	}
 
-	affinity__cleanup(&affinity);
 	evlist__for_each_entry_reverse(evlist, evsel) {
 		perf_evsel__free_fd(&evsel->core);
 		perf_evsel__free_id(&evsel->core);
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 911834ae7c2a..30dff7484d3c 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -10,6 +10,7 @@
 #include <internal/evlist.h>
 #include <internal/evsel.h>
 #include <perf/evlist.h>
+#include "affinity.h"
 #include "events_stats.h"
 #include "evsel.h"
 #include "rblist.h"
@@ -363,6 +364,8 @@ struct evlist_cpu_iterator {
 	struct perf_cpu cpu;
 	/** If present, used to set the affinity when switching between CPUs. */
 	struct affinity *affinity;
+	/** Maybe be used to hold affinity state prior to iterating. */
+	struct affinity saved_affinity;
 };
 
 /**
@@ -370,22 +373,31 @@ struct evlist_cpu_iterator {
  *                        affinity, iterate over all CPUs and then the evlist
  *                        for each evsel on that CPU. When switching between
  *                        CPUs the affinity is set to the CPU to avoid IPIs
- *                        during syscalls.
+ *                        during syscalls. The affinity is set up and removed
+ *                        automatically, if the loop is broken a call to
+ *                        evlist_cpu_iterator__exit is necessary.
  * @evlist_cpu_itr: the iterator instance.
  * @evlist: evlist instance to iterate.
- * @affinity: NULL or used to set the affinity to the current CPU.
  */
-#define evlist__for_each_cpu(evlist_cpu_itr, evlist, affinity)		\
-	for ((evlist_cpu_itr) = evlist__cpu_begin(evlist, affinity);	\
+#define evlist__for_each_cpu(evlist_cpu_itr, evlist)			\
+	for (evlist_cpu_iterator__init(&(evlist_cpu_itr), evlist);	\
 	     !evlist_cpu_iterator__end(&evlist_cpu_itr);		\
 	     evlist_cpu_iterator__next(&evlist_cpu_itr))
 
-/** Returns an iterator set to the first CPU/evsel of evlist. */
-struct evlist_cpu_iterator evlist__cpu_begin(struct evlist *evlist, struct affinity *affinity);
+/** Setup an iterator set to the first CPU/evsel of evlist. */
+void evlist_cpu_iterator__init(struct evlist_cpu_iterator *itr, struct evlist *evlist);
+/**
+ * Cleans up the iterator, automatically done by evlist_cpu_iterator__next when
+ * the end of the list is reached. Multiple calls are safe.
+ */
+void evlist_cpu_iterator__exit(struct evlist_cpu_iterator *itr);
 /** Move to next element in iterator, updating CPU, evsel and the affinity. */
 void evlist_cpu_iterator__next(struct evlist_cpu_iterator *evlist_cpu_itr);
 /** Returns true when iterator is at the end of the CPUs and evlist. */
-bool evlist_cpu_iterator__end(const struct evlist_cpu_iterator *evlist_cpu_itr);
+static inline bool evlist_cpu_iterator__end(const struct evlist_cpu_iterator *evlist_cpu_itr)
+{
+	return evlist_cpu_itr->evlist_cpu_map_idx >= evlist_cpu_itr->evlist_cpu_map_nr;
+}
 
 struct evsel *evlist__get_tracking_event(struct evlist *evlist);
 void evlist__set_tracking_event(struct evlist *evlist, struct evsel *tracking_evsel);
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 81ab74681c9b..5cdd350e8885 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -2375,6 +2375,18 @@ bool perf_pmu__is_software(const struct perf_pmu *pmu)
 	return false;
 }
 
+bool perf_pmu__benefits_from_affinity(struct perf_pmu *pmu)
+{
+	if (!pmu)
+		return true; /* Assume is core. */
+
+	/*
+	 * All perf event PMUs should benefit from accessing the perf event
+	 * contexts on the local CPU.
+	 */
+	return pmu->type <= PERF_PMU_TYPE_PE_END;
+}
+
 FILE *perf_pmu__open_file(const struct perf_pmu *pmu, const char *name)
 {
 	char path[PATH_MAX];
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 41c21389f393..0d9f3c57e8e8 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -303,6 +303,7 @@ bool perf_pmu__name_no_suffix_match(const struct perf_pmu *pmu, const char *to_m
  *                        perf_sw_context in the kernel?
  */
 bool perf_pmu__is_software(const struct perf_pmu *pmu);
+bool perf_pmu__benefits_from_affinity(struct perf_pmu *pmu);
 
 FILE *perf_pmu__open_file(const struct perf_pmu *pmu, const char *name);
 FILE *perf_pmu__open_file_at(const struct perf_pmu *pmu, int dirfd, const char *name);
-- 
2.53.0.rc2.204.g2597b5adb4-goog
[PATCH v8 6/6] perf stat: Add no-affinity flag
Posted by Ian Rogers 20 hours ago
Add flag that disables affinity behavior. Using sched_setaffinity to
place a perf thread on a CPU can avoid certain interprocessor
interrupts but may introduce a delay due to the scheduling,
particularly on loaded machines. Add a command line option to disable
the behavior. This behavior is less present in other tools like `perf
record`, as it uses a ring buffer and doesn't make repeated system
calls.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/Documentation/perf-stat.txt | 4 ++++
 tools/perf/builtin-stat.c              | 6 ++++++
 tools/perf/util/evlist.c               | 6 +-----
 tools/perf/util/evlist.h               | 1 +
 4 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 1a766d4a2233..1ffb510606af 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -382,6 +382,10 @@ color the metric's computed value.
 Don't print output, warnings or messages. This is useful with perf stat
 record below to only write data to the perf.data file.
 
+--no-affinity::
+Don't change scheduler affinities when iterating over CPUs. Disables
+an optimization aimed at minimizing interprocessor interrupts.
+
 STAT RECORD
 -----------
 Stores stat data into perf data file.
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index c1bb40b99176..8bbdea44c3ba 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -2426,6 +2426,7 @@ static int parse_tpebs_mode(const struct option *opt, const char *str,
 int cmd_stat(int argc, const char **argv)
 {
 	struct opt_aggr_mode opt_mode = {};
+	bool affinity = true, affinity_set = false;
 	struct option stat_options[] = {
 		OPT_BOOLEAN('T', "transaction", &transaction_run,
 			"hardware transaction statistics"),
@@ -2554,6 +2555,8 @@ int cmd_stat(int argc, const char **argv)
 			"don't print 'summary' for CSV summary output"),
 		OPT_BOOLEAN(0, "quiet", &quiet,
 			"don't print any output, messages or warnings (useful with record)"),
+		OPT_BOOLEAN_SET(0, "affinity", &affinity, &affinity_set,
+			"don't allow affinity optimizations aimed at reducing IPIs"),
 		OPT_CALLBACK(0, "cputype", &evsel_list, "hybrid cpu type",
 			"Only enable events on applying cpu with this type "
 			"for hybrid platform (e.g. core or atom)",
@@ -2611,6 +2614,9 @@ int cmd_stat(int argc, const char **argv)
 	} else
 		stat_config.csv_sep = DEFAULT_SEPARATOR;
 
+	if (affinity_set)
+		evsel_list->no_affinity = !affinity;
+
 	if (argc && strlen(argv[0]) > 2 && strstarts("record", argv[0])) {
 		argc = __cmd_record(stat_options, &opt_mode, argc, argv);
 		if (argc < 0)
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 45833244daf3..591bdf0b3e2a 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -369,11 +369,7 @@ static bool evlist__use_affinity(struct evlist *evlist)
 	struct perf_cpu_map *used_cpus = NULL;
 	bool ret = false;
 
-	/*
-	 * With perf record core.user_requested_cpus is usually NULL.
-	 * Use the old method to handle this for now.
-	 */
-	if (!evlist->core.user_requested_cpus ||
+	if (evlist->no_affinity || !evlist->core.user_requested_cpus ||
 	    cpu_map__is_dummy(evlist->core.user_requested_cpus))
 		return false;
 
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 30dff7484d3c..d17c3b57a409 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -59,6 +59,7 @@ struct event_enable_timer;
 struct evlist {
 	struct perf_evlist core;
 	bool		 enabled;
+	bool		 no_affinity;
 	int		 id_pos;
 	int		 is_pos;
 	int		 nr_br_cntr;
-- 
2.53.0.rc2.204.g2597b5adb4-goog
Re: [PATCH v8 6/6] perf stat: Add no-affinity flag
Posted by Arnaldo Carvalho de Melo 6 hours ago
On Fri, Feb 06, 2026 at 02:25:09PM -0800, Ian Rogers wrote:
> Add flag that disables affinity behavior. Using sched_setaffinity to
> place a perf thread on a CPU can avoid certain interprocessor
> interrupts but may introduce a delay due to the scheduling,
> particularly on loaded machines. Add a command line option to disable
> the behavior. This behavior is less present in other tools like `perf
> record`, as it uses a ring buffer and doesn't make repeated system
> calls.

This is confusing:

⬢ [acme@toolbx perf-tools-next]$ perf stat -h affinity

 Usage: perf stat [<options>] [<command>]

        --affinity        don't allow affinity optimizations aimed at reducing IPIs

⬢ [acme@toolbx perf-tools-next]$

The way it is presented in the -h output it looks as if one has to use:

	perf stat --affinity

To disable affinity setting, when used that way it looks as if the user
is asking for affinity to be used.

We have things like:

⬢ [acme@toolbx perf-tools-next]$ grep -A2 OPT_.*no- tools/perf/builtin-record.c
	OPT_BOOLEAN(0, "no-buffering", &record.opts.no_buffering,
		    "collect data without buffering"),
	OPT_BOOLEAN('R', "raw-samples", &record.opts.raw_samples,
--
	OPT_BOOLEAN_SET('i', "no-inherit", &record.opts.no_inherit,
			&record.opts.no_inherit_set,
			"child tasks do not inherit counters"),
--
	OPT_BOOLEAN(0, "no-bpf-event", &record.opts.no_bpf_event, "do not record bpf events"),
	OPT_BOOLEAN(0, "strict-freq", &record.opts.strict_freq,
		    "Fail if the specified frequency can't be used"),
--
	OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
		    "don't sample"),
	OPT_BOOLEAN_SET('N', "no-buildid-cache", &record.no_buildid_cache,
			&record.no_buildid_cache_set,
			"do not update the buildid cache"),
	OPT_BOOLEAN_SET('B', "no-buildid", &record.no_buildid,
			&record.no_buildid_set,
			"do not collect buildids in perf.data"),
⬢ [acme@toolbx perf-tools-next]$

Probably this needs to be that way?

- Arnaldo
 
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/Documentation/perf-stat.txt | 4 ++++
>  tools/perf/builtin-stat.c              | 6 ++++++
>  tools/perf/util/evlist.c               | 6 +-----
>  tools/perf/util/evlist.h               | 1 +
>  4 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
> index 1a766d4a2233..1ffb510606af 100644
> --- a/tools/perf/Documentation/perf-stat.txt
> +++ b/tools/perf/Documentation/perf-stat.txt
> @@ -382,6 +382,10 @@ color the metric's computed value.
>  Don't print output, warnings or messages. This is useful with perf stat
>  record below to only write data to the perf.data file.
>  
> +--no-affinity::
> +Don't change scheduler affinities when iterating over CPUs. Disables
> +an optimization aimed at minimizing interprocessor interrupts.
> +
>  STAT RECORD
>  -----------
>  Stores stat data into perf data file.
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index c1bb40b99176..8bbdea44c3ba 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -2426,6 +2426,7 @@ static int parse_tpebs_mode(const struct option *opt, const char *str,
>  int cmd_stat(int argc, const char **argv)
>  {
>  	struct opt_aggr_mode opt_mode = {};
> +	bool affinity = true, affinity_set = false;
>  	struct option stat_options[] = {
>  		OPT_BOOLEAN('T', "transaction", &transaction_run,
>  			"hardware transaction statistics"),
> @@ -2554,6 +2555,8 @@ int cmd_stat(int argc, const char **argv)
>  			"don't print 'summary' for CSV summary output"),
>  		OPT_BOOLEAN(0, "quiet", &quiet,
>  			"don't print any output, messages or warnings (useful with record)"),
> +		OPT_BOOLEAN_SET(0, "affinity", &affinity, &affinity_set,
> +			"don't allow affinity optimizations aimed at reducing IPIs"),
>  		OPT_CALLBACK(0, "cputype", &evsel_list, "hybrid cpu type",
>  			"Only enable events on applying cpu with this type "
>  			"for hybrid platform (e.g. core or atom)",
> @@ -2611,6 +2614,9 @@ int cmd_stat(int argc, const char **argv)
>  	} else
>  		stat_config.csv_sep = DEFAULT_SEPARATOR;
>  
> +	if (affinity_set)
> +		evsel_list->no_affinity = !affinity;
> +
>  	if (argc && strlen(argv[0]) > 2 && strstarts("record", argv[0])) {
>  		argc = __cmd_record(stat_options, &opt_mode, argc, argv);
>  		if (argc < 0)
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index 45833244daf3..591bdf0b3e2a 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -369,11 +369,7 @@ static bool evlist__use_affinity(struct evlist *evlist)
>  	struct perf_cpu_map *used_cpus = NULL;
>  	bool ret = false;
>  
> -	/*
> -	 * With perf record core.user_requested_cpus is usually NULL.
> -	 * Use the old method to handle this for now.
> -	 */
> -	if (!evlist->core.user_requested_cpus ||
> +	if (evlist->no_affinity || !evlist->core.user_requested_cpus ||
>  	    cpu_map__is_dummy(evlist->core.user_requested_cpus))
>  		return false;
>  
> diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
> index 30dff7484d3c..d17c3b57a409 100644
> --- a/tools/perf/util/evlist.h
> +++ b/tools/perf/util/evlist.h
> @@ -59,6 +59,7 @@ struct event_enable_timer;
>  struct evlist {
>  	struct perf_evlist core;
>  	bool		 enabled;
> +	bool		 no_affinity;
>  	int		 id_pos;
>  	int		 is_pos;
>  	int		 nr_br_cntr;
> -- 
> 2.53.0.rc2.204.g2597b5adb4-goog
> 
Re: [PATCH v8 6/6] perf stat: Add no-affinity flag
Posted by Ian Rogers 2 hours ago
On Sat, Feb 7, 2026 at 4:51 AM Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Fri, Feb 06, 2026 at 02:25:09PM -0800, Ian Rogers wrote:
> > Add flag that disables affinity behavior. Using sched_setaffinity to
> > place a perf thread on a CPU can avoid certain interprocessor
> > interrupts but may introduce a delay due to the scheduling,
> > particularly on loaded machines. Add a command line option to disable
> > the behavior. This behavior is less present in other tools like `perf
> > record`, as it uses a ring buffer and doesn't make repeated system
> > calls.
>
> This is confusing:
>
> ⬢ [acme@toolbx perf-tools-next]$ perf stat -h affinity
>
>  Usage: perf stat [<options>] [<command>]
>
>         --affinity        don't allow affinity optimizations aimed at reducing IPIs
>
> ⬢ [acme@toolbx perf-tools-next]$
>
> The way it is presented in the -h output it looks as if one has to use:
>
>         perf stat --affinity
>
> To disable affinity setting, when used that way it looks as if the user
> is asking for affinity to be used.
>
> We have things like:
>
> ⬢ [acme@toolbx perf-tools-next]$ grep -A2 OPT_.*no- tools/perf/builtin-record.c
>         OPT_BOOLEAN(0, "no-buffering", &record.opts.no_buffering,
>                     "collect data without buffering"),
>         OPT_BOOLEAN('R', "raw-samples", &record.opts.raw_samples,
> --
>         OPT_BOOLEAN_SET('i', "no-inherit", &record.opts.no_inherit,
>                         &record.opts.no_inherit_set,
>                         "child tasks do not inherit counters"),
> --
>         OPT_BOOLEAN(0, "no-bpf-event", &record.opts.no_bpf_event, "do not record bpf events"),
>         OPT_BOOLEAN(0, "strict-freq", &record.opts.strict_freq,
>                     "Fail if the specified frequency can't be used"),
> --
>         OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
>                     "don't sample"),
>         OPT_BOOLEAN_SET('N', "no-buildid-cache", &record.no_buildid_cache,
>                         &record.no_buildid_cache_set,
>                         "do not update the buildid cache"),
>         OPT_BOOLEAN_SET('B', "no-buildid", &record.no_buildid,
>                         &record.no_buildid_set,
>                         "do not collect buildids in perf.data"),
> ⬢ [acme@toolbx perf-tools-next]$
>
> Probably this needs to be that way?

So it was that way in the v4 patch set but Namhyung asked for it to be
the other way around:
https://lore.kernel.org/lkml/aRvcuMfbDRSBU87k@google.com/
The comment on the option didn't get changed and that's a  mistake.
Let me know what's preferred and I can send a patch.

Thanks,
Ian

> - Arnaldo
>
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/Documentation/perf-stat.txt | 4 ++++
> >  tools/perf/builtin-stat.c              | 6 ++++++
> >  tools/perf/util/evlist.c               | 6 +-----
> >  tools/perf/util/evlist.h               | 1 +
> >  4 files changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
> > index 1a766d4a2233..1ffb510606af 100644
> > --- a/tools/perf/Documentation/perf-stat.txt
> > +++ b/tools/perf/Documentation/perf-stat.txt
> > @@ -382,6 +382,10 @@ color the metric's computed value.
> >  Don't print output, warnings or messages. This is useful with perf stat
> >  record below to only write data to the perf.data file.
> >
> > +--no-affinity::
> > +Don't change scheduler affinities when iterating over CPUs. Disables
> > +an optimization aimed at minimizing interprocessor interrupts.
> > +
> >  STAT RECORD
> >  -----------
> >  Stores stat data into perf data file.
> > diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> > index c1bb40b99176..8bbdea44c3ba 100644
> > --- a/tools/perf/builtin-stat.c
> > +++ b/tools/perf/builtin-stat.c
> > @@ -2426,6 +2426,7 @@ static int parse_tpebs_mode(const struct option *opt, const char *str,
> >  int cmd_stat(int argc, const char **argv)
> >  {
> >       struct opt_aggr_mode opt_mode = {};
> > +     bool affinity = true, affinity_set = false;
> >       struct option stat_options[] = {
> >               OPT_BOOLEAN('T', "transaction", &transaction_run,
> >                       "hardware transaction statistics"),
> > @@ -2554,6 +2555,8 @@ int cmd_stat(int argc, const char **argv)
> >                       "don't print 'summary' for CSV summary output"),
> >               OPT_BOOLEAN(0, "quiet", &quiet,
> >                       "don't print any output, messages or warnings (useful with record)"),
> > +             OPT_BOOLEAN_SET(0, "affinity", &affinity, &affinity_set,
> > +                     "don't allow affinity optimizations aimed at reducing IPIs"),
> >               OPT_CALLBACK(0, "cputype", &evsel_list, "hybrid cpu type",
> >                       "Only enable events on applying cpu with this type "
> >                       "for hybrid platform (e.g. core or atom)",
> > @@ -2611,6 +2614,9 @@ int cmd_stat(int argc, const char **argv)
> >       } else
> >               stat_config.csv_sep = DEFAULT_SEPARATOR;
> >
> > +     if (affinity_set)
> > +             evsel_list->no_affinity = !affinity;
> > +
> >       if (argc && strlen(argv[0]) > 2 && strstarts("record", argv[0])) {
> >               argc = __cmd_record(stat_options, &opt_mode, argc, argv);
> >               if (argc < 0)
> > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> > index 45833244daf3..591bdf0b3e2a 100644
> > --- a/tools/perf/util/evlist.c
> > +++ b/tools/perf/util/evlist.c
> > @@ -369,11 +369,7 @@ static bool evlist__use_affinity(struct evlist *evlist)
> >       struct perf_cpu_map *used_cpus = NULL;
> >       bool ret = false;
> >
> > -     /*
> > -      * With perf record core.user_requested_cpus is usually NULL.
> > -      * Use the old method to handle this for now.
> > -      */
> > -     if (!evlist->core.user_requested_cpus ||
> > +     if (evlist->no_affinity || !evlist->core.user_requested_cpus ||
> >           cpu_map__is_dummy(evlist->core.user_requested_cpus))
> >               return false;
> >
> > diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
> > index 30dff7484d3c..d17c3b57a409 100644
> > --- a/tools/perf/util/evlist.h
> > +++ b/tools/perf/util/evlist.h
> > @@ -59,6 +59,7 @@ struct event_enable_timer;
> >  struct evlist {
> >       struct perf_evlist core;
> >       bool             enabled;
> > +     bool             no_affinity;
> >       int              id_pos;
> >       int              is_pos;
> >       int              nr_br_cntr;
> > --
> > 2.53.0.rc2.204.g2597b5adb4-goog
> >