[v4] perf sched: Introduce stats tool

[PATCH v4 00/11] perf sched: Introduce stats tool

Posted by Swapnil Sapkal 5 months, 2 weeks ago

Apologies for long delay, I was side tracked on some other items. I was
not able to focus on this.

MOTIVATION
----------

Existing `perf sched` is quite exhaustive and provides lot of insights
into scheduler behavior but it quickly becomes impractical to use for
long running or scheduler intensive workload. For ex, `perf sched record`
has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
generates huge 56G perf.data for which perf takes ~137 mins to prepare
and write it to disk [1].

Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
and generates samples on a tracepoint hit, `perf sched stats record` takes
snapshot of the /proc/schedstat file before and after the workload, i.e.
there is almost zero interference on workload run. Also, it takes very
minimal time to parse /proc/schedstat, convert it into perf samples and
save those samples into perf.data file. Result perf.data file is much
smaller. So, overall `perf sched stats record` is much more light weight
compare to `perf sched record`.

We, internally at AMD, have been using this (a variant of this, known as
"sched-scoreboard"[2]) and found it to be very useful to analyse impact
of any scheduler code changes[3][4]. Prateek used v2[5] of this patch
series to report the analysis[6][7].

Please note that, this is not a replacement of perf sched record/report.
The intended users of the new tool are scheduler developers, not regular
users.

USAGE
-----

  # perf sched stats record
  # perf sched stats report
  # perf sched stats diff

Note: Although `perf sched stats` tool supports workload profiling syntax
(i.e. -- <workload> ), the recorded profile is still systemwide since the
/proc/schedstat is a systemwide file.

HOW TO INTERPRET THE REPORT
---------------------------

The `perf sched stats report` starts with description of the columns
present in the report. These column names are given before cpu and
domain stats to improve the readability of the report.

  ----------------------------------------------------------------------------------------------------
  DESC                    -> Description of the field
  COUNT                   -> Value of the field
  PCT_CHANGE              -> Percent change with corresponding base value
  AVG_JIFFIES             -> Avg time in jiffies between two consecutive occurrence of event
  ----------------------------------------------------------------------------------------------------

Next is the total profiling time in terms of jiffies:

  ----------------------------------------------------------------------------------------------------
  Time elapsed (in jiffies)                                   :       24537
  ----------------------------------------------------------------------------------------------------

Next is CPU scheduling statistics. These are simple diffs of
/proc/schedstat CPU lines along with description. The report also
prints % relative to base stat.

In the example below, schedule() left the CPU0 idle 36.58% of the time.
0.45% of total try_to_wake_up() was to wakeup local CPU. And, the total
waittime by tasks on CPU0 is 48.70% of the total runtime by tasks on the
same CPU.

  ----------------------------------------------------------------------------------------------------
  CPU 0
  ----------------------------------------------------------------------------------------------------
  DESC                                                                     COUNT   PCT_CHANGE
  ----------------------------------------------------------------------------------------------------
  yld_count                                                        :           0
  array_exp                                                        :           0
  sched_count                                                      :      402267
  sched_goidle                                                     :      147161  (    36.58% )
  ttwu_count                                                       :      236309
  ttwu_local                                                       :        1062  (     0.45% )
  rq_cpu_time                                                      :  7083791148
  run_delay                                                        :  3449973971  (    48.70% )
  pcount                                                           :      255035
  ----------------------------------------------------------------------------------------------------

Next is load balancing statistics. For each of the sched domains
(eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
the following three categories:

  1) Idle Load Balance: Load balancing performed on behalf of a long
                        idling CPU by some other CPU.
  2) Busy Load Balance: Load balancing performed when the CPU was busy.
  3) New Idle Balance : Load balancing performed when a CPU just became
                        idle.

Under each of these three categories, sched stats report provides
different load balancing statistics. Along with direct stats, the
report also contains derived metrics prefixed with *. Example:

  ----------------------------------------------------------------------------------------------------
  CPU 0, DOMAIN SMT CPUS 0,64
  ----------------------------------------------------------------------------------------------------
  DESC                                                                     COUNT    AVG_JIFFIES
  ----------------------------------------- <Category busy> ------------------------------------------
  busy_lb_count                                                    :         136  $       17.08 $
  busy_lb_balanced                                                 :         131  $       17.73 $
  busy_lb_failed                                                   :           0  $        0.00 $
  busy_lb_imbalance_load                                           :          58
  busy_lb_imbalance_util                                           :           0
  busy_lb_imbalance_task                                           :           0
  busy_lb_imbalance_misfit                                         :           0
  busy_lb_gained                                                   :           7
  busy_lb_hot_gained                                               :           0
  busy_lb_nobusyq                                                  :           2  $     1161.50 $
  busy_lb_nobusyg                                                  :         129  $       18.01 $
  *busy_lb_success_count                                           :           5
  *busy_lb_avg_pulled                                              :        1.40
  ----------------------------------------- <Category idle> ------------------------------------------
  idle_lb_count                                                    :         449  $        5.17 $
  idle_lb_balanced                                                 :         382  $        6.08 $
  idle_lb_failed                                                   :           3  $      774.33 $
  idle_lb_imbalance_load                                           :           0
  idle_lb_imbalance_util                                           :           0
  idle_lb_imbalance_task                                           :          71
  idle_lb_imbalance_misfit                                         :           0
  idle_lb_gained                                                   :          67
  idle_lb_hot_gained                                               :           0
  idle_lb_nobusyq                                                  :           0  $        0.00 $
  idle_lb_nobusyg                                                  :         382  $        6.08 $
  *idle_lb_success_count                                           :          64
  *idle_lb_avg_pulled                                              :        1.05
  ---------------------------------------- <Category newidle> ----------------------------------------
  newidle_lb_count                                                 :       30471  $        0.08 $
  newidle_lb_balanced                                              :       28490  $        0.08 $
  newidle_lb_failed                                                :         633  $        3.67 $
  newidle_lb_imbalance_load                                        :           0
  newidle_lb_imbalance_util                                        :           0
  newidle_lb_imbalance_task                                        :        2040
  newidle_lb_imbalance_misfit                                      :           0
  newidle_lb_gained                                                :        1348
  newidle_lb_hot_gained                                            :           0
  newidle_lb_nobusyq                                               :           6  $      387.17 $
  newidle_lb_nobusyg                                               :       26634  $        0.09 $
  *newidle_lb_success_count                                        :        1348
  *newidle_lb_avg_pulled                                           :        1.00
  ----------------------------------------------------------------------------------------------------

Consider following line:

newidle_lb_balanced                                              :       28490  $        0.08 $

While profiling was active, the load-balancer found 28490 times the load
needs to be balanced on a newly idle CPU 0. Following value encapsulated
inside $ is average jiffies between two events (28490 / 24537 = 0.08).

Next are active_load_balance() stats. alb did not trigger while the
profiling was active, hence it's all 0s.


  --------------------------------- <Category active_load_balance()> ---------------------------------
  alb_count                                                        :           0
  alb_failed                                                       :           0
  alb_pushed                                                       :           0
  ----------------------------------------------------------------------------------------------------

Next are sched_balance_exec() and sched_balance_fork() stats. They are
not used but we kept it in RFC just for legacy purpose. Unless opposed,
we plan to remove them in next revision.

Next are wakeup statistics. For every domain, the report also shows
task-wakeup statistics. Example:

  ------------------------------------------ <Wakeup Info> -------------------------------------------
  ttwu_wake_remote                                                 :        1590
  ttwu_move_affine                                                 :          84
  ttwu_move_balance                                                :           0
  ----------------------------------------------------------------------------------------------------

Same set of stats are reported for each CPU and each domain level.

HOW TO INTERPRET THE DIFF
-------------------------

The `perf sched stats diff` will also start with explaining the columns
present in the diff. Then it will show the diff in time in terms of
jiffies. The order of the values depends on the order of input data
files. Example:

  ----------------------------------------------------------------------------------------------------
  Time elapsed (in jiffies)                                        :        2763,       2763
  ----------------------------------------------------------------------------------------------------

Below is the sample representing the difference in cpu and domain stats of
two runs. Here third column or the values enclosed in `|...|` shows the
percent change between the two. Second and fourth columns shows the
side-by-side representions of the corresponding fields from `perf sched
stats report`.

  ----------------------------------------------------------------------------------------------------
  CPU <ALL CPUS SUMMARY>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2   PCT_CHANG>
  ----------------------------------------------------------------------------------------------------
  yld_count                                                        :           0,          0  |     0.00>
  array_exp                                                        :           0,          0  |     0.00>
  sched_count                                                      :      528533,     412573  |   -21.94>
  sched_goidle                                                     :      193426,     146082  |   -24.48>
  ttwu_count                                                       :      313134,     385975  |    23.26>
  ttwu_local                                                       :        1126,       1282  |    13.85>
  rq_cpu_time                                                      :  8257200244, 8301250047  |     0.53>
  run_delay                                                        :  4728347053, 3997100703  |   -15.47>
  pcount                                                           :      335031,     266396  |   -20.49>
  ----------------------------------------------------------------------------------------------------

Below is the sample of domain stats diff:

  ----------------------------------------------------------------------------------------------------
  CPU <ALL CPUS SUMMARY>, DOMAIN SMT
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2   PCT_CHANG>
  ----------------------------------------- <Category busy> ------------------------------------------
  busy_lb_count                                                    :         122,         80  |   -34.43>
  busy_lb_balanced                                                 :         115,         76  |   -33.91>
  busy_lb_failed                                                   :           1,          3  |   200.00>
  busy_lb_imbalance_load                                           :          35,         49  |    40.00>
  busy_lb_imbalance_util                                           :           0,          0  |     0.00>
  busy_lb_imbalance_task                                           :           0,          0  |     0.00>
  busy_lb_imbalance_misfit                                         :           0,          0  |     0.00>
  busy_lb_gained                                                   :           7,          2  |   -71.43>
  busy_lb_hot_gained                                               :           0,          0  |     0.00>
  busy_lb_nobusyq                                                  :           0,          0  |     0.00>
  busy_lb_nobusyg                                                  :         115,         76  |   -33.91>
  *busy_lb_success_count                                           :           6,          1  |   -83.33>
  *busy_lb_avg_pulled                                              :        1.17,       2.00  |    71.43>
  ----------------------------------------- <Category idle> ------------------------------------------
  idle_lb_count                                                    :         568,        620  |     9.15>
  idle_lb_balanced                                                 :         462,        449  |    -2.81>
  idle_lb_failed                                                   :          11,         21  |    90.91>
  idle_lb_imbalance_load                                           :           0,          0  |     0.00>
  idle_lb_imbalance_util                                           :           0,          0  |     0.00>
  idle_lb_imbalance_task                                           :         115,        189  |    64.35>
  idle_lb_imbalance_misfit                                         :           0,          0  |     0.00>
  idle_lb_gained                                                   :         103,        169  |    64.08>
  idle_lb_hot_gained                                               :           0,          0  |     0.00>
  idle_lb_nobusyq                                                  :           0,          0  |     0.00>
  idle_lb_nobusyg                                                  :         462,        449  |    -2.81>
  *idle_lb_success_count                                           :          95,        150  |    57.89>
  *idle_lb_avg_pulled                                              :        1.08,       1.13  |     3.92>
  ---------------------------------------- <Category newidle> ----------------------------------------
  newidle_lb_count                                                 :       16961,       3155  |   -81.40>
  newidle_lb_balanced                                              :       15646,       2556  |   -83.66>
  newidle_lb_failed                                                :         397,        142  |   -64.23>
  newidle_lb_imbalance_load                                        :           0,          0  |     0.00>
  newidle_lb_imbalance_util                                        :           0,          0  |     0.00>
  newidle_lb_imbalance_task                                        :        1376,        655  |   -52.40>
  newidle_lb_imbalance_misfit                                      :           0,          0  |     0.00>
  newidle_lb_gained                                                :         917,        457  |   -50.16>
  newidle_lb_hot_gained                                            :           0,          0  |     0.00>
  newidle_lb_nobusyq                                               :           3,          1  |   -66.67>
  newidle_lb_nobusyg                                               :       14480,       2103  |   -85.48>
  *newidle_lb_success_count                                        :         918,        457  |   -50.22>
  *newidle_lb_avg_pulled                                           :        1.00,       1.00  |     0.11>
  --------------------------------- <Category active_load_balance()> ---------------------------------
  alb_count                                                        :           0,          1  |     0.00>
  alb_failed                                                       :           0,          0  |     0.00>
  alb_pushed                                                       :           0,          1  |     0.00>
  --------------------------------- <Category sched_balance_exec()> ----------------------------------
  sbe_count                                                        :           0,          0  |     0.00>
  sbe_balanced                                                     :           0,          0  |     0.00>
  sbe_pushed                                                       :           0,          0  |     0.00>
  --------------------------------- <Category sched_balance_fork()> ----------------------------------
  sbf_count                                                        :           0,          0  |     0.00>
  sbf_balanced                                                     :           0,          0  |     0.00>
  sbf_pushed                                                       :           0,          0  |     0.00>
  ------------------------------------------ <Wakeup Info> -------------------------------------------
  ttwu_wake_remote                                                 :        2031,       2914  |    43.48>
  ttwu_move_affine                                                 :          73,        124  |    69.86>
  ttwu_move_balance                                                :           0,          0  |     0.00>
  ----------------------------------------------------------------------------------------------------

v3: https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
v3->v4:
 - All the review comments from v3 are addressed [Namhyung Kim].
 - Print short names instead of field descripion in the report [Peter Zijlstra]
 - Fix the double free issue [Cristian Prundeanu]
 - Documentation update related to `perf sched stats diff` [Chen yu]
 - Bail out `perf sched stats diff` if perf.data files have different schedstat
   versions [Peter Zijlstra]

v2: https://lore.kernel.org/all/20241122084452.1064968-1-swapnil.sapkal@amd.com/
v2->v3:
 - Add perf unit test for basic sched stats functionalities
 - Describe new tool, it's usage and interpretation of report data in the
   perf-sched man page.
 - Add /proc/schedstat version 17 support.

v1: https://lore.kernel.org/lkml/20240916164722.1838-1-ravi.bangoria@amd.com
v1->v2
 - Add the support for `perf sched stats diff`
 - Add column header in report for better readability. Use
   procfs__mountpoint for consistency. Add hint for enabling
   CONFIG_SCHEDSTAT if disabled. [James Clark]
 - Use a single header file for both cpu and domain fileds. Change
   the layout of structs to minimise the padding. I tried changing
   `v15` to `15` in the header files but it was not giving any
   benefits so drop the idea. [Namhyung Kim]
 - Add tested-by.

RFC: https://lore.kernel.org/r/20240508060427.417-1-ravi.bangoria@amd.com
RFC->v1:
 - [Kernel] Print domain name along with domain number in /proc/schedstat
   file.
 - s/schedstat/stats/ for the subcommand.
 - Record domain name and cpumask details, also show them in report.
 - Add CPU filtering capability at record and report time.
 - Add /proc/schedstat v16 support.
 - Live mode support. Similar to perf stat command, live mode prints the
   sched stats on the stdout.
 - Add pager support in `perf sched stats report` for better scrolling.
 - Some minor cosmetic changes in report output to improve readability.
 - Rebase to latest perf-tools-next/perf-tools-next (1de5b5dcb835).

TODO:
 - perf sched stats records /proc/schedstat which is a CPU and domain
   level scheduler statistic. We are planning to add taskstat tool which
   reads task stats from procfs and generate scheduler statistic report
   at task granularity. this will probably a standalone tool, something
   like `perf sched taskstat record/report`.
 - Except pre-processor related checkpatch warnings, we have addressed
   most of the other possible warnings.
 - This version supports diff for two perf.data files captured for same
   schedstats version but the target is to show diff for multiple
   perf.data files. Plan is to support diff if perf.data files provided
   has different schedstat versions.

Patches are prepared on v6.17-rc3 (1b237f190eb3).

[1] https://youtu.be/lg-9aG2ajA0?t=283
[2] https://github.com/AMDESE/sched-scoreboard
[3] https://lore.kernel.org/lkml/c50bdbfe-02ce-c1bc-c761-c95f8e216ca0@amd.com/
[4] https://lore.kernel.org/lkml/3e32bec6-5e59-c66a-7676-7d15df2c961c@amd.com/
[5] https://lore.kernel.org/all/20241122084452.1064968-1-swapnil.sapkal@amd.com/
[6] https://lore.kernel.org/lkml/3170d16e-eb67-4db8-a327-eb8188397fdb@amd.com/
[7] https://lore.kernel.org/lkml/feb31b6e-6457-454c-a4f3-ce8ad96bf8de@amd.com/

Swapnil Sapkal (11):
  perf: Add print_separator to util
  tools/lib: Add list_is_first()
  perf header: Support CPU DOMAIN relation info
  perf sched stats: Add record and rawdump support
  perf sched stats: Add schedstat v16 support
  perf sched stats: Add schedstat v17 support
  perf sched stats: Add support for report subcommand
  perf sched stats: Add support for live mode
  perf sched stats: Add support for diff subcommand
  perf sched stats: Add basic perf sched stats test
  perf sched stats: Add details in man page

 tools/include/linux/list.h                    |   10 +
 tools/lib/perf/Documentation/libperf.txt      |    2 +
 tools/lib/perf/Makefile                       |    1 +
 tools/lib/perf/include/perf/event.h           |   69 ++
 tools/lib/perf/include/perf/schedstat-v15.h   |  146 +++
 tools/lib/perf/include/perf/schedstat-v16.h   |  146 +++
 tools/lib/perf/include/perf/schedstat-v17.h   |  164 +++
 tools/perf/Documentation/perf-sched.txt       |  261 ++++-
 .../Documentation/perf.data-file-format.txt   |   17 +
 tools/perf/builtin-inject.c                   |    3 +
 tools/perf/builtin-kwork.c                    |   13 +-
 tools/perf/builtin-sched.c                    | 1027 ++++++++++++++++-
 tools/perf/tests/shell/perf_sched_stats.sh    |   64 +
 tools/perf/util/env.h                         |   16 +
 tools/perf/util/event.c                       |   52 +
 tools/perf/util/event.h                       |    2 +
 tools/perf/util/header.c                      |  304 +++++
 tools/perf/util/header.h                      |    6 +
 tools/perf/util/session.c                     |   22 +
 tools/perf/util/synthetic-events.c            |  196 ++++
 tools/perf/util/synthetic-events.h            |    3 +
 tools/perf/util/tool.c                        |   18 +
 tools/perf/util/tool.h                        |    4 +-
 tools/perf/util/util.c                        |   48 +
 tools/perf/util/util.h                        |    5 +
 25 files changed, 2587 insertions(+), 12 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/schedstat-v15.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-v16.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-v17.h
 create mode 100755 tools/perf/tests/shell/perf_sched_stats.sh

-- 
2.43.0

Re: [PATCH v4 00/11] perf sched: Introduce stats tool

Posted by Sapkal, Swapnil 5 months, 2 weeks ago

Hello all,

Missed to add perf folks to the list. Adding them here. Sorry about that.

--
Thanks and Regards,
Swapnil

On 8/26/2025 10:40 AM, Swapnil Sapkal wrote:
> Apologies for long delay, I was side tracked on some other items. I was
> not able to focus on this.
> 
> MOTIVATION
> ----------
> 
> Existing `perf sched` is quite exhaustive and provides lot of insights
> into scheduler behavior but it quickly becomes impractical to use for
> long running or scheduler intensive workload. For ex, `perf sched record`
> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
> generates huge 56G perf.data for which perf takes ~137 mins to prepare
> and write it to disk [1].
> 
> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
> and generates samples on a tracepoint hit, `perf sched stats record` takes
> snapshot of the /proc/schedstat file before and after the workload, i.e.
> there is almost zero interference on workload run. Also, it takes very
> minimal time to parse /proc/schedstat, convert it into perf samples and
> save those samples into perf.data file. Result perf.data file is much
> smaller. So, overall `perf sched stats record` is much more light weight
> compare to `perf sched record`.
> 
> We, internally at AMD, have been using this (a variant of this, known as
> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
> of any scheduler code changes[3][4]. Prateek used v2[5] of this patch
> series to report the analysis[6][7].
> 
> Please note that, this is not a replacement of perf sched record/report.
> The intended users of the new tool are scheduler developers, not regular
> users.
> 
> USAGE
> -----
> 
>    # perf sched stats record
>    # perf sched stats report
>    # perf sched stats diff
> 
> Note: Although `perf sched stats` tool supports workload profiling syntax
> (i.e. -- <workload> ), the recorded profile is still systemwide since the
> /proc/schedstat is a systemwide file.
> 
> HOW TO INTERPRET THE REPORT
> ---------------------------
> 
> The `perf sched stats report` starts with description of the columns
> present in the report. These column names are given before cpu and
> domain stats to improve the readability of the report.
> 
>    ----------------------------------------------------------------------------------------------------
>    DESC                    -> Description of the field
>    COUNT                   -> Value of the field
>    PCT_CHANGE              -> Percent change with corresponding base value
>    AVG_JIFFIES             -> Avg time in jiffies between two consecutive occurrence of event
>    ----------------------------------------------------------------------------------------------------
> 
> Next is the total profiling time in terms of jiffies:
> 
>    ----------------------------------------------------------------------------------------------------
>    Time elapsed (in jiffies)                                   :       24537
>    ----------------------------------------------------------------------------------------------------
> 
> Next is CPU scheduling statistics. These are simple diffs of
> /proc/schedstat CPU lines along with description. The report also
> prints % relative to base stat.
> 
> In the example below, schedule() left the CPU0 idle 36.58% of the time.
> 0.45% of total try_to_wake_up() was to wakeup local CPU. And, the total
> waittime by tasks on CPU0 is 48.70% of the total runtime by tasks on the
> same CPU.
> 
>    ----------------------------------------------------------------------------------------------------
>    CPU 0
>    ----------------------------------------------------------------------------------------------------
>    DESC                                                                     COUNT   PCT_CHANGE
>    ----------------------------------------------------------------------------------------------------
>    yld_count                                                        :           0
>    array_exp                                                        :           0
>    sched_count                                                      :      402267
>    sched_goidle                                                     :      147161  (    36.58% )
>    ttwu_count                                                       :      236309
>    ttwu_local                                                       :        1062  (     0.45% )
>    rq_cpu_time                                                      :  7083791148
>    run_delay                                                        :  3449973971  (    48.70% )
>    pcount                                                           :      255035
>    ----------------------------------------------------------------------------------------------------
> 
> Next is load balancing statistics. For each of the sched domains
> (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
> the following three categories:
> 
>    1) Idle Load Balance: Load balancing performed on behalf of a long
>                          idling CPU by some other CPU.
>    2) Busy Load Balance: Load balancing performed when the CPU was busy.
>    3) New Idle Balance : Load balancing performed when a CPU just became
>                          idle.
> 
> Under each of these three categories, sched stats report provides
> different load balancing statistics. Along with direct stats, the
> report also contains derived metrics prefixed with *. Example:
> 
>    ----------------------------------------------------------------------------------------------------
>    CPU 0, DOMAIN SMT CPUS 0,64
>    ----------------------------------------------------------------------------------------------------
>    DESC                                                                     COUNT    AVG_JIFFIES
>    ----------------------------------------- <Category busy> ------------------------------------------
>    busy_lb_count                                                    :         136  $       17.08 $
>    busy_lb_balanced                                                 :         131  $       17.73 $
>    busy_lb_failed                                                   :           0  $        0.00 $
>    busy_lb_imbalance_load                                           :          58
>    busy_lb_imbalance_util                                           :           0
>    busy_lb_imbalance_task                                           :           0
>    busy_lb_imbalance_misfit                                         :           0
>    busy_lb_gained                                                   :           7
>    busy_lb_hot_gained                                               :           0
>    busy_lb_nobusyq                                                  :           2  $     1161.50 $
>    busy_lb_nobusyg                                                  :         129  $       18.01 $
>    *busy_lb_success_count                                           :           5
>    *busy_lb_avg_pulled                                              :        1.40
>    ----------------------------------------- <Category idle> ------------------------------------------
>    idle_lb_count                                                    :         449  $        5.17 $
>    idle_lb_balanced                                                 :         382  $        6.08 $
>    idle_lb_failed                                                   :           3  $      774.33 $
>    idle_lb_imbalance_load                                           :           0
>    idle_lb_imbalance_util                                           :           0
>    idle_lb_imbalance_task                                           :          71
>    idle_lb_imbalance_misfit                                         :           0
>    idle_lb_gained                                                   :          67
>    idle_lb_hot_gained                                               :           0
>    idle_lb_nobusyq                                                  :           0  $        0.00 $
>    idle_lb_nobusyg                                                  :         382  $        6.08 $
>    *idle_lb_success_count                                           :          64
>    *idle_lb_avg_pulled                                              :        1.05
>    ---------------------------------------- <Category newidle> ----------------------------------------
>    newidle_lb_count                                                 :       30471  $        0.08 $
>    newidle_lb_balanced                                              :       28490  $        0.08 $
>    newidle_lb_failed                                                :         633  $        3.67 $
>    newidle_lb_imbalance_load                                        :           0
>    newidle_lb_imbalance_util                                        :           0
>    newidle_lb_imbalance_task                                        :        2040
>    newidle_lb_imbalance_misfit                                      :           0
>    newidle_lb_gained                                                :        1348
>    newidle_lb_hot_gained                                            :           0
>    newidle_lb_nobusyq                                               :           6  $      387.17 $
>    newidle_lb_nobusyg                                               :       26634  $        0.09 $
>    *newidle_lb_success_count                                        :        1348
>    *newidle_lb_avg_pulled                                           :        1.00
>    ----------------------------------------------------------------------------------------------------
> 
> Consider following line:
> 
> newidle_lb_balanced                                              :       28490  $        0.08 $
> 
> While profiling was active, the load-balancer found 28490 times the load
> needs to be balanced on a newly idle CPU 0. Following value encapsulated
> inside $ is average jiffies between two events (28490 / 24537 = 0.08).
> 
> Next are active_load_balance() stats. alb did not trigger while the
> profiling was active, hence it's all 0s.
> 
> 
>    --------------------------------- <Category active_load_balance()> ---------------------------------
>    alb_count                                                        :           0
>    alb_failed                                                       :           0
>    alb_pushed                                                       :           0
>    ----------------------------------------------------------------------------------------------------
> 
> Next are sched_balance_exec() and sched_balance_fork() stats. They are
> not used but we kept it in RFC just for legacy purpose. Unless opposed,
> we plan to remove them in next revision.
> 
> Next are wakeup statistics. For every domain, the report also shows
> task-wakeup statistics. Example:
> 
>    ------------------------------------------ <Wakeup Info> -------------------------------------------
>    ttwu_wake_remote                                                 :        1590
>    ttwu_move_affine                                                 :          84
>    ttwu_move_balance                                                :           0
>    ----------------------------------------------------------------------------------------------------
> 
> Same set of stats are reported for each CPU and each domain level.
> 
> HOW TO INTERPRET THE DIFF
> -------------------------
> 
> The `perf sched stats diff` will also start with explaining the columns
> present in the diff. Then it will show the diff in time in terms of
> jiffies. The order of the values depends on the order of input data
> files. Example:
> 
>    ----------------------------------------------------------------------------------------------------
>    Time elapsed (in jiffies)                                        :        2763,       2763
>    ----------------------------------------------------------------------------------------------------
> 
> Below is the sample representing the difference in cpu and domain stats of
> two runs. Here third column or the values enclosed in `|...|` shows the
> percent change between the two. Second and fourth columns shows the
> side-by-side representions of the corresponding fields from `perf sched
> stats report`.
> 
>    ----------------------------------------------------------------------------------------------------
>    CPU <ALL CPUS SUMMARY>
>    ----------------------------------------------------------------------------------------------------
>    DESC                                                                    COUNT1      COUNT2   PCT_CHANG>
>    ----------------------------------------------------------------------------------------------------
>    yld_count                                                        :           0,          0  |     0.00>
>    array_exp                                                        :           0,          0  |     0.00>
>    sched_count                                                      :      528533,     412573  |   -21.94>
>    sched_goidle                                                     :      193426,     146082  |   -24.48>
>    ttwu_count                                                       :      313134,     385975  |    23.26>
>    ttwu_local                                                       :        1126,       1282  |    13.85>
>    rq_cpu_time                                                      :  8257200244, 8301250047  |     0.53>
>    run_delay                                                        :  4728347053, 3997100703  |   -15.47>
>    pcount                                                           :      335031,     266396  |   -20.49>
>    ----------------------------------------------------------------------------------------------------
> 
> Below is the sample of domain stats diff:
> 
>    ----------------------------------------------------------------------------------------------------
>    CPU <ALL CPUS SUMMARY>, DOMAIN SMT
>    ----------------------------------------------------------------------------------------------------
>    DESC                                                                    COUNT1      COUNT2   PCT_CHANG>
>    ----------------------------------------- <Category busy> ------------------------------------------
>    busy_lb_count                                                    :         122,         80  |   -34.43>
>    busy_lb_balanced                                                 :         115,         76  |   -33.91>
>    busy_lb_failed                                                   :           1,          3  |   200.00>
>    busy_lb_imbalance_load                                           :          35,         49  |    40.00>
>    busy_lb_imbalance_util                                           :           0,          0  |     0.00>
>    busy_lb_imbalance_task                                           :           0,          0  |     0.00>
>    busy_lb_imbalance_misfit                                         :           0,          0  |     0.00>
>    busy_lb_gained                                                   :           7,          2  |   -71.43>
>    busy_lb_hot_gained                                               :           0,          0  |     0.00>
>    busy_lb_nobusyq                                                  :           0,          0  |     0.00>
>    busy_lb_nobusyg                                                  :         115,         76  |   -33.91>
>    *busy_lb_success_count                                           :           6,          1  |   -83.33>
>    *busy_lb_avg_pulled                                              :        1.17,       2.00  |    71.43>
>    ----------------------------------------- <Category idle> ------------------------------------------
>    idle_lb_count                                                    :         568,        620  |     9.15>
>    idle_lb_balanced                                                 :         462,        449  |    -2.81>
>    idle_lb_failed                                                   :          11,         21  |    90.91>
>    idle_lb_imbalance_load                                           :           0,          0  |     0.00>
>    idle_lb_imbalance_util                                           :           0,          0  |     0.00>
>    idle_lb_imbalance_task                                           :         115,        189  |    64.35>
>    idle_lb_imbalance_misfit                                         :           0,          0  |     0.00>
>    idle_lb_gained                                                   :         103,        169  |    64.08>
>    idle_lb_hot_gained                                               :           0,          0  |     0.00>
>    idle_lb_nobusyq                                                  :           0,          0  |     0.00>
>    idle_lb_nobusyg                                                  :         462,        449  |    -2.81>
>    *idle_lb_success_count                                           :          95,        150  |    57.89>
>    *idle_lb_avg_pulled                                              :        1.08,       1.13  |     3.92>
>    ---------------------------------------- <Category newidle> ----------------------------------------
>    newidle_lb_count                                                 :       16961,       3155  |   -81.40>
>    newidle_lb_balanced                                              :       15646,       2556  |   -83.66>
>    newidle_lb_failed                                                :         397,        142  |   -64.23>
>    newidle_lb_imbalance_load                                        :           0,          0  |     0.00>
>    newidle_lb_imbalance_util                                        :           0,          0  |     0.00>
>    newidle_lb_imbalance_task                                        :        1376,        655  |   -52.40>
>    newidle_lb_imbalance_misfit                                      :           0,          0  |     0.00>
>    newidle_lb_gained                                                :         917,        457  |   -50.16>
>    newidle_lb_hot_gained                                            :           0,          0  |     0.00>
>    newidle_lb_nobusyq                                               :           3,          1  |   -66.67>
>    newidle_lb_nobusyg                                               :       14480,       2103  |   -85.48>
>    *newidle_lb_success_count                                        :         918,        457  |   -50.22>
>    *newidle_lb_avg_pulled                                           :        1.00,       1.00  |     0.11>
>    --------------------------------- <Category active_load_balance()> ---------------------------------
>    alb_count                                                        :           0,          1  |     0.00>
>    alb_failed                                                       :           0,          0  |     0.00>
>    alb_pushed                                                       :           0,          1  |     0.00>
>    --------------------------------- <Category sched_balance_exec()> ----------------------------------
>    sbe_count                                                        :           0,          0  |     0.00>
>    sbe_balanced                                                     :           0,          0  |     0.00>
>    sbe_pushed                                                       :           0,          0  |     0.00>
>    --------------------------------- <Category sched_balance_fork()> ----------------------------------
>    sbf_count                                                        :           0,          0  |     0.00>
>    sbf_balanced                                                     :           0,          0  |     0.00>
>    sbf_pushed                                                       :           0,          0  |     0.00>
>    ------------------------------------------ <Wakeup Info> -------------------------------------------
>    ttwu_wake_remote                                                 :        2031,       2914  |    43.48>
>    ttwu_move_affine                                                 :          73,        124  |    69.86>
>    ttwu_move_balance                                                :           0,          0  |     0.00>
>    ----------------------------------------------------------------------------------------------------
> 
> v3: https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
> v3->v4:
>   - All the review comments from v3 are addressed [Namhyung Kim].
>   - Print short names instead of field descripion in the report [Peter Zijlstra]
>   - Fix the double free issue [Cristian Prundeanu]
>   - Documentation update related to `perf sched stats diff` [Chen yu]
>   - Bail out `perf sched stats diff` if perf.data files have different schedstat
>     versions [Peter Zijlstra]
> 
> v2: https://lore.kernel.org/all/20241122084452.1064968-1-swapnil.sapkal@amd.com/
> v2->v3:
>   - Add perf unit test for basic sched stats functionalities
>   - Describe new tool, it's usage and interpretation of report data in the
>     perf-sched man page.
>   - Add /proc/schedstat version 17 support.
> 
> v1: https://lore.kernel.org/lkml/20240916164722.1838-1-ravi.bangoria@amd.com
> v1->v2
>   - Add the support for `perf sched stats diff`
>   - Add column header in report for better readability. Use
>     procfs__mountpoint for consistency. Add hint for enabling
>     CONFIG_SCHEDSTAT if disabled. [James Clark]
>   - Use a single header file for both cpu and domain fileds. Change
>     the layout of structs to minimise the padding. I tried changing
>     `v15` to `15` in the header files but it was not giving any
>     benefits so drop the idea. [Namhyung Kim]
>   - Add tested-by.
> 
> RFC: https://lore.kernel.org/r/20240508060427.417-1-ravi.bangoria@amd.com
> RFC->v1:
>   - [Kernel] Print domain name along with domain number in /proc/schedstat
>     file.
>   - s/schedstat/stats/ for the subcommand.
>   - Record domain name and cpumask details, also show them in report.
>   - Add CPU filtering capability at record and report time.
>   - Add /proc/schedstat v16 support.
>   - Live mode support. Similar to perf stat command, live mode prints the
>     sched stats on the stdout.
>   - Add pager support in `perf sched stats report` for better scrolling.
>   - Some minor cosmetic changes in report output to improve readability.
>   - Rebase to latest perf-tools-next/perf-tools-next (1de5b5dcb835).
> 
> TODO:
>   - perf sched stats records /proc/schedstat which is a CPU and domain
>     level scheduler statistic. We are planning to add taskstat tool which
>     reads task stats from procfs and generate scheduler statistic report
>     at task granularity. this will probably a standalone tool, something
>     like `perf sched taskstat record/report`.
>   - Except pre-processor related checkpatch warnings, we have addressed
>     most of the other possible warnings.
>   - This version supports diff for two perf.data files captured for same
>     schedstats version but the target is to show diff for multiple
>     perf.data files. Plan is to support diff if perf.data files provided
>     has different schedstat versions.
> 
> Patches are prepared on v6.17-rc3 (1b237f190eb3).
> 
> [1] https://youtu.be/lg-9aG2ajA0?t=283
> [2] https://github.com/AMDESE/sched-scoreboard
> [3] https://lore.kernel.org/lkml/c50bdbfe-02ce-c1bc-c761-c95f8e216ca0@amd.com/
> [4] https://lore.kernel.org/lkml/3e32bec6-5e59-c66a-7676-7d15df2c961c@amd.com/
> [5] https://lore.kernel.org/all/20241122084452.1064968-1-swapnil.sapkal@amd.com/
> [6] https://lore.kernel.org/lkml/3170d16e-eb67-4db8-a327-eb8188397fdb@amd.com/
> [7] https://lore.kernel.org/lkml/feb31b6e-6457-454c-a4f3-ce8ad96bf8de@amd.com/
> 
> Swapnil Sapkal (11):
>    perf: Add print_separator to util
>    tools/lib: Add list_is_first()
>    perf header: Support CPU DOMAIN relation info
>    perf sched stats: Add record and rawdump support
>    perf sched stats: Add schedstat v16 support
>    perf sched stats: Add schedstat v17 support
>    perf sched stats: Add support for report subcommand
>    perf sched stats: Add support for live mode
>    perf sched stats: Add support for diff subcommand
>    perf sched stats: Add basic perf sched stats test
>    perf sched stats: Add details in man page
> 
>   tools/include/linux/list.h                    |   10 +
>   tools/lib/perf/Documentation/libperf.txt      |    2 +
>   tools/lib/perf/Makefile                       |    1 +
>   tools/lib/perf/include/perf/event.h           |   69 ++
>   tools/lib/perf/include/perf/schedstat-v15.h   |  146 +++
>   tools/lib/perf/include/perf/schedstat-v16.h   |  146 +++
>   tools/lib/perf/include/perf/schedstat-v17.h   |  164 +++
>   tools/perf/Documentation/perf-sched.txt       |  261 ++++-
>   .../Documentation/perf.data-file-format.txt   |   17 +
>   tools/perf/builtin-inject.c                   |    3 +
>   tools/perf/builtin-kwork.c                    |   13 +-
>   tools/perf/builtin-sched.c                    | 1027 ++++++++++++++++-
>   tools/perf/tests/shell/perf_sched_stats.sh    |   64 +
>   tools/perf/util/env.h                         |   16 +
>   tools/perf/util/event.c                       |   52 +
>   tools/perf/util/event.h                       |    2 +
>   tools/perf/util/header.c                      |  304 +++++
>   tools/perf/util/header.h                      |    6 +
>   tools/perf/util/session.c                     |   22 +
>   tools/perf/util/synthetic-events.c            |  196 ++++
>   tools/perf/util/synthetic-events.h            |    3 +
>   tools/perf/util/tool.c                        |   18 +
>   tools/perf/util/tool.h                        |    4 +-
>   tools/perf/util/util.c                        |   48 +
>   tools/perf/util/util.h                        |    5 +
>   25 files changed, 2587 insertions(+), 12 deletions(-)
>   create mode 100644 tools/lib/perf/include/perf/schedstat-v15.h
>   create mode 100644 tools/lib/perf/include/perf/schedstat-v16.h
>   create mode 100644 tools/lib/perf/include/perf/schedstat-v17.h
>   create mode 100755 tools/perf/tests/shell/perf_sched_stats.sh
>

Re: [PATCH v4 00/11] perf sched: Introduce stats tool

Posted by Ian Rogers 2 months ago

On Wed, Aug 27, 2025 at 9:43 PM Sapkal, Swapnil <swapnil.sapkal@amd.com> wrote:
>
> Hello all,
>
> Missed to add perf folks to the list. Adding them here. Sorry about that.

Hi Swapnil,

I was wondering if this patch series was active? The kernel test robot
mentioned an issue.

> --
> Thanks and Regards,
> Swapnil
>
> On 8/26/2025 10:40 AM, Swapnil Sapkal wrote:
> > Apologies for long delay, I was side tracked on some other items. I was
> > not able to focus on this.
> >
> > MOTIVATION
> > ----------
> >
> > Existing `perf sched` is quite exhaustive and provides lot of insights
> > into scheduler behavior but it quickly becomes impractical to use for
> > long running or scheduler intensive workload. For ex, `perf sched record`
> > has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
> > on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
> > generates huge 56G perf.data for which perf takes ~137 mins to prepare
> > and write it to disk [1].
> >
> > Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
> > and generates samples on a tracepoint hit, `perf sched stats record` takes
> > snapshot of the /proc/schedstat file before and after the workload, i.e.
> > there is almost zero interference on workload run. Also, it takes very
> > minimal time to parse /proc/schedstat, convert it into perf samples and
> > save those samples into perf.data file. Result perf.data file is much
> > smaller. So, overall `perf sched stats record` is much more light weight
> > compare to `perf sched record`.
> >
> > We, internally at AMD, have been using this (a variant of this, known as
> > "sched-scoreboard"[2]) and found it to be very useful to analyse impact
> > of any scheduler code changes[3][4]. Prateek used v2[5] of this patch
> > series to report the analysis[6][7].
> >
> > Please note that, this is not a replacement of perf sched record/report.
> > The intended users of the new tool are scheduler developers, not regular
> > users.
> >
> > USAGE
> > -----
> >
> >    # perf sched stats record
> >    # perf sched stats report
> >    # perf sched stats diff
> >
> > Note: Although `perf sched stats` tool supports workload profiling syntax
> > (i.e. -- <workload> ), the recorded profile is still systemwide since the
> > /proc/schedstat is a systemwide file.
> >
> > HOW TO INTERPRET THE REPORT
> > ---------------------------
> >
> > The `perf sched stats report` starts with description of the columns
> > present in the report. These column names are given before cpu and
> > domain stats to improve the readability of the report.
> >
> >    ----------------------------------------------------------------------------------------------------
> >    DESC                    -> Description of the field
> >    COUNT                   -> Value of the field
> >    PCT_CHANGE              -> Percent change with corresponding base value
> >    AVG_JIFFIES             -> Avg time in jiffies between two consecutive occurrence of event
> >    ----------------------------------------------------------------------------------------------------
> >
> > Next is the total profiling time in terms of jiffies:
> >
> >    ----------------------------------------------------------------------------------------------------
> >    Time elapsed (in jiffies)                                   :       24537
> >    ----------------------------------------------------------------------------------------------------
> >
> > Next is CPU scheduling statistics. These are simple diffs of
> > /proc/schedstat CPU lines along with description. The report also
> > prints % relative to base stat.

I wonder if this is similar to user_time and system_time:
```
$ perf list
...
tool:
...
 system_time
      [System/kernel time in nanoseconds. Unit: tool]
...
 user_time
      [User (non-kernel) time in nanoseconds. Unit: tool]
...
```
These events are implemented by reading /proc/stat and /proc/pid/stat:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/tool_pmu.c?h=perf-tools-next#n267

As they are events then they can appear in perf stat output and also
within metrics.

Thanks,
Ian

> >
> > In the example below, schedule() left the CPU0 idle 36.58% of the time.
> > 0.45% of total try_to_wake_up() was to wakeup local CPU. And, the total
> > waittime by tasks on CPU0 is 48.70% of the total runtime by tasks on the
> > same CPU.
> >
> >    ----------------------------------------------------------------------------------------------------
> >    CPU 0
> >    ----------------------------------------------------------------------------------------------------
> >    DESC                                                                     COUNT   PCT_CHANGE
> >    ----------------------------------------------------------------------------------------------------
> >    yld_count                                                        :           0
> >    array_exp                                                        :           0
> >    sched_count                                                      :      402267
> >    sched_goidle                                                     :      147161  (    36.58% )
> >    ttwu_count                                                       :      236309
> >    ttwu_local                                                       :        1062  (     0.45% )
> >    rq_cpu_time                                                      :  7083791148
> >    run_delay                                                        :  3449973971  (    48.70% )
> >    pcount                                                           :      255035
> >    ----------------------------------------------------------------------------------------------------
> >
> > Next is load balancing statistics. For each of the sched domains
> > (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
> > the following three categories:
> >
> >    1) Idle Load Balance: Load balancing performed on behalf of a long
> >                          idling CPU by some other CPU.
> >    2) Busy Load Balance: Load balancing performed when the CPU was busy.
> >    3) New Idle Balance : Load balancing performed when a CPU just became
> >                          idle.
> >
> > Under each of these three categories, sched stats report provides
> > different load balancing statistics. Along with direct stats, the
> > report also contains derived metrics prefixed with *. Example:
> >
> >    ----------------------------------------------------------------------------------------------------
> >    CPU 0, DOMAIN SMT CPUS 0,64
> >    ----------------------------------------------------------------------------------------------------
> >    DESC                                                                     COUNT    AVG_JIFFIES
> >    ----------------------------------------- <Category busy> ------------------------------------------
> >    busy_lb_count                                                    :         136  $       17.08 $
> >    busy_lb_balanced                                                 :         131  $       17.73 $
> >    busy_lb_failed                                                   :           0  $        0.00 $
> >    busy_lb_imbalance_load                                           :          58
> >    busy_lb_imbalance_util                                           :           0
> >    busy_lb_imbalance_task                                           :           0
> >    busy_lb_imbalance_misfit                                         :           0
> >    busy_lb_gained                                                   :           7
> >    busy_lb_hot_gained                                               :           0
> >    busy_lb_nobusyq                                                  :           2  $     1161.50 $
> >    busy_lb_nobusyg                                                  :         129  $       18.01 $
> >    *busy_lb_success_count                                           :           5
> >    *busy_lb_avg_pulled                                              :        1.40
> >    ----------------------------------------- <Category idle> ------------------------------------------
> >    idle_lb_count                                                    :         449  $        5.17 $
> >    idle_lb_balanced                                                 :         382  $        6.08 $
> >    idle_lb_failed                                                   :           3  $      774.33 $
> >    idle_lb_imbalance_load                                           :           0
> >    idle_lb_imbalance_util                                           :           0
> >    idle_lb_imbalance_task                                           :          71
> >    idle_lb_imbalance_misfit                                         :           0
> >    idle_lb_gained                                                   :          67
> >    idle_lb_hot_gained                                               :           0
> >    idle_lb_nobusyq                                                  :           0  $        0.00 $
> >    idle_lb_nobusyg                                                  :         382  $        6.08 $
> >    *idle_lb_success_count                                           :          64
> >    *idle_lb_avg_pulled                                              :        1.05
> >    ---------------------------------------- <Category newidle> ----------------------------------------
> >    newidle_lb_count                                                 :       30471  $        0.08 $
> >    newidle_lb_balanced                                              :       28490  $        0.08 $
> >    newidle_lb_failed                                                :         633  $        3.67 $
> >    newidle_lb_imbalance_load                                        :           0
> >    newidle_lb_imbalance_util                                        :           0
> >    newidle_lb_imbalance_task                                        :        2040
> >    newidle_lb_imbalance_misfit                                      :           0
> >    newidle_lb_gained                                                :        1348
> >    newidle_lb_hot_gained                                            :           0
> >    newidle_lb_nobusyq                                               :           6  $      387.17 $
> >    newidle_lb_nobusyg                                               :       26634  $        0.09 $
> >    *newidle_lb_success_count                                        :        1348
> >    *newidle_lb_avg_pulled                                           :        1.00
> >    ----------------------------------------------------------------------------------------------------
> >
> > Consider following line:
> >
> > newidle_lb_balanced                                              :       28490  $        0.08 $
> >
> > While profiling was active, the load-balancer found 28490 times the load
> > needs to be balanced on a newly idle CPU 0. Following value encapsulated
> > inside $ is average jiffies between two events (28490 / 24537 = 0.08).
> >
> > Next are active_load_balance() stats. alb did not trigger while the
> > profiling was active, hence it's all 0s.
> >
> >
> >    --------------------------------- <Category active_load_balance()> ---------------------------------
> >    alb_count                                                        :           0
> >    alb_failed                                                       :           0
> >    alb_pushed                                                       :           0
> >    ----------------------------------------------------------------------------------------------------
> >
> > Next are sched_balance_exec() and sched_balance_fork() stats. They are
> > not used but we kept it in RFC just for legacy purpose. Unless opposed,
> > we plan to remove them in next revision.
> >
> > Next are wakeup statistics. For every domain, the report also shows
> > task-wakeup statistics. Example:
> >
> >    ------------------------------------------ <Wakeup Info> -------------------------------------------
> >    ttwu_wake_remote                                                 :        1590
> >    ttwu_move_affine                                                 :          84
> >    ttwu_move_balance                                                :           0
> >    ----------------------------------------------------------------------------------------------------
> >
> > Same set of stats are reported for each CPU and each domain level.
> >
> > HOW TO INTERPRET THE DIFF
> > -------------------------
> >
> > The `perf sched stats diff` will also start with explaining the columns
> > present in the diff. Then it will show the diff in time in terms of
> > jiffies. The order of the values depends on the order of input data
> > files. Example:
> >
> >    ----------------------------------------------------------------------------------------------------
> >    Time elapsed (in jiffies)                                        :        2763,       2763
> >    ----------------------------------------------------------------------------------------------------
> >
> > Below is the sample representing the difference in cpu and domain stats of
> > two runs. Here third column or the values enclosed in `|...|` shows the
> > percent change between the two. Second and fourth columns shows the
> > side-by-side representions of the corresponding fields from `perf sched
> > stats report`.
> >
> >    ----------------------------------------------------------------------------------------------------
> >    CPU <ALL CPUS SUMMARY>
> >    ----------------------------------------------------------------------------------------------------
> >    DESC                                                                    COUNT1      COUNT2   PCT_CHANG>
> >    ----------------------------------------------------------------------------------------------------
> >    yld_count                                                        :           0,          0  |     0.00>
> >    array_exp                                                        :           0,          0  |     0.00>
> >    sched_count                                                      :      528533,     412573  |   -21.94>
> >    sched_goidle                                                     :      193426,     146082  |   -24.48>
> >    ttwu_count                                                       :      313134,     385975  |    23.26>
> >    ttwu_local                                                       :        1126,       1282  |    13.85>
> >    rq_cpu_time                                                      :  8257200244, 8301250047  |     0.53>
> >    run_delay                                                        :  4728347053, 3997100703  |   -15.47>
> >    pcount                                                           :      335031,     266396  |   -20.49>
> >    ----------------------------------------------------------------------------------------------------
> >
> > Below is the sample of domain stats diff:
> >
> >    ----------------------------------------------------------------------------------------------------
> >    CPU <ALL CPUS SUMMARY>, DOMAIN SMT
> >    ----------------------------------------------------------------------------------------------------
> >    DESC                                                                    COUNT1      COUNT2   PCT_CHANG>
> >    ----------------------------------------- <Category busy> ------------------------------------------
> >    busy_lb_count                                                    :         122,         80  |   -34.43>
> >    busy_lb_balanced                                                 :         115,         76  |   -33.91>
> >    busy_lb_failed                                                   :           1,          3  |   200.00>
> >    busy_lb_imbalance_load                                           :          35,         49  |    40.00>
> >    busy_lb_imbalance_util                                           :           0,          0  |     0.00>
> >    busy_lb_imbalance_task                                           :           0,          0  |     0.00>
> >    busy_lb_imbalance_misfit                                         :           0,          0  |     0.00>
> >    busy_lb_gained                                                   :           7,          2  |   -71.43>
> >    busy_lb_hot_gained                                               :           0,          0  |     0.00>
> >    busy_lb_nobusyq                                                  :           0,          0  |     0.00>
> >    busy_lb_nobusyg                                                  :         115,         76  |   -33.91>
> >    *busy_lb_success_count                                           :           6,          1  |   -83.33>
> >    *busy_lb_avg_pulled                                              :        1.17,       2.00  |    71.43>
> >    ----------------------------------------- <Category idle> ------------------------------------------
> >    idle_lb_count                                                    :         568,        620  |     9.15>
> >    idle_lb_balanced                                                 :         462,        449  |    -2.81>
> >    idle_lb_failed                                                   :          11,         21  |    90.91>
> >    idle_lb_imbalance_load                                           :           0,          0  |     0.00>
> >    idle_lb_imbalance_util                                           :           0,          0  |     0.00>
> >    idle_lb_imbalance_task                                           :         115,        189  |    64.35>
> >    idle_lb_imbalance_misfit                                         :           0,          0  |     0.00>
> >    idle_lb_gained                                                   :         103,        169  |    64.08>
> >    idle_lb_hot_gained                                               :           0,          0  |     0.00>
> >    idle_lb_nobusyq                                                  :           0,          0  |     0.00>
> >    idle_lb_nobusyg                                                  :         462,        449  |    -2.81>
> >    *idle_lb_success_count                                           :          95,        150  |    57.89>
> >    *idle_lb_avg_pulled                                              :        1.08,       1.13  |     3.92>
> >    ---------------------------------------- <Category newidle> ----------------------------------------
> >    newidle_lb_count                                                 :       16961,       3155  |   -81.40>
> >    newidle_lb_balanced                                              :       15646,       2556  |   -83.66>
> >    newidle_lb_failed                                                :         397,        142  |   -64.23>
> >    newidle_lb_imbalance_load                                        :           0,          0  |     0.00>
> >    newidle_lb_imbalance_util                                        :           0,          0  |     0.00>
> >    newidle_lb_imbalance_task                                        :        1376,        655  |   -52.40>
> >    newidle_lb_imbalance_misfit                                      :           0,          0  |     0.00>
> >    newidle_lb_gained                                                :         917,        457  |   -50.16>
> >    newidle_lb_hot_gained                                            :           0,          0  |     0.00>
> >    newidle_lb_nobusyq                                               :           3,          1  |   -66.67>
> >    newidle_lb_nobusyg                                               :       14480,       2103  |   -85.48>
> >    *newidle_lb_success_count                                        :         918,        457  |   -50.22>
> >    *newidle_lb_avg_pulled                                           :        1.00,       1.00  |     0.11>
> >    --------------------------------- <Category active_load_balance()> ---------------------------------
> >    alb_count                                                        :           0,          1  |     0.00>
> >    alb_failed                                                       :           0,          0  |     0.00>
> >    alb_pushed                                                       :           0,          1  |     0.00>
> >    --------------------------------- <Category sched_balance_exec()> ----------------------------------
> >    sbe_count                                                        :           0,          0  |     0.00>
> >    sbe_balanced                                                     :           0,          0  |     0.00>
> >    sbe_pushed                                                       :           0,          0  |     0.00>
> >    --------------------------------- <Category sched_balance_fork()> ----------------------------------
> >    sbf_count                                                        :           0,          0  |     0.00>
> >    sbf_balanced                                                     :           0,          0  |     0.00>
> >    sbf_pushed                                                       :           0,          0  |     0.00>
> >    ------------------------------------------ <Wakeup Info> -------------------------------------------
> >    ttwu_wake_remote                                                 :        2031,       2914  |    43.48>
> >    ttwu_move_affine                                                 :          73,        124  |    69.86>
> >    ttwu_move_balance                                                :           0,          0  |     0.00>
> >    ----------------------------------------------------------------------------------------------------
> >
> > v3: https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
> > v3->v4:
> >   - All the review comments from v3 are addressed [Namhyung Kim].
> >   - Print short names instead of field descripion in the report [Peter Zijlstra]
> >   - Fix the double free issue [Cristian Prundeanu]
> >   - Documentation update related to `perf sched stats diff` [Chen yu]
> >   - Bail out `perf sched stats diff` if perf.data files have different schedstat
> >     versions [Peter Zijlstra]
> >
> > v2: https://lore.kernel.org/all/20241122084452.1064968-1-swapnil.sapkal@amd.com/
> > v2->v3:
> >   - Add perf unit test for basic sched stats functionalities
> >   - Describe new tool, it's usage and interpretation of report data in the
> >     perf-sched man page.
> >   - Add /proc/schedstat version 17 support.
> >
> > v1: https://lore.kernel.org/lkml/20240916164722.1838-1-ravi.bangoria@amd.com
> > v1->v2
> >   - Add the support for `perf sched stats diff`
> >   - Add column header in report for better readability. Use
> >     procfs__mountpoint for consistency. Add hint for enabling
> >     CONFIG_SCHEDSTAT if disabled. [James Clark]
> >   - Use a single header file for both cpu and domain fileds. Change
> >     the layout of structs to minimise the padding. I tried changing
> >     `v15` to `15` in the header files but it was not giving any
> >     benefits so drop the idea. [Namhyung Kim]
> >   - Add tested-by.
> >
> > RFC: https://lore.kernel.org/r/20240508060427.417-1-ravi.bangoria@amd.com
> > RFC->v1:
> >   - [Kernel] Print domain name along with domain number in /proc/schedstat
> >     file.
> >   - s/schedstat/stats/ for the subcommand.
> >   - Record domain name and cpumask details, also show them in report.
> >   - Add CPU filtering capability at record and report time.
> >   - Add /proc/schedstat v16 support.
> >   - Live mode support. Similar to perf stat command, live mode prints the
> >     sched stats on the stdout.
> >   - Add pager support in `perf sched stats report` for better scrolling.
> >   - Some minor cosmetic changes in report output to improve readability.
> >   - Rebase to latest perf-tools-next/perf-tools-next (1de5b5dcb835).
> >
> > TODO:
> >   - perf sched stats records /proc/schedstat which is a CPU and domain
> >     level scheduler statistic. We are planning to add taskstat tool which
> >     reads task stats from procfs and generate scheduler statistic report
> >     at task granularity. this will probably a standalone tool, something
> >     like `perf sched taskstat record/report`.
> >   - Except pre-processor related checkpatch warnings, we have addressed
> >     most of the other possible warnings.
> >   - This version supports diff for two perf.data files captured for same
> >     schedstats version but the target is to show diff for multiple
> >     perf.data files. Plan is to support diff if perf.data files provided
> >     has different schedstat versions.
> >
> > Patches are prepared on v6.17-rc3 (1b237f190eb3).
> >
> > [1] https://youtu.be/lg-9aG2ajA0?t=283
> > [2] https://github.com/AMDESE/sched-scoreboard
> > [3] https://lore.kernel.org/lkml/c50bdbfe-02ce-c1bc-c761-c95f8e216ca0@amd.com/
> > [4] https://lore.kernel.org/lkml/3e32bec6-5e59-c66a-7676-7d15df2c961c@amd.com/
> > [5] https://lore.kernel.org/all/20241122084452.1064968-1-swapnil.sapkal@amd.com/
> > [6] https://lore.kernel.org/lkml/3170d16e-eb67-4db8-a327-eb8188397fdb@amd.com/
> > [7] https://lore.kernel.org/lkml/feb31b6e-6457-454c-a4f3-ce8ad96bf8de@amd.com/
> >
> > Swapnil Sapkal (11):
> >    perf: Add print_separator to util
> >    tools/lib: Add list_is_first()
> >    perf header: Support CPU DOMAIN relation info
> >    perf sched stats: Add record and rawdump support
> >    perf sched stats: Add schedstat v16 support
> >    perf sched stats: Add schedstat v17 support
> >    perf sched stats: Add support for report subcommand
> >    perf sched stats: Add support for live mode
> >    perf sched stats: Add support for diff subcommand
> >    perf sched stats: Add basic perf sched stats test
> >    perf sched stats: Add details in man page
> >
> >   tools/include/linux/list.h                    |   10 +
> >   tools/lib/perf/Documentation/libperf.txt      |    2 +
> >   tools/lib/perf/Makefile                       |    1 +
> >   tools/lib/perf/include/perf/event.h           |   69 ++
> >   tools/lib/perf/include/perf/schedstat-v15.h   |  146 +++
> >   tools/lib/perf/include/perf/schedstat-v16.h   |  146 +++
> >   tools/lib/perf/include/perf/schedstat-v17.h   |  164 +++
> >   tools/perf/Documentation/perf-sched.txt       |  261 ++++-
> >   .../Documentation/perf.data-file-format.txt   |   17 +
> >   tools/perf/builtin-inject.c                   |    3 +
> >   tools/perf/builtin-kwork.c                    |   13 +-
> >   tools/perf/builtin-sched.c                    | 1027 ++++++++++++++++-
> >   tools/perf/tests/shell/perf_sched_stats.sh    |   64 +
> >   tools/perf/util/env.h                         |   16 +
> >   tools/perf/util/event.c                       |   52 +
> >   tools/perf/util/event.h                       |    2 +
> >   tools/perf/util/header.c                      |  304 +++++
> >   tools/perf/util/header.h                      |    6 +
> >   tools/perf/util/session.c                     |   22 +
> >   tools/perf/util/synthetic-events.c            |  196 ++++
> >   tools/perf/util/synthetic-events.h            |    3 +
> >   tools/perf/util/tool.c                        |   18 +
> >   tools/perf/util/tool.h                        |    4 +-
> >   tools/perf/util/util.c                        |   48 +
> >   tools/perf/util/util.h                        |    5 +
> >   25 files changed, 2587 insertions(+), 12 deletions(-)
> >   create mode 100644 tools/lib/perf/include/perf/schedstat-v15.h
> >   create mode 100644 tools/lib/perf/include/perf/schedstat-v16.h
> >   create mode 100644 tools/lib/perf/include/perf/schedstat-v17.h
> >   create mode 100755 tools/perf/tests/shell/perf_sched_stats.sh
> >
>

Re: [PATCH v4 00/11] perf sched: Introduce stats tool

Posted by Swapnil Sapkal 1 month, 3 weeks ago

Hi Ian,

On 10-12-2025 02:33, Ian Rogers wrote:
> On Wed, Aug 27, 2025 at 9:43 PM Sapkal, Swapnil <swapnil.sapkal@amd.com> wrote:
>>
>> Hello all,
>>
>> Missed to add perf folks to the list. Adding them here. Sorry about that.
> 
> Hi Swapnil,
> 
> I was wondering if this patch series was active? The kernel test robot
> mentioned an issue.

The series is active. I have fix for the kernel test robot issue. I will 
be posting the next version in a week. I did not see any more review 
comments on the series hopefully it will be the final version.

--
Thanks and regards,
Swapnil

Re: [PATCH v4 00/11] perf sched: Introduce stats tool

Posted by Namhyung Kim 1 month, 3 weeks ago

Hello,

On Tue, Dec 16, 2025 at 03:39:21PM +0530, Swapnil Sapkal wrote:
> Hi Ian,
> 
> On 10-12-2025 02:33, Ian Rogers wrote:
> > On Wed, Aug 27, 2025 at 9:43 PM Sapkal, Swapnil <swapnil.sapkal@amd.com> wrote:
> > > 
> > > Hello all,
> > > 
> > > Missed to add perf folks to the list. Adding them here. Sorry about that.
> > 
> > Hi Swapnil,
> > 
> > I was wondering if this patch series was active? The kernel test robot
> > mentioned an issue.
> 
> The series is active. I have fix for the kernel test robot issue. I will be
> posting the next version in a week. I did not see any more review comments
> on the series hopefully it will be the final version.

Sorry for the delay.  I'll try to review the series next week. ;-)

Thanks,
Namhyung

Re: [PATCH v4 00/11] perf sched: Introduce stats tool

Posted by Swapnil Sapkal 1 month, 3 weeks ago

Hi Namhyung,

On 17-12-2025 21:07, Namhyung Kim wrote:
> Hello,
> 
> On Tue, Dec 16, 2025 at 03:39:21PM +0530, Swapnil Sapkal wrote:
>> Hi Ian,
>>
>> On 10-12-2025 02:33, Ian Rogers wrote:
>>> On Wed, Aug 27, 2025 at 9:43 PM Sapkal, Swapnil <swapnil.sapkal@amd.com> wrote:
>>>>
>>>> Hello all,
>>>>
>>>> Missed to add perf folks to the list. Adding them here. Sorry about that.
>>>
>>> Hi Swapnil,
>>>
>>> I was wondering if this patch series was active? The kernel test robot
>>> mentioned an issue.
>>
>> The series is active. I have fix for the kernel test robot issue. I will be
>> posting the next version in a week. I did not see any more review comments
>> on the series hopefully it will be the final version.
> 
> Sorry for the delay.  I'll try to review the series next week. ;-)
> 
No problems.

I will be taking off next week. I will incorporate your review comments 
after coming back and post the new series.

--
Thanks and Regards,
Swapnil

> Thanks,
> Namhyung
>

Re: [PATCH v4 00/11] perf sched: Introduce stats tool

Posted by Ravi Bangoria 1 month, 4 weeks ago

Hi Ian,

>>> Next is CPU scheduling statistics. These are simple diffs of
>>> /proc/schedstat CPU lines along with description. The report also
>>> prints % relative to base stat.
> 
> I wonder if this is similar to user_time and system_time:
> ```
> $ perf list
> ...
> tool:
> ...
>  system_time
>       [System/kernel time in nanoseconds. Unit: tool]
> ...
>  user_time
>       [User (non-kernel) time in nanoseconds. Unit: tool]
> ...
> ```
> These events are implemented by reading /proc/stat and /proc/pid/stat:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/tool_pmu.c?h=perf-tools-next#n267
> 
> As they are events then they can appear in perf stat output and also
> within metrics.

Create synthesized events for each field of /proc/schedstat?

Your idea is interesting and, I suppose, will work best when we care
about individual counters. However, for the "perf sched stats" tool,
I see atleast two challenges:

1. One of the design goal of "perf sched stats" was to keep the
   overhead low. Currently, it reads /proc/schedstat once at the
   beginning and once at the end. Switching to per-counter events
   would require opening, reading and closing a large number of
   events which would incur significant overhead.

2. Taking a snapshot in one go allows us to correlate counts easily.
   Using synthetic events would force us to read each counter
   individually, making cross-counter correlation impossible.

Thanks,
Ravi

Re: [PATCH v4 00/11] perf sched: Introduce stats tool

Posted by Ian Rogers 1 month, 4 weeks ago

On Thu, Dec 11, 2025 at 7:43 PM Ravi Bangoria <ravi.bangoria@amd.com> wrote:
>
> Hi Ian,
>
> >>> Next is CPU scheduling statistics. These are simple diffs of
> >>> /proc/schedstat CPU lines along with description. The report also
> >>> prints % relative to base stat.
> >
> > I wonder if this is similar to user_time and system_time:
> > ```
> > $ perf list
> > ...
> > tool:
> > ...
> >  system_time
> >       [System/kernel time in nanoseconds. Unit: tool]
> > ...
> >  user_time
> >       [User (non-kernel) time in nanoseconds. Unit: tool]
> > ...
> > ```
> > These events are implemented by reading /proc/stat and /proc/pid/stat:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/tool_pmu.c?h=perf-tools-next#n267
> >
> > As they are events then they can appear in perf stat output and also
> > within metrics.
>
> Create synthesized events for each field of /proc/schedstat?
>
> Your idea is interesting and, I suppose, will work best when we care
> about individual counters. However, for the "perf sched stats" tool,
> I see atleast two challenges:
>
> 1. One of the design goal of "perf sched stats" was to keep the
>    overhead low. Currently, it reads /proc/schedstat once at the
>    beginning and once at the end. Switching to per-counter events
>    would require opening, reading and closing a large number of
>    events which would incur significant overhead.
>
> 2. Taking a snapshot in one go allows us to correlate counts easily.
>    Using synthetic events would force us to read each counter
>    individually, making cross-counter correlation impossible.

Thanks Ravi, those are interesting problems. There are similar
problems with just reading regular counters. For example, with the
problem in this series:
https://lore.kernel.org/lkml/20251113180517.44096-1-irogers@google.com/
that was reduced to just the remaining:
https://lore.kernel.org/lkml/20251118211326.1840989-1-irogers@google.com/
we could do a better bandwidth calculation if duration_time were read
along with the uncore counters. Perhaps we can have say a "wall-clock"
software counter (ie like cpu-clock and task-clock) to allow that and
allow the group of events to be read in one go as optimized here:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evsel.c?h=perf-tools-next#n1910

So maybe there is potential for a read group type optimization of tool
like counters, to do something similar to what you are doing here.
Anyway, that's a different set of things to do and shouldn't inhibit
trying to get this series to land.

Thanks,
Ian

> Thanks,
> Ravi