Documentation/scheduler/sched-stats.rst | 8 +- kernel/sched/stats.c | 6 +- tools/lib/perf/Documentation/libperf.txt | 2 + tools/lib/perf/Makefile | 2 +- tools/lib/perf/include/perf/event.h | 56 ++ .../lib/perf/include/perf/schedstat-cpu-v15.h | 22 + .../lib/perf/include/perf/schedstat-cpu-v16.h | 22 + .../perf/include/perf/schedstat-domain-v15.h | 121 +++ .../perf/include/perf/schedstat-domain-v16.h | 121 +++ tools/perf/builtin-inject.c | 2 + tools/perf/builtin-sched.c | 778 +++++++++++++++++- tools/perf/util/event.c | 104 +++ tools/perf/util/event.h | 2 + tools/perf/util/session.c | 20 + tools/perf/util/synthetic-events.c | 255 ++++++ tools/perf/util/synthetic-events.h | 3 + tools/perf/util/tool.c | 20 + tools/perf/util/tool.h | 4 +- 18 files changed, 1542 insertions(+), 6 deletions(-) create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v15.h create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v16.h create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v15.h create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v16.h
MOTIVATION ---------- Existing `perf sched` is quite exhaustive and provides lot of insights into scheduler behavior but it quickly becomes impractical to use for long running or scheduler intensive workload. For ex, `perf sched record` has ~7.77% overhead on hackbench (with 25 groups each running 700K loops on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it generates huge 56G perf.data for which perf takes ~137 mins to prepare and write it to disk [1]. Unlike `perf sched record`, which hooks onto set of scheduler tracepoints and generates samples on a tracepoint hit, `perf sched stats record` takes snapshot of the /proc/schedstat file before and after the workload, i.e. there is almost zero interference on workload run. Also, it takes very minimal time to parse /proc/schedstat, convert it into perf samples and save those samples into perf.data file. Result perf.data file is much smaller. So, overall `perf sched stats record` is much more light weight compare to `perf sched record`. We, internally at AMD, have been using this (a variant of this, known as "sched-scoreboard"[2]) and found it to be very useful to analyse impact of any scheduler code changes[3][4]. Please note that, this is not a replacement of perf sched record/report. The intended users of the new tool are scheduler developers, not regular users. USAGE ----- # perf sched stats record # perf sched stats report Note: Although `perf sched stats` tool supports workload profiling syntax (i.e. -- <workload> ), the recorded profile is still systemwide since the /proc/schedstat is a systemwide file. HOW TO INTERPRET THE REPORT --------------------------- The `perf sched stats report` starts with total time profiling was active in terms of jiffies: ---------------------------------------------------------------------------------------------------- Time elapsed (in jiffies) : 24537 ---------------------------------------------------------------------------------------------------- Next is CPU scheduling statistics. These are simple diffs of /proc/schedstat CPU lines along with description. The report also prints % relative to base stat. In the example below, schedule() left the CPU0 idle 98.19% of the time. 16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the same CPU. ---------------------------------------------------------------------------------------------------- CPU 0 ---------------------------------------------------------------------------------------------------- sched_yield() count : 0 Legacy counter can be ignored : 0 schedule() called : 17138 schedule() left the processor idle : 16827 ( 98.19% ) try_to_wake_up() was called : 508 try_to_wake_up() was called to wake up the local cpu : 84 ( 16.54% ) total runtime by tasks on this processor (in jiffies) : 2408959243 total waittime by tasks on this processor (in jiffies) : 11731825 ( 0.49% ) total timeslices run on this cpu : 311 ---------------------------------------------------------------------------------------------------- Next is load balancing statistics. For each of the sched domains (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under the following three categories: 1) Idle Load Balance: Load balancing performed on behalf of a long idling CPU by some other CPU. 2) Busy Load Balance: Load balancing performed when the CPU was busy. 3) New Idle Balance : Load balancing performed when a CPU just became idle. Under each of these three categories, sched stats report provides different load balancing statistics. Along with direct stats, the report also contains derived metrics prefixed with *. Example: ---------------------------------------------------------------------------------------------------- CPU 0 DOMAIN SMT CPUS <0, 64> ----------------------------------------- <Category idle> ------------------------------------------ load_balance() count on cpu idle : 50 $ 490.74 $ load_balance() found balanced on cpu idle : 42 $ 584.21 $ load_balance() move task failed on cpu idle : 8 $ 3067.12 $ imbalance sum on cpu idle : 8 pull_task() count on cpu idle : 0 pull_task() when target task was cache-hot on cpu idle : 0 load_balance() failed to find busier queue on cpu idle : 0 $ 0.00 $ load_balance() failed to find busier group on cpu idle : 42 $ 584.21 $ *load_balance() success count on cpu idle : 0 *avg task pulled per successful lb attempt (cpu idle) : 0.00 ----------------------------------------- <Category busy> ------------------------------------------ load_balance() count on cpu busy : 2 $ 12268.50 $ load_balance() found balanced on cpu busy : 2 $ 12268.50 $ load_balance() move task failed on cpu busy : 0 $ 0.00 $ imbalance sum on cpu busy : 0 pull_task() count on cpu busy : 0 pull_task() when target task was cache-hot on cpu busy : 0 load_balance() failed to find busier queue on cpu busy : 0 $ 0.00 $ load_balance() failed to find busier group on cpu busy : 1 $ 24537.00 $ *load_balance() success count on cpu busy : 0 *avg task pulled per successful lb attempt (cpu busy) : 0.00 ---------------------------------------- <Category newidle> ---------------------------------------- load_balance() count on cpu newly idle : 427 $ 57.46 $ load_balance() found balanced on cpu newly idle : 382 $ 64.23 $ load_balance() move task failed on cpu newly idle : 45 $ 545.27 $ imbalance sum on cpu newly idle : 48 pull_task() count on cpu newly idle : 0 pull_task() when target task was cache-hot on cpu newly idle : 0 load_balance() failed to find busier queue on cpu newly idle : 0 $ 0.00 $ load_balance() failed to find busier group on cpu newly idle : 382 $ 64.23 $ *load_balance() success count on cpu newly idle : 0 *avg task pulled per successful lb attempt (cpu newly idle) : 0.00 ---------------------------------------------------------------------------------------------------- Consider following line: load_balance() found balanced on cpu newly idle : 382 $ 64.23 $ While profiling was active, the load-balancer found 382 times the load needs to be balanced on a newly idle CPU 0. Following value encapsulated inside $ is average jiffies between two events (24537 / 382 = 64.23). Next are active_load_balance() stats. alb did not trigger while the profiling was active, hence it's all 0s. --------------------------------- <Category active_load_balance()> --------------------------------- active_load_balance() count : 0 active_load_balance() move task failed : 0 active_load_balance() successfully moved a task : 0 ---------------------------------------------------------------------------------------------------- Next are sched_balance_exec() and sched_balance_fork() stats. They are not used but we kept it in RFC just for legacy purpose. Unless opposed, we plan to remove them in next revision. Next are wakeup statistics. For every domain, the report also shows task-wakeup statistics. Example: ------------------------------------------- <Wakeup Info> ------------------------------------------ try_to_wake_up() awoke a task that last ran on a diff cpu : 12070 try_to_wake_up() moved task because cache-cold on own cpu : 3324 try_to_wake_up() started passive balancing : 0 ---------------------------------------------------------------------------------------------------- Same set of stats are reported for each CPU and each domain level. RFC: https://lore.kernel.org/r/20240508060427.417-1-ravi.bangoria@amd.com RFC->v1: - [Kernel] Print domain name along with domain number in /proc/schedstat file. - s/schedstat/stats/ for the subcommand. - Record domain name and cpumask details, also show them in report. - Add CPU filtering capability at record and report time. - Add /proc/schedstat v16 support. - Live mode support. Similar to perf stat command, live mode prints the sched stats on the stdout. - Add pager support in `perf sched stats report` for better scrolling. - Some minor cosmetic changes in report output to improve readability. - Rebase to latest perf-tools-next/perf-tools-next (1de5b5dcb835). TODO: - Add perf unit tests to test basic sched stats functionalities - Describe new tool, it's usage and interpretation of report data in the perf-sched man page. - Currently sched stats tool provides statistics of only one run but we are planning to add `perf sched stats diff` which can compare the data of two different runs (possibly good and bad) and highlight where scheduler decisions are impacting workload performance. - perf sched stats records /proc/schedstat which is a CPU and domain level scheduler statistic. We are planning to add taskstat tool which reads task stats from procfs and generate scheduler statistic report at task granularity. this will probably a standalone tool, something like `perf sched taskstat record/report`. - Except pre-processor related checkpatch warnings, we have addressed most of the other possible warnings. Patches are prepared on perf-tools-next/perf-tools-next (1de5b5dcb835). Apologies for the long delay in respin. sched-ext was proposed while we were working on next revision. So, we held on for a moment to settle down dusts and get a clear idea of whether the new tool will be useful or not. [1] https://youtu.be/lg-9aG2ajA0?t=283 [2] https://github.com/AMDESE/sched-scoreboard [3] https://lore.kernel.org/lkml/c50bdbfe-02ce-c1bc-c761-c95f8e216ca0@amd.com/ [4] https://lore.kernel.org/lkml/3e32bec6-5e59-c66a-7676-7d15df2c961c@amd.com/ K Prateek Nayak (1): sched/stats: Print domain name in /proc/schedstat Swapnil Sapkal (4): perf sched stats: Add record and rawdump support perf sched stats: Add schedstat v16 support perf sched stats: Add support for report subcommand perf sched stats: Add support for live mode Documentation/scheduler/sched-stats.rst | 8 +- kernel/sched/stats.c | 6 +- tools/lib/perf/Documentation/libperf.txt | 2 + tools/lib/perf/Makefile | 2 +- tools/lib/perf/include/perf/event.h | 56 ++ .../lib/perf/include/perf/schedstat-cpu-v15.h | 22 + .../lib/perf/include/perf/schedstat-cpu-v16.h | 22 + .../perf/include/perf/schedstat-domain-v15.h | 121 +++ .../perf/include/perf/schedstat-domain-v16.h | 121 +++ tools/perf/builtin-inject.c | 2 + tools/perf/builtin-sched.c | 778 +++++++++++++++++- tools/perf/util/event.c | 104 +++ tools/perf/util/event.h | 2 + tools/perf/util/session.c | 20 + tools/perf/util/synthetic-events.c | 255 ++++++ tools/perf/util/synthetic-events.h | 3 + tools/perf/util/tool.c | 20 + tools/perf/util/tool.h | 4 +- 18 files changed, 1542 insertions(+), 6 deletions(-) create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v15.h create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v16.h create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v15.h create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v16.h -- 2.46.0
On 16/09/2024 17:47, Ravi Bangoria wrote: > MOTIVATION > ---------- > > Existing `perf sched` is quite exhaustive and provides lot of insights > into scheduler behavior but it quickly becomes impractical to use for > long running or scheduler intensive workload. For ex, `perf sched record` > has ~7.77% overhead on hackbench (with 25 groups each running 700K loops > on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it > generates huge 56G perf.data for which perf takes ~137 mins to prepare > and write it to disk [1]. > > Unlike `perf sched record`, which hooks onto set of scheduler tracepoints > and generates samples on a tracepoint hit, `perf sched stats record` takes > snapshot of the /proc/schedstat file before and after the workload, i.e. > there is almost zero interference on workload run. Also, it takes very > minimal time to parse /proc/schedstat, convert it into perf samples and > save those samples into perf.data file. Result perf.data file is much > smaller. So, overall `perf sched stats record` is much more light weight > compare to `perf sched record`. > > We, internally at AMD, have been using this (a variant of this, known as > "sched-scoreboard"[2]) and found it to be very useful to analyse impact > of any scheduler code changes[3][4]. > > Please note that, this is not a replacement of perf sched record/report. > The intended users of the new tool are scheduler developers, not regular > users. > > USAGE > ----- > > # perf sched stats record > # perf sched stats report > > Note: Although `perf sched stats` tool supports workload profiling syntax > (i.e. -- <workload> ), the recorded profile is still systemwide since the > /proc/schedstat is a systemwide file. > > HOW TO INTERPRET THE REPORT > --------------------------- > > The `perf sched stats report` starts with total time profiling was active > in terms of jiffies: > > ---------------------------------------------------------------------------------------------------- > Time elapsed (in jiffies) : 24537 > ---------------------------------------------------------------------------------------------------- > > Next is CPU scheduling statistics. These are simple diffs of > /proc/schedstat CPU lines along with description. The report also > prints % relative to base stat. > > In the example below, schedule() left the CPU0 idle 98.19% of the time. > 16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total > waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the > same CPU. > > ---------------------------------------------------------------------------------------------------- > CPU 0 > ---------------------------------------------------------------------------------------------------- > sched_yield() count : 0 > Legacy counter can be ignored : 0 > schedule() called : 17138 > schedule() left the processor idle : 16827 ( 98.19% ) > try_to_wake_up() was called : 508 > try_to_wake_up() was called to wake up the local cpu : 84 ( 16.54% ) > total runtime by tasks on this processor (in jiffies) : 2408959243 > total waittime by tasks on this processor (in jiffies) : 11731825 ( 0.49% ) > total timeslices run on this cpu : 311 > ---------------------------------------------------------------------------------------------------- > > Next is load balancing statistics. For each of the sched domains > (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under > the following three categories: > > 1) Idle Load Balance: Load balancing performed on behalf of a long > idling CPU by some other CPU. > 2) Busy Load Balance: Load balancing performed when the CPU was busy. > 3) New Idle Balance : Load balancing performed when a CPU just became > idle. > > Under each of these three categories, sched stats report provides > different load balancing statistics. Along with direct stats, the > report also contains derived metrics prefixed with *. Example: > > ---------------------------------------------------------------------------------------------------- > CPU 0 DOMAIN SMT CPUS <0, 64> > ----------------------------------------- <Category idle> ------------------------------------------ > load_balance() count on cpu idle : 50 $ 490.74 $ > load_balance() found balanced on cpu idle : 42 $ 584.21 $ > load_balance() move task failed on cpu idle : 8 $ 3067.12 $ > imbalance sum on cpu idle : 8 > pull_task() count on cpu idle : 0 > pull_task() when target task was cache-hot on cpu idle : 0 > load_balance() failed to find busier queue on cpu idle : 0 $ 0.00 $ > load_balance() failed to find busier group on cpu idle : 42 $ 584.21 $ > *load_balance() success count on cpu idle : 0 > *avg task pulled per successful lb attempt (cpu idle) : 0.00 > ----------------------------------------- <Category busy> ------------------------------------------ > load_balance() count on cpu busy : 2 $ 12268.50 $ > load_balance() found balanced on cpu busy : 2 $ 12268.50 $ > load_balance() move task failed on cpu busy : 0 $ 0.00 $ > imbalance sum on cpu busy : 0 > pull_task() count on cpu busy : 0 > pull_task() when target task was cache-hot on cpu busy : 0 > load_balance() failed to find busier queue on cpu busy : 0 $ 0.00 $ > load_balance() failed to find busier group on cpu busy : 1 $ 24537.00 $ > *load_balance() success count on cpu busy : 0 > *avg task pulled per successful lb attempt (cpu busy) : 0.00 > ---------------------------------------- <Category newidle> ---------------------------------------- > load_balance() count on cpu newly idle : 427 $ 57.46 $ > load_balance() found balanced on cpu newly idle : 382 $ 64.23 $ > load_balance() move task failed on cpu newly idle : 45 $ 545.27 $ > imbalance sum on cpu newly idle : 48 > pull_task() count on cpu newly idle : 0 > pull_task() when target task was cache-hot on cpu newly idle : 0 > load_balance() failed to find busier queue on cpu newly idle : 0 $ 0.00 $ > load_balance() failed to find busier group on cpu newly idle : 382 $ 64.23 $ > *load_balance() success count on cpu newly idle : 0 > *avg task pulled per successful lb attempt (cpu newly idle) : 0.00 > ---------------------------------------------------------------------------------------------------- > > Consider following line: > > load_balance() found balanced on cpu newly idle : 382 $ 64.23 $ > > While profiling was active, the load-balancer found 382 times the load > needs to be balanced on a newly idle CPU 0. Following value encapsulated > inside $ is average jiffies between two events (24537 / 382 = 64.23). This explanation of the $ fields is quite buried. Is there a way of making it clearer with a column header in the report? I think even if it was documented in the man pages it might not be that useful. There are also other jiffies fields that don't use $. Maybe if it was like this it could be semi self documenting: ---------------------------------------------------------------------- Time elapsed (in jiffies) : $ 24537 $ ---------------------------------------------------------------------- ------------------<Category newidle> --------------------------------- load_balance() count on cpu newly idle : 427 $ 57.46 avg $ ---------------------------------------------------------------------- Other than that: Tested-by: James Clark <james.clark@linaro.org>
Hi James, On 9/17/2024 4:05 PM, James Clark wrote: > > > On 16/09/2024 17:47, Ravi Bangoria wrote: >> MOTIVATION >> ---------- >> >> Existing `perf sched` is quite exhaustive and provides lot of insights >> into scheduler behavior but it quickly becomes impractical to use for >> long running or scheduler intensive workload. For ex, `perf sched record` >> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops >> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it >> generates huge 56G perf.data for which perf takes ~137 mins to prepare >> and write it to disk [1]. >> >> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints >> and generates samples on a tracepoint hit, `perf sched stats record` >> takes >> snapshot of the /proc/schedstat file before and after the workload, i.e. >> there is almost zero interference on workload run. Also, it takes very >> minimal time to parse /proc/schedstat, convert it into perf samples and >> save those samples into perf.data file. Result perf.data file is much >> smaller. So, overall `perf sched stats record` is much more light weight >> compare to `perf sched record`. >> >> We, internally at AMD, have been using this (a variant of this, known as >> "sched-scoreboard"[2]) and found it to be very useful to analyse impact >> of any scheduler code changes[3][4]. >> >> Please note that, this is not a replacement of perf sched record/report. >> The intended users of the new tool are scheduler developers, not regular >> users. >> >> USAGE >> ----- >> >> # perf sched stats record >> # perf sched stats report >> >> Note: Although `perf sched stats` tool supports workload profiling syntax >> (i.e. -- <workload> ), the recorded profile is still systemwide since the >> /proc/schedstat is a systemwide file. >> >> HOW TO INTERPRET THE REPORT >> --------------------------- >> >> The `perf sched stats report` starts with total time profiling was active >> in terms of jiffies: >> >> >> ---------------------------------------------------------------------------------------------------- >> Time elapsed (in jiffies) : >> 24537 >> >> ---------------------------------------------------------------------------------------------------- >> >> Next is CPU scheduling statistics. These are simple diffs of >> /proc/schedstat CPU lines along with description. The report also >> prints % relative to base stat. >> >> In the example below, schedule() left the CPU0 idle 98.19% of the time. >> 16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total >> waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the >> same CPU. >> >> >> ---------------------------------------------------------------------------------------------------- >> CPU 0 >> >> ---------------------------------------------------------------------------------------------------- >> sched_yield() count >> : 0 >> Legacy counter can be ignored >> : 0 >> schedule() called : >> 17138 >> schedule() left the processor idle : >> 16827 ( 98.19% ) >> try_to_wake_up() was called >> : 508 >> try_to_wake_up() was called to wake up the local cpu >> : 84 ( 16.54% ) >> total runtime by tasks on this processor (in jiffies) : >> 2408959243 >> total waittime by tasks on this processor (in jiffies) : >> 11731825 ( 0.49% ) >> total timeslices run on this cpu >> : 311 >> >> ---------------------------------------------------------------------------------------------------- >> >> Next is load balancing statistics. For each of the sched domains >> (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under >> the following three categories: >> >> 1) Idle Load Balance: Load balancing performed on behalf of a long >> idling CPU by some other CPU. >> 2) Busy Load Balance: Load balancing performed when the CPU was busy. >> 3) New Idle Balance : Load balancing performed when a CPU just became >> idle. >> >> Under each of these three categories, sched stats report provides >> different load balancing statistics. Along with direct stats, the >> report also contains derived metrics prefixed with *. Example: >> >> >> ---------------------------------------------------------------------------------------------------- >> CPU 0 DOMAIN SMT CPUS <0, 64> >> ----------------------------------------- <Category idle> >> ------------------------------------------ >> load_balance() count on cpu idle >> : 50 $ 490.74 $ >> load_balance() found balanced on cpu idle >> : 42 $ 584.21 $ >> load_balance() move task failed on cpu idle >> : 8 $ 3067.12 $ >> imbalance sum on cpu idle >> : 8 >> pull_task() count on cpu idle >> : 0 >> pull_task() when target task was cache-hot on cpu idle >> : 0 >> load_balance() failed to find busier queue on cpu idle >> : 0 $ 0.00 $ >> load_balance() failed to find busier group on cpu idle >> : 42 $ 584.21 $ >> *load_balance() success count on cpu idle >> : 0 >> *avg task pulled per successful lb attempt (cpu idle) >> : 0.00 >> ----------------------------------------- <Category busy> >> ------------------------------------------ >> load_balance() count on cpu busy >> : 2 $ 12268.50 $ >> load_balance() found balanced on cpu busy >> : 2 $ 12268.50 $ >> load_balance() move task failed on cpu busy >> : 0 $ 0.00 $ >> imbalance sum on cpu busy >> : 0 >> pull_task() count on cpu busy >> : 0 >> pull_task() when target task was cache-hot on cpu busy >> : 0 >> load_balance() failed to find busier queue on cpu busy >> : 0 $ 0.00 $ >> load_balance() failed to find busier group on cpu busy >> : 1 $ 24537.00 $ >> *load_balance() success count on cpu busy >> : 0 >> *avg task pulled per successful lb attempt (cpu busy) >> : 0.00 >> ---------------------------------------- <Category newidle> >> ---------------------------------------- >> load_balance() count on cpu newly idle >> : 427 $ 57.46 $ >> load_balance() found balanced on cpu newly idle >> : 382 $ 64.23 $ >> load_balance() move task failed on cpu newly idle >> : 45 $ 545.27 $ >> imbalance sum on cpu newly idle >> : 48 >> pull_task() count on cpu newly idle >> : 0 >> pull_task() when target task was cache-hot on cpu newly idle >> : 0 >> load_balance() failed to find busier queue on cpu newly idle >> : 0 $ 0.00 $ >> load_balance() failed to find busier group on cpu newly idle >> : 382 $ 64.23 $ >> *load_balance() success count on cpu newly idle >> : 0 >> *avg task pulled per successful lb attempt (cpu newly idle) >> : 0.00 >> >> ---------------------------------------------------------------------------------------------------- >> >> Consider following line: >> >> load_balance() found balanced on cpu newly idle >> : 382 $ 64.23 $ >> >> While profiling was active, the load-balancer found 382 times the load >> needs to be balanced on a newly idle CPU 0. Following value encapsulated >> inside $ is average jiffies between two events (24537 / 382 = 64.23). > > This explanation of the $ fields is quite buried. Is there a way of > making it clearer with a column header in the report? I think even if it > was documented in the man pages it might not be that useful. > Thank you for the suggestion. I will add a header in the report to explain what each column values are representing. > There are also other jiffies fields that don't use $. Maybe if it was > like this it could be semi self documenting: > > ---------------------------------------------------------------------- > Time elapsed (in jiffies) : $ 24537 $ > ---------------------------------------------------------------------- > > ------------------<Category newidle> --------------------------------- > load_balance() count on cpu newly idle : 427 $ 57.46 avg $ > ---------------------------------------------------------------------- > > > Other than that: > > Tested-by: James Clark <james.clark@linaro.org> Ack. -- Thanks and Regards, Swapnil
Hi Ravi, On 16/09/24 22:17, Ravi Bangoria wrote: > MOTIVATION > ---------- > > Existing `perf sched` is quite exhaustive and provides lot of insights > into scheduler behavior but it quickly becomes impractical to use for > long running or scheduler intensive workload. For ex, `perf sched record` > has ~7.77% overhead on hackbench (with 25 groups each running 700K loops > on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it > generates huge 56G perf.data for which perf takes ~137 mins to prepare > and write it to disk [1]. > > Unlike `perf sched record`, which hooks onto set of scheduler tracepoints > and generates samples on a tracepoint hit, `perf sched stats record` takes > snapshot of the /proc/schedstat file before and after the workload, i.e. > there is almost zero interference on workload run. Also, it takes very > minimal time to parse /proc/schedstat, convert it into perf samples and > save those samples into perf.data file. Result perf.data file is much per.data file is empty after the record. Error: The perf.data data has no samples! Thanks and Regards Madadi Vineeth Reddy > smaller. So, overall `perf sched stats record` is much more light weight > compare to `perf sched record`. > > We, internally at AMD, have been using this (a variant of this, known as > "sched-scoreboard"[2]) and found it to be very useful to analyse impact > of any scheduler code changes[3][4]. > > Please note that, this is not a replacement of perf sched record/report. > The intended users of the new tool are scheduler developers, not regular > users.
Hi Vineeth, Thank you for testing the series. On 9/17/2024 4:27 PM, Madadi Vineeth Reddy wrote: > Hi Ravi, > > On 16/09/24 22:17, Ravi Bangoria wrote: >> MOTIVATION >> ---------- >> >> Existing `perf sched` is quite exhaustive and provides lot of insights >> into scheduler behavior but it quickly becomes impractical to use for >> long running or scheduler intensive workload. For ex, `perf sched record` >> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops >> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it >> generates huge 56G perf.data for which perf takes ~137 mins to prepare >> and write it to disk [1]. >> >> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints >> and generates samples on a tracepoint hit, `perf sched stats record` takes >> snapshot of the /proc/schedstat file before and after the workload, i.e. >> there is almost zero interference on workload run. Also, it takes very >> minimal time to parse /proc/schedstat, convert it into perf samples and >> save those samples into perf.data file. Result perf.data file is much > > per.data file is empty after the record. > > Error: > The perf.data data has no samples! I am not able to reproduce this error on my system. Can you please share `/proc/schedstat` file from your system? What was the base kernel you applied this on? -- Thanks And Regards, Swapnil > > Thanks and Regards > Madadi Vineeth Reddy > >> smaller. So, overall `perf sched stats record` is much more light weight >> compare to `perf sched record`. >> >> We, internally at AMD, have been using this (a variant of this, known as >> "sched-scoreboard"[2]) and found it to be very useful to analyse impact >> of any scheduler code changes[3][4]. >> >> Please note that, this is not a replacement of perf sched record/report. >> The intended users of the new tool are scheduler developers, not regular >> users. >
Hi Vineeth, Thank you for testing the series. On 9/17/2024 4:27 PM, Madadi Vineeth Reddy wrote: > Hi Ravi, > > On 16/09/24 22:17, Ravi Bangoria wrote: >> MOTIVATION >> ---------- >> >> Existing `perf sched` is quite exhaustive and provides lot of insights >> into scheduler behavior but it quickly becomes impractical to use for >> long running or scheduler intensive workload. For ex, `perf sched record` >> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops >> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it >> generates huge 56G perf.data for which perf takes ~137 mins to prepare >> and write it to disk [1]. >> >> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints >> and generates samples on a tracepoint hit, `perf sched stats record` takes >> snapshot of the /proc/schedstat file before and after the workload, i.e. >> there is almost zero interference on workload run. Also, it takes very >> minimal time to parse /proc/schedstat, convert it into perf samples and >> save those samples into perf.data file. Result perf.data file is much > > per.data file is empty after the record. > > Error: > The perf.data data has no samples! I am not able to reproduce this error on my system. Can you please share `/proc/schedstat` file from your system? What was the base kernel you applied this on? -- Thanks And Regards, Swapnil > > Thanks and Regards > Madadi Vineeth Reddy > >> smaller. So, overall `perf sched stats record` is much more light weight >> compare to `perf sched record`. >> >> We, internally at AMD, have been using this (a variant of this, known as >> "sched-scoreboard"[2]) and found it to be very useful to analyse impact >> of any scheduler code changes[3][4]. >> >> Please note that, this is not a replacement of perf sched record/report. >> The intended users of the new tool are scheduler developers, not regular >> users. >
© 2016 - 2024 Red Hat, Inc.