[PATCH] perf/test: Skip leader sampling for s390

Thomas Richter posted 1 patch 9 months, 2 weeks ago
tools/perf/tests/shell/record.sh | 6 ++++++
1 file changed, 6 insertions(+)
[PATCH] perf/test: Skip leader sampling for s390
Posted by Thomas Richter 9 months, 2 weeks ago
In tree linux-next
the perf test case 114 'perf record tests' has a subtest
named 'Basic leader sampling test' which always fails on s390.
Root cause is this invocation

 # perf record -vv -e '{cycles,cycles}:Su' -- perf test -w brstack

 ...
 In the debug output the following 2 event are installed:

 ------------------------------------------------------------
 perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  size                             136
  config                           0 (PERF_COUNT_HW_CPU_CYCLES)
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|READ|CPU|PERIOD|IDENTIFIER
  read_format                      ID|GROUP|LOST
  disabled                         1
  exclude_kernel                   1
  exclude_hv                       1
  freq                             1
  sample_id_all                    1
 ------------------------------------------------------------
 sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 5
 ------------------------------------------------------------
 perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  size                             136
  config                           0 (PERF_COUNT_HW_CPU_CYCLES)
  sample_type                      IP|TID|TIME|READ|CPU|PERIOD|IDENTIFIER
  read_format                      ID|GROUP|LOST
  exclude_kernel                   1
  exclude_hv                       1
  sample_id_all                    1
 ------------------------------------------------------------
 sys_perf_event_open: pid -1  cpu 0  group_fd 5  flags 0x8 = 6
 ...

The first event is the group leader and is installed as sampling event.
The secound one is group member and is installed as counting event.

Namhyung Kim confirms this observation:
> Yep, the syntax '{event1,event2}:S' is for group leader sampling which
> reduces the overhead of PMU interrupts.  The idea is that those events
> are scheduled together so sampling is enabled only for the leader
> (usually the first) event and it reads counts from the member events
> using PERF_SAMPLE_READ.
>
> So they should have the same counts if it uses the same events in a
> group.

However this does not work on s390. s390 has one dedicated sampling PMU
which supports only one event. A different PMU is used for counting.
Both run concurrently using different setups and frequencies.

On s390x a sampling event is setup using a preset trigger and a large
buffer. The hardware
 - writes a samples (64 bytes) into this buffer
   when a given number of CPU instructions has been executed.
 - and triggers an interrupt when the buffer gets full.
The trigger has just a few possible values.

On s390x the counting event cycles is used to read out the numer of
CPU cycles executed.

On s390 above invocation created 2 events executed on 2 different
PMU and the result are diffent values from two independently running
PMUs which do not match in a consistent and reliably as on Intel:

 # ./perf record  -e '{cycles,cycles}:Su' -- perf test -w brstack
   ...
 # ./perf script
   perf 2799437 92568.845118:  5508000 cycles:  3ffbcb898b6 do_lookup_x+0x196
   perf 2799437 92568.845119:  1377000 cycles:  3ffbcb898b6 do_lookup_x+0x196
   perf 2799437 92568.845120:  4131000 cycles:  3ffbcb897e8 do_lookup_x+0xc8
   perf 2799437 92568.845121:  1377000 cycles:  3ffbcb8a37c _dl_lookup_symbol
   perf 2799437 92568.845122:  1377000 cycles:  3ffbcb89558 check_match+0x18
   perf 2799437 92568.845123:  2754000 cycles:  3ffbcb89b2a do_lookup_x+0x40a
   perf 2799437 92568.845124:  1377000 cycles:  3ffbcb89b1e do_lookup_x+0x3fe

As can be seen the result match very often but not all the time
make this test on s390 failing very, very often.

This patch bypasses this test on s390.

Output before:
 # ./perf test 114
 114: perf record tests                       : FAILED!
 #

Output after:
 # ./perf test 114
 114: perf record tests                       : Ok
 #

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
 tools/perf/tests/shell/record.sh | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tools/perf/tests/shell/record.sh b/tools/perf/tests/shell/record.sh
index ba8d873d3ca7..98b69820bc5f 100755
--- a/tools/perf/tests/shell/record.sh
+++ b/tools/perf/tests/shell/record.sh
@@ -231,6 +231,12 @@ test_cgroup() {
 
 test_leader_sampling() {
   echo "Basic leader sampling test"
+  if [ "$(uname -m)" = s390x ]
+  then
+    echo "Leader sampling skipped"
+    ((skipped+=1))
+    return
+  fi
   if ! perf record -o "${perfdata}" -e "{cycles,cycles}:Su" -- \
     perf test -w brstack 2> /dev/null
   then
-- 
2.45.2
Re: [PATCH] perf/test: Skip leader sampling for s390
Posted by Namhyung Kim 9 months, 2 weeks ago
Hello,

On Fri, Feb 28, 2025 at 07:22:41AM +0100, Thomas Richter wrote:
> In tree linux-next
> the perf test case 114 'perf record tests' has a subtest
> named 'Basic leader sampling test' which always fails on s390.
> Root cause is this invocation
> 
>  # perf record -vv -e '{cycles,cycles}:Su' -- perf test -w brstack
> 
>  ...
>  In the debug output the following 2 event are installed:
> 
>  ------------------------------------------------------------
>  perf_event_attr:
>   type                             0 (PERF_TYPE_HARDWARE)
>   size                             136
>   config                           0 (PERF_COUNT_HW_CPU_CYCLES)
>   { sample_period, sample_freq }   4000
>   sample_type                      IP|TID|TIME|READ|CPU|PERIOD|IDENTIFIER
>   read_format                      ID|GROUP|LOST
>   disabled                         1
>   exclude_kernel                   1
>   exclude_hv                       1
>   freq                             1
>   sample_id_all                    1
>  ------------------------------------------------------------
>  sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 5
>  ------------------------------------------------------------
>  perf_event_attr:
>   type                             0 (PERF_TYPE_HARDWARE)
>   size                             136
>   config                           0 (PERF_COUNT_HW_CPU_CYCLES)
>   sample_type                      IP|TID|TIME|READ|CPU|PERIOD|IDENTIFIER
>   read_format                      ID|GROUP|LOST
>   exclude_kernel                   1
>   exclude_hv                       1
>   sample_id_all                    1
>  ------------------------------------------------------------
>  sys_perf_event_open: pid -1  cpu 0  group_fd 5  flags 0x8 = 6
>  ...
> 
> The first event is the group leader and is installed as sampling event.
> The secound one is group member and is installed as counting event.
> 
> Namhyung Kim confirms this observation:
> > Yep, the syntax '{event1,event2}:S' is for group leader sampling which
> > reduces the overhead of PMU interrupts.  The idea is that those events
> > are scheduled together so sampling is enabled only for the leader
> > (usually the first) event and it reads counts from the member events
> > using PERF_SAMPLE_READ.
> >
> > So they should have the same counts if it uses the same events in a
> > group.
> 
> However this does not work on s390. s390 has one dedicated sampling PMU
> which supports only one event. A different PMU is used for counting.
> Both run concurrently using different setups and frequencies.
> 
> On s390x a sampling event is setup using a preset trigger and a large
> buffer. The hardware
>  - writes a samples (64 bytes) into this buffer
>    when a given number of CPU instructions has been executed.
>  - and triggers an interrupt when the buffer gets full.
> The trigger has just a few possible values.
> 
> On s390x the counting event cycles is used to read out the numer of
> CPU cycles executed.
> 
> On s390 above invocation created 2 events executed on 2 different
> PMU and the result are diffent values from two independently running
> PMUs which do not match in a consistent and reliably as on Intel:
> 
>  # ./perf record  -e '{cycles,cycles}:Su' -- perf test -w brstack
>    ...
>  # ./perf script
>    perf 2799437 92568.845118:  5508000 cycles:  3ffbcb898b6 do_lookup_x+0x196
>    perf 2799437 92568.845119:  1377000 cycles:  3ffbcb898b6 do_lookup_x+0x196
>    perf 2799437 92568.845120:  4131000 cycles:  3ffbcb897e8 do_lookup_x+0xc8
>    perf 2799437 92568.845121:  1377000 cycles:  3ffbcb8a37c _dl_lookup_symbol
>    perf 2799437 92568.845122:  1377000 cycles:  3ffbcb89558 check_match+0x18
>    perf 2799437 92568.845123:  2754000 cycles:  3ffbcb89b2a do_lookup_x+0x40a
>    perf 2799437 92568.845124:  1377000 cycles:  3ffbcb89b1e do_lookup_x+0x3fe
> 
> As can be seen the result match very often but not all the time
> make this test on s390 failing very, very often.
> 
> This patch bypasses this test on s390.
> 
> Output before:
>  # ./perf test 114
>  114: perf record tests                       : FAILED!
>  #
> 
> Output after:
>  # ./perf test 114
>  114: perf record tests                       : Ok
>  #
> 
> Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>

Thanks for the fix.  I think Ian saw the same problem on other archs
too.  Maybe we need to enable it on supported archs only.

Thanks,
Namhyung

> ---
>  tools/perf/tests/shell/record.sh | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/tools/perf/tests/shell/record.sh b/tools/perf/tests/shell/record.sh
> index ba8d873d3ca7..98b69820bc5f 100755
> --- a/tools/perf/tests/shell/record.sh
> +++ b/tools/perf/tests/shell/record.sh
> @@ -231,6 +231,12 @@ test_cgroup() {
>  
>  test_leader_sampling() {
>    echo "Basic leader sampling test"
> +  if [ "$(uname -m)" = s390x ]
> +  then
> +    echo "Leader sampling skipped"
> +    ((skipped+=1))
> +    return
> +  fi
>    if ! perf record -o "${perfdata}" -e "{cycles,cycles}:Su" -- \
>      perf test -w brstack 2> /dev/null
>    then
> -- 
> 2.45.2
>
Re: [PATCH] perf/test: Skip leader sampling for s390
Posted by Ian Rogers 9 months, 2 weeks ago
On Fri, Feb 28, 2025 at 4:13 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> On Fri, Feb 28, 2025 at 07:22:41AM +0100, Thomas Richter wrote:
> > In tree linux-next
> > the perf test case 114 'perf record tests' has a subtest
> > named 'Basic leader sampling test' which always fails on s390.
> > Root cause is this invocation
> >
> >  # perf record -vv -e '{cycles,cycles}:Su' -- perf test -w brstack
> >
> >  ...
> >  In the debug output the following 2 event are installed:
> >
> >  ------------------------------------------------------------
> >  perf_event_attr:
> >   type                             0 (PERF_TYPE_HARDWARE)
> >   size                             136
> >   config                           0 (PERF_COUNT_HW_CPU_CYCLES)
> >   { sample_period, sample_freq }   4000
> >   sample_type                      IP|TID|TIME|READ|CPU|PERIOD|IDENTIFIER
> >   read_format                      ID|GROUP|LOST
> >   disabled                         1
> >   exclude_kernel                   1
> >   exclude_hv                       1
> >   freq                             1
> >   sample_id_all                    1
> >  ------------------------------------------------------------
> >  sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 5
> >  ------------------------------------------------------------
> >  perf_event_attr:
> >   type                             0 (PERF_TYPE_HARDWARE)
> >   size                             136
> >   config                           0 (PERF_COUNT_HW_CPU_CYCLES)
> >   sample_type                      IP|TID|TIME|READ|CPU|PERIOD|IDENTIFIER
> >   read_format                      ID|GROUP|LOST
> >   exclude_kernel                   1
> >   exclude_hv                       1
> >   sample_id_all                    1
> >  ------------------------------------------------------------
> >  sys_perf_event_open: pid -1  cpu 0  group_fd 5  flags 0x8 = 6
> >  ...
> >
> > The first event is the group leader and is installed as sampling event.
> > The secound one is group member and is installed as counting event.
> >
> > Namhyung Kim confirms this observation:
> > > Yep, the syntax '{event1,event2}:S' is for group leader sampling which
> > > reduces the overhead of PMU interrupts.  The idea is that those events
> > > are scheduled together so sampling is enabled only for the leader
> > > (usually the first) event and it reads counts from the member events
> > > using PERF_SAMPLE_READ.
> > >
> > > So they should have the same counts if it uses the same events in a
> > > group.
> >
> > However this does not work on s390. s390 has one dedicated sampling PMU
> > which supports only one event. A different PMU is used for counting.
> > Both run concurrently using different setups and frequencies.
> >
> > On s390x a sampling event is setup using a preset trigger and a large
> > buffer. The hardware
> >  - writes a samples (64 bytes) into this buffer
> >    when a given number of CPU instructions has been executed.
> >  - and triggers an interrupt when the buffer gets full.
> > The trigger has just a few possible values.
> >
> > On s390x the counting event cycles is used to read out the numer of
> > CPU cycles executed.
> >
> > On s390 above invocation created 2 events executed on 2 different
> > PMU and the result are diffent values from two independently running
> > PMUs which do not match in a consistent and reliably as on Intel:
> >
> >  # ./perf record  -e '{cycles,cycles}:Su' -- perf test -w brstack

Hi Thomas,

Thanks for reporting this! Could you try adding --count=100000 so that
we're not using frequency mode and we expect the counts to look like
100,000. For example, on my x86 laptop:
```
$ perf record --count=100000 -e '{cycles,cycles}:Su' -- perf test -w brstack
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.047 MB perf.data (712 samples) ]
$ perf script
            perf  635952 290271.436115:     100007 cycles:
ffffffffada00080 [unknown] ([unknown])
           perf  635952 290271.436115:     100007 cycles:
ffffffffada00080 [unknown] ([unknown])
           perf  635952 290271.436650:     100525 cycles:
7f86352b01b3 _dl_map_object_from_fd+0x553
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
           perf  635952 290271.436650:     100525 cycles:
7f86352b01b3 _dl_map_object_from_fd+0x553
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
           perf  635952 290271.437088:      99866 cycles:
7f86352cb827 strchr+0x27
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
           perf  635952 290271.437088:      99866 cycles:
7f86352cb827 strchr+0x27
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
           perf  635952 290271.437376:      99912 cycles:
7f86352cba74 strcmp+0x54
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
           perf  635952 290271.437376:      99912 cycles:
7f86352cba74 strcmp+0x54
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
           perf  635952 290271.437509:     100279 cycles:
7f86352cba3a strcmp+0x1a
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
           perf  635952 290271.437509:     100279 cycles:
7f86352cba3a strcmp+0x1a
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
           perf  635952 290271.437559:      99760 cycles:
7f86352bc39f _dl_check_map_versions+0x50f
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
           perf  635952 290271.437559:      99760 cycles:
7f86352bc39f _dl_check_map_versions+0x50f
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
...
```
I'm particularly concerned if we see the cycles count very deviant
from the 100000.

> >    ...
> >  # ./perf script
> >    perf 2799437 92568.845118:  5508000 cycles:  3ffbcb898b6 do_lookup_x+0x196
> >    perf 2799437 92568.845119:  1377000 cycles:  3ffbcb898b6 do_lookup_x+0x196
> >    perf 2799437 92568.845120:  4131000 cycles:  3ffbcb897e8 do_lookup_x+0xc8
> >    perf 2799437 92568.845121:  1377000 cycles:  3ffbcb8a37c _dl_lookup_symbol
> >    perf 2799437 92568.845122:  1377000 cycles:  3ffbcb89558 check_match+0x18
> >    perf 2799437 92568.845123:  2754000 cycles:  3ffbcb89b2a do_lookup_x+0x40a
> >    perf 2799437 92568.845124:  1377000 cycles:  3ffbcb89b1e do_lookup_x+0x3fe
> >
> > As can be seen the result match very often but not all the time
> > make this test on s390 failing very, very often.

Actually this is much more deviation than I'd expect. If we use
task-clock softer/timer based event I see:
```
$ perf record --count=100000 -e '{task-clock,task-clock}:Su' -- perf
test -w brstack
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.047 MB perf.data (712 samples) ]
$ perf script
            perf  636643 290571.807049:     801858 task-clock:
7fdf48643439 _dl_map_object_from_fd+0x7d9
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
            perf  636643 290571.807049:     804012 task-clock:
7fdf48643439 _dl_map_object_from_fd+0x7d9
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
            perf  636643 290571.807549:     499833 task-clock:
7fdf4863eb9b _dl_map_object_deps+0x3eb
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
            perf  636643 290571.807549:     498236 task-clock:
7fdf4863eb9b _dl_map_object_deps+0x3eb
(/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
```
So the count deviates by a few hundred, but your output seems to
deviate by 4 million.

So, I think the test needs to be more tolerant that should help your
case. As Namhyung mentions I think there may be another bug lurking.

Thanks,
Ian

> > This patch bypasses this test on s390.
> >
> > Output before:
> >  # ./perf test 114
> >  114: perf record tests                       : FAILED!
> >  #
> >
> > Output after:
> >  # ./perf test 114
> >  114: perf record tests                       : Ok
> >  #
> >
> > Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
> > Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
>
> Thanks for the fix.  I think Ian saw the same problem on other archs
> too.  Maybe we need to enable it on supported archs only.
>
> Thanks,
> Namhyung
>
> > ---
> >  tools/perf/tests/shell/record.sh | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/tools/perf/tests/shell/record.sh b/tools/perf/tests/shell/record.sh
> > index ba8d873d3ca7..98b69820bc5f 100755
> > --- a/tools/perf/tests/shell/record.sh
> > +++ b/tools/perf/tests/shell/record.sh
> > @@ -231,6 +231,12 @@ test_cgroup() {
> >
> >  test_leader_sampling() {
> >    echo "Basic leader sampling test"
> > +  if [ "$(uname -m)" = s390x ]
> > +  then
> > +    echo "Leader sampling skipped"
> > +    ((skipped+=1))
> > +    return
> > +  fi
> >    if ! perf record -o "${perfdata}" -e "{cycles,cycles}:Su" -- \
> >      perf test -w brstack 2> /dev/null
> >    then
> > --
> > 2.45.2
> >
Re: [PATCH] perf/test: Skip leader sampling for s390
Posted by Thomas Richter 9 months, 2 weeks ago
On 3/1/25 01:36, Ian Rogers wrote:
> perf record --count=100000 -e '{cycles,cycles}:Su' -- perf test -w brstack

Ian, Namhyung,

here is my output using this command:
# ./perf record --count=100000 -e '{cycles,cycles}:Su' -- perf test -w brstack
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.106 MB perf.data (1080 samples) ]
# ./perf script
            perf   22194 484835.185113:     100000 cycles:       3ff9e407c8c _dl_map_object_from_fd+0xa3c (/usr/lib/ld64.so.1)
            perf   22194 484835.185114:     100000 cycles:       3ff9e408940 _dl_map_object+0x110 (/usr/lib/ld64.so.1)
            perf   22194 484835.185116:     400000 cycles:       3ff9e40890e _dl_map_object+0xde (/usr/lib/ld64.so.1)
            perf   22194 484835.185117:     900000 cycles:       3ff9e40b572 _dl_name_match_p+0x42 (/usr/lib/ld64.so.1)
            perf   22194 484835.185118:     500000 cycles:       3ff9e407c8c _dl_map_object_from_fd+0xa3c (/usr/lib/ld64.so.1)
            perf   22194 484835.185119:     100000 cycles:       3ff9e40b53e _dl_name_match_p+0xe (/usr/lib/ld64.so.1)
            perf   22194 484835.185120:     100000 cycles:       3ff9e40890e _dl_map_object+0xde (/usr/lib/ld64.so.1)
            perf   22194 484835.185121:     100000 cycles:       3ff9e408904 _dl_map_object+0xd4 (/usr/lib/ld64.so.1)
            perf   22194 484835.185122:     100000 cycles:       3ff9e40369a _dl_map_object_deps+0xbba (/usr/lib/ld64.so.1)
            perf   22194 484835.185123:     100000 cycles:       3ff9e413460 _dl_check_map_versions+0x100 (/usr/lib/ld64.so.1)
            perf   22194 484835.185124:     500000 cycles:       3ff9e40b53e _dl_name_match_p+0xe (/usr/lib/ld64.so.1)
            perf   22194 484835.185125:     100000 cycles:       3ff9e40e7e0 _dl_relocate_object+0x550 (/usr/lib/ld64.so.1)
            perf   22194 484835.185126:     200000 cycles:       3ff9e40e7e0 _dl_relocate_object+0x550 (/usr/lib/ld64.so.1)
            perf   22194 484835.185127:     200000 cycles:       3ff9e409558 check_match+0x18 (/usr/lib/ld64.so.1)
            perf   22194 484835.185128:     200000 cycles:       3ff9e409894 do_lookup_x+0x174 (/usr/lib/ld64.so.1)
            perf   22194 484835.185129:     100000 cycles:       3ff9e409910 do_lookup_x+0x1f0 (/usr/lib/ld64.so.1)
            perf   22194 484835.185130:     100000 cycles:       3ff9e409b1e do_lookup_x+0x3fe (/usr/lib/ld64.so.1)
            perf   22194 484835.185131:     100000 cycles:       3ff9e409894 do_lookup_x+0x174 (/usr/lib/ld64.so.1)
            perf   22194 484835.185132:     100000 cycles:       3ff9e409558 check_match+0x18 (/usr/lib/ld64.so.1)
            perf   22194 484835.187445:     100000 cycles:       3ff9e409ad4 do_lookup_x+0x3b4 (/usr/lib/ld64.so.1)

The difference when using counts instead of frequency is similar. Most of time the numbers are identical, 
but sometime they do not match.

Using task-clock as event, I have similar results. The counts vary a bit, but the numbers are pretty close.
They vary by just a few hundred at the most:

# perf record --count=100000 -e '{task-clock,task-clock}:Su' -- perf test -w brstack
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (246 samples) ]
]# ./perf script
            perf   22223 485235.378380:     402070 task-clock:       3ffbed874c6 _dl_map_object_from_fd+0x276 (/usr/lib/ld64.so.1)
            perf   22223 485235.378380:     404640 task-clock:       3ffbed874c6 _dl_map_object_from_fd+0x276 (/usr/lib/ld64.so.1)
            perf   22223 485235.378779:     399960 task-clock:       3ffbed888de _dl_map_object+0xae (/usr/lib/ld64.so.1)
            perf   22223 485235.378779:     397689 task-clock:       3ffbed888de _dl_map_object+0xae (/usr/lib/ld64.so.1)
            perf   22223 485235.378879:     100055 task-clock:       3ffbed8e7e0 _dl_relocate_object+0x550 (/usr/lib/ld64.so.1)
            perf   22223 485235.378879:     100100 task-clock:       3ffbed8e7e0 _dl_relocate_object+0x550 (/usr/lib/ld64.so.1)
            perf   22223 485235.378979:      99981 task-clock:       3ffbed895ae check_match+0x6e (/usr/lib/ld64.so.1)
            perf   22223 485235.378979:      99876 task-clock:       3ffbed895ae check_match+0x6e (/usr/lib/ld64.so.1)
            perf   22223 485235.379079:      99950 task-clock:       3ffbed8974c do_lookup_x+0x2c (/usr/lib/ld64.so.1)
            perf   22223 485235.379079:      99957 task-clock:       3ffbed8974c do_lookup_x+0x2c (/usr/lib/ld64.so.1)
            perf   22223 485235.379179:     100051 task-clock:       3ffbed8e7f0 _dl_relocate_object+0x560 (/usr/lib/ld64.so.1)
            perf   22223 485235.379179:     100004 task-clock:       3ffbed8e7f0 _dl_relocate_object+0x560 (/usr/lib/ld64.so.1)
            perf   22223 485235.379279:      99933 task-clock:       3ffbed8e7ea _dl_relocate_object+0x55a (/usr/lib/ld64.so.1)
            perf   22223 485235.379279:      99952 task-clock:       3ffbed8e7ea _dl_relocate_object+0x55a (/usr/lib/ld64.so.1)

Thanks for your help
-- 
Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany
--
IBM Deutschland Research & Development GmbH

Vorsitzender des Aufsichtsrats: Wolfgang Wendt

Geschäftsführung: David Faller

Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
Re: [PATCH] perf/test: Skip leader sampling for s390
Posted by Chun-Tse Shao 8 months, 3 weeks ago
We believe we know the problem, appreciate Stephan Eranian's investigation.
It comes from throttling. While the sampling is too high, the generic code
does not modify event scheduling. `perf_event_overflow()` simply returns 1,
and subsequently, `pmu_stop()` only stops the leader event, not the slave
events because the arch layer does not consider groups. Also, the
`event_stop()` callback only operates on a single event, not the siblings.

This would impact all architectures. Perhaps we can extend the
`evnet_stop()` callback to include a new argument to also stop the siblings.
We also welcome all suggestions and open to discuss any potential solutions.

Thanks,
CT

Cc: Stephane Eranian <eranian@google.com>
Re: [PATCH] perf/test: Skip leader sampling for s390
Posted by Stephane Eranian 8 months, 3 weeks ago
Hi,

Thanks CT for the post. Indeed this is a long-standing bug impacting
(most likely)
all architectures. The rate throttling code does not consider event grouping. It
stops the sampling event in place (on x86) at the hardware level, not
the generic
scheduling layer. But if the event is in a group, it may make sense to also stop
all the other events in the group, i.e., stop the group. Otherwise you may get
discrepancies between samples of the "slave events". Similarly, the time_running
and time_enable logic is not modified during throttling.
Interested in hearing potential ways of solving this in a portable manner.

On Fri, Mar 28, 2025 at 11:27 AM Chun-Tse Shao <ctshao@google.com> wrote:
>
> We believe we know the problem, appreciate Stephan Eranian's investigation.
> It comes from throttling. While the sampling is too high, the generic code
> does not modify event scheduling. `perf_event_overflow()` simply returns 1,
> and subsequently, `pmu_stop()` only stops the leader event, not the slave
> events because the arch layer does not consider groups. Also, the
> `event_stop()` callback only operates on a single event, not the siblings.
>
> This would impact all architectures. Perhaps we can extend the
> `evnet_stop()` callback to include a new argument to also stop the siblings.
> We also welcome all suggestions and open to discuss any potential solutions.
>
> Thanks,
> CT
>
> Cc: Stephane Eranian <eranian@google.com>