Hi,
Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
the signal gets delayed until event_sched_out() which then uses
task_work_add() for its delivery. This breaks on PREEMPT_RT because the
signal is delivered with disabled preemption.
While looking at this, I also stumbled upon __perf_pending_irq() which
requires disabled interrupts but this is not the case on PREEMPT_RT.
This series aim to address both issues while not introducing a new issue
at the same time ;)
Any testing is appreciated.
v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/
- Marco pointed me to the testsuite that showed two problems:
- Delayed task_work from NMI / missing events.
Fixed by triggering dummy irq_work to enforce an interrupt for
the exit-to-userland path which checks task_work
- Increased ref-count on clean up/ during exec.
Mostly addressed by the former change. There is still a window
if the NMI occurs during execve(). This is addressed by removing
the task_work before free_event().
The testsuite (remove_on_exec) fails sometimes if the event/
SIGTRAP is sent before the sighandler is installed.
Sebastian
On Tue, 12 Mar 2024 at 19:08, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > Hi, > > Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending > the signal gets delayed until event_sched_out() which then uses > task_work_add() for its delivery. This breaks on PREEMPT_RT because the > signal is delivered with disabled preemption. > > While looking at this, I also stumbled upon __perf_pending_irq() which > requires disabled interrupts but this is not the case on PREEMPT_RT. > > This series aim to address both issues while not introducing a new issue > at the same time ;) > Any testing is appreciated. > > v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/ > - Marco pointed me to the testsuite that showed two problems: > - Delayed task_work from NMI / missing events. > Fixed by triggering dummy irq_work to enforce an interrupt for > the exit-to-userland path which checks task_work > - Increased ref-count on clean up/ during exec. > Mostly addressed by the former change. There is still a window > if the NMI occurs during execve(). This is addressed by removing > the task_work before free_event(). > The testsuite (remove_on_exec) fails sometimes if the event/ > SIGTRAP is sent before the sighandler is installed. Tested-by: Marco Elver <elver@google.com> It does pass the tests in tools/testing/selftests/perf_events (non-RT kernel, lockdep enabled). But I do recall this being a particularly sharp corner of perf, so any additional testing and review here is useful. Thanks, -- Marco
On Wed, Mar 13, 2024 at 03:35:27PM +0100, Marco Elver wrote: > On Tue, 12 Mar 2024 at 19:08, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending > > the signal gets delayed until event_sched_out() which then uses > > task_work_add() for its delivery. This breaks on PREEMPT_RT because the > > signal is delivered with disabled preemption. > > While looking at this, I also stumbled upon __perf_pending_irq() which > > requires disabled interrupts but this is not the case on PREEMPT_RT. > > This series aim to address both issues while not introducing a new issue > > at the same time ;) > > Any testing is appreciated. > > v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/ > > - Marco pointed me to the testsuite that showed two problems: > > - Delayed task_work from NMI / missing events. > > Fixed by triggering dummy irq_work to enforce an interrupt for > > the exit-to-userland path which checks task_work > > - Increased ref-count on clean up/ during exec. > > Mostly addressed by the former change. There is still a window > > if the NMI occurs during execve(). This is addressed by removing > > the task_work before free_event(). > > The testsuite (remove_on_exec) fails sometimes if the event/ > > SIGTRAP is sent before the sighandler is installed. > Tested-by: Marco Elver <elver@google.com> > It does pass the tests in tools/testing/selftests/perf_events (non-RT > kernel, lockdep enabled). But I do recall this being a particularly > sharp corner of perf, so any additional testing and review here is > useful. Right, I'm testing with the full 'perf test' suite now. - Arnaldo
On Wed, Mar 13, 2024 at 12:23:32PM -0300, Arnaldo Carvalho de Melo wrote: > On Wed, Mar 13, 2024 at 03:35:27PM +0100, Marco Elver wrote: > > On Tue, 12 Mar 2024 at 19:08, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > > Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending > > > the signal gets delayed until event_sched_out() which then uses > > > task_work_add() for its delivery. This breaks on PREEMPT_RT because the > > > signal is delivered with disabled preemption. > > > While looking at this, I also stumbled upon __perf_pending_irq() which > > > requires disabled interrupts but this is not the case on PREEMPT_RT. > > > This series aim to address both issues while not introducing a new issue > > > at the same time ;) > > > Any testing is appreciated. > > > v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/ > > > - Marco pointed me to the testsuite that showed two problems: > > > - Delayed task_work from NMI / missing events. > > > Fixed by triggering dummy irq_work to enforce an interrupt for > > > the exit-to-userland path which checks task_work > > > - Increased ref-count on clean up/ during exec. > > > Mostly addressed by the former change. There is still a window > > > if the NMI occurs during execve(). This is addressed by removing > > > the task_work before free_event(). > > > The testsuite (remove_on_exec) fails sometimes if the event/ > > > SIGTRAP is sent before the sighandler is installed. > > > Tested-by: Marco Elver <elver@google.com> > > > It does pass the tests in tools/testing/selftests/perf_events (non-RT > > kernel, lockdep enabled). But I do recall this being a particularly > > sharp corner of perf, so any additional testing and review here is > > useful. > Right, I'm testing with the full 'perf test' suite now. 'perf test' doesn't show any regression, now I'm running Vince Weaver's https://github.com/deater/perf_event_tests, storing the results with this patchset and then without, to do a diff, lets see... - Arnaldo
On Wed, Mar 13, 2024 at 03:14:28PM -0300, Arnaldo Carvalho de Melo wrote: > On Wed, Mar 13, 2024 at 12:23:32PM -0300, Arnaldo Carvalho de Melo wrote: > > On Wed, Mar 13, 2024 at 03:35:27PM +0100, Marco Elver wrote: > > > On Tue, 12 Mar 2024 at 19:08, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > > > Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending > > > > the signal gets delayed until event_sched_out() which then uses > > > > task_work_add() for its delivery. This breaks on PREEMPT_RT because the > > > > signal is delivered with disabled preemption. > > > > > While looking at this, I also stumbled upon __perf_pending_irq() which > > > > requires disabled interrupts but this is not the case on PREEMPT_RT. > > > > > This series aim to address both issues while not introducing a new issue > > > > at the same time ;) > > > > Any testing is appreciated. > > > > > v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/ > > > > - Marco pointed me to the testsuite that showed two problems: > > > > - Delayed task_work from NMI / missing events. > > > > Fixed by triggering dummy irq_work to enforce an interrupt for > > > > the exit-to-userland path which checks task_work > > > > - Increased ref-count on clean up/ during exec. > > > > Mostly addressed by the former change. There is still a window > > > > if the NMI occurs during execve(). This is addressed by removing > > > > the task_work before free_event(). > > > > The testsuite (remove_on_exec) fails sometimes if the event/ > > > > SIGTRAP is sent before the sighandler is installed. > > > > > Tested-by: Marco Elver <elver@google.com> > > > It does pass the tests in tools/testing/selftests/perf_events (non-RT > > > kernel, lockdep enabled). But I do recall this being a particularly > > > sharp corner of perf, so any additional testing and review here is > > > useful. > > > Right, I'm testing with the full 'perf test' suite now. > > 'perf test' doesn't show any regression, now I'm running Vince Weaver's > https://github.com/deater/perf_event_tests, storing the results with > this patchset and then without, to do a diff, lets see... [root@nine perf_event_tests]# diff -u results.6.8.0-rc7-rt6 results.6.8.0-rc7.sebastian-rt6+ | grep ^[+-] --- results.6.8.0-rc7-rt6 2024-03-13 15:26:37.923323518 -0300 +++ results.6.8.0-rc7.sebastian-rt6+ 2024-03-13 15:14:11.505333801 -0300 -Linux nine 6.8.0-rc7-rt6 #1 SMP PREEMPT_RT Fri Mar 8 17:36:48 -03 2024 x86_64 x86_64 x86_64 GNU/Linux +Linux nine 6.8.0-rc7.sebastian-rt6+ #2 SMP PREEMPT_RT Tue Mar 12 18:01:31 -03 2024 x86_64 x86_64 x86_64 GNU/Linux - Testing "branch-misses" generalized event... FAILED + Testing "branch-misses" generalized event... PASSED - Testing uncore events... SKIPPED + Testing uncore events... PASSED - We are running release 6.8.0-rc7-rt6 + We are running release 6.8.0-rc7.sebastian-rt6+ - Running on CPU 4 + Running on CPU 2 - Running on CPU 6 + Running on CPU 2 - Measuring on CPU 5 -Running on CPU 6 -Measuring on CPU 5 + Measuring on CPU 6 +Running on CPU 2 +Measuring on CPU 6 [root@nine perf_event_tests]# So basically: - Testing "branch-misses" generalized event... FAILED + Testing "branch-misses" generalized event... PASSED - Testing uncore events... SKIPPED + Testing uncore events... PASSED So things improved! I'll re-run to see if these results are stable... - Arnaldo
On Wed, Mar 13, 2024 at 03:30:52PM -0300, Arnaldo Carvalho de Melo wrote:
> On Wed, Mar 13, 2024 at 03:14:28PM -0300, Arnaldo Carvalho de Melo wrote:
> > 'perf test' doesn't show any regression, now I'm running Vince Weaver's
> > https://github.com/deater/perf_event_tests, storing the results with
> > this patchset and then without, to do a diff, lets see...
>
> So things improved! I'll re-run to see if these results are stable...
tldr; No dmesg activity, no kernel splats, most tests passed, nothing
noticeable when running with/without the patch with Vince's regression
tests. So:
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
- Arnaldo
Further details:
Without the patch:
[root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new ; diff -u results.$(uname -r) results.$(uname -r).new
--- results.6.8.0-rc7-rt6 2024-03-13 15:26:37.923323518 -0300
+++ results.6.8.0-rc7-rt6.new 2024-03-13 15:32:43.983245095 -0300
@@ -296,7 +296,7 @@
+ tests/rdpmc/rdpmc_validation
Testing if userspace rdpmc reads give expected results... PASSED
+ tests/rdpmc/rdpmc_multiplexing
- Testing if userspace rdpmc multiplexing works... PASSED
+ Testing if userspace rdpmc multiplexing works... FAILED
+ tests/rdpmc/rdpmc_reset
Testing if resetting while using rdpmc works... PASSED
+ tests/rdpmc/rdpmc_group
@@ -304,15 +304,15 @@
+ tests/rdpmc/rdpmc_attach
Testing if rdpmc attach works... PASSED
+ tests/rdpmc/rdpmc_attach_cpu
- Running on CPU 4
+ Running on CPU 0
Testing if rdpmc behavior on attach CPU... PASSED
+ tests/rdpmc/rdpmc_attach_global_cpu
- Running on CPU 6
+ Running on CPU 3
Testing if rdpmc behavior on attach all procs on other CPU... FAILED
+ tests/rdpmc/rdpmc_attach_other_cpu
- Measuring on CPU 5
-Running on CPU 6
-Measuring on CPU 5
+ Measuring on CPU 0
+Running on CPU 3
+Measuring on CPU 0
Testing if rdpmc behavior on attach other CPU... FAILED
+ tests/rdpmc/rdpmc_multiattach
Testing if rdpmc multi-attach works... PASSED
A test flipped results.
Trying again with a more compact output:
[root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new ; diff -u results.$(uname -r) results.$(uname -r).new | grep ^[+-]
--- results.6.8.0-rc7-rt6 2024-03-13 15:26:37.923323518 -0300
+++ results.6.8.0-rc7-rt6.new 2024-03-13 17:06:34.944149451 -0300
- Running on CPU 4
-Testing if rdpmc behavior on attach CPU... PASSED
- + tests/rdpmc/rdpmc_attach_global_cpu
+Testing if rdpmc behavior on attach CPU... FAILED
+ + tests/rdpmc/rdpmc_attach_global_cpu
+ Running on CPU 0
- Measuring on CPU 5
-Running on CPU 6
-Measuring on CPU 5
+ Measuring on CPU 7
+Running on CPU 1
+Measuring on CPU 7
[root@nine perf_event_tests]#
Since its that rdpmc that is now always failing without this patch
series, lets try using that .new as the new baseline:
[root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new2 ; diff -u results.$(uname -r).new results.$(uname -r).new2 | grep ^[+-]
--- results.6.8.0-rc7-rt6.new 2024-03-13 17:06:34.944149451 -0300
+++ results.6.8.0-rc7-rt6.new2 2024-03-13 17:08:41.438282558 -0300
- Testing "branch-misses" generalized event... FAILED
+ Testing "branch-misses" generalized event... PASSED
- Testing if userspace rdpmc multiplexing works... PASSED
+ Testing if userspace rdpmc multiplexing works... FAILED
- Running on CPU 6
-Testing if rdpmc behavior on attach CPU... FAILED
+ Running on CPU 2
+Testing if rdpmc behavior on attach CPU... PASSED
- Running on CPU 0
+ Running on CPU 2
- Measuring on CPU 7
-Running on CPU 1
-Measuring on CPU 7
+ Measuring on CPU 2
+Running on CPU 0
+Measuring on CPU 2
[root@nine perf_event_tests]#
On Wed, Mar 13, 2024 at 05:12:25PM -0300, Arnaldo Carvalho de Melo wrote: > On Wed, Mar 13, 2024 at 03:30:52PM -0300, Arnaldo Carvalho de Melo wrote: > > On Wed, Mar 13, 2024 at 03:14:28PM -0300, Arnaldo Carvalho de Melo wrote: > > > 'perf test' doesn't show any regression, now I'm running Vince Weaver's > > > https://github.com/deater/perf_event_tests, storing the results with > > > this patchset and then without, to do a diff, lets see... > > So things improved! I'll re-run to see if these results are stable... > tldr; No dmesg activity, no kernel splats, most tests passed, nothing > noticeable when running with/without the patch with Vince's regression > tests. So: > Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> > Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Too quick, now I'm testing it on top of torvalds/master, no PREEMPT_RT. - Arnaldo > - Arnaldo > > Further details: > > Without the patch: > > [root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new ; diff -u results.$(uname -r) results.$(uname -r).new > --- results.6.8.0-rc7-rt6 2024-03-13 15:26:37.923323518 -0300 > +++ results.6.8.0-rc7-rt6.new 2024-03-13 15:32:43.983245095 -0300 > @@ -296,7 +296,7 @@ > + tests/rdpmc/rdpmc_validation > Testing if userspace rdpmc reads give expected results... PASSED > + tests/rdpmc/rdpmc_multiplexing > - Testing if userspace rdpmc multiplexing works... PASSED > + Testing if userspace rdpmc multiplexing works... FAILED > + tests/rdpmc/rdpmc_reset > Testing if resetting while using rdpmc works... PASSED > + tests/rdpmc/rdpmc_group > @@ -304,15 +304,15 @@ > + tests/rdpmc/rdpmc_attach > Testing if rdpmc attach works... PASSED > + tests/rdpmc/rdpmc_attach_cpu > - Running on CPU 4 > + Running on CPU 0 > Testing if rdpmc behavior on attach CPU... PASSED > + tests/rdpmc/rdpmc_attach_global_cpu > - Running on CPU 6 > + Running on CPU 3 > Testing if rdpmc behavior on attach all procs on other CPU... FAILED > + tests/rdpmc/rdpmc_attach_other_cpu > - Measuring on CPU 5 > -Running on CPU 6 > -Measuring on CPU 5 > + Measuring on CPU 0 > +Running on CPU 3 > +Measuring on CPU 0 > Testing if rdpmc behavior on attach other CPU... FAILED > + tests/rdpmc/rdpmc_multiattach > Testing if rdpmc multi-attach works... PASSED > > A test flipped results. > > Trying again with a more compact output: > > [root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new ; diff -u results.$(uname -r) results.$(uname -r).new | grep ^[+-] > --- results.6.8.0-rc7-rt6 2024-03-13 15:26:37.923323518 -0300 > +++ results.6.8.0-rc7-rt6.new 2024-03-13 17:06:34.944149451 -0300 > - Running on CPU 4 > -Testing if rdpmc behavior on attach CPU... PASSED > - + tests/rdpmc/rdpmc_attach_global_cpu > +Testing if rdpmc behavior on attach CPU... FAILED > + + tests/rdpmc/rdpmc_attach_global_cpu > + Running on CPU 0 > - Measuring on CPU 5 > -Running on CPU 6 > -Measuring on CPU 5 > + Measuring on CPU 7 > +Running on CPU 1 > +Measuring on CPU 7 > [root@nine perf_event_tests]# > > Since its that rdpmc that is now always failing without this patch > series, lets try using that .new as the new baseline: > > [root@nine perf_event_tests]# ./run_tests.sh | tee results.$(uname -r).new2 ; diff -u results.$(uname -r).new results.$(uname -r).new2 | grep ^[+-] > --- results.6.8.0-rc7-rt6.new 2024-03-13 17:06:34.944149451 -0300 > +++ results.6.8.0-rc7-rt6.new2 2024-03-13 17:08:41.438282558 -0300 > - Testing "branch-misses" generalized event... FAILED > + Testing "branch-misses" generalized event... PASSED > - Testing if userspace rdpmc multiplexing works... PASSED > + Testing if userspace rdpmc multiplexing works... FAILED > - Running on CPU 6 > -Testing if rdpmc behavior on attach CPU... FAILED > + Running on CPU 2 > +Testing if rdpmc behavior on attach CPU... PASSED > - Running on CPU 0 > + Running on CPU 2 > - Measuring on CPU 7 > -Running on CPU 1 > -Measuring on CPU 7 > + Measuring on CPU 2 > +Running on CPU 0 > +Measuring on CPU 2 > [root@nine perf_event_tests]#
On 2024-03-13 17:14:25 [-0300], Arnaldo Carvalho de Melo wrote: > > tldr; No dmesg activity, no kernel splats, most tests passed, nothing > > noticeable when running with/without the patch with Vince's regression > > tests. So: > > > Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> > > Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> > > Too quick, now I'm testing it on top of torvalds/master, no PREEMPT_RT. Just to be clear: You revert your Tested-by because now you test this on torvalds/master but not because you reported a regression which I missed. > - Arnaldo Sebastian
On Thu, Mar 14, 2024 at 10:10:33AM +0100, Sebastian Andrzej Siewior wrote: > On 2024-03-13 17:14:25 [-0300], Arnaldo Carvalho de Melo wrote: > > > tldr; No dmesg activity, no kernel splats, most tests passed, nothing > > > noticeable when running with/without the patch with Vince's regression > > > tests. So: > > > > > Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> > > > Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> > > > > Too quick, now I'm testing it on top of torvalds/master, no PREEMPT_RT. > > Just to be clear: You revert your Tested-by because now you test this on > torvalds/master but not because you reported a regression which I > missed. You got it right. No regressions, the code is good, I just need to test it a bit further, with torvalds/master, without PREEMPT_RT. - Arnaldo
On Thu, Mar 14, 2024 at 11:34:39AM -0300, Arnaldo Carvalho de Melo wrote:
> On Thu, Mar 14, 2024 at 10:10:33AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2024-03-13 17:14:25 [-0300], Arnaldo Carvalho de Melo wrote:
> > > > tldr; No dmesg activity, no kernel splats, most tests passed, nothing
> > > > noticeable when running with/without the patch with Vince's regression
> > > > tests. So:
> > > > Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > > Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > Too quick, now I'm testing it on top of torvalds/master, no PREEMPT_RT.
> > Just to be clear: You revert your Tested-by because now you test this on
> > torvalds/master but not because you reported a regression which I
> > missed.
> You got it right. No regressions, the code is good, I just need to test
> it a bit further, with torvalds/master, without PREEMPT_RT.
Tests performed, no regressions detected, same behaviour when killing
the remove_on_exec selftests midway:
[acme@nine perf_events]$ perf annotate --stdio2 exec_child | tail
mov $0x1,%edi
→ callq _exit@plt
b5: nop
100.00 b6: mov signal_count,%eax
test %eax,%eax
↑ je b6
nop
nop
leaveq
← retq
[acme@nine perf_events]$ pidof exec_child
28256 28249 28241 28240 28239 28236 28228 28226 28224 28223 28219 28215 28208 28207 28206 28205 28200 28188 28187 28186 28185 28169 28168 28167 28166 28155 28154 28153 28152 28140 28139 28138 28137 28124 28123 28122 28121 28111 28110 28109 28108 28094 28093 28092 28091 28080 28079 28078 28077 28064 28062 28061 28060 28048 28047 28046 28045 28030 28029 28028 28027 28012 28011 28010 28009 27998 27996 27994 27993 27982 27981 27979 27978 27966 27965 27962 27961 27952 27951 27949 27948 27934 27933 27932 27931 27920 27919 27918 27917 27906 27905 27903 27902 27888 27885 27883 27882
[acme@nine perf_events]$
[acme@nine linux]$ uname -a
Linux nine 6.8.0-rc7.sebastian-rt6+ #2 SMP PREEMPT_RT Tue Mar 12 18:01:31 -03 2024 x86_64 x86_64 x86_64 GNU/Linux
[acme@nine linux]$ perf probe -L perf_pending_disable
<perf_pending_disable@/home/acme/git/linux/kernel/events/core.c:0>
0 static void perf_pending_disable(struct irq_work *entry)
{
2 struct perf_event *event = container_of(entry, struct perf_event, pending_disable_irq);
int rctx;
/*
* If we 'fail' here, that's OK, it means recursion is already disabled
* and we won't recurse 'further'.
*/
rctx = perf_swevent_get_recursion_context();
10 __perf_pending_disable(event);
11 if (rctx >= 0)
12 perf_swevent_put_recursion_context(rctx);
}
static void perf_pending_irq(struct irq_work *entry)
[acme@nine linux]$
So I keep my:
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
- Arnaldo
On Thu, Mar 14, 2024 at 06:22:38PM -0300, Arnaldo Carvalho de Melo wrote:
> On Thu, Mar 14, 2024 at 11:34:39AM -0300, Arnaldo Carvalho de Melo wrote:
> > On Thu, Mar 14, 2024 at 10:10:33AM +0100, Sebastian Andrzej Siewior wrote:
> > > On 2024-03-13 17:14:25 [-0300], Arnaldo Carvalho de Melo wrote:
> > > > > tldr; No dmesg activity, no kernel splats, most tests passed, nothing
> > > > > noticeable when running with/without the patch with Vince's regression
> > > > > tests. So:
>
> > > > > Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > > > Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
>
> > > > Too quick, now I'm testing it on top of torvalds/master, no PREEMPT_RT.
>
> > > Just to be clear: You revert your Tested-by because now you test this on
> > > torvalds/master but not because you reported a regression which I
> > > missed.
>
> > You got it right. No regressions, the code is good, I just need to test
> > it a bit further, with torvalds/master, without PREEMPT_RT.
>
> Tests performed, no regressions detected, same behaviour when killing
> the remove_on_exec selftests midway:
> [acme@nine linux]$ uname -a
> Linux nine 6.8.0-rc7.sebastian-rt6+ #2 SMP PREEMPT_RT Tue Mar 12 18:01:31 -03 2024 x86_64 x86_64 x86_64 GNU/Linux
Re-reading this I noticed I really retested with the rt kernel, d0h, so
here it goes again:
[acme@nine ~]$ uname -r
6.8.0.sigtrapfix+
[acme@nine ~]$ set -o vi
[acme@nine ~]$ perf test sigtrap
68: Sigtrap : Ok
[acme@nine ~]$ cd ~acme/git/linux
[acme@nine linux]$ cd tools/testing/selftests/perf_events && make
make: Nothing to be done for 'all'.
[acme@nine perf_events]$ for x in {0..100}; do (./remove_on_exec &); done
<snip>
ok 4 remove_on_exec.exec_stress
# PASSED: 4 / 4 tests passed.
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
# OK remove_on_exec.exec_stress
ok 4 remove_on_exec.exec_stress
# PASSED: 4 / 4 tests passed.
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
# OK remove_on_exec.exec_stress
ok 4 remove_on_exec.exec_stress
# PASSED: 4 / 4 tests passed.
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
[acme@nine perf_events]$
So despite this mistake all is well, torvalds/master + your patchkit
seems ok.
Sorry for the noise :-\
- Arnaldo
On 2024-03-14 18:46:21 [-0300], Arnaldo Carvalho de Melo wrote: > So despite this mistake all is well, torvalds/master + your patchkit > seems ok. > > Sorry for the noise :-\ No worries, thank you. > - Arnaldo Sebastian
On Tue, Mar 12, 2024 at 07:01:48PM +0100, Sebastian Andrzej Siewior wrote:
> Hi,
>
> Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
> the signal gets delayed until event_sched_out() which then uses
> task_work_add() for its delivery. This breaks on PREEMPT_RT because the
> signal is delivered with disabled preemption.
>
> While looking at this, I also stumbled upon __perf_pending_irq() which
> requires disabled interrupts but this is not the case on PREEMPT_RT.
>
> This series aim to address both issues while not introducing a new issue
> at the same time ;)
> Any testing is appreciated.
>
> v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/
> - Marco pointed me to the testsuite that showed two problems:
> - Delayed task_work from NMI / missing events.
> Fixed by triggering dummy irq_work to enforce an interrupt for
> the exit-to-userland path which checks task_work
> - Increased ref-count on clean up/ during exec.
> Mostly addressed by the former change. There is still a window
> if the NMI occurs during execve(). This is addressed by removing
> the task_work before free_event().
> The testsuite (remove_on_exec) fails sometimes if the event/
> SIGTRAP is sent before the sighandler is installed.
Survives:
[acme@nine perf_events]$ for x in {0..1000}; do (perf test sigtrap &); done
And:
[acme@nine perf_events]$ pwd
/home/acme/git/linux/tools/testing/selftests/perf_events
[acme@nine perf_events]$ for x in {0..100}; do (./remove_on_exec &); done
Nothing on dmesg.
But:
[acme@nine ~]$ pidof exec_child
24273 24271 24270 24269 24268 24267 24266 24265 24264 24263 24262 24261 24260 24259 24258 24257 24256 24255 24254 24253 24252 24251 24250 24249 24248 24247 24246 24245 24244 24243 24242 24241 24240 24239 24238 24237 24236 24233 24232 24231 24230 24229 24228 24227 24226 24225 24224 24223 24220 24219 24218 24217 24216 24215 24214 24213 24212 24211 24210 24209 24208 24207 24206 24204 24203 24202 24201 24200 24199 24198 24197 24196 24195 24194 24192 24191 24190 24189 24188 24187 24186 24185 24184 24183 24182 24181 24180 24179 24178 24177 24176 24175 24174 24173 24172 24171 24170 24169 24168 24167 24166 24165 24164 24163 24162 24161 24160 24159 24158 24157 24154 24153 24152 24151 24150 24149 24148 24147 24146 24145 24144 24143 24142 24141 24140 24139 24138 24137 24136 24135 24134 24133 24132 24131 24130 24129 24127 24126 24125 24124 24123 24122 24121 24120 24119 24118 24117 24115 24114 24113 24112 24111 24110 24109 24108 24107 24106 24105 24104 24103 24102 24101 24100 24099 24098 24097 24096 24095 24094 24093 24092 24091 24090 24089 24088 24087 24086 24085 24084 24083 24082 24081 24080 24076 24075 24074 24073 24072 24071 24070 24069 24068 24067 24066 24065 24064 24063 24062 24061 24060 24059 24058 24057 24056 24055 24054 24053 24052 24051 24049 24048 24047 24046 24045 24044 24043 24042 24041 24040 24039 24037 24036 24035 24034 24033 24032 24031 24030 24029 24028 24027 24026 24025 24024 24023 24022 24021 24020 24019 24018 24017 24016 24015 24014 24013 24012 24011 24010 24009 24008 24007 24006 24005 24004 24003 24002 23998 23997 23996 23995 23994 23993 23992 23991 23990 23989 23988 23986 23985 23984 23983 23982 23981 23980 23979 23978 23977 23976 23975 23974 23973 23972 23970 23969 23968 23967 23966 23965 23964 23963 23962 23961 23960 23958 23957 23956 23955 23954 23953 23952 23951 23950 23949 23948 23947 23946 23945 23944 23943 23942 23941 23940 23939 23938 23937 23936 23935 23934 23933 23932 23931 23930 23929 23928 23927 23926 23925 23924 23923 23919 23918 23917 23916 23915 23914 23913 23912 23911 23910 23909 23907 23906 23905 23904 23903 23902 23901 23900 23899 23898 23897 23896 23895 23894 23893 23891 23890 23889 23888 23887 23886 23885 23884 23883 23882 23881 23879 23878 23877 23876 23875 23874 23873 23872 23871 23870 23869 23867 23866 23865 23864 23863 23862 23858 23857 23856 23855 23854 23853 23852 23851 23850 23849 23848 23847 23846 23845 23844 23843 23842 23841 23840 23833 23832 23831 23830 23829 23828 23827 23826 23825 23824 23823 23820 23819 23818 23817 23816 23815 23814 23813 23812 23811 23810 23809 23808 23807 23806 23804 23803 23802 23801 23800 23799 23798 23797 23793 23792 23791 23787 23786 23774 23773 23772 23771 23770 23769 23768 23767 23756 23754 23751 23750 23749 23748 23747 23746 23743 23734 23733 23732 23731 23730 23729 23728 23727 23726 23725 23724 23723 23722 23719 23716 23715 23684 23668 23667 23666 23589 23574 23573 23526 23461 23396 23395 23375 23271 23270 23266 23242 23241 23231 23214 23213 23210 23197 23196 23182 23181 23129 23128 23046 22964
[acme@nine ~]$
[root@nine ~]# cat /proc/24263/stack
[<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0
[<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20
[root@nine ~]#
[root@nine ~]# for a in `pidof exec_child` ; do cat /proc/$a/stack ; done | sort | uniq -c
1 [<0>] asm_common_interrupt+0x22/0x40
495 [<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20
2 [<0>] asm_sysvec_reschedule_ipi+0x16/0x20
498 [<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0
[root@nine ~]#
[acme@nine ~]$ ps ax|grep exec_child| wc -l
504
[acme@nine ~]$ ps ax|grep exec_child| tail
24264 pts/0 R 0:04 exec_child
24265 pts/0 R 0:04 exec_child
24266 pts/0 R 0:04 exec_child
24267 pts/0 R 0:04 exec_child
24268 pts/0 R 0:04 exec_child
24269 pts/0 R 0:04 exec_child
24270 pts/0 R 0:04 exec_child
24271 pts/0 R 0:04 exec_child
24273 pts/0 R 0:04 exec_child
26704 pts/1 S+ 0:00 grep --color=auto exec_child
[acme@nine ~]$
All in 'R' state.
[root@nine ~]# killall exec_child
exec_child: no process found
[root@nine ~]# ps ax | grep exec_child | head -5
22964 pts/0 R 0:06 exec_child
23046 pts/0 R 0:05 exec_child
23128 pts/0 R 0:05 exec_child
23129 pts/0 R 0:05 exec_child
23181 pts/0 R 0:05 exec_child
[root@nine ~]# kill 22964 23046 23128 23129 23181
[root@nine ~]# ps ax | grep exec_child | head -5
23182 pts/0 R 0:06 exec_child
23196 pts/0 R 0:06 exec_child
23197 pts/0 R 0:06 exec_child
23210 pts/0 R 0:06 exec_child
23213 pts/0 R 0:06 exec_child
[root@nine ~]#
[root@nine ~]# kill `pidof exec_child`
[root@nine ~]# ps ax | grep exec_child | head -5
26806 pts/0 S+ 0:00 grep --color=auto exec_child
[root@nine ~]#
[acme@nine perf_events]$ ps ax|grep exec_child
26816 pts/0 S+ 0:00 grep --color=auto exec_child
[acme@nine perf_events]$ ./remove_on_exec
TAP version 13
1..4
# Starting 4 tests from 1 test cases.
# RUN remove_on_exec.fork_only ...
# OK remove_on_exec.fork_only
ok 1 remove_on_exec.fork_only
# RUN remove_on_exec.fork_exec_then_enable ...
# OK remove_on_exec.fork_exec_then_enable
ok 2 remove_on_exec.fork_exec_then_enable
# RUN remove_on_exec.enable_then_fork_exec ...
*# OK remove_on_exec.enable_then_fork_exec
ok 3 remove_on_exec.enable_then_fork_exec
# RUN remove_on_exec.exec_stress ...
*************************# OK remove_on_exec.exec_stress
ok 4 remove_on_exec.exec_stress
# PASSED: 4 / 4 tests passed.
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
[acme@nine perf_events]$ ps ax|grep exec_child
26858 pts/0 S+ 0:00 grep --color=auto exec_child
[acme@nine perf_events]$ ps ax|grep exec_child
26860 pts/0 S+ 0:00 grep --color=auto exec_child
[acme@nine perf_events]$
[acme@nine perf_events]$ for x in {0..100}; do (./remove_on_exec &); done
<SNIP>
While it runs I can see:
[acme@nine ~]$ ps ax|grep exec_child
28627 pts/0 R 0:00 exec_child
28668 pts/0 R 0:00 exec_child
28744 pts/0 R 0:00 exec_child
28813 pts/0 R 0:00 exec_child
28912 pts/0 R 0:00 exec_child
29666 pts/1 S+ 0:00 grep --color=auto exec_child
[acme@nine ~]$
at the end they disappeared, on this last run.
But if I do a 'killall remove_on_exec' and stop that loop (control+C/Z)
we get all those exec_child running a seemingly eternal loop:
[acme@nine ~]$ ps ax|grep exec_child|wc -l
479
[acme@nine ~]$ ps ax|grep exec_child|head
30143 pts/0 R 0:01 exec_child
30144 pts/0 R 0:01 exec_child
30145 pts/0 R 0:01 exec_child
30147 pts/0 R 0:01 exec_child
30150 pts/0 R 0:01 exec_child
30151 pts/0 R 0:01 exec_child
30153 pts/0 R 0:01 exec_child
30154 pts/0 R 0:01 exec_child
30155 pts/0 R 0:01 exec_child
30156 pts/0 R 0:01 exec_child
[acme@nine ~]$ sudo cat /proc/30143/stack
[<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0
[<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20
[acme@nine ~]$
[acme@nine perf_events]$ uptime
18:36:17 up 16 min, 2 users, load average: 460.01, 375.77, 240.59
[acme@nine perf_events]$ uptime
18:36:21 up 16 min, 2 users, load average: 461.53, 377.49, 241.87
[acme@nine perf_events]$ uptime
18:36:24 up 16 min, 2 users, load average: 462.85, 379.16, 243.14
[acme@nine perf_events]$ uptime
18:36:27 up 16 min, 2 users, load average: 462.85, 379.16, 243.14
[acme@nine perf_events]$
- Arnaldo
On 2024-03-12 18:42:38 [-0300], Arnaldo Carvalho de Melo wrote: … > But: > > [acme@nine ~]$ pidof exec_child > 24273 24271 24270 24269 24268 24267 24266 24265 24264 24263 24262 24261 24260 24259 … > [root@nine ~]# cat /proc/24263/stack > [<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0 > [<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20 > [root@nine ~]# … > [acme@nine ~]$ ps ax|grep exec_child| wc -l > 504 > [acme@nine ~]$ ps ax|grep exec_child| tail > 24264 pts/0 R 0:04 exec_child > 24265 pts/0 R 0:04 exec_child > 24266 pts/0 R 0:04 exec_child > 24267 pts/0 R 0:04 exec_child > 24268 pts/0 R 0:04 exec_child > 24269 pts/0 R 0:04 exec_child > 24270 pts/0 R 0:04 exec_child > 24271 pts/0 R 0:04 exec_child > 24273 pts/0 R 0:04 exec_child > 26704 pts/1 S+ 0:00 grep --color=auto exec_child > [acme@nine ~]$ > > All in 'R' state. > > [root@nine ~]# killall exec_child > exec_child: no process found > [root@nine ~]# ps ax | grep exec_child | head -5 > 22964 pts/0 R 0:06 exec_child > 23046 pts/0 R 0:05 exec_child > 23128 pts/0 R 0:05 exec_child > 23129 pts/0 R 0:05 exec_child > 23181 pts/0 R 0:05 exec_child > [root@nine ~]# kill 22964 23046 23128 23129 23181 > [root@nine ~]# ps ax | grep exec_child | head -5 > 23182 pts/0 R 0:06 exec_child > 23196 pts/0 R 0:06 exec_child > 23197 pts/0 R 0:06 exec_child > 23210 pts/0 R 0:06 exec_child > 23213 pts/0 R 0:06 exec_child > [root@nine ~]# You can't kill them? > at the end they disappeared, on this last run. > > But if I do a 'killall remove_on_exec' and stop that loop (control+C/Z) > we get all those exec_child running a seemingly eternal loop: Is this new or was it there? Is this VM or bare metal? One part I don't get: did you let it run or did you kill it? `exec_child' spins until a signal is received or the parent kills it. So it shouldn't remain there for ever. And my guess, that it is in spinning in userland and not in kernel. I tried it on bare metal and VM and couldn't reproduce this. … > > - Arnaldo Sebastian
On Wed, Mar 13, 2024 at 09:13:03AM +0100, Sebastian Andrzej Siewior wrote: > On 2024-03-12 18:42:38 [-0300], Arnaldo Carvalho de Melo wrote: > … > > But: > > > > [acme@nine ~]$ pidof exec_child > > 24273 24271 24270 24269 24268 24267 24266 24265 24264 24263 24262 24261 24260 24259 > … > > > [root@nine ~]# cat /proc/24263/stack > > [<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0 > > [<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20 > > [root@nine ~]# > … > > [acme@nine ~]$ ps ax|grep exec_child| wc -l > > 504 > > [acme@nine ~]$ ps ax|grep exec_child| tail > > 24264 pts/0 R 0:04 exec_child > > 24265 pts/0 R 0:04 exec_child > > 24266 pts/0 R 0:04 exec_child > > 24267 pts/0 R 0:04 exec_child > > 24268 pts/0 R 0:04 exec_child > > 24269 pts/0 R 0:04 exec_child > > 24270 pts/0 R 0:04 exec_child > > 24271 pts/0 R 0:04 exec_child > > 24273 pts/0 R 0:04 exec_child > > 26704 pts/1 S+ 0:00 grep --color=auto exec_child > > [acme@nine ~]$ > > > > All in 'R' state. > > > > [root@nine ~]# killall exec_child > > exec_child: no process found > > [root@nine ~]# ps ax | grep exec_child | head -5 > > 22964 pts/0 R 0:06 exec_child > > 23046 pts/0 R 0:05 exec_child > > 23128 pts/0 R 0:05 exec_child > > 23129 pts/0 R 0:05 exec_child > > 23181 pts/0 R 0:05 exec_child > > [root@nine ~]# kill 22964 23046 23128 23129 23181 > > [root@nine ~]# ps ax | grep exec_child | head -5 > > 23182 pts/0 R 0:06 exec_child > > 23196 pts/0 R 0:06 exec_child > > 23197 pts/0 R 0:06 exec_child > > 23210 pts/0 R 0:06 exec_child > > 23213 pts/0 R 0:06 exec_child > > [root@nine ~]# > You can't kill them? I can, they remain in R state and I can kill them with 'kill `pidof exec_child`' > > at the end they disappeared, on this last run. > > But if I do a 'killall remove_on_exec' and stop that loop (control+C/Z) > > we get all those exec_child running a seemingly eternal loop: > > Is this new or was it there? Is this VM or bare metal? bare metal. > One part I don't get: did you let it run or did you kill it? If I let them run they will finish and exit, no exec_child remains. If I instead try to stop the loop that goes on forking the 100 of them, then the exec_child remain spinning. > `exec_child' spins until a signal is received or the parent kills it. So > it shouldn't remain there for ever. And my guess, that it is in spinning > in userland and not in kernel. Checking that now, the stack is the one I posted: > > [root@nine ~]# cat /proc/24263/stack > > [<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0 > > [<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20 > > [root@nine ~]# > I tried it on bare metal and VM and couldn't reproduce this. All my tests are in bare metal. - Arnaldo
On Wed, Mar 13, 2024 at 10:28:44AM -0300, Arnaldo Carvalho de Melo wrote:
> On Wed, Mar 13, 2024 at 09:13:03AM +0100, Sebastian Andrzej Siewior wrote:
> > One part I don't get: did you let it run or did you kill it?
> If I let them run they will finish and exit, no exec_child remains.
> If I instead try to stop the loop that goes on forking the 100 of them,
> then the exec_child remain spinning.
> > `exec_child' spins until a signal is received or the parent kills it. So
> > it shouldn't remain there for ever. And my guess, that it is in spinning
> > in userland and not in kernel.
> Checking that now:
tldr; the tight loop, full details at the end.
100.00 b6: mov signal_count,%eax
test %eax,%eax
↑ je b6
remove_on_exec.c
/* For exec'd child. */
static void exec_child(void)
{
struct sigaction action = {};
const int val = 42;
/* Set up sigtrap handler in case we erroneously receive a trap. */
action.sa_flags = SA_SIGINFO | SA_NODEFER;
action.sa_sigaction = sigtrap_handler;
sigemptyset(&action.sa_mask);
if (sigaction(SIGTRAP, &action, NULL))
_exit((perror("sigaction failed"), 1));
/* Signal parent that we're starting to spin. */
if (write(STDOUT_FILENO, &val, sizeof(int)) == -1)
_exit((perror("write failed"), 1));
/* Should hang here until killed. */
while (!signal_count);
}
So probably just a test needing to be a bit more polished?
Seems like it, on a newer machine, faster, I managed to reproduce it on
a non-RT kernel, with one exec_child remaining:
1.44 b6: mov signal_count,%eax
test %eax,%eax
98.56 ↑ je b6
same tight loop:
acme@x1:~/git/perf-tools-next/tools/testing/selftests/perf_events$ pidof exec_child
722300
acme@x1:~/git/perf-tools-next/tools/testing/selftests/perf_events$ ps ax|grep exec_child
722300 pts/2 R 4:08 exec_child
722502 pts/2 S+ 0:00 grep --color=auto exec_child
acme@x1:~/git/perf-tools-next/tools/testing/selftests/perf_events$
- Arnaldo
[root@nine ~]# perf record --call-graph dwarf -p 35785
^C[ perf record: Woken up 48 times to write data ]
[ perf record: Captured and wrote 12.120 MB perf.data (1503 samples) ]
[root@nine ~]# ls -la perf.data
-rw-------. 1 root root 12720152 Mar 13 10:32 perf.data
[root@nine ~]#
[root@nine ~]# perf report --no-child --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles:P'
# Event count (approx.): 926018718
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ......................................
#
98.48% exe remove_on_exec [.] exec_child
|
---exec_child
main
__libc_start_call_main
__libc_start_main@@GLIBC_2.34
_start
0.33% exe [kernel.kallsyms] [k] arch_scale_freq_tick
0.13% exe [kernel.kallsyms] [k] debug_smp_processor_id
0.13% exe [kernel.kallsyms] [k] check_cpu_stall
0.13% exe [kernel.kallsyms] [k] acct_account_cputime
0.13% exe [kernel.kallsyms] [k] cpuacct_account_field
0.07% exe [kernel.kallsyms] [k] preempt_count_add
0.07% exe [kernel.kallsyms] [k] update_irq_load_avg
0.07% exe [kernel.kallsyms] [k] cgroup_rstat_updated
0.07% exe [kernel.kallsyms] [k] rcu_sched_clock_irq
0.07% exe [kernel.kallsyms] [k] account_user_time
0.07% exe [kernel.kallsyms] [k] __hrtimer_run_queues
0.07% exe [kernel.kallsyms] [k] tick_nohz_highres_handler
0.07% exe [kernel.kallsyms] [k] ktime_get_update_offsets_now
0.06% exe [kernel.kallsyms] [k] __enqueue_entity
0.06% exe [kernel.kallsyms] [k] tick_sched_handle
0.00% exe [kernel.kallsyms] [k] __intel_pmu_enable_all.constprop.0
#
# (Tip: To show assembler sample contexts use perf record -b / perf script -F +brstackinsn --xed)
#
[root@nine ~]#
[root@nine ~]# perf annotate --stdio2 exec_child
Samples: 1K of event 'cycles:P', 4000 Hz, Event count (approx.): 911943256, [percent: local period]
exec_child() /home/acme/git/linux/tools/testing/selftests/perf_events/remove_on_exec
Percent
Disassembly of section .text:
00000000004045cf <exec_child>:
push %rbp
mov %rsp,%rbp
sub $0xb0,%rsp
lea -0xa0(%rbp),%rdx
mov $0x0,%eax
mov $0x13,%ecx
mov %rdx,%rdi
rep stos %rax,%es:(%rdi)
movl $0x2a,-0xa4(%rbp)
movl $0x40000004,-0x18(%rbp)
movq $0x402a2e,-0xa0(%rbp)
lea -0xa0(%rbp),%rax
add $0x8,%rax
mov %rax,%rdi
→ callq sigemptyset@plt
lea -0xa0(%rbp),%rax
mov $0x0,%edx
mov %rax,%rsi
mov $0x5,%edi
→ callq sigaction@plt
test %eax,%eax
↓ je 82
mov $0x4058af,%edi
→ callq perror@plt
mov $0x1,%edi
→ callq _exit@plt
82: lea -0xa4(%rbp),%rax
mov $0x4,%edx
mov %rax,%rsi
mov $0x1,%edi
→ callq write@plt
cmp $0xffffffffffffffff,%rax
↓ jne b5
mov $0x4058c0,%edi
→ callq perror@plt
mov $0x1,%edi
→ callq _exit@plt
b5: nop
100.00 b6: mov signal_count,%eax
test %eax,%eax
↑ je b6
nop
nop
leaveq
← retq
[root@nine ~]#
On Wed, 13 Mar 2024 at 14:47, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
>
> On Wed, Mar 13, 2024 at 10:28:44AM -0300, Arnaldo Carvalho de Melo wrote:
> > On Wed, Mar 13, 2024 at 09:13:03AM +0100, Sebastian Andrzej Siewior wrote:
> > > One part I don't get: did you let it run or did you kill it?
>
> > If I let them run they will finish and exit, no exec_child remains.
>
> > If I instead try to stop the loop that goes on forking the 100 of them,
> > then the exec_child remain spinning.
>
> > > `exec_child' spins until a signal is received or the parent kills it. So
>
> > > it shouldn't remain there for ever. And my guess, that it is in spinning
> > > in userland and not in kernel.
>
> > Checking that now:
>
> tldr; the tight loop, full details at the end.
>
> 100.00 b6: mov signal_count,%eax
> test %eax,%eax
> ↑ je b6
>
> remove_on_exec.c
>
> /* For exec'd child. */
> static void exec_child(void)
> {
> struct sigaction action = {};
> const int val = 42;
>
> /* Set up sigtrap handler in case we erroneously receive a trap. */
> action.sa_flags = SA_SIGINFO | SA_NODEFER;
> action.sa_sigaction = sigtrap_handler;
> sigemptyset(&action.sa_mask);
> if (sigaction(SIGTRAP, &action, NULL))
> _exit((perror("sigaction failed"), 1));
>
> /* Signal parent that we're starting to spin. */
> if (write(STDOUT_FILENO, &val, sizeof(int)) == -1)
> _exit((perror("write failed"), 1));
>
> /* Should hang here until killed. */
> while (!signal_count);
> }
>
> So probably just a test needing to be a bit more polished?
Yes, possible.
> Seems like it, on a newer machine, faster, I managed to reproduce it on
> a non-RT kernel, with one exec_child remaining:
>
> 1.44 b6: mov signal_count,%eax
> test %eax,%eax
> 98.56 ↑ je b6
It's unclear to me why that happens. But I do recall seeing it before,
and my explanation was that with too many concurrent copies of the
test the system either ran out of memory (maybe?) because the stress
test also spawns 30 parallel copies of the "exec_child" subprocess. So
with the 100 parallel copies we end up with 30 * 100 processes. Maybe
that's too much?
In any case, if the kernel didn't fall over during that kind of stress
testing, and the test itself passes when run as a single copy, then
I'd conclude all looks good.
This particular feature of perf along with testing it once before
melted Peter's and my brain [1]. I hope your experience didn't result
in complete brain-melt. ;-)
[1] https://lore.kernel.org/all/Y0VofNVMBXPOJJr7@elver.google.com/
Thanks,
-- Marco
On 2024-03-13 10:47:33 [-0300], Arnaldo Carvalho de Melo wrote:
> /* For exec'd child. */
> static void exec_child(void)
> {
> struct sigaction action = {};
> const int val = 42;
>
> /* Set up sigtrap handler in case we erroneously receive a trap. */
> action.sa_flags = SA_SIGINFO | SA_NODEFER;
> action.sa_sigaction = sigtrap_handler;
> sigemptyset(&action.sa_mask);
> if (sigaction(SIGTRAP, &action, NULL))
> _exit((perror("sigaction failed"), 1));
>
> /* Signal parent that we're starting to spin. */
> if (write(STDOUT_FILENO, &val, sizeof(int)) == -1)
> _exit((perror("write failed"), 1));
>
> /* Should hang here until killed. */
> while (!signal_count);
> }
>
> So probably just a test needing to be a bit more polished?
Maybe. I'm not sure where this is coming from. Either someone should
kill or the signal should be delivered but, hmm… If the signal isn't
coming then it might one of those without a perf counter.
> Seems like it, on a newer machine, faster, I managed to reproduce it on
> a non-RT kernel, with one exec_child remaining:
Okay, so no regression. That is something ;)
> - Arnaldo
Sebastian
On 2024-03-13 10:28:41 [-0300], Arnaldo Carvalho de Melo wrote: > > One part I don't get: did you let it run or did you kill it? > > If I let them run they will finish and exit, no exec_child remains. > > If I instead try to stop the loop that goes on forking the 100 of them, > then the exec_child remain spinning. Okay. So that problem only exists if you intervene. And you can reproduce this odd behaviour with my patches but not without them, right? > > it shouldn't remain there for ever. And my guess, that it is in spinning > > in userland and not in kernel. > > Checking that now, the stack is the one I posted: > > > > [root@nine ~]# cat /proc/24263/stack > > > [<0>] irqentry_exit_to_user_mode+0x1c9/0x1e0 > > > [<0>] asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > [root@nine ~]# could you resolve irqentry_exit_to_user_mode+0x1c9/0x1e0, please? > > I tried it on bare metal and VM and couldn't reproduce this. > > All my tests are in bare metal. Would you mind sending me your .config? The shell is bash I guess. I will try to reproduce your setup on another box… > - Arnaldo Sebastian
On Wed, Mar 13, 2024 at 02:46:45PM +0100, Sebastian Andrzej Siewior wrote: > On 2024-03-13 10:28:41 [-0300], Arnaldo Carvalho de Melo wrote: > > > One part I don't get: did you let it run or did you kill it? > > > > If I let them run they will finish and exit, no exec_child remains. > > > > If I instead try to stop the loop that goes on forking the 100 of them, > > then the exec_child remain spinning. > > Okay. So that problem only exists if you intervene. And you can > reproduce this odd behaviour with my patches but not without them, > right? See the next message, I managed to reproduce that behaviour in a non-RT kernel as well. - Arnaldo
© 2016 - 2026 Red Hat, Inc.