kernel/events/core.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
The perf core code fails to account for total_enable_time of event
when its state is inactive.
Here is the error case for failure to account for total_enable_time for
core event:
sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s
...
armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740
armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440
Performance counter stats for 'stress-ng --pthread=2 -t 2s':
1,138,698,008 armv8_pmuv3_0/event=0x08/ (94.99%)
1,826,791,390 armv8_pmuv3_1/event=0x08/ (43.47%)
Since above two events are belong to the same task context and
mutually-exclusive per cpu (they couldn't be active at the same time on the same cpu),
the total_enable_time should be the same (the 2nd number among 3 numbers)
and the summation of ratio should be 100%.
This account failure of total_enable_time for event could happen in below sequence.
1. two event opened with:
- first event (e0) is opened with pmu0(p0) which could be added on CPU0.
- second event (e1) is opened with pmu1(p1) which could be added on CPU1.
- these two event belongs to the same task_ctx.
at this point:
task_ctx->time = 0
e0->total_enable_time = 0
e0->total_running_time = 0
e1->total_enable_time = 0
e1->total_running_time = 0
2. the task_ctx is sched in CPU0.
- In ctx_sched_in(), the task_ctx->time doesn't updated.
- In event_sched_in() e0 is activated so, its state becomes ACTIVE.
- In event_sched_in() e1 is activated, but soon becomes !ACTIVE
because pmu1 doesn't support CPU1 so it failed in pmu1->add().
at this point:
task_ctx->time = 0
e0->total_enable_time = 0
e0->total_running_time = 0
e1->total_enable_time = 0
e1->total_running_time = 0
3. the task_ctx is sched out from CPU0. and suppose scheulded time as t1.
- In ctx_sched_out(), the task_ctx->time updated with t1.
- e0 was ACTIVE, so In event_sched_out(),
e0->total_enable_time and e0->total_running_time are updated with t1.
- e1 was !ACTIVE, so it doesn't update
e1->total_enable_time and e1->total_running_time.
at this point:
task_ctx->time = t1
e0->total_enable_time = t1
e0->total_running_time = t1
e1->total_enable_time = 0
e1->total_running_time = 0
4. the task_ctx is sched in to CPU1.
- task_ctx->time doesn't changed.
- In event_sched_in() e0 is activated, but soon becomes !ACTIVE
because pmu1 doesn't support CPU1 so it failed in pmu1->add().
- In event_sched_in() e1 is activated, and when it changes event
state to ACTIVE, e1->total_enable_time is updated with
task_ctx->time.
at this point:
task_ctx->time = t1
e0->total_enable_time = t1
e0->total_running_time = t1
e1->total_enable_time = t1
e1->total_running_time = 0
5. the task_ctx is sched out from CPU1. and suppose scheulded time as t2.
- In ctx_sched_out(), the task_ctx->time updated with t1 + t2.
- e0 was !ACTIVE, so it doesn't update
e0->total_enable_time and e0->total_running_time.
- e1 was ACTIVE, so In event_sched_out(),
e1->total_enable_time and e1->total_running_time are updated with t2.
at this point:
task_ctx->time = t1 + t2
e0->total_enable_time = t1
e0->total_running_time = t1
e1->total_enable_time = t1 + t2
e1->total_running_time = t2
6. the task_ctx is sched in CPU0 again.
- task_ctx->time doesn't changed.
- In event_sched_in() e0 is activated, and when it changes event
state to ACTIVE, e0->total_enable_time is updated with
task_ctx->time.
- In event_sched_in() e1 is activated, but soon becomes !ACTIVE
because pmu1 doesn't support CPU1 so it failed in pmu1->add().
at this point:
task_ctx->time = t1 + t2
e0->total_enable_time = t1 + t2
e0->total_running_time = t1
e1->total_enable_time = t1 + t2
e1->total_running_time = t2
7. the task exit and task_ctx is sched out. and suppose scheulded time as t3.
- In ctx_sched_out(), the task_ctx->time updated with t1 + t2 + t3.
- e0 was ACTIVE, so In event_sched_out(),
e0->total_enable_time and e1->total_running_time are updated with t3.
- e1 was !ACTIVE, so it doesn't update
e1->total_enable_time and e1->total_running_time.
at this point:
task_ctx->time = t1 + t2 + t3
e0->total_enable_time = t1 + t2 +t3
e0->total_running_time = t1 + t3
e1->total_enable_time = t1 + t2
e1->total_running_time = t2
In step 7, amount of t3 enable time is missing in e1.
So, the perf prints error ratio for total_running_time/total_enable_time
for e1.
Here is the summary of change of time as timeline:
CPU0 CPU1 CPU0
| in | out | | in | out | | in | out (exit) |
------------------------------------------------------------------------------
ctx->time | 0 | t1 | | t1 | t1 + t2 | | t1 + t2 | t1 + t2 + t3 |
------------------------------------------------------------------------------
e0->ena | 0 | t1 | | t1 | t1 | | t1 + t2 | t1 + t2 + t3 |
------------------------------------------------------------------------------
e0->run | 0 | t1 | | t1 | t1 | | t1 + t2 | t1 + t3 |
------------------------------------------------------------------------------
e1->ena | 0 | 0 | | t1 | t1 + t2 | | t1 + t2 | t1 + t2 |
------------------------------------------------------------------------------
e1->run | 0 | 0 | | 0 | t2 | | t1 + t2 | t2 |
------------------------------------------------------------------------------
To address this, update total_enable_time in event_sched_out() when event state
is PERF_EVENT_STATE_INACTIVE.
After this patch, this problem is gone like:
sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 10s
...
armv8_pmuv3_0/event=0x08/: 15396770398 32157963940 21898169000
armv8_pmuv3_1/event=0x08/: 22428964974 32157963940 10259794940
Performance counter stats for 'stress-ng --pthread=2 -t 10s':
15,396,770,398 armv8_pmuv3_0/event=0x08/ (68.10%)
22,428,964,974 armv8_pmuv3_1/event=0x08/ (31.90%)
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
kernel/events/core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 065f9188b44a..308850449a84 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2311,6 +2311,11 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
WARN_ON_ONCE(event->ctx != ctx);
lockdep_assert_held(&ctx->lock);
+ if (event->state == PERF_EVENT_STATE_INACTIVE) {
+ perf_event_update_time(event);
+ return;
+ }
+
if (event->state != PERF_EVENT_STATE_ACTIVE)
return;
@@ -2351,7 +2356,7 @@ group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
{
struct perf_event *event;
- if (group_event->state != PERF_EVENT_STATE_ACTIVE)
+ if (group_event->state < PERF_EVENT_STATE_INACTIVE)
return;
perf_assert_pmu_disabled(group_event->pmu_ctx->pmu);
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
On Fri, Dec 20, 2024 at 10:02:02AM +0000, Yeoreum Yun wrote:
> The perf core code fails to account for total_enable_time of event
> when its state is inactive.
>
> Here is the error case for failure to account for total_enable_time for
> core event:
>
> sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s
> ...
> armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740
> armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440
^^^^^
So *that* number is wrong and should have been equal to the number above
it.
>
> Performance counter stats for 'stress-ng --pthread=2 -t 2s':
>
> 1,138,698,008 armv8_pmuv3_0/event=0x08/ (94.99%)
> 1,826,791,390 armv8_pmuv3_1/event=0x08/ (43.47%)
>
> Since above two events are belong to the same task context and
> mutually-exclusive per cpu (they couldn't be active at the same time on the same cpu),
> the total_enable_time should be the same (the 2nd number among 3 numbers)
> and the summation of ratio should be 100%.
So the enable_time for the second event is too low, meaning the ratio
hsould have been ~15% rather than ~43%.
/me goes try and congest the wall of text below for more clues :/
> This account failure of total_enable_time for event could happen in below sequence.
>
> 1. two event opened with:
> - first event (e0) is opened with pmu0(p0) which could be added on CPU0.
> - second event (e1) is opened with pmu1(p1) which could be added on CPU1.
> - these two event belongs to the same task_ctx.
>
> at this point:
> task_ctx->time = 0
> e0->total_enable_time = 0
> e0->total_running_time = 0
> e1->total_enable_time = 0
> e1->total_running_time = 0
>
> 2. the task_ctx is sched in CPU0.
> - In ctx_sched_in(), the task_ctx->time doesn't updated.
> - In event_sched_in() e0 is activated so, its state becomes ACTIVE.
> - In event_sched_in() e1 is activated, but soon becomes !ACTIVE
> because pmu1 doesn't support CPU1 so it failed in pmu1->add().
This doesn't make sense; e1 should never reach event_sched_in() and it
should already be INACTIVE.
Notable events are created INACTIVE when !attr->disabled.
Also, scheduling should not get beyond merge_sched_in()'s
event_filter_match(), which will find the CPU is a mismatch and stop
right there.
This also means the event (e1) does not get to go on flexible_active
(see below).
>
> at this point:
> task_ctx->time = 0
> e0->total_enable_time = 0
> e0->total_running_time = 0
> e1->total_enable_time = 0
> e1->total_running_time = 0
Both have the same state, surely that could've been compressed somewhat.
/me deletes giant wall of text
> 7. the task exit and task_ctx is sched out. and suppose scheulded time as t3.
> - In ctx_sched_out(), the task_ctx->time updated with t1 + t2 + t3.
> - e0 was ACTIVE, so In event_sched_out(),
> e0->total_enable_time and e1->total_running_time are updated with t3.
> - e1 was !ACTIVE, so it doesn't update
> e1->total_enable_time and e1->total_running_time.
>
> at this point:
> task_ctx->time = t1 + t2 + t3
> e0->total_enable_time = t1 + t2 +t3
> e0->total_running_time = t1 + t3
> e1->total_enable_time = t1 + t2
> e1->total_running_time = t2
>
> In step 7, amount of t3 enable time is missing in e1.
> So, the perf prints error ratio for total_running_time/total_enable_time
> for e1.
>
> Here is the summary of change of time as timeline:
>
> CPU0 CPU1 CPU0
> | in | out | | in | out | | in | out (exit) |
> ------------------------------------------------------------------------------
> ctx->time | 0 | t1 | | t1 | t1 + t2 | | t1 + t2 | t1 + t2 + t3 |
> ------------------------------------------------------------------------------
> e0->ena | 0 | t1 | | t1 | t1 *| | t1 + t2 | t1 + t2 + t3 |
> ------------------------------------------------------------------------------
> e0->run | 0 | t1 | | t1 | t1 | | t1 + t2 | t1 + t3 |
> ------------------------------------------------------------------------------
> e1->ena | 0 | 0 *| | t1 | t1 + t2 | | t1 + t2 | t1 + t2 XXX |
> ------------------------------------------------------------------------------
> e1->run | 0 | 0 | | 0 | t2 | | t1 + t2 | t2 |
> ------------------------------------------------------------------------------
So just the table is fine (better even than this giant wall of text),
but you have to explain what's what. Eg, there is no clear progression,
and having two columns both labeled CPU0 is highly confusing.
Also, a visual indication of where the error happens is useful. So the
cell marked with XXX is the failure case, right, that's missing '+ t3'.
Except then you should also mark those I stuck a * on.
> To address this, update total_enable_time in event_sched_out() when event state
> is PERF_EVENT_STATE_INACTIVE.
This is a giant jump that I'm not following. Notably ctx_sched_out()
will only iterate pmu_ctx->{pinned,flexible}_active and that list should
only include ACTIVE events.
So how does handling INACTIVE in event_sched_out() even begin to help?
Additionally, where does perf-stat get those numbers from; because
afaict perf_event_read() will do perf_event_update_time() and you should
be getting the correct numbers, no?
I'm utterly lost in the perf-stat code :-( Also it appears I can't even
build tools/perf anymore :-(
> After this patch, this problem is gone like:
Yes, but why? Something smells.
Looking at group_sched_in(), I'm thinking that that STATE_OFF thing is
totally wrong, should that not be something like:
if (group_event->state <= PERF_EVENT_STATE_OFF)
return -EBUSY;
or something !0 at any rate.
So what path leads to event_sched_out() with !ACTIVE that's relevant
here?
It can't have geen that group_sched_out(), because that wasn't in your
first patch.
How is this a fix and why is it the right fix?
Hi, Peter.
On Fri, Dec 20, 2024 at 02:33:59PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 20, 2024 at 10:02:02AM +0000, Yeoreum Yun wrote:
> > The perf core code fails to account for total_enable_time of event
> > when its state is inactive.
> >
> > Here is the error case for failure to account for total_enable_time for
> > core event:
> >
> > sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s
> > ...
> > armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740
> > armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440
>
> So *that* number is wrong and should have been equal to the number above
> it.
Yes it is.
> >
> > Performance counter stats for 'stress-ng --pthread=2 -t 2s':
> >
> > 1,138,698,008 armv8_pmuv3_0/event=0x08/ (94.99%)
> > 1,826,791,390 armv8_pmuv3_1/event=0x08/ (43.47%)
> >
> > Since above two events are belong to the same task context and
> > mutually-exclusive per cpu (they couldn't be active at the same time on the same cpu),
> > the total_enable_time should be the same (the 2nd number among 3 numbers)
> > and the summation of ratio should be 100%.
>
> So the enable_time for the second event is too low, meaning the ratio
> hsould have been ~15% rather than ~43%.
>
> /me goes try and congest the wall of text below for more clues :/
>
> > This account failure of total_enable_time for event could happen in below sequence.
> >
> > 1. two event opened with:
> > - first event (e0) is opened with pmu0(p0) which could be added on CPU0.
> > - second event (e1) is opened with pmu1(p1) which could be added on CPU1.
> > - these two event belongs to the same task_ctx.
> >
> > at this point:
> > task_ctx->time = 0
> > e0->total_enable_time = 0
> > e0->total_running_time = 0
> > e1->total_enable_time = 0
> > e1->total_running_time = 0
> >
> > 2. the task_ctx is sched in CPU0.
> > - In ctx_sched_in(), the task_ctx->time doesn't updated.
> > - In event_sched_in() e0 is activated so, its state becomes ACTIVE.
> > - In event_sched_in() e1 is activated, but soon becomes !ACTIVE
> > because pmu1 doesn't support CPU1 so it failed in pmu1->add().
>
> This doesn't make sense; e1 should never reach event_sched_in() and it
> should already be INACTIVE.
>
> Notable events are created INACTIVE when !attr->disabled.
But in perf stat code, it via enable_counter(), so it's set with
INACTIVE.
> Also, scheduling should not get beyond merge_sched_in()'s
> event_filter_match(), which will find the CPU is a mismatch and stop
> right there.
>
> This also means the event (e1) does not get to go on flexible_active
> (see below).
No, when perf stat command with above, the cpu sets as == -1,
So, It doesn't filter out in event_filter_match(). so it enter into
merge_sched_in() and get into event_sched_in().
> > at this point:
> > task_ctx->time = 0
> > e0->total_enable_time = 0
> > e0->total_running_time = 0
> > e1->total_enable_time = 0
> > e1->total_running_time = 0
>
> Both have the same state, surely that could've been compressed somewhat.
>
> /me deletes giant wall of text
>
> > 7. the task exit and task_ctx is sched out. and suppose scheulded time as t3.
> > - In ctx_sched_out(), the task_ctx->time updated with t1 + t2 + t3.
> > - e0 was ACTIVE, so In event_sched_out(),
> > e0->total_enable_time and e1->total_running_time are updated with t3.
> > - e1 was !ACTIVE, so it doesn't update
> > e1->total_enable_time and e1->total_running_time.
> >
> > at this point:
> > task_ctx->time = t1 + t2 + t3
> > e0->total_enable_time = t1 + t2 +t3
> > e0->total_running_time = t1 + t3
> > e1->total_enable_time = t1 + t2
> > e1->total_running_time = t2
> >
> > In step 7, amount of t3 enable time is missing in e1.
> > So, the perf prints error ratio for total_running_time/total_enable_time
> > for e1.
> >
> > Here is the summary of change of time as timeline:
> >
> > CPU0 CPU1 CPU0
> > | in | out | | in | out | | in | out (exit) |
> > ------------------------------------------------------------------------------
> > ctx->time | 0 | t1 | | t1 | t1 + t2 | | t1 + t2 | t1 + t2 + t3 |
> > ------------------------------------------------------------------------------
> > e0->ena | 0 | t1 | | t1 | t1 *| | t1 + t2 | t1 + t2 + t3 |
> > ------------------------------------------------------------------------------
> > e0->run | 0 | t1 | | t1 | t1 | | t1 + t2 | t1 + t3 |
> > ------------------------------------------------------------------------------
> > e1->ena | 0 | 0 *| | t1 | t1 + t2 | | t1 + t2 | t1 + t2 XXX |
> > ------------------------------------------------------------------------------
> > e1->run | 0 | 0 | | 0 | t2 | | t1 + t2 | t2 |
> > ------------------------------------------------------------------------------
>
> So just the table is fine (better even than this giant wall of text),
> but you have to explain what's what. Eg, there is no clear progression,
> and having two columns both labeled CPU0 is highly confusing.
>
> Also, a visual indication of where the error happens is useful. So the
> cell marked with XXX is the failure case, right, that's missing '+ t3'.
>
> Except then you should also mark those I stuck a * on.
>
Thanks. I'll rewrite it again.
> > To address this, update total_enable_time in event_sched_out() when event state
> > is PERF_EVENT_STATE_INACTIVE.
>
> This is a giant jump that I'm not following. Notably ctx_sched_out()
> will only iterate pmu_ctx->{pinned,flexible}_active and that list should
> only include ACTIVE events.
> So how does handling INACTIVE in event_sched_out() even begin to help?
the answer is in the perf_event_exit_event()'s
perf_remove_from_context(). in here
event_sched_out() is called via __perf_remove_from_context()
So above case, the enable time is fixed in here.
But I think in case of change of group_ctx_out() is my mistak. Sorry.
>
> Additionally, where does perf-stat get those numbers from; because
> afaict perf_event_read() will do perf_event_update_time() and you should
> be getting the correct numbers, no?
>
> I'm utterly lost in the perf-stat code :-( Also it appears I can't even
> build tools/perf anymore :-(
>
After exit task. in this case, the ctx->is_active doesn't have
EVENT_TIME, it doesn't update ctx->time and the event time.
So, the task_ctx which is child one is callected with missing
total_enable_time.
> > After this patch, this problem is gone like:
>
> Yes, but why? Something smells.
> Looking at group_sched_in(), I'm thinking that that STATE_OFF thing is
> totally wrong, should that not be something like:
>
> if (group_event->state <= PERF_EVENT_STATE_OFF)
> return -EBUSY;
>
> or something !0 at any rate.
>
>
> So what path leads to event_sched_out() with !ACTIVE that's relevant
> here?
>
> It can't have geen that group_sched_out(), because that wasn't in your
> first patch.
>
> How is this a fix and why is it the right fix?
>
>
As I said above, at the time of exit. __perf_remove_from_context()
calls in event_sched_out(). so it fixes the missing enabled time at the
exit.
the middle of event''s total_enable_time() is updated in ctx_sched_in()
(as I above said, it doesn't filter out because it runs with cpu == -1).
So, it's enough to add missing enabled time with exit() and this is
fixed via event_sched_out() in __perf_remove_from_context().
The group_sched_out() is my mistake, But with update in
event_sched_out(). It's enough to fix and I think it's right fix.
On Fri, Dec 20, 2024 at 02:05:39PM +0000, Yeoreum Yun wrote:
> > > This account failure of total_enable_time for event could happen in below sequence.
> > >
> > > 1. two event opened with:
> > > - first event (e0) is opened with pmu0(p0) which could be added on CPU0.
> > > - second event (e1) is opened with pmu1(p1) which could be added on CPU1.
> > > - these two event belongs to the same task_ctx.
> > >
> > > at this point:
> > > task_ctx->time = 0
> > > e0->total_enable_time = 0
> > > e0->total_running_time = 0
> > > e1->total_enable_time = 0
> > > e1->total_running_time = 0
> > >
> > > 2. the task_ctx is sched in CPU0.
> > > - In ctx_sched_in(), the task_ctx->time doesn't updated.
> > > - In event_sched_in() e0 is activated so, its state becomes ACTIVE.
> > > - In event_sched_in() e1 is activated, but soon becomes !ACTIVE
> > > because pmu1 doesn't support CPU1 so it failed in pmu1->add().
> >
> > This doesn't make sense; e1 should never reach event_sched_in() and it
> > should already be INACTIVE.
> >
> > Notable events are created INACTIVE when !attr->disabled.
>
> But in perf stat code, it via enable_counter(), so it's set with
> INACTIVE.
your text above references ctx_sched_in(), what you're now saying is
__perf_event_enable(); *that* will indeed set INACTIVE, but it will then
also fail event_filter_match() and never even reschedule.
> > Also, scheduling should not get beyond merge_sched_in()'s
> > event_filter_match(), which will find the CPU is a mismatch and stop
> > right there.
> >
> > This also means the event (e1) does not get to go on flexible_active
> > (see below).
>
> No, when perf stat command with above, the cpu sets as == -1,
> So, It doesn't filter out in event_filter_match(). so it enter into
> merge_sched_in() and get into event_sched_in().
Hurmph, I thought the hybrid stuff used to set CPU.
Let me try and remember how the hybrid stuff works again. Ah
pmu::filter(), that's called in visit_groups_merge() and should stop
right there if the PMU doesn't work on that CPU.
Is your hybrid PMu not set up right?
> > > To address this, update total_enable_time in event_sched_out() when event state
> > > is PERF_EVENT_STATE_INACTIVE.
> >
> > This is a giant jump that I'm not following. Notably ctx_sched_out()
> > will only iterate pmu_ctx->{pinned,flexible}_active and that list should
> > only include ACTIVE events.
> > So how does handling INACTIVE in event_sched_out() even begin to help?
>
> the answer is in the perf_event_exit_event()'s
> perf_remove_from_context(). in here
> event_sched_out() is called via __perf_remove_from_context()
> So above case, the enable time is fixed in here.
OK, how's this then?
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 065f9188b44a..d12b402f9751 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2422,6 +2422,7 @@ __perf_remove_from_context(struct perf_event *event,
{
struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
unsigned long flags = (unsigned long)info;
+ enum perf_event_state state = PERF_EVENT_STATE_OFF;
ctx_time_update(cpuctx, ctx);
@@ -2438,7 +2439,9 @@ __perf_remove_from_context(struct perf_event *event,
perf_child_detach(event);
list_del_event(event, ctx);
if (flags & DETACH_DEAD)
- event->state = PERF_EVENT_STATE_DEAD;
+ state = PERF_EVENT_STATE_DEAD;
+
+ perf_event_set_state(event, state);
if (!pmu_ctx->nr_events) {
pmu_ctx->rotate_necessary = 0;
On Fri, Dec 20, 2024 at 04:14:14PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 20, 2024 at 02:05:39PM +0000, Yeoreum Yun wrote:
> > the answer is in the perf_event_exit_event()'s
> > perf_remove_from_context(). in here
> > event_sched_out() is called via __perf_remove_from_context()
> > So above case, the enable time is fixed in here.
>
> OK, how's this then?
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 065f9188b44a..96a00034b26f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2406,7 +2406,8 @@ ctx_time_update_event(struct perf_event_context *ctx, struct perf_event *event)
#define DETACH_GROUP 0x01UL
#define DETACH_CHILD 0x02UL
-#define DETACH_DEAD 0x04UL
+#define DETACH_EXIT 0x04UL
+#define DETACH_DEAD 0x048L
/*
* Cross CPU call to remove a performance event
@@ -2421,6 +2422,7 @@ __perf_remove_from_context(struct perf_event *event,
void *info)
{
struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
+ enum perf_event_state state = PERF_EVENT_STATE_OFF;
unsigned long flags = (unsigned long)info;
ctx_time_update(cpuctx, ctx);
@@ -2429,16 +2431,20 @@ __perf_remove_from_context(struct perf_event *event,
* Ensure event_sched_out() switches to OFF, at the very least
* this avoids raising perf_pending_task() at this time.
*/
- if (flags & DETACH_DEAD)
+ if (flags & DETACH_EXIT)
+ state = PERF_EVENT_STATE_EXIT;
+ if (flags & DETACH_DEAD) {
event->pending_disable = 1;
+ state = PERF_EVENT_STATE_DEAD;
+ }
event_sched_out(event, ctx);
if (flags & DETACH_GROUP)
perf_group_detach(event);
if (flags & DETACH_CHILD)
perf_child_detach(event);
list_del_event(event, ctx);
- if (flags & DETACH_DEAD)
- event->state = PERF_EVENT_STATE_DEAD;
+
+ perf_event_set_state(event, min(event->state, state));
if (!pmu_ctx->nr_events) {
pmu_ctx->rotate_necessary = 0;
@@ -13371,12 +13377,7 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
mutex_lock(&parent_event->child_mutex);
}
- perf_remove_from_context(event, detach_flags);
-
- raw_spin_lock_irq(&ctx->lock);
- if (event->state > PERF_EVENT_STATE_EXIT)
- perf_event_set_state(event, PERF_EVENT_STATE_EXIT);
- raw_spin_unlock_irq(&ctx->lock);
+ perf_remove_from_context(event, detach_flags | DETACH_EXIT);
/*
* Child events can be freed.
On Fri, Dec 20, 2024 at 04:28:10PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 20, 2024 at 04:14:14PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 20, 2024 at 02:05:39PM +0000, Yeoreum Yun wrote:
>
> > > the answer is in the perf_event_exit_event()'s
> > > perf_remove_from_context(). in here
> > > event_sched_out() is called via __perf_remove_from_context()
> > > So above case, the enable time is fixed in here.
> >
> > OK, how's this then?
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 065f9188b44a..96a00034b26f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2406,7 +2406,8 @@ ctx_time_update_event(struct perf_event_context *ctx, struct perf_event *event)
>
> #define DETACH_GROUP 0x01UL
> #define DETACH_CHILD 0x02UL
> -#define DETACH_DEAD 0x04UL
> +#define DETACH_EXIT 0x04UL
> +#define DETACH_DEAD 0x048L
>
> /*
> * Cross CPU call to remove a performance event
> @@ -2421,6 +2422,7 @@ __perf_remove_from_context(struct perf_event *event,
> void *info)
> {
> struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
> + enum perf_event_state state = PERF_EVENT_STATE_OFF;
> unsigned long flags = (unsigned long)info;
>
> ctx_time_update(cpuctx, ctx);
> @@ -2429,16 +2431,20 @@ __perf_remove_from_context(struct perf_event *event,
> * Ensure event_sched_out() switches to OFF, at the very least
> * this avoids raising perf_pending_task() at this time.
> */
> - if (flags & DETACH_DEAD)
> + if (flags & DETACH_EXIT)
> + state = PERF_EVENT_STATE_EXIT;
> + if (flags & DETACH_DEAD) {
> event->pending_disable = 1;
> + state = PERF_EVENT_STATE_DEAD;
> + }
> event_sched_out(event, ctx);
> if (flags & DETACH_GROUP)
> perf_group_detach(event);
> if (flags & DETACH_CHILD)
> perf_child_detach(event);
> list_del_event(event, ctx);
> - if (flags & DETACH_DEAD)
> - event->state = PERF_EVENT_STATE_DEAD;
> +
> + perf_event_set_state(event, min(event->state, state));
>
> if (!pmu_ctx->nr_events) {
> pmu_ctx->rotate_necessary = 0;
> @@ -13371,12 +13377,7 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> mutex_lock(&parent_event->child_mutex);
> }
>
> - perf_remove_from_context(event, detach_flags);
> -
> - raw_spin_lock_irq(&ctx->lock);
> - if (event->state > PERF_EVENT_STATE_EXIT)
> - perf_event_set_state(event, PERF_EVENT_STATE_EXIT);
> - raw_spin_unlock_irq(&ctx->lock);
> + perf_remove_from_context(event, detach_flags | DETACH_EXIT);
>
> /*
> * Child events can be freed.
Yes. I think this looks good to me to solve this error prints :)
Hi Peter,
> > > This doesn't make sense; e1 should never reach event_sched_in() and it
> > > should already be INACTIVE.
> > >
> > > Notable events are created INACTIVE when !attr->disabled.
> >
> > But in perf stat code, it via enable_counter(), so it's set with
> > INACTIVE.
>
> your text above references ctx_sched_in(), what you're now saying is
> __perf_event_enable(); *that* will indeed set INACTIVE, but it will then
> also fail event_filter_match() and never even reschedule.
>
> > > Also, scheduling should not get beyond merge_sched_in()'s
> > > event_filter_match(), which will find the CPU is a mismatch and stop
> > > right there.
> > >
> > > This also means the event (e1) does not get to go on flexible_active
> > > (see below).
> >
> > No, when perf stat command with above, the cpu sets as == -1,
> > So, It doesn't filter out in event_filter_match(). so it enter into
> > merge_sched_in() and get into event_sched_in().
>
> Hurmph, I thought the hybrid stuff used to set CPU.
>
> Let me try and remember how the hybrid stuff works again. Ah
> pmu::filter(), that's called in visit_groups_merge() and should stop
> right there if the PMU doesn't work on that CPU.
>
> Is your hybrid PMu not set up right?
Oh, I miss it for filter. Thanks correct this.
But this is not related to this error.
The error occurs at "task exit" -- failured to account last enable time
for inactive event at exit.
> > > > To address this, update total_enable_time in event_sched_out() when event state
> > > > is PERF_EVENT_STATE_INACTIVE.
> > >
> > > This is a giant jump that I'm not following. Notably ctx_sched_out()
> > > will only iterate pmu_ctx->{pinned,flexible}_active and that list should
> > > only include ACTIVE events.
> > > So how does handling INACTIVE in event_sched_out() even begin to help?
> >
> > the answer is in the perf_event_exit_event()'s
> > perf_remove_from_context(). in here
> > event_sched_out() is called via __perf_remove_from_context()
> > So above case, the enable time is fixed in here.
>
> OK, how's this then?
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 065f9188b44a..d12b402f9751 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2422,6 +2422,7 @@ __perf_remove_from_context(struct perf_event *event,
> {
> struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
> unsigned long flags = (unsigned long)info;
> + enum perf_event_state state = PERF_EVENT_STATE_OFF;
>
> ctx_time_update(cpuctx, ctx);
>
> @@ -2438,7 +2439,9 @@ __perf_remove_from_context(struct perf_event *event,
> perf_child_detach(event);
> list_del_event(event, ctx);
> if (flags & DETACH_DEAD)
> - event->state = PERF_EVENT_STATE_DEAD;
> + state = PERF_EVENT_STATE_DEAD;
> +
> + perf_event_set_state(event, state);
>
It works. but what about this?
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 065f9188b44a..71ed8f847b04 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2432,6 +2432,7 @@ __perf_remove_from_context(struct perf_event *event,
if (flags & DETACH_DEAD)
event->pending_disable = 1;
event_sched_out(event, ctx);
+ perf_event_update_time(event);
if (flags & DETACH_GROUP)
perf_group_detach(event);
if (flags & DETACH_CHILD)
Thanks
On Fri, Dec 20, 2024 at 03:26:42PM +0000, Yeoreum Yun wrote:
> > OK, how's this then?
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 065f9188b44a..d12b402f9751 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2422,6 +2422,7 @@ __perf_remove_from_context(struct perf_event *event,
> > {
> > struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
> > unsigned long flags = (unsigned long)info;
> > + enum perf_event_state state = PERF_EVENT_STATE_OFF;
> >
> > ctx_time_update(cpuctx, ctx);
> >
> > @@ -2438,7 +2439,9 @@ __perf_remove_from_context(struct perf_event *event,
> > perf_child_detach(event);
> > list_del_event(event, ctx);
> > if (flags & DETACH_DEAD)
> > - event->state = PERF_EVENT_STATE_DEAD;
> > + state = PERF_EVENT_STATE_DEAD;
> > +
> > + perf_event_set_state(event, state);
> >
>
> It works. but what about this?
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 065f9188b44a..71ed8f847b04 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2432,6 +2432,7 @@ __perf_remove_from_context(struct perf_event *event,
> if (flags & DETACH_DEAD)
> event->pending_disable = 1;
> event_sched_out(event, ctx);
> + perf_event_update_time(event);
> if (flags & DETACH_GROUP)
> perf_group_detach(event);
> if (flags & DETACH_CHILD)
Well, the rule is that timekeeping is tied to state change. And as you
can see __perf_remove_from_context() violates that.
Probably because someone (probably me) figured that the event would not
be observed again after detach or something like that.
Hi Peter.
> On Fri, Dec 20, 2024 at 03:26:42PM +0000, Yeoreum Yun wrote:
>
> > > OK, how's this then?
> > >
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index 065f9188b44a..d12b402f9751 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -2422,6 +2422,7 @@ __perf_remove_from_context(struct perf_event *event,
> > > {
> > > struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
> > > unsigned long flags = (unsigned long)info;
> > > + enum perf_event_state state = PERF_EVENT_STATE_OFF;
> > >
> > > ctx_time_update(cpuctx, ctx);
> > >
> > > @@ -2438,7 +2439,9 @@ __perf_remove_from_context(struct perf_event *event,
> > > perf_child_detach(event);
> > > list_del_event(event, ctx);
> > > if (flags & DETACH_DEAD)
> > > - event->state = PERF_EVENT_STATE_DEAD;
> > > + state = PERF_EVENT_STATE_DEAD;
> > > +
> > > + perf_event_set_state(event, state);
> > >
> >
> > It works. but what about this?
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 065f9188b44a..71ed8f847b04 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2432,6 +2432,7 @@ __perf_remove_from_context(struct perf_event *event,
> > if (flags & DETACH_DEAD)
> > event->pending_disable = 1;
> > event_sched_out(event, ctx);
> > + perf_event_update_time(event);
> > if (flags & DETACH_GROUP)
> > perf_group_detach(event);
> > if (flags & DETACH_CHILD)
>
This patch doesn't work when the event is child event.
In case of parent's event, when you see the list_del_event(),
the total_enable_time is updated properly by changing state with
PERF_EVENT_STATE_OFF.
However, child event's total_enable_time is added before list_del_event.
So, the missing total_enable_time isn't added to parents event and the
error print happens.
So, I think it wouldn't be possible to update time with set_state.
instead I think it should update total_enable_time before
child's total_enable_time is added to parents' child_total_enable_time
like
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 065f9188b44a..d27717c44924 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13337,6 +13337,7 @@ static void sync_child_event(struct perf_event *child_event)
}
child_val = perf_event_count(child_event, false);
+ perf_event_update_time(child_event);
/*
* Add back the child's count to the parent's count:
Thanks.
On Fri, Dec 20, 2024 at 04:23:39PM +0000, Yeoreum Yun wrote:
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index 065f9188b44a..71ed8f847b04 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -2432,6 +2432,7 @@ __perf_remove_from_context(struct perf_event *event,
> > > if (flags & DETACH_DEAD)
> > > event->pending_disable = 1;
> > > event_sched_out(event, ctx);
> > > + perf_event_update_time(event);
> > > if (flags & DETACH_GROUP)
> > > perf_group_detach(event);
> > > if (flags & DETACH_CHILD)
> >
>
> This patch doesn't work when the event is child event.
> In case of parent's event, when you see the list_del_event(),
> the total_enable_time is updated properly by changing state with
> PERF_EVENT_STATE_OFF.
>
> However, child event's total_enable_time is added before list_del_event.
> So, the missing total_enable_time isn't added to parents event and the
> error print happens.
>
> So, I think it wouldn't be possible to update time with set_state.
> instead I think it should update total_enable_time before
> child's total_enable_time is added to parents' child_total_enable_time
>
> like
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 065f9188b44a..d27717c44924 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -13337,6 +13337,7 @@ static void sync_child_event(struct perf_event *child_event)
> }
>
> child_val = perf_event_count(child_event, false);
> + perf_event_update_time(child_event);
>
> /*
> * Add back the child's count to the parent's count:
Well, that again violates the rule that we update time on state change.
AFAICT there is no issue with simply moving the perf_event_set_state()
up a few lines in __perf_remove_from_context().
Notably event_sched_out() will already put us in OFF state; and nothing
after that cares about further states AFAICT.
So isn't the below the simpler solution?
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2438,14 +2438,13 @@ __perf_remove_from_context(struct perf_e
state = PERF_EVENT_STATE_DEAD;
}
event_sched_out(event, ctx);
+ perf_event_set_state(event, min(event->state, state));
if (flags & DETACH_GROUP)
perf_group_detach(event);
if (flags & DETACH_CHILD)
perf_child_detach(event);
list_del_event(event, ctx);
- perf_event_set_state(event, min(event->state, state));
-
if (!pmu_ctx->nr_events) {
pmu_ctx->rotate_necessary = 0;
> > It works. but what about this? > > > > diff --git a/kernel/events/core.c b/kernel/events/core.c > > index 065f9188b44a..71ed8f847b04 100644 > > --- a/kernel/events/core.c > > +++ b/kernel/events/core.c > > @@ -2432,6 +2432,7 @@ __perf_remove_from_context(struct perf_event *event, > > if (flags & DETACH_DEAD) > > event->pending_disable = 1; > > event_sched_out(event, ctx); > > + perf_event_update_time(event); > > if (flags & DETACH_GROUP) > > perf_group_detach(event); > > if (flags & DETACH_CHILD) > > Well, the rule is that timekeeping is tied to state change. And as you > can see __perf_remove_from_context() violates that. > > Probably because someone (probably me) figured that the event would not > be observed again after detach or something like that. > Ah. Thanks to let me know to think this rule. I agree with your last suggestion. Could I send patch with it again? Thanks
© 2016 - 2026 Red Hat, Inc.