[PATCH V2] hrtimer: Check running timer state

Enlin Mu posted 1 patch 2 months, 3 weeks ago
kernel/time/hrtimer.c | 1 +
1 file changed, 1 insertion(+)
[PATCH V2] hrtimer: Check running timer state
Posted by Enlin Mu 2 months, 3 weeks ago
When the running timer is not NULL, print debugging information.
The test code is roughly as follows:

static struct hrtimer serial_timer;
enum hrtimer_restart serial_timer_handler(struct hrtimer * timer)
{
	local_irq_disable();
	......
	do_someting();
	copy_data_with_dma();
	......
	hrtimer_forward_now(*serial_timer, ns_to_ktime(1000*2000));
	local_irq_enable();
	return HRTIMER_RESTART;
}

static int serial_start(struct uart_port *port)
{
	......
	......
	hrtime_init(&serial_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
	ktime = no_to_ktime(1000*2000);
	serial_timer.function = serial_timer_handler;
	hrtimer_start(&serial_timer, ktime, HRTIMER_MODE_REL);
	......
	return 0;
}

static void serial_shutdown(struct uart_port *port)
{
	......
	hrtimer_cancle(&serial_timer);
	......
	serial_release_dma(port);
	......
}

Kernel panic - not syning: Asynchronous SError Interrupt
CPU: 7 PID: 7082 Comm: binder:542_1
Call trace:
    dump_backtrace+0xec/0x138
    show_stack+0x18/0x28
    show_stack_lvl+0x60/0x84
    dump_stack+0x18/0x24
    panic+0x164/0x37c
    nmi_panic+0x8c/0x90
    arm64_serror_panic+0x6c/0x94
    do_serror+0xbc/0xc8
    el1h_64_error_handler+0x34/0x4c
    el1h_64_error+0x68/0x6c
    readl+0x44/0x1dl
    dma_resume+0x34/0x68
    serial_timer_handler+0x9c/0x38c
    __hrtimer_run_queues+0x1f0/0x400
    hrtimer_interrupt+0xdc/0x39c
    arch_timer_hanler_phys+0x50/0x64
    handler_percpu_devid_irq+0xb4/0x258
    generic_handle_domain_irq+0x58/0x80
    gic_handle_irq+0x4c/0x114
    call_on_irq_stack+0x3c/0x74
    do_interrupt_handler+0x4c/0x84
    el1_interrupt+0x34/0x58
    .......

The cpu6 canceled serial_timer and released dma,but the
serial_timer still ran many times on CPU7 until a panic occurred.
The reason for the panic is that serial_timer accessed the
released dma,But the serial_timer had been canceled for
some time now on cpu6.
The cpu6 can successfully cancel the serial_timer because the
running timer has changed and it is another timer(such as hrtimer_usb).
When the serial_timer is enable to interrupt, the next hrtimer
(such as hrtimer_usb) on cpu7 preempts the return of ther serial_timer,
causing a change in the running timer.

Signed-off-by: Enlin Mu <enlin.mu@linux.dev>
Signed-off-by: Enlin Mu <enlin.mu@unisoc.com>
Signed-off-by: Enlin Mu <enlin.mu@linux.dev>
---
 kernel/time/hrtimer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 88aa062b8a55..b7f00c39dd97 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1743,6 +1743,7 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 	lockdep_assert_held(&cpu_base->lock);
 
 	debug_deactivate(timer);
+	WARN_ON_ONCE(base->running);
 	base->running = timer;
 
 	/*
-- 
2.39.5
Re: [PATCH V2] hrtimer: Check running timer state
Posted by Thomas Gleixner 2 months, 3 weeks ago
On Fri, Nov 14 2025 at 19:42, Enlin Mu wrote:
> When the running timer is not NULL, print debugging information.

What for?

> The test code is roughly as follows:
>
> static struct hrtimer serial_timer;
> enum hrtimer_restart serial_timer_handler(struct hrtimer * timer)
> {
> 	local_irq_disable();

What is this for?

> 	......
> 	do_someting();
> 	copy_data_with_dma();
> 	......
> 	hrtimer_forward_now(*serial_timer, ns_to_ktime(1000*2000));
> 	local_irq_enable();

And this?

> 	return HRTIMER_RESTART;
> }
>
> static int serial_start(struct uart_port *port)
> {
> 	......
> 	......
> 	hrtime_init(&serial_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);

That function does not exist.

> 	ktime = no_to_ktime(1000*2000);
> 	serial_timer.function = serial_timer_handler;
> 	hrtimer_start(&serial_timer, ktime, HRTIMER_MODE_REL);
> 	......
> 	return 0;
> }
>
> static void serial_shutdown(struct uart_port *port)
> {
> 	......
> 	hrtimer_cancle(&serial_timer);
> 	......
> 	serial_release_dma(port);
> 	......
> }

> The cpu6 canceled serial_timer and released dma,but the
> serial_timer still ran many times on CPU7 until a panic occurred.
> The reason for the panic is that serial_timer accessed the
> released dma,But the serial_timer had been canceled for
> some time now on cpu6.

You still fail to explain how the timer can still run after being
canceled.

> The cpu6 can successfully cancel the serial_timer because the
> running timer has changed and it is another timer(such as
> hrtimer_usb).

After that the timer _cannot_ be running anymore unless some other code
re-arms it afterwards.

> When the serial_timer is enable to interrupt, the next hrtimer
> (such as hrtimer_usb) on cpu7 preempts the return of ther serial_timer,
> causing a change in the running timer.

Then fix your timer callback. The callback is invoked in hard interrupt
context and the callback enables interrupts, which is a NONO. You
clearly never ran your code with lockdep enabled. It would have told you
so.

> Signed-off-by: Enlin Mu <enlin.mu@linux.dev>
> Signed-off-by: Enlin Mu <enlin.mu@unisoc.com>
> Signed-off-by: Enlin Mu <enlin.mu@linux.dev>

Interesting Signed-off-by chain. Seems you're co-developing this patch
with your Alter ego.

Thanks,

        tglx
Re: [PATCH V2] hrtimer: Check running timer state
Posted by enlin.mu 2 months, 3 weeks ago

On 2025/11/15 02:34, Thomas Gleixner wrote:
> On Fri, Nov 14 2025 at 19:42, Enlin Mu wrote:
>> When the running timer is not NULL, print debugging information.
> 
> What for?
> 
>> The test code is roughly as follows:
>>
>> static struct hrtimer serial_timer;
>> enum hrtimer_restart serial_timer_handler(struct hrtimer * timer)
>> {
>> 	local_irq_disable();
> 
> What is this for?
Hi Thomas

This code comes from my colleauge who is not familiar with the
running machanism of hrtimer, which is why there is this logical error.

> 
>> 	......
>> 	do_someting();
>> 	copy_data_with_dma();
>> 	......
>> 	hrtimer_forward_now(*serial_timer, ns_to_ktime(1000*2000));
>> 	local_irq_enable();
> 
> And this?
> 
>> 	return HRTIMER_RESTART;
>> }
>>
>> static int serial_start(struct uart_port *port)
>> {
>> 	......
>> 	......
>> 	hrtime_init(&serial_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> 
> That function does not exist.
> 
>> 	ktime = no_to_ktime(1000*2000);
>> 	serial_timer.function = serial_timer_handler;
>> 	hrtimer_start(&serial_timer, ktime, HRTIMER_MODE_REL);
>> 	......
>> 	return 0;
>> }
>>
>> static void serial_shutdown(struct uart_port *port)
>> {
>> 	......
>> 	hrtimer_cancle(&serial_timer);
>> 	......
>> 	serial_release_dma(port);
>> 	......
>> }
> 
>> The cpu6 canceled serial_timer and released dma,but the
>> serial_timer still ran many times on CPU7 until a panic occurred.
>> The reason for the panic is that serial_timer accessed the
>> released dma,But the serial_timer had been canceled for
>> some time now on cpu6.
> 
> You still fail to explain how the timer can still run after being
> canceled.

Hi Thomas

The serial_timer on cpu6 has not been properly cancelled.
The serial_timer is once again entered into the active queue on cpu7.
---------------------------
on cpu7

step 1:  preempted by hrtimer_usb

static void __run_hrtimer(.....)
{
	......
	base->running = timer;   //timer is serial_timer;

	.....
	.....
	restart = fn(timer);   // fn is serial_timer_handler
	....  ==> irq is enable.  serial_timer is preempted by hrtimer_usb
	....
}


step 2: hrtimer_usb return

static void __run_hrtimer(.....)
{
	......
	base->running = timer;   //timer is hrtimer_usb;

	.....
	.....
	restart = fn(timer);
	....
}


step 3: serial_timer continues to execute

static void __run_hrtimer(.....)
{
	......
	 (restart != HRTIMER_NORESTART &&
		timer->state & HRTIMER_STATE_ENQUEUED))
			(timer, base, HRTIMER_MODE_ABS);  //timer is serial_timer

}

step 4: serial_timer run again.
	.......

........

step n: serial_timer run again.
	.......

step n+1: serial_timer run again.
	.......
	panic from serial_timer;

---------------------------
on cpu6

step 1:
     hrtimer_cancel


step 2:
     hrtimer_try_to_cancel

step 3:
     hrtimer_active return false, because running timer is hrtimer_usb 
on cpu7(step 2 on cpu7)

step 4:
     hrtimer_cancel returns 0

step 5:
     serial_timer is added active queue on cpu 7

---------------------------
Due to English not being my native language, I am unable to
explain this exception with a more complex description, and
can only provide a simple step-by-step explanation.

> 
>> The cpu6 can successfully cancel the serial_timer because the
>> running timer has changed and it is another timer(such as
>> hrtimer_usb).
> 
> After that the timer _cannot_ be running anymore unless some other code
> re-arms it afterwards.
> 
>> When the serial_timer is enable to interrupt, the next hrtimer
>> (such as hrtimer_usb) on cpu7 preempts the return of ther serial_timer,
>> causing a change in the running timer.
> 
> Then fix your timer callback. The callback is invoked in hard interrupt
> context and the callback enables interrupts, which is a NONO. You
> clearly never ran your code with lockdep enabled. It would have told you
> so.
> 
>> Signed-off-by: Enlin Mu <enlin.mu@linux.dev>
>> Signed-off-by: Enlin Mu <enlin.mu@unisoc.com>
>> Signed-off-by: Enlin Mu <enlin.mu@linux.dev>
> 
> Interesting Signed-off-by chain. Seems you're co-developing this patch
> with your Alter ego.
> 
> Thanks,
> 
>          tglx
Re: [PATCH V2] hrtimer: Check running timer state
Posted by Thomas Gleixner 2 months, 3 weeks ago
On Mon, Nov 17 2025 at 07:58, enlin mu wrote:
> On 2025/11/15 02:34, Thomas Gleixner wrote:
> The serial_timer on cpu6 has not been properly cancelled.
> The serial_timer is once again entered into the active queue on cpu7.
> ---------------------------
> on cpu7
>
> step 1:  preempted by hrtimer_usb
>
> static void __run_hrtimer(.....)
> {
> 	......
> 	base->running = timer;   //timer is serial_timer;
>
> 	.....
> 	.....
> 	restart = fn(timer);   // fn is serial_timer_handler
> 	....  ==> irq is enable.  serial_timer is preempted by hrtimer_usb

I told you before that enabling interrupts in the callback is
wrong. That's what needs to be fixed.

Run your experimental code with lockdep enabled. That will tell you
what's wrong.

Thanks,

        tglx