clocksource/drivers/timer-rtl-otto: enhancements

[PATCH 1/4] clocksource/drivers/timer-rtl-otto: work around dying timers

Posted by Markus Stockhausen 6 months, 1 week ago

The OpenWrt distribution has switched from kernel longterm 6.6 to
6.12. Reports show that devices with the Realtek Otto switch platform
die during operation and are rebooted by the watchdog. Sorting out
other possible reasons the Otto timer is to blame. The platform
currently consists of 4 targets with different hardware revisions.
It is not 100% clear which devices and revisions are affected.

Analysis shows:

A more aggressive sched/deadline handling leads to more timer starts
with small intervals. This increases the bug chances. See
https://marc.info/?l=linux-kernel&m=175276556023276&w=2

Focusing on the real issue a hardware limitation on some devices was
found. There is a minimal chance that a timer ends without firing an
interrupt if it is reprogrammed within the 5us before its expiration
time. Work around this issue by introducing a bounce() function. It
restarts the timer directly before the normal restart functions as
follows:

- Stop timer
- Restart timer with a slow frequency.
- Target time will be >5us
- The subsequent normal restart is outside the critical window

Downstream has already tested and confirmed a patch. See
https://github.com/openwrt/openwrt/pull/19468
https://forum.openwrt.org/t/support-for-rtl838x-based-managed-switches/57875/3788

Tested-by: Stephen Howell <howels@allthatwemight.be>
Tested-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
---
 drivers/clocksource/timer-rtl-otto.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/clocksource/timer-rtl-otto.c b/drivers/clocksource/timer-rtl-otto.c
index 8a3068b36e75..8be45a11fb8b 100644
--- a/drivers/clocksource/timer-rtl-otto.c
+++ b/drivers/clocksource/timer-rtl-otto.c
@@ -38,6 +38,7 @@
 #define RTTM_BIT_COUNT		28
 #define RTTM_MIN_DELTA		8
 #define RTTM_MAX_DELTA		CLOCKSOURCE_MASK(28)
+#define RTTM_MAX_DIVISOR	GENMASK(15, 0)
 
 /*
  * Timers are derived from the LXB clock frequency. Usually this is a fixed
@@ -112,6 +113,22 @@ static irqreturn_t rttm_timer_interrupt(int irq, void *dev_id)
 	return IRQ_HANDLED;
 }
 
+static void rttm_bounce_timer(void __iomem *base, u32 mode)
+{
+	/*
+	 * When a running timer has less than ~5us left, a stop/start sequence
+	 * might fail. While the details are unknown the most evident effect is
+	 * that the subsequent interrupt will not be fired.
+	 *
+	 * As a workaround issue an intermediate restart with a very slow
+	 * frequency of ~3kHz keeping the target counter (>=8). So the follow
+	 * up restart will always be issued outside the critical window.
+	 */
+
+	rttm_disable_timer(base);
+	rttm_enable_timer(base, mode, RTTM_MAX_DIVISOR);
+}
+
 static void rttm_stop_timer(void __iomem *base)
 {
 	rttm_disable_timer(base);
@@ -129,6 +146,7 @@ static int rttm_next_event(unsigned long delta, struct clock_event_device *clkev
 	struct timer_of *to = to_timer_of(clkevt);
 
 	RTTM_DEBUG(to->of_base.base);
+	rttm_bounce_timer(to->of_base.base, RTTM_CTRL_COUNTER);
 	rttm_stop_timer(to->of_base.base);
 	rttm_set_period(to->of_base.base, delta);
 	rttm_start_timer(to, RTTM_CTRL_COUNTER);
@@ -141,6 +159,7 @@ static int rttm_state_oneshot(struct clock_event_device *clkevt)
 	struct timer_of *to = to_timer_of(clkevt);
 
 	RTTM_DEBUG(to->of_base.base);
+	rttm_bounce_timer(to->of_base.base, RTTM_CTRL_COUNTER);
 	rttm_stop_timer(to->of_base.base);
 	rttm_set_period(to->of_base.base, RTTM_TICKS_PER_SEC / HZ);
 	rttm_start_timer(to, RTTM_CTRL_COUNTER);
@@ -153,6 +172,7 @@ static int rttm_state_periodic(struct clock_event_device *clkevt)
 	struct timer_of *to = to_timer_of(clkevt);
 
 	RTTM_DEBUG(to->of_base.base);
+	rttm_bounce_timer(to->of_base.base, RTTM_CTRL_TIMER);
 	rttm_stop_timer(to->of_base.base);
 	rttm_set_period(to->of_base.base, RTTM_TICKS_PER_SEC / HZ);
 	rttm_start_timer(to, RTTM_CTRL_TIMER);
-- 
2.47.0

Re: [PATCH 1/4] clocksource/drivers/timer-rtl-otto: work around dying timers

Posted by Daniel Lezcano 5 months ago

On 04/08/2025 10:03, Markus Stockhausen wrote:
> The OpenWrt distribution has switched from kernel longterm 6.6 to
> 6.12. Reports show that devices with the Realtek Otto switch platform
> die during operation and are rebooted by the watchdog. Sorting out
> other possible reasons the Otto timer is to blame. The platform
> currently consists of 4 targets with different hardware revisions.
> It is not 100% clear which devices and revisions are affected.
> 
> Analysis shows:
> 
> A more aggressive sched/deadline handling leads to more timer starts
> with small intervals. This increases the bug chances. See
> https://marc.info/?l=linux-kernel&m=175276556023276&w=2
> 
> Focusing on the real issue a hardware limitation on some devices was
> found. There is a minimal chance that a timer ends without firing an
> interrupt if it is reprogrammed within the 5us before its expiration
> time.

Is it possible the timer IRQ flag is reset when setting the new counter 
value ?

While in the code path with the interrupt disabled, the timer expires in 
these 5us, the IRQ flag is raised, then the driver sets a new value and 
this flag is reset automatically, thus losing the current timer expiration ?





-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

AW: [PATCH 1/4] clocksource/drivers/timer-rtl-otto: work around dying timers

Posted by markus.stockhausen@gmx.de 5 months ago

> Von: Daniel Lezcano <daniel.lezcano@linaro.org> 
> Gesendet: Mittwoch, 10. September 2025 11:03
> 
> On 04/08/2025 10:03, Markus Stockhausen wrote:
> > The OpenWrt distribution has switched from kernel longterm 6.6 to
> > 6.12. Reports show that devices with the Realtek Otto switch platform
> > die during operation and are rebooted by the watchdog. Sorting out
> > other possible reasons the Otto timer is to blame. The platform
> > currently consists of 4 targets with different hardware revisions.
> > It is not 100% clear which devices and revisions are affected.
> > 
> > Analysis shows:
> > 
> > A more aggressive sched/deadline handling leads to more timer starts
> > with small intervals. This increases the bug chances. See
> > https://marc.info/?l=linux-kernel&m=175276556023276&w=2
> > 
> > Focusing on the real issue a hardware limitation on some devices was
> > found. There is a minimal chance that a timer ends without firing an
> > interrupt if it is reprogrammed within the 5us before its expiration
> > time.
>
> Is it possible the timer IRQ flag is reset when setting the new counter 
> value ?
>
> While in the code path with the interrupt disabled, the timer expires in 
> these 5us, the IRQ flag is raised, then the driver sets a new value and 
> this flag is reset automatically, thus losing the current timer expiration ?

Something like this ...

During my analysis I tried a lot of things to identify the situation that
leads to this error. Especially just before the reprogramming command

static inline void rttm_enable_timer(void __iomem *base, u32 mode, u32 divisor)
{
  iowrite32(RTTM_CTRL_ENABLE | mode | divisor, base + RTTM_CTRL);
}

What I tried: 

1. Read out the current (remaining) timer value: In the error cases
this can give any value between 1 (=320ns) and 15 (=4800ns).

2. Check if IRQ flag is already set and IRQ might trigger next. This was 
never the case. 

3. Reorder reprogramming sequence (as far as possible). Only the
double reprogramming helped here.

So nothing we can do to actively identify and work around the buggy
situation. There is some hardware limitation between expiring timers
and reprgramming. Due to missing erratum the current bugfix is the
only (and best) solution I have.

Markus

Re: AW: [PATCH 1/4] clocksource/drivers/timer-rtl-otto: work around dying timers

Posted by Daniel Lezcano 5 months ago

On 10/09/2025 12:16, markus.stockhausen@gmx.de wrote:
>> Von: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Gesendet: Mittwoch, 10. September 2025 11:03
>>
>> On 04/08/2025 10:03, Markus Stockhausen wrote:
>>> The OpenWrt distribution has switched from kernel longterm 6.6 to
>>> 6.12. Reports show that devices with the Realtek Otto switch platform
>>> die during operation and are rebooted by the watchdog. Sorting out
>>> other possible reasons the Otto timer is to blame. The platform
>>> currently consists of 4 targets with different hardware revisions.
>>> It is not 100% clear which devices and revisions are affected.
>>>
>>> Analysis shows:
>>>
>>> A more aggressive sched/deadline handling leads to more timer starts
>>> with small intervals. This increases the bug chances. See
>>> https://marc.info/?l=linux-kernel&m=175276556023276&w=2
>>>
>>> Focusing on the real issue a hardware limitation on some devices was
>>> found. There is a minimal chance that a timer ends without firing an
>>> interrupt if it is reprogrammed within the 5us before its expiration
>>> time.
>>
>> Is it possible the timer IRQ flag is reset when setting the new counter
>> value ?
>>
>> While in the code path with the interrupt disabled, the timer expires in
>> these 5us, the IRQ flag is raised, then the driver sets a new value and
>> this flag is reset automatically, thus losing the current timer expiration ?
> 
> Something like this ...
> 
> During my analysis I tried a lot of things to identify the situation that
> leads to this error. Especially just before the reprogramming command
> 
> static inline void rttm_enable_timer(void __iomem *base, u32 mode, u32 divisor)
> {
>    iowrite32(RTTM_CTRL_ENABLE | mode | divisor, base + RTTM_CTRL);
> }
> 
> What I tried:
> 
> 1. Read out the current (remaining) timer value: In the error cases
> this can give any value between 1 (=320ns) and 15 (=4800ns).
> 
> 2. Check if IRQ flag is already set and IRQ might trigger next. This was
> never the case.

It would have been interesting to check if we are in the time bug range 
to wait with a delay (5us), check the IRQ flag as the current timer 
should have expired, then set the counter and recheck the IRQ flag.


> 3. Reorder reprogramming sequence (as far as possible). Only the
> double reprogramming helped here.
> 
> So nothing we can do to actively identify and work around the buggy
> situation. There is some hardware limitation between expiring timers
> and reprgramming. Due to missing erratum the current bugfix is the
> only (and best) solution I have.
> 
> Markus
> 


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

AW: AW: [PATCH 1/4] clocksource/drivers/timer-rtl-otto: work around dying timers

Posted by markus.stockhausen@gmx.de 5 months ago

> Von: Daniel Lezcano <daniel.lezcano@linaro.org> 
> Gesendet: Mittwoch, 10. September 2025 18:39
>
> > What I tried:
> > 
> > 1. Read out the current (remaining) timer value: In the error cases
> > this can give any value between 1 (=320ns) and 15 (=4800ns).
> > 
> > 2. Check if IRQ flag is already set and IRQ might trigger next. This was
> > never the case.
>
> It would have been interesting to check if we are in the time bug range 
> to wait with a delay (5us), check the IRQ flag as the current timer 
> should have expired, then set the counter and recheck the IRQ flag.

It's been 2 months that I dived deep into this case. Finding a 
reproducer, adding lightweight logging and try&error a solution 
was really hard. In the end I was happy to have a fix that was 
intensively tested.

For some notes see
https://github.com/openwrt/openwrt/pull/19468#issuecomment-3095570297

From what I remember:

- I started on a multithreading SoC and went over to a single
core SoC to reduce side effects during analysis. 

- The timer never died when it was reprogrammed from
an interrupt of a just finished timer. The reason was always
a reprogramming from outside the interrupt->reprogram
call sequence.

- Reprogramming always worked fine. A timer with <5us left, was 
restarted with a timer >5us. The new timer started to count.
No interrupt flag seemed to be magically toggled during this 
process. There was no active IRQ notification directly after the
reprogramming. That was how I expected it.

- But in rare cases the new timer did not trigger the subsequent
interrupt. I was totally confused that the future interrupt of 
a newly started timer did not work.

Graphically:

- timer run ---+-------------------->|
               | issue stop & start 
               | timer run ------------------>|
                                              | no IRQ here

Conclusion was for me: If we "kill" a running timer and restart 
it and it will not fire an interrupt after the newly set time, 
then something must be somehow broken. The ending timer and 
the stop/start sequence (that consists of two register writes) 
have some interference. Whatever it might be.

Markus

Re: AW: AW: [PATCH 1/4] clocksource/drivers/timer-rtl-otto: work around dying timers

Posted by Daniel Lezcano 5 months ago

On 10/09/2025 20:16, markus.stockhausen@gmx.de wrote:
>> Von: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Gesendet: Mittwoch, 10. September 2025 18:39
>>
>>> What I tried:
>>>
>>> 1. Read out the current (remaining) timer value: In the error cases
>>> this can give any value between 1 (=320ns) and 15 (=4800ns).
>>>
>>> 2. Check if IRQ flag is already set and IRQ might trigger next. This was
>>> never the case.
>>
>> It would have been interesting to check if we are in the time bug range
>> to wait with a delay (5us), check the IRQ flag as the current timer
>> should have expired, then set the counter and recheck the IRQ flag.
> 
> It's been 2 months that I dived deep into this case. Finding a
> reproducer, adding lightweight logging and try&error a solution
> was really hard. In the end I was happy to have a fix that was
> intensively tested.

I understand. No worries I applied the series, it is in the compilation 
batch.

> For some notes see
> https://github.com/openwrt/openwrt/pull/19468#issuecomment-3095570297
> 
>  From what I remember:
> 
> - I started on a multithreading SoC and went over to a single
> core SoC to reduce side effects during analysis.
> 
> - The timer never died when it was reprogrammed from
> an interrupt of a just finished timer. The reason was always
> a reprogramming from outside the interrupt->reprogram
> call sequence.
> 
> - Reprogramming always worked fine. A timer with <5us left, was
> restarted with a timer >5us. The new timer started to count.
> No interrupt flag seemed to be magically toggled during this
> process. There was no active IRQ notification directly after the
> reprogramming. That was how I expected it.
> 
> - But in rare cases the new timer did not trigger the subsequent
> interrupt. I was totally confused that the future interrupt of
> a newly started timer did not work.
> 
> Graphically:
> 
> - timer run ---+-------------------->|
>                 | issue stop & start
>                 | timer run ------------------>|
>                                                | no IRQ here
> 
> Conclusion was for me: If we "kill" a running timer and restart
> it and it will not fire an interrupt after the newly set time,
> then something must be somehow broken. The ending timer and
> the stop/start sequence (that consists of two register writes)
> have some interference. Whatever it might be.

Mmh, I think I misunderstood initially the problem. Thanks for clarifying.



-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog