[PATCH] clocksource: timer-tegra186: Enable WDT at probe

Kartik Rajput posted 1 patch 3 months, 1 week ago
There is a newer version of this series
drivers/clocksource/timer-tegra186.c | 42 ++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
[PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Kartik Rajput 3 months, 1 week ago
Currently, if the system crashes or hangs during kernel boot before
userspace initializes and configures the watchdog timer, then the
watchdog won’t be able to recover the system as it’s not running. This
becomes crucial during an over-the-air update, where if the newly
updated kernel crashes on boot, the watchdog is needed to reset the
device and boot into an alternative system partition. If the watchdog
is disabled in such scenarios, it can lead to the system getting
bricked.

Enable the WDT during driver probe to allow recovery from any crash/hang
seen during early kernel boot. Also, disable interrupts once userspace
starts pinging the watchdog.

Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
---
 drivers/clocksource/timer-tegra186.c | 42 ++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/drivers/clocksource/timer-tegra186.c b/drivers/clocksource/timer-tegra186.c
index e5394f98a02e..59abb5dab8f1 100644
--- a/drivers/clocksource/timer-tegra186.c
+++ b/drivers/clocksource/timer-tegra186.c
@@ -57,6 +57,8 @@
 #define WDTUR 0x00c
 #define  WDTUR_UNLOCK_PATTERN 0x0000c45a
 
+#define WDT_DEFAULT_TIMEOUT 120
+
 struct tegra186_timer_soc {
 	unsigned int num_timers;
 	unsigned int num_wdts;
@@ -74,6 +76,7 @@ struct tegra186_wdt {
 
 	void __iomem *regs;
 	unsigned int index;
+	bool enable_irq;
 	bool locked;
 
 	struct tegra186_tmr *tmr;
@@ -174,6 +177,12 @@ static void tegra186_wdt_enable(struct tegra186_wdt *wdt)
 		value &= ~WDTCR_PERIOD_MASK;
 		value |= WDTCR_PERIOD(1);
 
+		/* configure local interrupt for WDT petting */
+		if (wdt->enable_irq)
+			value |= WDTCR_LOCAL_INT_ENABLE;
+		else
+			value &= ~WDTCR_LOCAL_INT_ENABLE;
+
 		/* enable system POR reset */
 		value |= WDTCR_SYSTEM_POR_RESET_ENABLE;
 
@@ -205,6 +214,10 @@ static int tegra186_wdt_ping(struct watchdog_device *wdd)
 {
 	struct tegra186_wdt *wdt = to_tegra186_wdt(wdd);
 
+	/* Disable WDT interrupt once userspace takes over. */
+	if (wdt->enable_irq)
+		wdt->enable_irq = false;
+
 	tegra186_wdt_disable(wdt);
 	tegra186_wdt_enable(wdt);
 
@@ -315,6 +328,8 @@ static struct tegra186_wdt *tegra186_wdt_create(struct tegra186_timer *tegra,
 	if (value & WDTCR_LOCAL_INT_ENABLE)
 		wdt->locked = true;
 
+	wdt->enable_irq = true;
+
 	source = value & WDTCR_TIMER_SOURCE_MASK;
 
 	wdt->tmr = tegra186_tmr_create(tegra, source);
@@ -339,6 +354,13 @@ static struct tegra186_wdt *tegra186_wdt_create(struct tegra186_timer *tegra,
 		return ERR_PTR(err);
 	}
 
+	/*
+	 * Start the watchdog to recover the system if it crashes before
+	 * userspace initialize the WDT.
+	 */
+	tegra186_wdt_set_timeout(&wdt->base, WDT_DEFAULT_TIMEOUT);
+	tegra186_wdt_start(&wdt->base);
+
 	return wdt;
 }
 
@@ -415,10 +437,21 @@ static int tegra186_timer_usec_init(struct tegra186_timer *tegra)
 	return clocksource_register_hz(&tegra->usec, USEC_PER_SEC);
 }
 
+static irqreturn_t tegra186_timer_irq(int irq, void *data)
+{
+	struct tegra186_timer *tegra = data;
+
+	tegra186_wdt_disable(tegra->wdt);
+	tegra186_wdt_enable(tegra->wdt);
+
+	return IRQ_HANDLED;
+}
+
 static int tegra186_timer_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
 	struct tegra186_timer *tegra;
+	unsigned int irq;
 	int err;
 
 	tegra = devm_kzalloc(dev, sizeof(*tegra), GFP_KERNEL);
@@ -437,6 +470,15 @@ static int tegra186_timer_probe(struct platform_device *pdev)
 	if (err < 0)
 		return err;
 
+	irq = err;
+
+	err = devm_request_irq(dev, irq, tegra186_timer_irq, 0,
+			       "tegra186-timer", tegra);
+	if (err < 0) {
+		dev_err(dev, "failed to request IRQ#%u: %d\n", irq, err);
+		return err;
+	}
+
 	/* create a watchdog using a preconfigured timer */
 	tegra->wdt = tegra186_wdt_create(tegra, 0);
 	if (IS_ERR(tegra->wdt)) {
-- 
2.43.0

Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Jon Hunter 3 months ago

On 30/06/2025 12:01, Kartik Rajput wrote:
> Currently, if the system crashes or hangs during kernel boot before
> userspace initializes and configures the watchdog timer, then the
> watchdog won’t be able to recover the system as it’s not running. This
> becomes crucial during an over-the-air update, where if the newly
> updated kernel crashes on boot, the watchdog is needed to reset the
> device and boot into an alternative system partition. If the watchdog
> is disabled in such scenarios, it can lead to the system getting
> bricked.
> 
> Enable the WDT during driver probe to allow recovery from any crash/hang
> seen during early kernel boot. Also, disable interrupts once userspace
> starts pinging the watchdog.
> 
> Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
> ---
>   drivers/clocksource/timer-tegra186.c | 42 ++++++++++++++++++++++++++++
>   1 file changed, 42 insertions(+)
> 
> diff --git a/drivers/clocksource/timer-tegra186.c b/drivers/clocksource/timer-tegra186.c
> index e5394f98a02e..59abb5dab8f1 100644
> --- a/drivers/clocksource/timer-tegra186.c
> +++ b/drivers/clocksource/timer-tegra186.c
> @@ -57,6 +57,8 @@
>   #define WDTUR 0x00c
>   #define  WDTUR_UNLOCK_PATTERN 0x0000c45a
>   
> +#define WDT_DEFAULT_TIMEOUT 120
> +
>   struct tegra186_timer_soc {
>   	unsigned int num_timers;
>   	unsigned int num_wdts;
> @@ -74,6 +76,7 @@ struct tegra186_wdt {
>   
>   	void __iomem *regs;
>   	unsigned int index;
> +	bool enable_irq;
>   	bool locked;
>   
>   	struct tegra186_tmr *tmr;
> @@ -174,6 +177,12 @@ static void tegra186_wdt_enable(struct tegra186_wdt *wdt)
>   		value &= ~WDTCR_PERIOD_MASK;
>   		value |= WDTCR_PERIOD(1);
>   
> +		/* configure local interrupt for WDT petting */

It might be a bit clearer if this comment states ...

'If enable_irq is set then enable the watchdog IRQ for kernel petting, 
otherwise userspace is responsible for petting the watchdog.'

> +		if (wdt->enable_irq)
> +			value |= WDTCR_LOCAL_INT_ENABLE;
> +		else
> +			value &= ~WDTCR_LOCAL_INT_ENABLE;
> +
>   		/* enable system POR reset */
>   		value |= WDTCR_SYSTEM_POR_RESET_ENABLE;
>   
> @@ -205,6 +214,10 @@ static int tegra186_wdt_ping(struct watchdog_device *wdd)
>   {
>   	struct tegra186_wdt *wdt = to_tegra186_wdt(wdd);
>   
> +	/* Disable WDT interrupt once userspace takes over. */

Technically userspace is taking over at this point and so we should be 
more assertive here ...

'Disable the watchdog IRQ now userspace is taking over'

> +	if (wdt->enable_irq)
> +		wdt->enable_irq = false;
> +
>   	tegra186_wdt_disable(wdt);
>   	tegra186_wdt_enable(wdt);
>   
> @@ -315,6 +328,8 @@ static struct tegra186_wdt *tegra186_wdt_create(struct tegra186_timer *tegra,
>   	if (value & WDTCR_LOCAL_INT_ENABLE)
>   		wdt->locked = true;
>   
> +	wdt->enable_irq = true;
> +
>   	source = value & WDTCR_TIMER_SOURCE_MASK;
>   
>   	wdt->tmr = tegra186_tmr_create(tegra, source);
> @@ -339,6 +354,13 @@ static struct tegra186_wdt *tegra186_wdt_create(struct tegra186_timer *tegra,
>   		return ERR_PTR(err);
>   	}
>   
> +	/*
> +	 * Start the watchdog to recover the system if it crashes before
> +	 * userspace initialize the WDT.
> +	 */
> +	tegra186_wdt_set_timeout(&wdt->base, WDT_DEFAULT_TIMEOUT);
> +	tegra186_wdt_start(&wdt->base);
> +
>   	return wdt;
>   }
>   
> @@ -415,10 +437,21 @@ static int tegra186_timer_usec_init(struct tegra186_timer *tegra)
>   	return clocksource_register_hz(&tegra->usec, USEC_PER_SEC);
>   }
>   
> +static irqreturn_t tegra186_timer_irq(int irq, void *data)
> +{
> +	struct tegra186_timer *tegra = data;
> +
> +	tegra186_wdt_disable(tegra->wdt);
> +	tegra186_wdt_enable(tegra->wdt);
> +
> +	return IRQ_HANDLED;
> +}
> +
>   static int tegra186_timer_probe(struct platform_device *pdev)
>   {
>   	struct device *dev = &pdev->dev;
>   	struct tegra186_timer *tegra;
> +	unsigned int irq;
>   	int err;
>   
>   	tegra = devm_kzalloc(dev, sizeof(*tegra), GFP_KERNEL);
> @@ -437,6 +470,15 @@ static int tegra186_timer_probe(struct platform_device *pdev)
>   	if (err < 0)
>   		return err;
>   
> +	irq = err;
> +
> +	err = devm_request_irq(dev, irq, tegra186_timer_irq, 0,
> +			       "tegra186-timer", tegra);
> +	if (err < 0) {
> +		dev_err(dev, "failed to request IRQ#%u: %d\n", irq, err);
> +		return err;
> +	}
> +
>   	/* create a watchdog using a preconfigured timer */
>   	tegra->wdt = tegra186_wdt_create(tegra, 0);
>   	if (IS_ERR(tegra->wdt)) {

-- 
nvpublic

Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Thierry Reding 3 months ago
On Thu, Jul 03, 2025 at 11:36:16AM +0100, Jon Hunter wrote:
> 
> 
> On 30/06/2025 12:01, Kartik Rajput wrote:
> > Currently, if the system crashes or hangs during kernel boot before
> > userspace initializes and configures the watchdog timer, then the
> > watchdog won’t be able to recover the system as it’s not running. This
> > becomes crucial during an over-the-air update, where if the newly
> > updated kernel crashes on boot, the watchdog is needed to reset the
> > device and boot into an alternative system partition. If the watchdog
> > is disabled in such scenarios, it can lead to the system getting
> > bricked.
> > 
> > Enable the WDT during driver probe to allow recovery from any crash/hang
> > seen during early kernel boot. Also, disable interrupts once userspace
> > starts pinging the watchdog.
> > 
> > Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
> > ---
> >   drivers/clocksource/timer-tegra186.c | 42 ++++++++++++++++++++++++++++
> >   1 file changed, 42 insertions(+)
> > 
> > diff --git a/drivers/clocksource/timer-tegra186.c b/drivers/clocksource/timer-tegra186.c
> > index e5394f98a02e..59abb5dab8f1 100644
> > --- a/drivers/clocksource/timer-tegra186.c
> > +++ b/drivers/clocksource/timer-tegra186.c
> > @@ -57,6 +57,8 @@
> >   #define WDTUR 0x00c
> >   #define  WDTUR_UNLOCK_PATTERN 0x0000c45a
> > +#define WDT_DEFAULT_TIMEOUT 120
> > +
> >   struct tegra186_timer_soc {
> >   	unsigned int num_timers;
> >   	unsigned int num_wdts;
> > @@ -74,6 +76,7 @@ struct tegra186_wdt {
> >   	void __iomem *regs;
> >   	unsigned int index;
> > +	bool enable_irq;
> >   	bool locked;
> >   	struct tegra186_tmr *tmr;
> > @@ -174,6 +177,12 @@ static void tegra186_wdt_enable(struct tegra186_wdt *wdt)
> >   		value &= ~WDTCR_PERIOD_MASK;
> >   		value |= WDTCR_PERIOD(1);
> > +		/* configure local interrupt for WDT petting */
> 
> It might be a bit clearer if this comment states ...
> 
> 'If enable_irq is set then enable the watchdog IRQ for kernel petting,
> otherwise userspace is responsible for petting the watchdog.'
> 
> > +		if (wdt->enable_irq)
> > +			value |= WDTCR_LOCAL_INT_ENABLE;
> > +		else
> > +			value &= ~WDTCR_LOCAL_INT_ENABLE;
> > +
> >   		/* enable system POR reset */
> >   		value |= WDTCR_SYSTEM_POR_RESET_ENABLE;
> > @@ -205,6 +214,10 @@ static int tegra186_wdt_ping(struct watchdog_device *wdd)
> >   {
> >   	struct tegra186_wdt *wdt = to_tegra186_wdt(wdd);
> > +	/* Disable WDT interrupt once userspace takes over. */
> 
> Technically userspace is taking over at this point and so we should be more
> assertive here ...
> 
> 'Disable the watchdog IRQ now userspace is taking over'
> 
> > +	if (wdt->enable_irq)
> > +		wdt->enable_irq = false;
> > +
> >   	tegra186_wdt_disable(wdt);
> >   	tegra186_wdt_enable(wdt);
> > @@ -315,6 +328,8 @@ static struct tegra186_wdt *tegra186_wdt_create(struct tegra186_timer *tegra,
> >   	if (value & WDTCR_LOCAL_INT_ENABLE)
> >   		wdt->locked = true;
> > +	wdt->enable_irq = true;
> > +
> >   	source = value & WDTCR_TIMER_SOURCE_MASK;
> >   	wdt->tmr = tegra186_tmr_create(tegra, source);
> > @@ -339,6 +354,13 @@ static struct tegra186_wdt *tegra186_wdt_create(struct tegra186_timer *tegra,
> >   		return ERR_PTR(err);
> >   	}
> > +	/*
> > +	 * Start the watchdog to recover the system if it crashes before
> > +	 * userspace initialize the WDT.
> > +	 */
> > +	tegra186_wdt_set_timeout(&wdt->base, WDT_DEFAULT_TIMEOUT);
> > +	tegra186_wdt_start(&wdt->base);

If we need to stick to the single watchdog, then it's probably better to
explicitly enable the local interrupt here and explicitly disable it
when userspace takes over. That would allow us to avoid tracking this in
the enable_irq state variable.

Thierry
Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Thierry Reding 3 months, 1 week ago
On Mon, Jun 30, 2025 at 04:31:35PM +0530, Kartik Rajput wrote:
> Currently, if the system crashes or hangs during kernel boot before
> userspace initializes and configures the watchdog timer, then the
> watchdog won’t be able to recover the system as it’s not running. This
> becomes crucial during an over-the-air update, where if the newly
> updated kernel crashes on boot, the watchdog is needed to reset the
> device and boot into an alternative system partition. If the watchdog
> is disabled in such scenarios, it can lead to the system getting
> bricked.
> 
> Enable the WDT during driver probe to allow recovery from any crash/hang
> seen during early kernel boot. Also, disable interrupts once userspace
> starts pinging the watchdog.
> 
> Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
> ---
>  drivers/clocksource/timer-tegra186.c | 42 ++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)

This seems dangerous to me. It means that if the operating system
doesn't start some sort of watchdog service in userspace that pings the
watchdog, the system will reboot 120 seconds after the watchdog probe.

I think the convention is to enable the watchdog only via userspace
request, precisely to avoid this kind of unexpected behavior.

If we really want this, maybe a better solution would be to add one more
watchdog that is specifically kernel-controlled? We have 3 watchdog
instances on all these newer chips but currently only use one of them,
so grabbing another that is then entirely under the control of the
kernel should work for these cases.

Thierry
Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Kartik Rajput 3 months ago
On Thu, 2025-07-03 at 08:55 +0200, Thierry Reding wrote:
> On Mon, Jun 30, 2025 at 04:31:35PM +0530, Kartik Rajput wrote:
> > Currently, if the system crashes or hangs during kernel boot before
> > userspace initializes and configures the watchdog timer, then the
> > watchdog won’t be able to recover the system as it’s not running.
> > This
> > becomes crucial during an over-the-air update, where if the newly
> > updated kernel crashes on boot, the watchdog is needed to reset the
> > device and boot into an alternative system partition. If the
> > watchdog
> > is disabled in such scenarios, it can lead to the system getting
> > bricked.
> > 
> > Enable the WDT during driver probe to allow recovery from any
> > crash/hang
> > seen during early kernel boot. Also, disable interrupts once
> > userspace
> > starts pinging the watchdog.
> > 
> > Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
> > ---
> >  drivers/clocksource/timer-tegra186.c | 42
> > ++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> 
> This seems dangerous to me. It means that if the operating system
> doesn't start some sort of watchdog service in userspace that pings
> the
> watchdog, the system will reboot 120 seconds after the watchdog
> probe.
> 

This is not the case. The driver keeps petting the watchdog with the
ISR until userspace takes over. So, the watchdog does not expires even
if userspace takes more time to load.

Thanks & Regards,
Kartik

Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Jon Hunter 3 months ago

On 03/07/2025 07:55, Thierry Reding wrote:
> On Mon, Jun 30, 2025 at 04:31:35PM +0530, Kartik Rajput wrote:
>> Currently, if the system crashes or hangs during kernel boot before
>> userspace initializes and configures the watchdog timer, then the
>> watchdog won’t be able to recover the system as it’s not running. This
>> becomes crucial during an over-the-air update, where if the newly
>> updated kernel crashes on boot, the watchdog is needed to reset the
>> device and boot into an alternative system partition. If the watchdog
>> is disabled in such scenarios, it can lead to the system getting
>> bricked.
>>
>> Enable the WDT during driver probe to allow recovery from any crash/hang
>> seen during early kernel boot. Also, disable interrupts once userspace
>> starts pinging the watchdog.
>>
>> Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
>> ---
>>   drivers/clocksource/timer-tegra186.c | 42 ++++++++++++++++++++++++++++
>>   1 file changed, 42 insertions(+)
> 
> This seems dangerous to me. It means that if the operating system
> doesn't start some sort of watchdog service in userspace that pings the
> watchdog, the system will reboot 120 seconds after the watchdog probe.


I don't believe that will happen with this change. The kernel will 
continue to pet the watchdog until userspace takes over with this 
change. At least that is my understanding.

Jon

-- 
nvpublic

Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Thierry Reding 3 months ago
On Thu, Jul 03, 2025 at 08:55:04AM +0100, Jon Hunter wrote:
> 
> 
> On 03/07/2025 07:55, Thierry Reding wrote:
> > On Mon, Jun 30, 2025 at 04:31:35PM +0530, Kartik Rajput wrote:
> > > Currently, if the system crashes or hangs during kernel boot before
> > > userspace initializes and configures the watchdog timer, then the
> > > watchdog won’t be able to recover the system as it’s not running. This
> > > becomes crucial during an over-the-air update, where if the newly
> > > updated kernel crashes on boot, the watchdog is needed to reset the
> > > device and boot into an alternative system partition. If the watchdog
> > > is disabled in such scenarios, it can lead to the system getting
> > > bricked.
> > > 
> > > Enable the WDT during driver probe to allow recovery from any crash/hang
> > > seen during early kernel boot. Also, disable interrupts once userspace
> > > starts pinging the watchdog.
> > > 
> > > Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
> > > ---
> > >   drivers/clocksource/timer-tegra186.c | 42 ++++++++++++++++++++++++++++
> > >   1 file changed, 42 insertions(+)
> > 
> > This seems dangerous to me. It means that if the operating system
> > doesn't start some sort of watchdog service in userspace that pings the
> > watchdog, the system will reboot 120 seconds after the watchdog probe.
> 
> 
> I don't believe that will happen with this change. The kernel will continue
> to pet the watchdog until userspace takes over with this change. At least
> that is my understanding.

Ah yes... I skipped over that IRQ handling bit. However, I think this
still violates the assumptions because the driver will keep petting the
watchdog no matter what, which means that we now have no way of forcing
a reset of the system when userspace hangs. As long as just a tiny part
of the kernel keeps running, the watchdog would keep getting petted and
prevent it from resetting the system.

Using a second watchdog still seems like a more robust alternative. Or
maybe we can find a way to remove the kernel petting once userspace
starts the watchdog.

Thierry
Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Jon Hunter 3 months ago

On 03/07/2025 11:12, Thierry Reding wrote:
> On Thu, Jul 03, 2025 at 08:55:04AM +0100, Jon Hunter wrote:
>>
>>
>> On 03/07/2025 07:55, Thierry Reding wrote:
>>> On Mon, Jun 30, 2025 at 04:31:35PM +0530, Kartik Rajput wrote:
>>>> Currently, if the system crashes or hangs during kernel boot before
>>>> userspace initializes and configures the watchdog timer, then the
>>>> watchdog won’t be able to recover the system as it’s not running. This
>>>> becomes crucial during an over-the-air update, where if the newly
>>>> updated kernel crashes on boot, the watchdog is needed to reset the
>>>> device and boot into an alternative system partition. If the watchdog
>>>> is disabled in such scenarios, it can lead to the system getting
>>>> bricked.
>>>>
>>>> Enable the WDT during driver probe to allow recovery from any crash/hang
>>>> seen during early kernel boot. Also, disable interrupts once userspace
>>>> starts pinging the watchdog.
>>>>
>>>> Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
>>>> ---
>>>>    drivers/clocksource/timer-tegra186.c | 42 ++++++++++++++++++++++++++++
>>>>    1 file changed, 42 insertions(+)
>>>
>>> This seems dangerous to me. It means that if the operating system
>>> doesn't start some sort of watchdog service in userspace that pings the
>>> watchdog, the system will reboot 120 seconds after the watchdog probe.
>>
>>
>> I don't believe that will happen with this change. The kernel will continue
>> to pet the watchdog until userspace takes over with this change. At least
>> that is my understanding.
> 
> Ah yes... I skipped over that IRQ handling bit. However, I think this
> still violates the assumptions because the driver will keep petting the
> watchdog no matter what, which means that we now have no way of forcing
> a reset of the system when userspace hangs. As long as just a tiny part
> of the kernel keeps running, the watchdog would keep getting petted and
> prevent it from resetting the system.
> 
> Using a second watchdog still seems like a more robust alternative. Or
> maybe we can find a way to remove the kernel petting once userspace
> starts the watchdog.

Once userspace calls the "->ping" callback then, 'enable_irq' is set to 
false and when 'tegra186_wdt_enable()' is called this will disable the 
IRQ so that the kernel no longer pets the watchdog. So this should 
disable kernel petting once userspace is up and running.

Cheers!
Jon

-- 
nvpublic

Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Thierry Reding 3 months ago
On Thu, Jul 03, 2025 at 11:26:28AM +0100, Jon Hunter wrote:
> 
> 
> On 03/07/2025 11:12, Thierry Reding wrote:
> > On Thu, Jul 03, 2025 at 08:55:04AM +0100, Jon Hunter wrote:
> > > 
> > > 
> > > On 03/07/2025 07:55, Thierry Reding wrote:
> > > > On Mon, Jun 30, 2025 at 04:31:35PM +0530, Kartik Rajput wrote:
> > > > > Currently, if the system crashes or hangs during kernel boot before
> > > > > userspace initializes and configures the watchdog timer, then the
> > > > > watchdog won’t be able to recover the system as it’s not running. This
> > > > > becomes crucial during an over-the-air update, where if the newly
> > > > > updated kernel crashes on boot, the watchdog is needed to reset the
> > > > > device and boot into an alternative system partition. If the watchdog
> > > > > is disabled in such scenarios, it can lead to the system getting
> > > > > bricked.
> > > > > 
> > > > > Enable the WDT during driver probe to allow recovery from any crash/hang
> > > > > seen during early kernel boot. Also, disable interrupts once userspace
> > > > > starts pinging the watchdog.
> > > > > 
> > > > > Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
> > > > > ---
> > > > >    drivers/clocksource/timer-tegra186.c | 42 ++++++++++++++++++++++++++++
> > > > >    1 file changed, 42 insertions(+)
> > > > 
> > > > This seems dangerous to me. It means that if the operating system
> > > > doesn't start some sort of watchdog service in userspace that pings the
> > > > watchdog, the system will reboot 120 seconds after the watchdog probe.
> > > 
> > > 
> > > I don't believe that will happen with this change. The kernel will continue
> > > to pet the watchdog until userspace takes over with this change. At least
> > > that is my understanding.
> > 
> > Ah yes... I skipped over that IRQ handling bit. However, I think this
> > still violates the assumptions because the driver will keep petting the
> > watchdog no matter what, which means that we now have no way of forcing
> > a reset of the system when userspace hangs. As long as just a tiny part
> > of the kernel keeps running, the watchdog would keep getting petted and
> > prevent it from resetting the system.
> > 
> > Using a second watchdog still seems like a more robust alternative. Or
> > maybe we can find a way to remove the kernel petting once userspace
> > starts the watchdog.
> 
> Once userspace calls the "->ping" callback then, 'enable_irq' is set to
> false and when 'tegra186_wdt_enable()' is called this will disable the IRQ
> so that the kernel no longer pets the watchdog. So this should disable
> kernel petting once userspace is up and running.

I clearly can't read code today. Seems generally fine, then, but I'm
actually really enthused now about using a second watchdog for kernel
petting. Since we don't use any of the other two watchdogs, is there
any reason why we can't cleanly separate both use-cases? It would let
us avoid some of these special cases that are not intuitive to
understand.

Thierry
Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Jon Hunter 3 months ago
On 03/07/2025 14:36, Thierry Reding wrote:

...

> I clearly can't read code today. Seems generally fine, then, but I'm
> actually really enthused now about using a second watchdog for kernel
> petting. Since we don't use any of the other two watchdogs, is there
> any reason why we can't cleanly separate both use-cases? It would let
> us avoid some of these special cases that are not intuitive to
> understand.

The only reason would be if for some reason the other are all allocated 
for other uses outside of the kernel. We are currently only using the 
one for the kernel so that it would mean updating all the device trees 
for all platforms to support this too.

I was also thinking about how do we identify/select if a watchdog is pet 
by the kernel or userspace? I was thinking that the presence of the 
'interrupt' property in device-tree could be used; if present the kernel 
pets and if not assume userspace pets. However, the 'interrupt' property 
is currently marked as required and not optional.

-- 
nvpublic
Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Thierry Reding 3 months ago
On Thu, Jul 03, 2025 at 03:11:57PM +0100, Jon Hunter wrote:
> 
> On 03/07/2025 14:36, Thierry Reding wrote:
> 
> ...
> 
> > I clearly can't read code today. Seems generally fine, then, but I'm
> > actually really enthused now about using a second watchdog for kernel
> > petting. Since we don't use any of the other two watchdogs, is there
> > any reason why we can't cleanly separate both use-cases? It would let
> > us avoid some of these special cases that are not intuitive to
> > understand.
> 
> The only reason would be if for some reason the other are all allocated for
> other uses outside of the kernel. We are currently only using the one for
> the kernel so that it would mean updating all the device trees for all
> platforms to support this too.
> 
> I was also thinking about how do we identify/select if a watchdog is pet by
> the kernel or userspace? I was thinking that the presence of the 'interrupt'
> property in device-tree could be used; if present the kernel pets and if not
> assume userspace pets. However, the 'interrupt' property is currently marked
> as required and not optional.

The other two instances are part of the TKE block, too. It should be as
simple as doing something like this:

	tegra->wdt_kernel = tegra186_wdt_create(tegra, 1);

and using that instead of tegra->wdt.

Thierry
Re: [PATCH] clocksource: timer-tegra186: Enable WDT at probe
Posted by Jon Hunter 3 months ago
On 03/07/2025 15:19, Thierry Reding wrote:
> On Thu, Jul 03, 2025 at 03:11:57PM +0100, Jon Hunter wrote:
>>
>> On 03/07/2025 14:36, Thierry Reding wrote:
>>
>> ...
>>
>>> I clearly can't read code today. Seems generally fine, then, but I'm
>>> actually really enthused now about using a second watchdog for kernel
>>> petting. Since we don't use any of the other two watchdogs, is there
>>> any reason why we can't cleanly separate both use-cases? It would let
>>> us avoid some of these special cases that are not intuitive to
>>> understand.
>>
>> The only reason would be if for some reason the other are all allocated for
>> other uses outside of the kernel. We are currently only using the one for
>> the kernel so that it would mean updating all the device trees for all
>> platforms to support this too.
>>
>> I was also thinking about how do we identify/select if a watchdog is pet by
>> the kernel or userspace? I was thinking that the presence of the 'interrupt'
>> property in device-tree could be used; if present the kernel pets and if not
>> assume userspace pets. However, the 'interrupt' property is currently marked
>> as required and not optional.
> 
> The other two instances are part of the TKE block, too. It should be as
> simple as doing something like this:
> 
> 	tegra->wdt_kernel = tegra186_wdt_create(tegra, 1);
> 
> and using that instead of tegra->wdt.


OK, well that is much simpler than what I was thinking!

Jon

-- 
nvpublic