notifiers: Add oops check in blocking_notifier_call_chain()

[PATCH stable] notifiers: Add oops check in blocking_notifier_call_chain()

Posted by Yi Yang 3 months, 3 weeks ago

In hrtimer_interrupt(), interrupts are disabled when acquiring a spinlock,
which subsequently triggers an oops. During the oops call chain,
blocking_notifier_call_chain() invokes _cond_resched, ultimately leading
to a hard lockup.

Call Stack:
hrtimer_interrupt//raw_spin_lock_irqsave
__hrtimer_run_queues
page_fault
do_page_fault
bad_area_nosemaphore
no_context
oops_end
bust_spinlocks
unblank_screen
do_unblank_screen
fbcon_blank
fb_notifier_call_chain
blocking_notifier_call_chain
down_read
_cond_resched

If the system is in an oops state, use down_read_trylock instead of a
blocking lock acquisition. If the trylock fails, skip executing the
notifier callbacks to avoid potential deadlocks or unsafe operations
during the oops handling process.

Cc: stable@vger.kernel.org      # 6.6
Fixes: fe9d4f576324 ("Add kernel/notifier.c")
Signed-off-by: Yi Yang <yiyang13@huawei.com>
---
 kernel/notifier.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/notifier.c b/kernel/notifier.c
index b3ce28f39eb6..ebff2315fac2 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -384,9 +384,18 @@ int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
 	 * is, we re-check the list after having taken the lock anyway:
 	 */
 	if (rcu_access_pointer(nh->head)) {
-		down_read(&nh->rwsem);
-		ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
-		up_read(&nh->rwsem);
+		if (!oops_in_progress) {
+			down_read(&nh->rwsem);
+			ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
+			up_read(&nh->rwsem);
+		} else {
+			if (down_read_trylock(&nh->rwsem)) {
+				ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
+				up_read(&nh->rwsem);
+			} else {
+				ret = NOTIFY_BAD;
+			}
+		}
 	}
 	return ret;
 }
-- 
2.25.1

Re: [PATCH stable] notifiers: Add oops check in blocking_notifier_call_chain()

Posted by Andrew Morton 3 months, 3 weeks ago

On Fri, 17 Oct 2025 06:17:40 +0000 Yi Yang <yiyang13@huawei.com> wrote:

> In hrtimer_interrupt(), interrupts are disabled when acquiring a spinlock,
> which subsequently triggers an oops. During the oops call chain,
> blocking_notifier_call_chain() invokes _cond_resched, ultimately leading
> to a hard lockup.
> 
> Call Stack:
> hrtimer_interrupt//raw_spin_lock_irqsave
> __hrtimer_run_queues
> page_fault
> do_page_fault
> bad_area_nosemaphore
> no_context
> oops_end
> bust_spinlocks
> unblank_screen
> do_unblank_screen
> fbcon_blank
> fb_notifier_call_chain
> blocking_notifier_call_chain
> down_read
> _cond_resched

Seems this trace is upside-down relative to what we usually see.

Is the unaltered dmesg output available?

> If the system is in an oops state, use down_read_trylock instead of a
> blocking lock acquisition. If the trylock fails, skip executing the
> notifier callbacks to avoid potential deadlocks or unsafe operations
> during the oops handling process.
> 
> ...
>
> --- a/kernel/notifier.c
> +++ b/kernel/notifier.c
> @@ -384,9 +384,18 @@ int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
>  	 * is, we re-check the list after having taken the lock anyway:
>  	 */
>  	if (rcu_access_pointer(nh->head)) {
> -		down_read(&nh->rwsem);
> -		ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
> -		up_read(&nh->rwsem);
> +		if (!oops_in_progress) {
> +			down_read(&nh->rwsem);
> +			ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
> +			up_read(&nh->rwsem);
> +		} else {
> +			if (down_read_trylock(&nh->rwsem)) {
> +				ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
> +				up_read(&nh->rwsem);
> +			} else {
> +				ret = NOTIFY_BAD;
> +			}
> +		}
>  	}
>  	return ret;

Am I correct in believing that fb_notifier_call_chain() is only ever
called if defined(CONFIG_GUMSTIX_AM200EPD)?

I wonder what that call is for, and if we can simply remove it.

Re: [PATCH stable] notifiers: Add oops check in blocking_notifier_call_chain()

Posted by yiyang (D) 3 months, 2 weeks ago

On 2025/10/18 6:25, Andrew Morton wrote:
> On Fri, 17 Oct 2025 06:17:40 +0000 Yi Yang <yiyang13@huawei.com> wrote:
> 
>> In hrtimer_interrupt(), interrupts are disabled when acquiring a spinlock,
>> which subsequently triggers an oops. During the oops call chain,
>> blocking_notifier_call_chain() invokes _cond_resched, ultimately leading
>> to a hard lockup.
>>
>> Call Stack:
>> hrtimer_interrupt//raw_spin_lock_irqsave
>> __hrtimer_run_queues
>> page_fault
>> do_page_fault
>> bad_area_nosemaphore
>> no_context
>> oops_end
>> bust_spinlocks
>> unblank_screen
>> do_unblank_screen
>> fbcon_blank
>> fb_notifier_call_chain
>> blocking_notifier_call_chain
>> down_read
>> _cond_resched
> 
> Seems this trace is upside-down relative to what we usually see.
> 
> Is the unaltered dmesg output available?
>
Below is an excerpt from the original error message:

  #0[ffff8a317f6c3ac0] __cond_resched at ffffffffa10d29a6
  #1[ffff8a317f6c3ad8] _cond_resched at ffffffffa17292cf
  #2[ffff8a317f6c3ae8] down_read at ffffffffa1728022
  #3[ffff8a317f6c3b00] __blocking_notifier_call_chain at ffffffffa10c5c37
  #4[ffff8a317f6c3b40] blocking_notifier_call_chain at ffffffffa10c5c86
  #5[ffff8a317f6c3b50] fb_notifier_call_chain at ffffffffa13c83eb
  #6[ffff8a317f6c3b60] fb_blank at ffffffffa13c88eb
  #7[ffff8a317f6c3ba0] fbcon_blank at ffffffffa13d4a4b
  #8[ffff8a317f6c3ca0] do_unblank_screen at ffffffffa144cb30
  #9[ffff8a317f6c3cc0] unblank_screen at ffffffffa144cbf0
#10[ffff8a317f6c3ce0] oops_end at ffffffffa172d6d5
#11[ffff8a317f6c3d08] no_context at ffffffffa171cebc
#12[ffff8a317f6c3d58] __bad_area_nosemaphore at ffffffffa171cf53
#13[ffff8a317f6c3da8] bad_area_nosemaphore at ffffffffa171d0c4
#14[ffff8a317f6c3db8] __do_page_fault at ffffffffa17306b0
#15[ffff8a317f6c3e20] do_page_fault at ffffffffa1730895
#16[ffff8a317f6c3e50] page_fault at ffffffffa172c768

>> If the system is in an oops state, use down_read_trylock instead of a
>> blocking lock acquisition. If the trylock fails, skip executing the
>> notifier callbacks to avoid potential deadlocks or unsafe operations
>> during the oops handling process.
>>
>> ...
>>
>> --- a/kernel/notifier.c
>> +++ b/kernel/notifier.c
>> @@ -384,9 +384,18 @@ int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
>>   	 * is, we re-check the list after having taken the lock anyway:
>>   	 */
>>   	if (rcu_access_pointer(nh->head)) {
>> -		down_read(&nh->rwsem);
>> -		ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
>> -		up_read(&nh->rwsem);
>> +		if (!oops_in_progress) {
>> +			down_read(&nh->rwsem);
>> +			ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
>> +			up_read(&nh->rwsem);
>> +		} else {
>> +			if (down_read_trylock(&nh->rwsem)) {
>> +				ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
>> +				up_read(&nh->rwsem);
>> +			} else {
>> +				ret = NOTIFY_BAD;
>> +			}
>> +		}
>>   	}
>>   	return ret;
> 
> Am I correct in believing that fb_notifier_call_chain() is only ever
> called if defined(CONFIG_GUMSTIX_AM200EPD)?
> 
fb_notifier_call_chain() is called in both the fb_blank() and 
fb_set_var() functions, and it is not only called when 
defined(CONFIG_GUMSTIX_AM200EPD).
> I wonder what that call is for, and if we can simply remove it.
The function called when an issue occurs is 
`fb_notifier_call_chain(FB_EVENT_BLANK, &event);`.
The purpose of this function is to invoke the notification chain that 
has registered for the FB_EVENT_BLANK event.

The FB_EVENT_BLANK event appears to indicate a screen-related state.
> 
> .
> 

--

Re: [PATCH stable] notifiers: Add oops check in blocking_notifier_call_chain()

Posted by yiyang (D) 2 months, 3 weeks ago

On 2025/10/22 11:36, yiyang (D) wrote:
> On 2025/10/18 6:25, Andrew Morton wrote:
>> On Fri, 17 Oct 2025 06:17:40 +0000 Yi Yang <yiyang13@huawei.com> wrote:
>>
>>> In hrtimer_interrupt(), interrupts are disabled when acquiring a 
>>> spinlock,
>>> which subsequently triggers an oops. During the oops call chain,
>>> blocking_notifier_call_chain() invokes _cond_resched, ultimately leading
>>> to a hard lockup.
>>>
>>> Call Stack:
>>> hrtimer_interrupt//raw_spin_lock_irqsave
>>> __hrtimer_run_queues
>>> page_fault
>>> do_page_fault
>>> bad_area_nosemaphore
>>> no_context
>>> oops_end
>>> bust_spinlocks
>>> unblank_screen
>>> do_unblank_screen
>>> fbcon_blank
>>> fb_notifier_call_chain
>>> blocking_notifier_call_chain
>>> down_read
>>> _cond_resched
>>
>> Seems this trace is upside-down relative to what we usually see.
>>
>> Is the unaltered dmesg output available?
>>
> Below is an excerpt from the original error message:
> 
>   #0[ffff8a317f6c3ac0] __cond_resched at ffffffffa10d29a6
>   #1[ffff8a317f6c3ad8] _cond_resched at ffffffffa17292cf
>   #2[ffff8a317f6c3ae8] down_read at ffffffffa1728022
>   #3[ffff8a317f6c3b00] __blocking_notifier_call_chain at ffffffffa10c5c37
>   #4[ffff8a317f6c3b40] blocking_notifier_call_chain at ffffffffa10c5c86
>   #5[ffff8a317f6c3b50] fb_notifier_call_chain at ffffffffa13c83eb
>   #6[ffff8a317f6c3b60] fb_blank at ffffffffa13c88eb
>   #7[ffff8a317f6c3ba0] fbcon_blank at ffffffffa13d4a4b
>   #8[ffff8a317f6c3ca0] do_unblank_screen at ffffffffa144cb30
>   #9[ffff8a317f6c3cc0] unblank_screen at ffffffffa144cbf0
> #10[ffff8a317f6c3ce0] oops_end at ffffffffa172d6d5
> #11[ffff8a317f6c3d08] no_context at ffffffffa171cebc
> #12[ffff8a317f6c3d58] __bad_area_nosemaphore at ffffffffa171cf53
> #13[ffff8a317f6c3da8] bad_area_nosemaphore at ffffffffa171d0c4
> #14[ffff8a317f6c3db8] __do_page_fault at ffffffffa17306b0
> #15[ffff8a317f6c3e20] do_page_fault at ffffffffa1730895
> #16[ffff8a317f6c3e50] page_fault at ffffffffa172c768
> 
>>> If the system is in an oops state, use down_read_trylock instead of a
>>> blocking lock acquisition. If the trylock fails, skip executing the
>>> notifier callbacks to avoid potential deadlocks or unsafe operations
>>> during the oops handling process.
>>>
>>> ...
>>>
>>> --- a/kernel/notifier.c
>>> +++ b/kernel/notifier.c
>>> @@ -384,9 +384,18 @@ int blocking_notifier_call_chain(struct 
>>> blocking_notifier_head *nh,
>>>        * is, we re-check the list after having taken the lock anyway:
>>>        */
>>>       if (rcu_access_pointer(nh->head)) {
>>> -        down_read(&nh->rwsem);
>>> -        ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
>>> -        up_read(&nh->rwsem);
>>> +        if (!oops_in_progress) {
>>> +            down_read(&nh->rwsem);
>>> +            ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
>>> +            up_read(&nh->rwsem);
>>> +        } else {
>>> +            if (down_read_trylock(&nh->rwsem)) {
>>> +                ret = notifier_call_chain(&nh->head, val, v, -1, NULL);
>>> +                up_read(&nh->rwsem);
>>> +            } else {
>>> +                ret = NOTIFY_BAD;
>>> +            }
>>> +        }
>>>       }
>>>       return ret;
>>
>> Am I correct in believing that fb_notifier_call_chain() is only ever
>> called if defined(CONFIG_GUMSTIX_AM200EPD)?
>>
> fb_notifier_call_chain() is called in both the fb_blank() and 
> fb_set_var() functions, and it is not only called when 
> defined(CONFIG_GUMSTIX_AM200EPD).
>> I wonder what that call is for, and if we can simply remove it.
> The function called when an issue occurs is 
> `fb_notifier_call_chain(FB_EVENT_BLANK, &event);`.
> The purpose of this function is to invoke the notification chain that 
> has registered for the FB_EVENT_BLANK event.
> 
> The FB_EVENT_BLANK event appears to indicate a screen-related state.
>>
>> .
>>
> 
Do you think it is necessary to merge this patch into the 6.6 stable 
branch (or earlier versions)?
Currently, when an oops occurs, the actual panic stack trace is not 
being printed because it is being blocked by the notification chain.