[PATCH] mailbox: pcc: Fix can't clear level interrupt of type3 in cornor case

Huisong Li posted 1 patch 9 months, 3 weeks ago
drivers/mailbox/pcc.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
[PATCH] mailbox: pcc: Fix can't clear level interrupt of type3 in cornor case
Posted by Huisong Li 9 months, 3 weeks ago
The mbox_chan_received_data() will call Rx callback of mbox client driver
using type3 to set the flag of command completion. Then driver can continue
to do something like sending a new command. In this case, the rest of the
interrupt handler function may be concurrent with pcc_send_data().

The 'chan_in_use' flag of a channel is true after sending a command. And
the flag of the new command may be cleared by the running interrupt handler
in cornor case. As a result, the interrupt being level triggered can't be
cleared in pcc_mbox_irq() and it will be disabled after the number of
handled times exceeds the specified value. The error log is as follows:

[519082.811553] kunpeng_hccs HISI04B2:00: PCC command executed timeout!
[519082.828532] kunpeng_hccs HISI04B2:00: get port link status info failed, ret = -110.
[519082.833438] irq 13: nobody cared (try booting with the "irqpoll" option)
[519082.844622] CPU: 304 PID: 15206 Comm: systemd-journal Kdump: loaded Tainted: G           OE     5.10.0 #5
[519082.854959] Hardware name: To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M., BIOS Nezha B800 V3.1.0 01/02/2024
[519082.867467] Call trace:
[519082.870709]  dump_backtrace+0x0/0x210
[519082.875145]  show_stack+0x1c/0x2c
[519082.879240]  dump_stack+0xec/0x130
[519082.883421]  __report_bad_irq+0x50/0x190
[519082.888122]  note_interrupt+0x1e4/0x260
[519082.892740]  handle_irq_event+0x144/0x17c
[519082.897519]  handle_fasteoi_irq+0xd0/0x240
[519082.902386]  __handle_domain_irq+0x80/0xf0
[519082.907255]  gic_handle_irq+0x74/0x2d0
[519082.911774]  el1_irq+0xbc/0x140
[519082.915698]  mnt_clone_write+0x0/0x70
[519082.920131]  file_update_time+0xcc/0x160
[519082.924832]  fault_dirty_shared_page+0xe8/0x150
[519082.930133]  do_shared_fault+0x80/0x1d0
[519082.934737]  do_fault+0x118/0x1a4
[519082.938821]  handle_pte_fault+0x154/0x230
[519082.943600]  __handle_mm_fault+0x1ac/0x390
[519082.948465]  handle_mm_fault+0xf0/0x250
[519082.953075]  do_page_fault+0x184/0x454
[519082.957595]  do_translation_fault+0xac/0xd4
[519082.962555]  do_mem_abort+0x44/0xb4
[519082.966817]  el0_da+0x40/0x74
[519082.970554]  el0_sync_handler+0x60/0xb4
[519082.975160]  el0_sync+0x168/0x180
[519082.979243] handlers:
[519082.982300] [<0000000039882697>] pcc_mbox_irq
[519082.987433] Disabling IRQ #13

To solve this issue, pcc_mbox_irq() clear 'chann_in_use' flag immediately
after clearing interrupt ack register.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
 drivers/mailbox/pcc.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
index 82102a4c5d68..077ff98366cb 100644
--- a/drivers/mailbox/pcc.c
+++ b/drivers/mailbox/pcc.c
@@ -333,10 +333,20 @@ static irqreturn_t pcc_mbox_irq(int irq, void *p)
 	if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
 		return IRQ_NONE;
 
+	/*
+	 * The mbox_chan_received_data() will call Rx callback of mbox
+	 * client driver using type3 to set the flag of command completion.
+	 * Then driver can continue to do something like sending a new
+	 * command. In this case, the rest of the interrupt handler
+	 * function may be concurrent with pcc_send_data().
+	 * To avoid the 'chan_in_use' flag of new command being cleared by
+	 * interrupt handler, clear this flag immediately after clearing
+	 * interrupt ack register.
+	 */
+	pchan->chan_in_use = false;
 	mbox_chan_received_data(chan, NULL);
 
 	check_and_ack(pchan, chan);
-	pchan->chan_in_use = false;
 
 	return IRQ_HANDLED;
 }
-- 
2.22.0
Re: [PATCH] mailbox: pcc: Fix can't clear level interrupt of type3 in cornor case
Posted by Sudeep Holla 9 months, 3 weeks ago
On Thu, Feb 27, 2025 at 03:23:41PM +0800, Huisong Li wrote:
> The mbox_chan_received_data() will call Rx callback of mbox client driver
> using type3 to set the flag of command completion. Then driver can continue
> to do something like sending a new command. In this case, the rest of the
> interrupt handler function may be concurrent with pcc_send_data().
>

Understood and valid issue/bug.

> The 'chan_in_use' flag of a channel is true after sending a command. And
> the flag of the new command may be cleared by the running interrupt handler
> in cornor case. As a result, the interrupt being level triggered can't be
> cleared in pcc_mbox_irq() and it will be disabled after the number of
> handled times exceeds the specified value. The error log is as follows:
>
> [519082.811553] kunpeng_hccs HISI04B2:00: PCC command executed timeout!
   ^^^^
These timestamps are useless, needs to be dropped.

> [519082.828532] kunpeng_hccs HISI04B2:00: get port link status info failed, ret = -110.
> [519082.833438] irq 13: nobody cared (try booting with the "irqpoll" option)
> [519082.844622] CPU: 304 PID: 15206 Comm: systemd-journal Kdump: loaded Tainted: G           OE     5.10.0 #5
> [519082.854959] Hardware name: To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M., BIOS Nezha B800 V3.1.0 01/02/2024

"To be filled by O.E.M." interesting. Either as silicon vendor, some prefer
to leave it this way to ensure the O.E.M fill them correctly or the firmware
engineers are not bothered to get these right as nothing breaks without
these.

Anyways, good example of what not to have in the products, as it is completely
useless.

[...]

>
> To solve this issue, pcc_mbox_irq() clear 'chann_in_use' flag immediately
> after clearing interrupt ack register.
>

This may be correct way of fixing the issue here, but I am questioning the
existence of this flag now. I have some rework on this file. I will pick
this up if I think this is needed as it may conflict with my changes or
we will drop the flag completely. Give a week or so, will post the changes
and we can take it from there.

--
Regards,
Sudeep
Re: [PATCH] mailbox: pcc: Fix can't clear level interrupt of type3 in cornor case
Posted by lihuisong (C) 9 months, 3 weeks ago
在 2025/3/1 0:32, Sudeep Holla 写道:
> On Thu, Feb 27, 2025 at 03:23:41PM +0800, Huisong Li wrote:
>> The mbox_chan_received_data() will call Rx callback of mbox client driver
>> using type3 to set the flag of command completion. Then driver can continue
>> to do something like sending a new command. In this case, the rest of the
>> interrupt handler function may be concurrent with pcc_send_data().
>>
> Understood and valid issue/bug.
>
>> The 'chan_in_use' flag of a channel is true after sending a command. And
>> the flag of the new command may be cleared by the running interrupt handler
>> in cornor case. As a result, the interrupt being level triggered can't be
>> cleared in pcc_mbox_irq() and it will be disabled after the number of
>> handled times exceeds the specified value. The error log is as follows:
>>
>> [519082.811553] kunpeng_hccs HISI04B2:00: PCC command executed timeout!
>     ^^^^
> These timestamps are useless, needs to be dropped.
Got it. Thanks.
>
>> [519082.828532] kunpeng_hccs HISI04B2:00: get port link status info failed, ret = -110.
>> [519082.833438] irq 13: nobody cared (try booting with the "irqpoll" option)
>> [519082.844622] CPU: 304 PID: 15206 Comm: systemd-journal Kdump: loaded Tainted: G           OE     5.10.0 #5
>> [519082.854959] Hardware name: To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M., BIOS Nezha B800 V3.1.0 01/02/2024
> "To be filled by O.E.M." interesting. Either as silicon vendor, some prefer
> to leave it this way to ensure the O.E.M fill them correctly or the firmware
> engineers are not bothered to get these right as nothing breaks without
> these.
>
> Anyways, good example of what not to have in the products, as it is completely
> useless.
>
> [...]
Thanks for pointing it out.
Please ignore this. will drop the line.
>
>> To solve this issue, pcc_mbox_irq() clear 'chann_in_use' flag immediately
>> after clearing interrupt ack register.
>>
> This may be correct way of fixing the issue here, but I am questioning the
> existence of this flag now. I have some rework on this file. I will pick
This flag is for shared interrupt case on type3. please see:
3db174e478cb ("mailbox: pcc: Support shared interrupt for multiple 
subspaces")

We may need to fix it first before your refactoring this file. After 
all, it's an issue.
A little modification is more easily to backport and merge.
If it's ok for you, I'll update this commit log.
> this up if I think this is needed as it may conflict with my changes or
> we will drop the flag completely. Give a week or so, will post the changes
> and we can take it from there.
>
Looking forward to your rework.


/Huisong

>
> .
Re: [PATCH] mailbox: pcc: Fix can't clear level interrupt of type3 in cornor case
Posted by Sudeep Holla 9 months, 3 weeks ago
On Sun, Mar 02, 2025 at 03:30:01PM +0800, lihuisong (C) wrote:
>
> 在 2025/3/1 0:32, Sudeep Holla 写道:

[...]

> > This may be correct way of fixing the issue here, but I am questioning the
> > existence of this flag now. I have some rework on this file. I will pick
> This flag is for shared interrupt case on type3. please see:
> 3db174e478cb ("mailbox: pcc: Support shared interrupt for multiple
> subspaces")
>

Yes, I looked at all the history of this patch and I saw I had suggested
it. So it took a while to recall why it is a must. I was reworking some
of the recent change added which I couldn't review and got merged. I was
convinced for a short period this flag is not needed but I was wrong.

I will repost the patch with minor updated to the commit message as part
of my series soon to avoid any conflicts.

--
Regards,
Sudeep
Re: [PATCH] mailbox: pcc: Fix can't clear level interrupt of type3 in cornor case
Posted by lihuisong (C) 9 months, 2 weeks ago
在 2025/3/2 22:47, Sudeep Holla 写道:
> On Sun, Mar 02, 2025 at 03:30:01PM +0800, lihuisong (C) wrote:
>> 在 2025/3/1 0:32, Sudeep Holla 写道:
> [...]
>
>>> This may be correct way of fixing the issue here, but I am questioning the
>>> existence of this flag now. I have some rework on this file. I will pick
>> This flag is for shared interrupt case on type3. please see:
>> 3db174e478cb ("mailbox: pcc: Support shared interrupt for multiple
>> subspaces")
>>
> Yes, I looked at all the history of this patch and I saw I had suggested
> it. So it took a while to recall why it is a must. I was reworking some
> of the recent change added which I couldn't review and got merged. I was
I konw and understand.
> convinced for a short period this flag is not needed but I was wrong.
>
> I will repost the patch with minor updated to the commit message as part
> of my series soon to avoid any conflicts.
>
ok, looking forward to your series.
> .