[PATCH] mailbox: pcc: Fix probabilistic command execution timeout

Huisong Li posted 1 patch 2 months ago
drivers/mailbox/pcc.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
[PATCH] mailbox: pcc: Fix probabilistic command execution timeout
Posted by Huisong Li 2 months ago
In some scenarios, PCC command may experience probabilistic timeout.
This is primarily caused by the chan_in_use flag being updated after
ringing the doorbell, coupled with a lack of proper memory barriers
across CPU cores.

On fast platforms, a race condition occurs: the platform processing
completes and triggers an interrupt before the local CPU sets
chan_in_use to true. When the interrupt handler pcc_mbox_irq() runs,
it reads chan_in_use as false and incorrectly ignores the interrupt.

This patch fixes the race by:
1. Moving the chan_in_use update before ringing the doorbell.
2. Using smp_store_release() to ensure the flag update is visible
   to other cores before subsequent hardware or software actions
   are triggered.
3. Using smp_load_acquire() in the interrupt handler to ensure the
   latest flag value is read before deciding to skip the interrupt.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
 drivers/mailbox/pcc.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
index 636879ae1db7..f45fccd635b6 100644
--- a/drivers/mailbox/pcc.c
+++ b/drivers/mailbox/pcc.c
@@ -320,8 +320,13 @@ static irqreturn_t pcc_mbox_irq(int irq, void *p)
 	if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
 		return IRQ_NONE;
 
+	/*
+	 * For Master Subspaces, we must ensure we see the latest chan_in_use
+	 * value updated by pcc_send_data() on another core. smp_load_acquire
+	 * provides the necessary barrier to avoid probabilistic IRQ misses.
+	 */
 	if (pchan->type == ACPI_PCCT_TYPE_EXT_PCC_MASTER_SUBSPACE &&
-	    !pchan->chan_in_use)
+	    !smp_load_acquire(&pchan->chan_in_use))
 		return IRQ_NONE;
 
 	if (!pcc_mbox_cmd_complete_check(pchan))
@@ -336,7 +341,7 @@ static irqreturn_t pcc_mbox_irq(int irq, void *p)
 	 * where the flag is set again to start new transfer. This is
 	 * required to avoid any possible race in updatation of this flag.
 	 */
-	pchan->chan_in_use = false;
+	smp_store_release(&pchan->chan_in_use, false);
 	mbox_chan_received_data(chan, NULL);
 	mbox_chan_txdone(chan, 0);
 
@@ -438,9 +443,16 @@ static int pcc_send_data(struct mbox_chan *chan, void *data)
 	if (ret)
 		return ret;
 
+	/*
+	 * Set chan_in_use before ringing the doorbell. Using smp_store_release
+	 * ensures the flag is visible to the interrupt handler before the
+	 * hardware is actually notified.
+	 */
+	if (pchan->plat_irq > 0)
+		smp_store_release(&pchan->chan_in_use, true);
 	ret = pcc_chan_reg_read_modify_write(&pchan->db);
-	if (!ret && pchan->plat_irq > 0)
-		pchan->chan_in_use = true;
+	if (ret && pchan->plat_irq > 0)
+		smp_store_release(&pchan->chan_in_use, false);
 
 	return ret;
 }
-- 
2.33.0
Re: [PATCH] mailbox: pcc: Fix probabilistic command execution timeout
Posted by Sudeep Holla 1 month, 3 weeks ago
On Fri, Apr 17, 2026 at 11:14:29AM +0800, Huisong Li wrote:
> In some scenarios, PCC command may experience probabilistic timeout.
> This is primarily caused by the chan_in_use flag being updated after
> ringing the doorbell, coupled with a lack of proper memory barriers
> across CPU cores.
> 
> On fast platforms, a race condition occurs: the platform processing
> completes and triggers an interrupt before the local CPU sets
> chan_in_use to true. When the interrupt handler pcc_mbox_irq() runs,
> it reads chan_in_use as false and incorrectly ignores the interrupt.
> 
> This patch fixes the race by:
> 1. Moving the chan_in_use update before ringing the doorbell.
> 2. Using smp_store_release() to ensure the flag update is visible
>    to other cores before subsequent hardware or software actions
>    are triggered.
> 3. Using smp_load_acquire() in the interrupt handler to ensure the
>    latest flag value is read before deciding to skip the interrupt.
> 

Are you seeing the issue on real platforms or you are just reviewing the
code. I would like to test it on the platform I use but I don't have it
handy, so may take some time.

I have added Robbie King who also has helped in testing this PCC driver
in the past.

-- 
Regards,
Sudeep
Re: [PATCH] mailbox: pcc: Fix probabilistic command execution timeout
Posted by Robbie King 1 month, 2 weeks ago
On 4/21/2026 6:05 AM, Sudeep Holla wrote:
> On Fri, Apr 17, 2026 at 11:14:29AM +0800, Huisong Li wrote:
>> In some scenarios, PCC command may experience probabilistic timeout.
>> This is primarily caused by the chan_in_use flag being updated after
>> ringing the doorbell, coupled with a lack of proper memory barriers
>> across CPU cores.
>>
>> On fast platforms, a race condition occurs: the platform processing
>> completes and triggers an interrupt before the local CPU sets
>> chan_in_use to true. When the interrupt handler pcc_mbox_irq() runs,
>> it reads chan_in_use as false and incorrectly ignores the interrupt.
>>
>> This patch fixes the race by:
>> 1. Moving the chan_in_use update before ringing the doorbell.
>> 2. Using smp_store_release() to ensure the flag update is visible
>>    to other cores before subsequent hardware or software actions
>>    are triggered.
>> 3. Using smp_load_acquire() in the interrupt handler to ensure the
>>    latest flag value is read before deciding to skip the interrupt.
>>
> 
> Are you seeing the issue on real platforms or you are just reviewing the
> code. I would like to test it on the platform I use but I don't have it
> handy, so may take some time.
> 
> I have added Robbie King who also has helped in testing this PCC driver
> in the past.
> 

I was unable to apply the patch to our current kernel version.  We are in the
process of updating our kernel modules to support 7.0, once that effort is
finished I can run a few of our current regressions against the patch.
Re: [PATCH] mailbox: pcc: Fix probabilistic command execution timeout
Posted by lihuisong (C) 2 weeks, 6 days ago
On 4/27/2026 10:28 PM, Robbie King wrote:
> On 4/21/2026 6:05 AM, Sudeep Holla wrote:
>> On Fri, Apr 17, 2026 at 11:14:29AM +0800, Huisong Li wrote:
>>> In some scenarios, PCC command may experience probabilistic timeout.
>>> This is primarily caused by the chan_in_use flag being updated after
>>> ringing the doorbell, coupled with a lack of proper memory barriers
>>> across CPU cores.
>>>
>>> On fast platforms, a race condition occurs: the platform processing
>>> completes and triggers an interrupt before the local CPU sets
>>> chan_in_use to true. When the interrupt handler pcc_mbox_irq() runs,
>>> it reads chan_in_use as false and incorrectly ignores the interrupt.
>>>
>>> This patch fixes the race by:
>>> 1. Moving the chan_in_use update before ringing the doorbell.
>>> 2. Using smp_store_release() to ensure the flag update is visible
>>>     to other cores before subsequent hardware or software actions
>>>     are triggered.
>>> 3. Using smp_load_acquire() in the interrupt handler to ensure the
>>>     latest flag value is read before deciding to skip the interrupt.
>>>
>> Are you seeing the issue on real platforms or you are just reviewing the
>> code. I would like to test it on the platform I use but I don't have it
>> handy, so may take some time.
>>
>> I have added Robbie King who also has helped in testing this PCC driver
>> in the past.
>>
> I was unable to apply the patch to our current kernel version.  We are in the
> process of updating our kernel modules to support 7.0, once that effort is
> finished I can run a few of our current regressions against the patch.
Hi Robbie King,
Can you help test this patch now?
Or we can fix the conflict to test it on your current kernel.
Re: [PATCH] mailbox: pcc: Fix probabilistic command execution timeout
Posted by lihuisong (C) 1 month, 3 weeks ago
On 4/21/2026 6:05 PM, Sudeep Holla wrote:
> On Fri, Apr 17, 2026 at 11:14:29AM +0800, Huisong Li wrote:
>> In some scenarios, PCC command may experience probabilistic timeout.
>> This is primarily caused by the chan_in_use flag being updated after
>> ringing the doorbell, coupled with a lack of proper memory barriers
>> across CPU cores.
>>
>> On fast platforms, a race condition occurs: the platform processing
>> completes and triggers an interrupt before the local CPU sets
>> chan_in_use to true. When the interrupt handler pcc_mbox_irq() runs,
>> it reads chan_in_use as false and incorrectly ignores the interrupt.
>>
>> This patch fixes the race by:
>> 1. Moving the chan_in_use update before ringing the doorbell.
>> 2. Using smp_store_release() to ensure the flag update is visible
>>     to other cores before subsequent hardware or software actions
>>     are triggered.
>> 3. Using smp_load_acquire() in the interrupt handler to ensure the
>>     latest flag value is read before deciding to skip the interrupt.
>>
> Are you seeing the issue on real platforms or you are just reviewing the
> code. I would like to test it on the platform I use but I don't have it
> handy, so may take some time.
Yeah, this is a real issue on my platform. It is probabilistic.
And the problem is gone after this modification.

Gemini AI help me to analyze and suggest that I fix it as this patch done.

>
> I have added Robbie King who also has helped in testing this PCC driver
> in the past.
Thanks. Please help review this patch.
>