drivers/mailbox/pcc.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-)
In some scenarios, PCC command may experience probabilistic timeout.
This is primarily caused by the chan_in_use flag being updated after
ringing the doorbell, coupled with a lack of proper memory barriers
across CPU cores.
On fast platforms, a race condition occurs: the platform processing
completes and triggers an interrupt before the local CPU sets
chan_in_use to true. When the interrupt handler pcc_mbox_irq() runs,
it reads chan_in_use as false and incorrectly ignores the interrupt.
This patch fixes the race by:
1. Moving the chan_in_use update before ringing the doorbell.
2. Using smp_store_release() to ensure the flag update is visible
to other cores before subsequent hardware or software actions
are triggered.
3. Using smp_load_acquire() in the interrupt handler to ensure the
latest flag value is read before deciding to skip the interrupt.
Signed-off-by: Huisong Li <lihuisong@huawei.com>
---
drivers/mailbox/pcc.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
index 636879ae1db7..f45fccd635b6 100644
--- a/drivers/mailbox/pcc.c
+++ b/drivers/mailbox/pcc.c
@@ -320,8 +320,13 @@ static irqreturn_t pcc_mbox_irq(int irq, void *p)
if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
return IRQ_NONE;
+ /*
+ * For Master Subspaces, we must ensure we see the latest chan_in_use
+ * value updated by pcc_send_data() on another core. smp_load_acquire
+ * provides the necessary barrier to avoid probabilistic IRQ misses.
+ */
if (pchan->type == ACPI_PCCT_TYPE_EXT_PCC_MASTER_SUBSPACE &&
- !pchan->chan_in_use)
+ !smp_load_acquire(&pchan->chan_in_use))
return IRQ_NONE;
if (!pcc_mbox_cmd_complete_check(pchan))
@@ -336,7 +341,7 @@ static irqreturn_t pcc_mbox_irq(int irq, void *p)
* where the flag is set again to start new transfer. This is
* required to avoid any possible race in updatation of this flag.
*/
- pchan->chan_in_use = false;
+ smp_store_release(&pchan->chan_in_use, false);
mbox_chan_received_data(chan, NULL);
mbox_chan_txdone(chan, 0);
@@ -438,9 +443,16 @@ static int pcc_send_data(struct mbox_chan *chan, void *data)
if (ret)
return ret;
+ /*
+ * Set chan_in_use before ringing the doorbell. Using smp_store_release
+ * ensures the flag is visible to the interrupt handler before the
+ * hardware is actually notified.
+ */
+ if (pchan->plat_irq > 0)
+ smp_store_release(&pchan->chan_in_use, true);
ret = pcc_chan_reg_read_modify_write(&pchan->db);
- if (!ret && pchan->plat_irq > 0)
- pchan->chan_in_use = true;
+ if (ret && pchan->plat_irq > 0)
+ smp_store_release(&pchan->chan_in_use, false);
return ret;
}
--
2.33.0
On Fri, Apr 17, 2026 at 11:14:29AM +0800, Huisong Li wrote: > In some scenarios, PCC command may experience probabilistic timeout. > This is primarily caused by the chan_in_use flag being updated after > ringing the doorbell, coupled with a lack of proper memory barriers > across CPU cores. > > On fast platforms, a race condition occurs: the platform processing > completes and triggers an interrupt before the local CPU sets > chan_in_use to true. When the interrupt handler pcc_mbox_irq() runs, > it reads chan_in_use as false and incorrectly ignores the interrupt. > > This patch fixes the race by: > 1. Moving the chan_in_use update before ringing the doorbell. > 2. Using smp_store_release() to ensure the flag update is visible > to other cores before subsequent hardware or software actions > are triggered. > 3. Using smp_load_acquire() in the interrupt handler to ensure the > latest flag value is read before deciding to skip the interrupt. > Are you seeing the issue on real platforms or you are just reviewing the code. I would like to test it on the platform I use but I don't have it handy, so may take some time. I have added Robbie King who also has helped in testing this PCC driver in the past. -- Regards, Sudeep
On 4/21/2026 6:05 AM, Sudeep Holla wrote: > On Fri, Apr 17, 2026 at 11:14:29AM +0800, Huisong Li wrote: >> In some scenarios, PCC command may experience probabilistic timeout. >> This is primarily caused by the chan_in_use flag being updated after >> ringing the doorbell, coupled with a lack of proper memory barriers >> across CPU cores. >> >> On fast platforms, a race condition occurs: the platform processing >> completes and triggers an interrupt before the local CPU sets >> chan_in_use to true. When the interrupt handler pcc_mbox_irq() runs, >> it reads chan_in_use as false and incorrectly ignores the interrupt. >> >> This patch fixes the race by: >> 1. Moving the chan_in_use update before ringing the doorbell. >> 2. Using smp_store_release() to ensure the flag update is visible >> to other cores before subsequent hardware or software actions >> are triggered. >> 3. Using smp_load_acquire() in the interrupt handler to ensure the >> latest flag value is read before deciding to skip the interrupt. >> > > Are you seeing the issue on real platforms or you are just reviewing the > code. I would like to test it on the platform I use but I don't have it > handy, so may take some time. > > I have added Robbie King who also has helped in testing this PCC driver > in the past. > I was unable to apply the patch to our current kernel version. We are in the process of updating our kernel modules to support 7.0, once that effort is finished I can run a few of our current regressions against the patch.
On 4/27/2026 10:28 PM, Robbie King wrote: > On 4/21/2026 6:05 AM, Sudeep Holla wrote: >> On Fri, Apr 17, 2026 at 11:14:29AM +0800, Huisong Li wrote: >>> In some scenarios, PCC command may experience probabilistic timeout. >>> This is primarily caused by the chan_in_use flag being updated after >>> ringing the doorbell, coupled with a lack of proper memory barriers >>> across CPU cores. >>> >>> On fast platforms, a race condition occurs: the platform processing >>> completes and triggers an interrupt before the local CPU sets >>> chan_in_use to true. When the interrupt handler pcc_mbox_irq() runs, >>> it reads chan_in_use as false and incorrectly ignores the interrupt. >>> >>> This patch fixes the race by: >>> 1. Moving the chan_in_use update before ringing the doorbell. >>> 2. Using smp_store_release() to ensure the flag update is visible >>> to other cores before subsequent hardware or software actions >>> are triggered. >>> 3. Using smp_load_acquire() in the interrupt handler to ensure the >>> latest flag value is read before deciding to skip the interrupt. >>> >> Are you seeing the issue on real platforms or you are just reviewing the >> code. I would like to test it on the platform I use but I don't have it >> handy, so may take some time. >> >> I have added Robbie King who also has helped in testing this PCC driver >> in the past. >> > I was unable to apply the patch to our current kernel version. We are in the > process of updating our kernel modules to support 7.0, once that effort is > finished I can run a few of our current regressions against the patch. Hi Robbie King, Can you help test this patch now? Or we can fix the conflict to test it on your current kernel.
On 4/21/2026 6:05 PM, Sudeep Holla wrote: > On Fri, Apr 17, 2026 at 11:14:29AM +0800, Huisong Li wrote: >> In some scenarios, PCC command may experience probabilistic timeout. >> This is primarily caused by the chan_in_use flag being updated after >> ringing the doorbell, coupled with a lack of proper memory barriers >> across CPU cores. >> >> On fast platforms, a race condition occurs: the platform processing >> completes and triggers an interrupt before the local CPU sets >> chan_in_use to true. When the interrupt handler pcc_mbox_irq() runs, >> it reads chan_in_use as false and incorrectly ignores the interrupt. >> >> This patch fixes the race by: >> 1. Moving the chan_in_use update before ringing the doorbell. >> 2. Using smp_store_release() to ensure the flag update is visible >> to other cores before subsequent hardware or software actions >> are triggered. >> 3. Using smp_load_acquire() in the interrupt handler to ensure the >> latest flag value is read before deciding to skip the interrupt. >> > Are you seeing the issue on real platforms or you are just reviewing the > code. I would like to test it on the platform I use but I don't have it > handy, so may take some time. Yeah, this is a real issue on my platform. It is probabilistic. And the problem is gone after this modification. Gemini AI help me to analyze and suggest that I fix it as this patch done. > > I have added Robbie King who also has helped in testing this PCC driver > in the past. Thanks. Please help review this patch. >
© 2016 - 2026 Red Hat, Inc.