RFC: mailbox: Fix NULL message support in mbox_send_message()

[PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by jassisinghbrar@gmail.com 4 weeks ago

From: Jassi Brar <jassisinghbrar@gmail.com>

The active_req field serves double duty as both the "is a TX in
flight" flag (NULL means idle) and the storage for the in-flight
message pointer. When a client sends NULL via mbox_send_message(),
active_req is set to NULL, which the framework misinterprets as
"no active request." This breaks the TX state machine by:

 - tx_tick() short-circuits on (!mssg), skipping the tx_done
   callback and the tx_complete completion
 - txdone_hrtimer() skips the channel entirely since active_req
   is NULL, so poll-based TX-done detection never fires.

Fix this by introducing a MBOX_NO_MSG sentinel value that means
"no active request," freeing NULL to be valid message data. The
sentinel is internal to the mailbox core and is never exposed to
controller drivers or clients.

The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
by clients.

Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
---
 drivers/mailbox/mailbox.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 617ba505691d..d06f6d49deaf 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -20,6 +20,9 @@
 
 #include "mailbox.h"
 
+/* Sentinel value distinguishing "no active request" from "NULL message data" */
+#define MBOX_NO_MSG	((void *)-1)
+
 static LIST_HEAD(mbox_cons);
 static DEFINE_MUTEX(con_mutex);
 
@@ -52,7 +55,7 @@ static void msg_submit(struct mbox_chan *chan)
 	int err = -EBUSY;
 
 	scoped_guard(spinlock_irqsave, &chan->lock) {
-		if (!chan->msg_count || chan->active_req)
+		if (!chan->msg_count || chan->active_req != MBOX_NO_MSG)
 			break;
 
 		count = chan->msg_count;
@@ -87,13 +90,13 @@ static void tx_tick(struct mbox_chan *chan, int r)
 
 	scoped_guard(spinlock_irqsave, &chan->lock) {
 		mssg = chan->active_req;
-		chan->active_req = NULL;
+		chan->active_req = MBOX_NO_MSG;
 	}
 
 	/* Submit next message */
 	msg_submit(chan);
 
-	if (!mssg)
+	if (mssg == MBOX_NO_MSG)
 		return;
 
 	/* Notify the client */
@@ -114,7 +117,7 @@ static enum hrtimer_restart txdone_hrtimer(struct hrtimer *hrtimer)
 	for (i = 0; i < mbox->num_chans; i++) {
 		struct mbox_chan *chan = &mbox->chans[i];
 
-		if (chan->active_req && chan->cl) {
+		if (chan->active_req != MBOX_NO_MSG && chan->cl) {
 			txdone = chan->mbox->ops->last_tx_done(chan);
 			if (txdone)
 				tx_tick(chan, 0);
@@ -319,7 +322,7 @@ static int __mbox_bind_client(struct mbox_chan *chan, struct mbox_client *cl)
 	scoped_guard(spinlock_irqsave, &chan->lock) {
 		chan->msg_free = 0;
 		chan->msg_count = 0;
-		chan->active_req = NULL;
+		chan->active_req = MBOX_NO_MSG;
 		chan->cl = cl;
 		init_completion(&chan->tx_complete);
 
@@ -477,7 +480,7 @@ void mbox_free_channel(struct mbox_chan *chan)
 	/* The queued TX requests are simply aborted, no callbacks are made */
 	scoped_guard(spinlock_irqsave, &chan->lock) {
 		chan->cl = NULL;
-		chan->active_req = NULL;
+		chan->active_req = MBOX_NO_MSG;
 		if (chan->txdone_method == TXDONE_BY_ACK)
 			chan->txdone_method = TXDONE_BY_POLL;
 	}
-- 
2.43.0

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Joonwon Kang 3 weeks, 4 days ago

> The active_req field serves double duty as both the "is a TX in
> flight" flag (NULL means idle) and the storage for the in-flight
> message pointer. When a client sends NULL via mbox_send_message(),
> active_req is set to NULL, which the framework misinterprets as
> "no active request." This breaks the TX state machine by:
> 
>  - tx_tick() short-circuits on (!mssg), skipping the tx_done
>    callback and the tx_complete completion
>  - txdone_hrtimer() skips the channel entirely since active_req
>    is NULL, so poll-based TX-done detection never fires.
> 
> Fix this by introducing a MBOX_NO_MSG sentinel value that means
> "no active request," freeing NULL to be valid message data. The
> sentinel is internal to the mailbox core and is never exposed to
> controller drivers or clients.

The following drivers are currently using ->active_req which now could be
assigned MBOX_NO_MSG.
- drivers/mailbox/tegra-hsp.c
- drivers/mailbox/mtk-vcp-mailbox.c

One of them is using ->active_req to wait until the channel is empty. In
this case, strictly speaking, that controller driver should be aware of
the sentinel value MBOX_NO_MSG, which means the sentinel value should be
exposed to the controller. Or, if a future controller driver to come is to
use ->active_req for the same purpose for doorbell or non-doorbell, it
should also be aware of the sentinel value anyway.

However, I believe that it is not intuitive to the controller developers
that a pointer value could be other value than a real memory address,
NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason,
I think it will be better to change the type of ->active_req to give a
better indication to the controller developers, e.g. to integer as in the
original patch
https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/.

Or, we could change those drivers not to use ->active_req, hide
->active_req entirely in the mailbox core and keep this patch.

Thanks.

> 
> The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
> by clients.

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Jassi Brar 3 weeks ago

On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote:
>
> > The active_req field serves double duty as both the "is a TX in
> > flight" flag (NULL means idle) and the storage for the in-flight
> > message pointer. When a client sends NULL via mbox_send_message(),
> > active_req is set to NULL, which the framework misinterprets as
> > "no active request." This breaks the TX state machine by:
> >
> >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> >    callback and the tx_complete completion
> >  - txdone_hrtimer() skips the channel entirely since active_req
> >    is NULL, so poll-based TX-done detection never fires.
> >
> > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > "no active request," freeing NULL to be valid message data. The
> > sentinel is internal to the mailbox core and is never exposed to
> > controller drivers or clients.
>
> The following drivers are currently using ->active_req which now could be
> assigned MBOX_NO_MSG.
> - drivers/mailbox/tegra-hsp.c
> - drivers/mailbox/mtk-vcp-mailbox.c
>
Good catch, Thanks.
mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This
patch will not introduce any change for it.
Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single
line change.

> One of them is using ->active_req to wait until the channel is empty. In
> this case, strictly speaking, that controller driver should be aware of
> the sentinel value MBOX_NO_MSG, which means the sentinel value should be
> exposed to the controller. Or, if a future controller driver to come is to
> use ->active_req for the same purpose for doorbell or non-doorbell, it
> should also be aware of the sentinel value anyway.
>
> However, I believe that it is not intuitive to the controller developers
> that a pointer value could be other value than a real memory address,
> NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason,
> I think it will be better to change the type of ->active_req to give a
> better indication to the controller developers, e.g. to integer as in the
> original patch
> https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/.
>
> Or, we could change those drivers not to use ->active_req, hide
> ->active_req entirely in the mailbox core and keep this patch.
>
Yes, ideally controller drivers should not track active_req.  But this
patch will cause least churn it seems.

Thanks
Jassi

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Joonwon Kang 3 weeks ago

> On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote:
> >
> > > The active_req field serves double duty as both the "is a TX in
> > > flight" flag (NULL means idle) and the storage for the in-flight
> > > message pointer. When a client sends NULL via mbox_send_message(),
> > > active_req is set to NULL, which the framework misinterprets as
> > > "no active request." This breaks the TX state machine by:
> > >
> > >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > >    callback and the tx_complete completion
> > >  - txdone_hrtimer() skips the channel entirely since active_req
> > >    is NULL, so poll-based TX-done detection never fires.
> > >
> > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > "no active request," freeing NULL to be valid message data. The
> > > sentinel is internal to the mailbox core and is never exposed to
> > > controller drivers or clients.
> >
> > The following drivers are currently using ->active_req which now could be
> > assigned MBOX_NO_MSG.
> > - drivers/mailbox/tegra-hsp.c
> > - drivers/mailbox/mtk-vcp-mailbox.c
> >
> Good catch, Thanks.
> mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This
> patch will not introduce any change for it.
> Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single
> line change.
> 
A more important aspect than single line change would be that you are
creating a new API contract that the controller drivers who are to use
->active_req should now have new knowledge that the pointer value could
be unconventional value such as -1. Without this additional knowledge,
they may easily fail checking if the channel is empty.

> > One of them is using ->active_req to wait until the channel is empty. In
> > this case, strictly speaking, that controller driver should be aware of
> > the sentinel value MBOX_NO_MSG, which means the sentinel value should be
> > exposed to the controller. Or, if a future controller driver to come is to
> > use ->active_req for the same purpose for doorbell or non-doorbell, it
> > should also be aware of the sentinel value anyway.
> >
> > However, I believe that it is not intuitive to the controller developers
> > that a pointer value could be other value than a real memory address,
> > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason,
> > I think it will be better to change the type of ->active_req to give a
> > better indication to the controller developers, e.g. to integer as in the
> > original patch
> > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/.
> >
> > Or, we could change those drivers not to use ->active_req, hide
> > ->active_req entirely in the mailbox core and keep this patch.
> >
> Yes, ideally controller drivers should not track active_req.  But this
> patch will cause least churn it seems.

If we will take those two drivers as an exceptional and not recommended
case that uses ->active_req directly, it will be fine not to change them
except for tegra-hsp.c to check MBOX_NO_MSG.

If that is the case and you are to keep this patch, please keep in mind to
leave the original patch link in the commit message for better trackability
of the discussion and solutions and for credit for the contribution of
finding and analyzing this issue and proposing the first solution, which
I believe is also important for future contributions to come.

Thanks.

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Doug Anderson 2 weeks, 4 days ago

Hi,

On Mon, Mar 16, 2026 at 10:03 PM Joonwon Kang <joonwonkang@google.com> wrote:
>
> > On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote:
> > >
> > > > The active_req field serves double duty as both the "is a TX in
> > > > flight" flag (NULL means idle) and the storage for the in-flight
> > > > message pointer. When a client sends NULL via mbox_send_message(),
> > > > active_req is set to NULL, which the framework misinterprets as
> > > > "no active request." This breaks the TX state machine by:
> > > >
> > > >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > > >    callback and the tx_complete completion
> > > >  - txdone_hrtimer() skips the channel entirely since active_req
> > > >    is NULL, so poll-based TX-done detection never fires.
> > > >
> > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > > "no active request," freeing NULL to be valid message data. The
> > > > sentinel is internal to the mailbox core and is never exposed to
> > > > controller drivers or clients.
> > >
> > > The following drivers are currently using ->active_req which now could be
> > > assigned MBOX_NO_MSG.
> > > - drivers/mailbox/tegra-hsp.c
> > > - drivers/mailbox/mtk-vcp-mailbox.c
> > >
> > Good catch, Thanks.
> > mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This
> > patch will not introduce any change for it.
> > Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single
> > line change.
> >
> A more important aspect than single line change would be that you are
> creating a new API contract that the controller drivers who are to use
> ->active_req should now have new knowledge that the pointer value could
> be unconventional value such as -1. Without this additional knowledge,
> they may easily fail checking if the channel is empty.
>
> > > One of them is using ->active_req to wait until the channel is empty. In
> > > this case, strictly speaking, that controller driver should be aware of
> > > the sentinel value MBOX_NO_MSG, which means the sentinel value should be
> > > exposed to the controller. Or, if a future controller driver to come is to
> > > use ->active_req for the same purpose for doorbell or non-doorbell, it
> > > should also be aware of the sentinel value anyway.
> > >
> > > However, I believe that it is not intuitive to the controller developers
> > > that a pointer value could be other value than a real memory address,
> > > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason,
> > > I think it will be better to change the type of ->active_req to give a
> > > better indication to the controller developers, e.g. to integer as in the
> > > original patch
> > > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/.
> > >
> > > Or, we could change those drivers not to use ->active_req, hide
> > > ->active_req entirely in the mailbox core and keep this patch.
> > >
> > Yes, ideally controller drivers should not track active_req.  But this
> > patch will cause least churn it seems.
>
> If we will take those two drivers as an exceptional and not recommended
> case that uses ->active_req directly, it will be fine not to change them
> except for tegra-hsp.c to check MBOX_NO_MSG.
>
> If that is the case and you are to keep this patch, please keep in mind to
> leave the original patch link in the commit message for better trackability
> of the discussion and solutions and for credit for the contribution of
> finding and analyzing this issue and proposing the first solution, which
> I believe is also important for future contributions to come.

So what's the next steps here, then? I don't think I adequately
explained exactly the needs of my mbox client, but I also don't think
it's a big deal. I'm convinced it should be fine even if NULL messages
get queued.

Jassi: are you going to send a new version of your patch? ...or are
you expecting Joonwon to post a new version? ...or do you want me to
post some variant of these patches?

-Doug

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Jassi Brar 2 weeks, 3 days ago

On Fri, Mar 20, 2026 at 4:03 PM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Mon, Mar 16, 2026 at 10:03 PM Joonwon Kang <joonwonkang@google.com> wrote:
> >
> > > On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote:
> > > >
> > > > > The active_req field serves double duty as both the "is a TX in
> > > > > flight" flag (NULL means idle) and the storage for the in-flight
> > > > > message pointer. When a client sends NULL via mbox_send_message(),
> > > > > active_req is set to NULL, which the framework misinterprets as
> > > > > "no active request." This breaks the TX state machine by:
> > > > >
> > > > >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > > > >    callback and the tx_complete completion
> > > > >  - txdone_hrtimer() skips the channel entirely since active_req
> > > > >    is NULL, so poll-based TX-done detection never fires.
> > > > >
> > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > > > "no active request," freeing NULL to be valid message data. The
> > > > > sentinel is internal to the mailbox core and is never exposed to
> > > > > controller drivers or clients.
> > > >
> > > > The following drivers are currently using ->active_req which now could be
> > > > assigned MBOX_NO_MSG.
> > > > - drivers/mailbox/tegra-hsp.c
> > > > - drivers/mailbox/mtk-vcp-mailbox.c
> > > >
> > > Good catch, Thanks.
> > > mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This
> > > patch will not introduce any change for it.
> > > Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single
> > > line change.
> > >
> > A more important aspect than single line change would be that you are
> > creating a new API contract that the controller drivers who are to use
> > ->active_req should now have new knowledge that the pointer value could
> > be unconventional value such as -1. Without this additional knowledge,
> > they may easily fail checking if the channel is empty.
> >
> > > > One of them is using ->active_req to wait until the channel is empty. In
> > > > this case, strictly speaking, that controller driver should be aware of
> > > > the sentinel value MBOX_NO_MSG, which means the sentinel value should be
> > > > exposed to the controller. Or, if a future controller driver to come is to
> > > > use ->active_req for the same purpose for doorbell or non-doorbell, it
> > > > should also be aware of the sentinel value anyway.
> > > >
> > > > However, I believe that it is not intuitive to the controller developers
> > > > that a pointer value could be other value than a real memory address,
> > > > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason,
> > > > I think it will be better to change the type of ->active_req to give a
> > > > better indication to the controller developers, e.g. to integer as in the
> > > > original patch
> > > > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/.
> > > >
> > > > Or, we could change those drivers not to use ->active_req, hide
> > > > ->active_req entirely in the mailbox core and keep this patch.
> > > >
> > > Yes, ideally controller drivers should not track active_req.  But this
> > > patch will cause least churn it seems.
> >
> > If we will take those two drivers as an exceptional and not recommended
> > case that uses ->active_req directly, it will be fine not to change them
> > except for tegra-hsp.c to check MBOX_NO_MSG.
> >
> > If that is the case and you are to keep this patch, please keep in mind to
> > leave the original patch link in the commit message for better trackability
> > of the discussion and solutions and for credit for the contribution of
> > finding and analyzing this issue and proposing the first solution, which
> > I believe is also important for future contributions to come.
>
> So what's the next steps here, then? I don't think I adequately
> explained exactly the needs of my mbox client, but I also don't think
> it's a big deal. I'm convinced it should be fine even if NULL messages
> get queued.
>
I think it will be simpler than you imagine, but yes I too don't think
it's a big deal. We can iron out details later.

> Jassi: are you going to send a new version of your patch? ...or are
> you expecting Joonwon to post a new version? ...or do you want me to
> post some variant of these patches?
>
Honestly I only care about minimal churn to code and API. I think
simply using a different sentinel value is simpler than tracking the
number of submissions... which is neither strictly required nor
reduces the changes we have to make.
So I plan to submit my RFC as a patch, the trivial change to
tegra-hsp.c and resubmit the qcom fix with your ack.

Regards,
Jassi

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Joonwon Kang 1 week, 5 days ago

> On Fri, Mar 20, 2026 at 4:03 PM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Mon, Mar 16, 2026 at 10:03 PM Joonwon Kang <joonwonkang@google.com> wrote:
> > >
> > > > On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote:
> > > > >
> > > > > > The active_req field serves double duty as both the "is a TX in
> > > > > > flight" flag (NULL means idle) and the storage for the in-flight
> > > > > > message pointer. When a client sends NULL via mbox_send_message(),
> > > > > > active_req is set to NULL, which the framework misinterprets as
> > > > > > "no active request." This breaks the TX state machine by:
> > > > > >
> > > > > >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > > > > >    callback and the tx_complete completion
> > > > > >  - txdone_hrtimer() skips the channel entirely since active_req
> > > > > >    is NULL, so poll-based TX-done detection never fires.
> > > > > >
> > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > > > > "no active request," freeing NULL to be valid message data. The
> > > > > > sentinel is internal to the mailbox core and is never exposed to
> > > > > > controller drivers or clients.
> > > > >
> > > > > The following drivers are currently using ->active_req which now could be
> > > > > assigned MBOX_NO_MSG.
> > > > > - drivers/mailbox/tegra-hsp.c
> > > > > - drivers/mailbox/mtk-vcp-mailbox.c
> > > > >
> > > > Good catch, Thanks.
> > > > mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This
> > > > patch will not introduce any change for it.
> > > > Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single
> > > > line change.
> > > >
> > > A more important aspect than single line change would be that you are
> > > creating a new API contract that the controller drivers who are to use
> > > ->active_req should now have new knowledge that the pointer value could
> > > be unconventional value such as -1. Without this additional knowledge,
> > > they may easily fail checking if the channel is empty.
> > >
> > > > > One of them is using ->active_req to wait until the channel is empty. In
> > > > > this case, strictly speaking, that controller driver should be aware of
> > > > > the sentinel value MBOX_NO_MSG, which means the sentinel value should be
> > > > > exposed to the controller. Or, if a future controller driver to come is to
> > > > > use ->active_req for the same purpose for doorbell or non-doorbell, it
> > > > > should also be aware of the sentinel value anyway.
> > > > >
> > > > > However, I believe that it is not intuitive to the controller developers
> > > > > that a pointer value could be other value than a real memory address,
> > > > > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason,
> > > > > I think it will be better to change the type of ->active_req to give a
> > > > > better indication to the controller developers, e.g. to integer as in the
> > > > > original patch
> > > > > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/.
> > > > >
> > > > > Or, we could change those drivers not to use ->active_req, hide
> > > > > ->active_req entirely in the mailbox core and keep this patch.
> > > > >
> > > > Yes, ideally controller drivers should not track active_req.  But this
> > > > patch will cause least churn it seems.
> > >
> > > If we will take those two drivers as an exceptional and not recommended
> > > case that uses ->active_req directly, it will be fine not to change them
> > > except for tegra-hsp.c to check MBOX_NO_MSG.
> > >
> > > If that is the case and you are to keep this patch, please keep in mind to
> > > leave the original patch link in the commit message for better trackability
> > > of the discussion and solutions and for credit for the contribution of
> > > finding and analyzing this issue and proposing the first solution, which
> > > I believe is also important for future contributions to come.
> >
> > So what's the next steps here, then? I don't think I adequately
> > explained exactly the needs of my mbox client, but I also don't think
> > it's a big deal. I'm convinced it should be fine even if NULL messages
> > get queued.
> >
> I think it will be simpler than you imagine, but yes I too don't think
> it's a big deal. We can iron out details later.
> 
> > Jassi: are you going to send a new version of your patch? ...or are
> > you expecting Joonwon to post a new version? ...or do you want me to
> > post some variant of these patches?
> >
> Honestly I only care about minimal churn to code and API. I think
> simply using a different sentinel value is simpler than tracking the
> number of submissions... which is neither strictly required nor
> reduces the changes we have to make.
> So I plan to submit my RFC as a patch, the trivial change to
> tegra-hsp.c and resubmit the qcom fix with your ack.
> 
> Regards,
> Jassi

I have reviewed the new patch.

Regarding the issue I brought up on how this alternative patch had been
uploaded without letting the original patch author know and without any tag for
better tracking or credit, I hope next time you could use "Reported-by:" or
"Link:" tag at least to encourage contributors to keep spending effort finding
and analyzing issues in the mailbox framework. I believe it could be better for
community and quality of code as also guided in the submitting-patches doc.

Sincerely,
Joonwon Kang

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Jassi Brar 1 week, 4 days ago

On Thu, Mar 26, 2026 at 2:31 AM Joonwon Kang <joonwonkang@google.com> wrote:
>
> > On Fri, Mar 20, 2026 at 4:03 PM Doug Anderson <dianders@chromium.org> wrote:
> > >
> > > Hi,
> > >
> > > On Mon, Mar 16, 2026 at 10:03 PM Joonwon Kang <joonwonkang@google.com> wrote:
> > > >
> > > > > On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote:
> > > > > >
> > > > > > > The active_req field serves double duty as both the "is a TX in
> > > > > > > flight" flag (NULL means idle) and the storage for the in-flight
> > > > > > > message pointer. When a client sends NULL via mbox_send_message(),
> > > > > > > active_req is set to NULL, which the framework misinterprets as
> > > > > > > "no active request." This breaks the TX state machine by:
> > > > > > >
> > > > > > >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > > > > > >    callback and the tx_complete completion
> > > > > > >  - txdone_hrtimer() skips the channel entirely since active_req
> > > > > > >    is NULL, so poll-based TX-done detection never fires.
> > > > > > >
> > > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > > > > > "no active request," freeing NULL to be valid message data. The
> > > > > > > sentinel is internal to the mailbox core and is never exposed to
> > > > > > > controller drivers or clients.
> > > > > >
> > > > > > The following drivers are currently using ->active_req which now could be
> > > > > > assigned MBOX_NO_MSG.
> > > > > > - drivers/mailbox/tegra-hsp.c
> > > > > > - drivers/mailbox/mtk-vcp-mailbox.c
> > > > > >
> > > > > Good catch, Thanks.
> > > > > mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This
> > > > > patch will not introduce any change for it.
> > > > > Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single
> > > > > line change.
> > > > >
> > > > A more important aspect than single line change would be that you are
> > > > creating a new API contract that the controller drivers who are to use
> > > > ->active_req should now have new knowledge that the pointer value could
> > > > be unconventional value such as -1. Without this additional knowledge,
> > > > they may easily fail checking if the channel is empty.
> > > >
> > > > > > One of them is using ->active_req to wait until the channel is empty. In
> > > > > > this case, strictly speaking, that controller driver should be aware of
> > > > > > the sentinel value MBOX_NO_MSG, which means the sentinel value should be
> > > > > > exposed to the controller. Or, if a future controller driver to come is to
> > > > > > use ->active_req for the same purpose for doorbell or non-doorbell, it
> > > > > > should also be aware of the sentinel value anyway.
> > > > > >
> > > > > > However, I believe that it is not intuitive to the controller developers
> > > > > > that a pointer value could be other value than a real memory address,
> > > > > > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason,
> > > > > > I think it will be better to change the type of ->active_req to give a
> > > > > > better indication to the controller developers, e.g. to integer as in the
> > > > > > original patch
> > > > > > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/.
> > > > > >
> > > > > > Or, we could change those drivers not to use ->active_req, hide
> > > > > > ->active_req entirely in the mailbox core and keep this patch.
> > > > > >
> > > > > Yes, ideally controller drivers should not track active_req.  But this
> > > > > patch will cause least churn it seems.
> > > >
> > > > If we will take those two drivers as an exceptional and not recommended
> > > > case that uses ->active_req directly, it will be fine not to change them
> > > > except for tegra-hsp.c to check MBOX_NO_MSG.
> > > >
> > > > If that is the case and you are to keep this patch, please keep in mind to
> > > > leave the original patch link in the commit message for better trackability
> > > > of the discussion and solutions and for credit for the contribution of
> > > > finding and analyzing this issue and proposing the first solution, which
> > > > I believe is also important for future contributions to come.
> > >
> > > So what's the next steps here, then? I don't think I adequately
> > > explained exactly the needs of my mbox client, but I also don't think
> > > it's a big deal. I'm convinced it should be fine even if NULL messages
> > > get queued.
> > >
> > I think it will be simpler than you imagine, but yes I too don't think
> > it's a big deal. We can iron out details later.
> >
> > > Jassi: are you going to send a new version of your patch? ...or are
> > > you expecting Joonwon to post a new version? ...or do you want me to
> > > post some variant of these patches?
> > >
> > Honestly I only care about minimal churn to code and API. I think
> > simply using a different sentinel value is simpler than tracking the
> > number of submissions... which is neither strictly required nor
> > reduces the changes we have to make.
> > So I plan to submit my RFC as a patch, the trivial change to
> > tegra-hsp.c and resubmit the qcom fix with your ack.
> >
> > Regards,
> > Jassi
>
> I have reviewed the new patch.
>
> Regarding the issue I brought up on how this alternative patch had been
> uploaded without letting the original patch author know and without any tag for
> better tracking or credit, I hope next time you could use "Reported-by:" or
> "Link:" tag at least to encourage contributors to keep spending effort finding
> and analyzing issues in the mailbox framework. I believe it could be better for
> community and quality of code as also guided in the submitting-patches doc.
>
I am sorry for the oversight. I should have attributed you as the
reporter. I forgot but I will add it before applying.

Thanks,
Jassi

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Doug Anderson 3 weeks, 4 days ago

Hi,

On Fri, Mar 13, 2026 at 3:12 AM Joonwon Kang <joonwonkang@google.com> wrote:
>
> > The active_req field serves double duty as both the "is a TX in
> > flight" flag (NULL means idle) and the storage for the in-flight
> > message pointer. When a client sends NULL via mbox_send_message(),
> > active_req is set to NULL, which the framework misinterprets as
> > "no active request." This breaks the TX state machine by:
> >
> >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> >    callback and the tx_complete completion
> >  - txdone_hrtimer() skips the channel entirely since active_req
> >    is NULL, so poll-based TX-done detection never fires.
> >
> > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > "no active request," freeing NULL to be valid message data. The
> > sentinel is internal to the mailbox core and is never exposed to
> > controller drivers or clients.
>
> The following drivers are currently using ->active_req which now could be
> assigned MBOX_NO_MSG.
> - drivers/mailbox/tegra-hsp.c
> - drivers/mailbox/mtk-vcp-mailbox.c
>
> One of them is using ->active_req to wait until the channel is empty. In
> this case, strictly speaking, that controller driver should be aware of
> the sentinel value MBOX_NO_MSG, which means the sentinel value should be
> exposed to the controller. Or, if a future controller driver to come is to
> use ->active_req for the same purpose for doorbell or non-doorbell, it
> should also be aware of the sentinel value anyway.
>
> However, I believe that it is not intuitive to the controller developers
> that a pointer value could be other value than a real memory address,
> NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason,
> I think it will be better to change the type of ->active_req to give a
> better indication to the controller developers, e.g. to integer as in the
> original patch
> https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/.
>
> Or, we could change those drivers not to use ->active_req, hide
> ->active_req entirely in the mailbox core and keep this patch.

You make some good points here.

Looking at your patch, you're keeping "active_req" as an index into
"chan->msg_data". It seems like we could maybe get the best of both
worlds if we started with your solution but keep "active_req" as a
pointer. It can be a pointer to one of the elements in
"chan->msg_data". Then NULL can still mean that no request is active.

-Doug

Re: [PATCH] RFC: mailbox: Fix NULL message support in

Posted by Joonwon Kang 3 weeks, 5 days ago

> The active_req field serves double duty as both the "is a TX in
> flight" flag (NULL means idle) and the storage for the in-flight
> message pointer. When a client sends NULL via mbox_send_message(),
> active_req is set to NULL, which the framework misinterprets as
> "no active request." This breaks the TX state machine by:
> 
>  - tx_tick() short-circuits on (!mssg), skipping the tx_done
>    callback and the tx_complete completion
>  - txdone_hrtimer() skips the channel entirely since active_req
>    is NULL, so poll-based TX-done detection never fires.
> 
> Fix this by introducing a MBOX_NO_MSG sentinel value that means
> "no active request," freeing NULL to be valid message data. The
> sentinel is internal to the mailbox core and is never exposed to
> controller drivers or clients.
> 
> The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
> by clients.

I was surprised to find out this alternative patch while we have been waiting
a few months for your response to my previous patch for this issue on
https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/.

Will you handle this issue on your end? If so, could you leave the original
discussion link to the commit message so that we could better track it? Also,
could you reply to the original patches next time with what you have decided
or at least cc the people who posted the original patches to your new patches
for better collaboration?

Thanks.

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Doug Anderson 4 weeks ago

Hi,

On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote:
>
> From: Jassi Brar <jassisinghbrar@gmail.com>
>
> The active_req field serves double duty as both the "is a TX in
> flight" flag (NULL means idle) and the storage for the in-flight
> message pointer. When a client sends NULL via mbox_send_message(),
> active_req is set to NULL, which the framework misinterprets as
> "no active request." This breaks the TX state machine by:
>
>  - tx_tick() short-circuits on (!mssg), skipping the tx_done
>    callback and the tx_complete completion
>  - txdone_hrtimer() skips the channel entirely since active_req
>    is NULL, so poll-based TX-done detection never fires.
>
> Fix this by introducing a MBOX_NO_MSG sentinel value that means
> "no active request," freeing NULL to be valid message data. The
> sentinel is internal to the mailbox core and is never exposed to
> controller drivers or clients.
>
> The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
> by clients.
>
> Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
> ---
>  drivers/mailbox/mailbox.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)

While I can certainly be corrected, I suspect this patch will break a
lot of users.

My analysis of people that were passing NULL mbox messages is that
they _wanted_ the current behavior. They didn't want NULL messages to
be queued up as would happen with this patch. Instead, they just
wanted to immediately ring the doorbell again to signal an interrupt
to the other side.

My analysis of clients is why I chose not to implement it this way and
isntead introduced the mbox_ring_doorbell(), keeping it as similar to
the existing NULL behavior as possible.

-Doug

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Jassi Brar 4 weeks ago

On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote:
> >
> > From: Jassi Brar <jassisinghbrar@gmail.com>
> >
> > The active_req field serves double duty as both the "is a TX in
> > flight" flag (NULL means idle) and the storage for the in-flight
> > message pointer. When a client sends NULL via mbox_send_message(),
> > active_req is set to NULL, which the framework misinterprets as
> > "no active request." This breaks the TX state machine by:
> >
> >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> >    callback and the tx_complete completion
> >  - txdone_hrtimer() skips the channel entirely since active_req
> >    is NULL, so poll-based TX-done detection never fires.
> >
> > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > "no active request," freeing NULL to be valid message data. The
> > sentinel is internal to the mailbox core and is never exposed to
> > controller drivers or clients.
> >
> > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
> > by clients.
> >
> > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
> > ---
> >  drivers/mailbox/mailbox.c | 15 +++++++++------
> >  1 file changed, 9 insertions(+), 6 deletions(-)
>
> While I can certainly be corrected, I suspect this patch will break a
> lot of users.
>
> My analysis of people that were passing NULL mbox messages is that
> they _wanted_ the current behavior. They didn't want NULL messages to
> be queued up as would happen with this patch. Instead, they just
> wanted to immediately ring the doorbell again to signal an interrupt
> to the other side.
>
They won't be queued up because they call mbox_client_txdone()
immediately after mbox_send_message()
From my analysis of the 14 clients, there was just one
(irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a
patch.

Can you please point me to the driver you are concerned about? Perhaps
I am overlooking something.

Regards,
Jassi

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Doug Anderson 4 weeks ago

Hi,

On Tue, Mar 10, 2026 at 4:59 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
>
> On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote:
> > >
> > > From: Jassi Brar <jassisinghbrar@gmail.com>
> > >
> > > The active_req field serves double duty as both the "is a TX in
> > > flight" flag (NULL means idle) and the storage for the in-flight
> > > message pointer. When a client sends NULL via mbox_send_message(),
> > > active_req is set to NULL, which the framework misinterprets as
> > > "no active request." This breaks the TX state machine by:
> > >
> > >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > >    callback and the tx_complete completion
> > >  - txdone_hrtimer() skips the channel entirely since active_req
> > >    is NULL, so poll-based TX-done detection never fires.
> > >
> > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > "no active request," freeing NULL to be valid message data. The
> > > sentinel is internal to the mailbox core and is never exposed to
> > > controller drivers or clients.
> > >
> > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
> > > by clients.
> > >
> > > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
> > > ---
> > >  drivers/mailbox/mailbox.c | 15 +++++++++------
> > >  1 file changed, 9 insertions(+), 6 deletions(-)
> >
> > While I can certainly be corrected, I suspect this patch will break a
> > lot of users.
> >
> > My analysis of people that were passing NULL mbox messages is that
> > they _wanted_ the current behavior. They didn't want NULL messages to
> > be queued up as would happen with this patch. Instead, they just
> > wanted to immediately ring the doorbell again to signal an interrupt
> > to the other side.
> >
> They won't be queued up because they call mbox_client_txdone()
> immediately after mbox_send_message()
> From my analysis of the 14 clients, there was just one
> (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a
> patch.
>
> Can you please point me to the driver you are concerned about? Perhaps
> I am overlooking something.

Ah, interesting! I hadn't thought about it that way...

I think there are at least a few others that would need to change,
maybe? It looks like these ones:

drivers/soc/xilinx/zynqmp_power.c
drivers/remoteproc/xlnx_r5_remoteproc.c
drivers/firmware/imx/imx-dsp.c

In my case, I have a mailbox driver that's currently downstream
(though I hope to change that). My mailbox controller has an interrupt
for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
Some clients of this mailbox client want the txdone interrupt, but
some clients of it just care about sending doorbells.

This driver would be able to send "txdone" for doorbells (or it could
have a "doorbell done"), but no clients that send doorbells actually
care about this signal. That being said, I believe clients do care
about doorbells not being queued.

-Doug

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Jassi Brar 4 weeks ago

On Tue, Mar 10, 2026 at 7:15 PM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Tue, Mar 10, 2026 at 4:59 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
> >
> > On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@chromium.org> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote:
> > > >
> > > > From: Jassi Brar <jassisinghbrar@gmail.com>
> > > >
> > > > The active_req field serves double duty as both the "is a TX in
> > > > flight" flag (NULL means idle) and the storage for the in-flight
> > > > message pointer. When a client sends NULL via mbox_send_message(),
> > > > active_req is set to NULL, which the framework misinterprets as
> > > > "no active request." This breaks the TX state machine by:
> > > >
> > > >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > > >    callback and the tx_complete completion
> > > >  - txdone_hrtimer() skips the channel entirely since active_req
> > > >    is NULL, so poll-based TX-done detection never fires.
> > > >
> > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > > "no active request," freeing NULL to be valid message data. The
> > > > sentinel is internal to the mailbox core and is never exposed to
> > > > controller drivers or clients.
> > > >
> > > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
> > > > by clients.
> > > >
> > > > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
> > > > ---
> > > >  drivers/mailbox/mailbox.c | 15 +++++++++------
> > > >  1 file changed, 9 insertions(+), 6 deletions(-)
> > >
> > > While I can certainly be corrected, I suspect this patch will break a
> > > lot of users.
> > >
> > > My analysis of people that were passing NULL mbox messages is that
> > > they _wanted_ the current behavior. They didn't want NULL messages to
> > > be queued up as would happen with this patch. Instead, they just
> > > wanted to immediately ring the doorbell again to signal an interrupt
> > > to the other side.
> > >
> > They won't be queued up because they call mbox_client_txdone()
> > immediately after mbox_send_message()
> > From my analysis of the 14 clients, there was just one
> > (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a
> > patch.
> >
> > Can you please point me to the driver you are concerned about? Perhaps
> > I am overlooking something.
>
> Ah, interesting! I hadn't thought about it that way...
>
> I think there are at least a few others that would need to change,
> maybe? It looks like these ones:
>
> drivers/soc/xilinx/zynqmp_power.c
> drivers/remoteproc/xlnx_r5_remoteproc.c
The underlying driver for both is drivers/mailbox/zynqmp-ipi-mailbox.c
which sets
txdone_poll = true and implements the .last_tx_done() callback.
The client (zynqmp_power.c) should not call mbox_client_txdone()

> drivers/firmware/imx/imx-dsp.c
The underlying driver is drivers/mailbox/imx-mailbox.c which sets
'txdone_irq = true'
and ticks the state machine with mbox_chan_txdone()
Again the client need not call mbox_client_txdone()

All three clients above are currently failed by the core which
misinterprets NULL messages.
This patch will fix such clients too, besides avoiding a new api.

>
> In my case, I have a mailbox driver that's currently downstream
> (though I hope to change that). My mailbox controller has an interrupt
> for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> Some clients of this mailbox client want the txdone interrupt, but
> some clients of it just care about sending doorbells.
>
So imx-dsp.c like? Please let me know what is lacking in the core to
fully support that. Happy to look into it.

Regards,
Jassi

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Doug Anderson 3 weeks, 6 days ago

Hi,

On Tue, Mar 10, 2026 at 5:46 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
>
> On Tue, Mar 10, 2026 at 7:15 PM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Tue, Mar 10, 2026 at 4:59 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
> > >
> > > On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@chromium.org> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote:
> > > > >
> > > > > From: Jassi Brar <jassisinghbrar@gmail.com>
> > > > >
> > > > > The active_req field serves double duty as both the "is a TX in
> > > > > flight" flag (NULL means idle) and the storage for the in-flight
> > > > > message pointer. When a client sends NULL via mbox_send_message(),
> > > > > active_req is set to NULL, which the framework misinterprets as
> > > > > "no active request." This breaks the TX state machine by:
> > > > >
> > > > >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > > > >    callback and the tx_complete completion
> > > > >  - txdone_hrtimer() skips the channel entirely since active_req
> > > > >    is NULL, so poll-based TX-done detection never fires.
> > > > >
> > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > > > "no active request," freeing NULL to be valid message data. The
> > > > > sentinel is internal to the mailbox core and is never exposed to
> > > > > controller drivers or clients.
> > > > >
> > > > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
> > > > > by clients.
> > > > >
> > > > > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
> > > > > ---
> > > > >  drivers/mailbox/mailbox.c | 15 +++++++++------
> > > > >  1 file changed, 9 insertions(+), 6 deletions(-)
> > > >
> > > > While I can certainly be corrected, I suspect this patch will break a
> > > > lot of users.
> > > >
> > > > My analysis of people that were passing NULL mbox messages is that
> > > > they _wanted_ the current behavior. They didn't want NULL messages to
> > > > be queued up as would happen with this patch. Instead, they just
> > > > wanted to immediately ring the doorbell again to signal an interrupt
> > > > to the other side.
> > > >
> > > They won't be queued up because they call mbox_client_txdone()
> > > immediately after mbox_send_message()
> > > From my analysis of the 14 clients, there was just one
> > > (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a
> > > patch.
> > >
> > > Can you please point me to the driver you are concerned about? Perhaps
> > > I am overlooking something.
> >
> > Ah, interesting! I hadn't thought about it that way...
> >
> > I think there are at least a few others that would need to change,
> > maybe? It looks like these ones:
> >
> > drivers/soc/xilinx/zynqmp_power.c
> > drivers/remoteproc/xlnx_r5_remoteproc.c
> The underlying driver for both is drivers/mailbox/zynqmp-ipi-mailbox.c
> which sets
> txdone_poll = true and implements the .last_tx_done() callback.
> The client (zynqmp_power.c) should not call mbox_client_txdone()
>
> > drivers/firmware/imx/imx-dsp.c
> The underlying driver is drivers/mailbox/imx-mailbox.c which sets
> 'txdone_irq = true'
> and ticks the state machine with mbox_chan_txdone()
> Again the client need not call mbox_client_txdone()
>
> All three clients above are currently failed by the core which
> misinterprets NULL messages.
> This patch will fix such clients too, besides avoiding a new api.

...but doesn't that mean that the behavior of these clients will
change? Previously all calls to mbox_send_message() immediately called
through to the mailbox controller. Now, if the previous message hasn't
finished, the "NULL" messages will queue up.

For instance, let's say that our client calls mbox_send_message() 2
times in quick succession (10 us apart). Let's say that the remote
processor takes 1 ms to react.

Previously, the 2 calls to mbox_send_message() would probably be
coalesced. Each would call through to the mailbox controller, which
would make sure the doorbell was asserted. After 1 ms, the remote
processor would confirm the single doorbell it saw.

Now, the first call will ring the doorbell. The second one will queue
up. After 1 ms, the remote processor will confirm the doorbell, which
will cause the second doorbell to be sent.


Looking at the 3 drivers in question, I _guess_ maybe that situation
never comes up for them? So maybe they're all fine? I don't have tons
of confidence that I understand enough about these clients to say for
sure that this doesn't happen...


> > In my case, I have a mailbox driver that's currently downstream
> > (though I hope to change that). My mailbox controller has an interrupt
> > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> > Some clients of this mailbox client want the txdone interrupt, but
> > some clients of it just care about sending doorbells.
> >
> So imx-dsp.c like? Please let me know what is lacking in the core to
> fully support that. Happy to look into it.

I know that with my downstream mailbox client, if I let NULL messages
queue up I end up with a queue of a dozen or so NULL messages. The
downstream client is really taking advantage (AKA abusing) the core's
current NULL behavior. It truly does want the "mbox_ring_doorbell"
concept of just making sure the doorbell is asserted and returning
immediately. If the core changes to start queuing NULL messages, we'll
have to figure out some sort of workaround...

-Doug

Re: [PATCH] RFC: mailbox: Fix NULL message support in

Posted by Joonwon Kang 3 weeks, 5 days ago

> Hi,
> 
> On Tue, Mar 10, 2026 at 5:46 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
> >
> > On Tue, Mar 10, 2026 at 7:15 PM Doug Anderson <dianders@chromium.org> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Mar 10, 2026 at 4:59 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
> > > >
> > > > On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@chromium.org> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote:
> > > > > >
> > > > > > From: Jassi Brar <jassisinghbrar@gmail.com>
> > > > > >
> > > > > > The active_req field serves double duty as both the "is a TX in
> > > > > > flight" flag (NULL means idle) and the storage for the in-flight
> > > > > > message pointer. When a client sends NULL via mbox_send_message(),
> > > > > > active_req is set to NULL, which the framework misinterprets as
> > > > > > "no active request." This breaks the TX state machine by:
> > > > > >
> > > > > >  - tx_tick() short-circuits on (!mssg), skipping the tx_done
> > > > > >    callback and the tx_complete completion
> > > > > >  - txdone_hrtimer() skips the channel entirely since active_req
> > > > > >    is NULL, so poll-based TX-done detection never fires.
> > > > > >
> > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means
> > > > > > "no active request," freeing NULL to be valid message data. The
> > > > > > sentinel is internal to the mailbox core and is never exposed to
> > > > > > controller drivers or clients.
> > > > > >
> > > > > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
> > > > > > by clients.
> > > > > >
> > > > > > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
> > > > > > ---
> > > > > >  drivers/mailbox/mailbox.c | 15 +++++++++------
> > > > > >  1 file changed, 9 insertions(+), 6 deletions(-)
> > > > >
> > > > > While I can certainly be corrected, I suspect this patch will break a
> > > > > lot of users.
> > > > >
> > > > > My analysis of people that were passing NULL mbox messages is that
> > > > > they _wanted_ the current behavior. They didn't want NULL messages to
> > > > > be queued up as would happen with this patch. Instead, they just
> > > > > wanted to immediately ring the doorbell again to signal an interrupt
> > > > > to the other side.
> > > > >
> > > > They won't be queued up because they call mbox_client_txdone()
> > > > immediately after mbox_send_message()
> > > > From my analysis of the 14 clients, there was just one
> > > > (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a
> > > > patch.
> > > >
> > > > Can you please point me to the driver you are concerned about? Perhaps
> > > > I am overlooking something.
> > >
> > > Ah, interesting! I hadn't thought about it that way...
> > >
> > > I think there are at least a few others that would need to change,
> > > maybe? It looks like these ones:
> > >
> > > drivers/soc/xilinx/zynqmp_power.c
> > > drivers/remoteproc/xlnx_r5_remoteproc.c
> > The underlying driver for both is drivers/mailbox/zynqmp-ipi-mailbox.c
> > which sets
> > txdone_poll = true and implements the .last_tx_done() callback.
> > The client (zynqmp_power.c) should not call mbox_client_txdone()
> >
> > > drivers/firmware/imx/imx-dsp.c
> > The underlying driver is drivers/mailbox/imx-mailbox.c which sets
> > 'txdone_irq = true'
> > and ticks the state machine with mbox_chan_txdone()
> > Again the client need not call mbox_client_txdone()
> >
> > All three clients above are currently failed by the core which
> > misinterprets NULL messages.
> > This patch will fix such clients too, besides avoiding a new api.
> 
> ...but doesn't that mean that the behavior of these clients will
> change? Previously all calls to mbox_send_message() immediately called
> through to the mailbox controller. Now, if the previous message hasn't
> finished, the "NULL" messages will queue up.
> 
> For instance, let's say that our client calls mbox_send_message() 2
> times in quick succession (10 us apart). Let's say that the remote
> processor takes 1 ms to react.
> 
> Previously, the 2 calls to mbox_send_message() would probably be
> coalesced. Each would call through to the mailbox controller, which
> would make sure the doorbell was asserted. After 1 ms, the remote
> processor would confirm the single doorbell it saw.
> 
> Now, the first call will ring the doorbell. The second one will queue
> up. After 1 ms, the remote processor will confirm the doorbell, which
> will cause the second doorbell to be sent.
> 
> 
> Looking at the 3 drivers in question, I _guess_ maybe that situation
> never comes up for them? So maybe they're all fine? I don't have tons
> of confidence that I understand enough about these clients to say for
> sure that this doesn't happen...
> 

If multiple threads are ringing the doorbell through the same channel many
times within a very short period of time, I think the issue you are
concerned about could occur in theory with this patch even if the doorbell
ack could be ignored. If that is the case, I believe it will be clients to
handle the send failure, e.g. by retrying.

> 
> > > In my case, I have a mailbox driver that's currently downstream
> > > (though I hope to change that). My mailbox controller has an interrupt
> > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> > > Some clients of this mailbox client want the txdone interrupt, but
> > > some clients of it just care about sending doorbells.
> > >
> > So imx-dsp.c like? Please let me know what is lacking in the core to
> > fully support that. Happy to look into it.
> 
> I know that with my downstream mailbox client, if I let NULL messages
> queue up I end up with a queue of a dozen or so NULL messages. The
> downstream client is really taking advantage (AKA abusing) the core's
> current NULL behavior. It truly does want the "mbox_ring_doorbell"
> concept of just making sure the doorbell is asserted and returning
> immediately. If the core changes to start queuing NULL messages, we'll
> have to figure out some sort of workaround...

According to the Jassi's comment on https://lore.kernel.org/all/CABb+yY18OEvfc8DDUiZqVeQtkmwcOFCSTMT7KoXb1LVA3RuxdA@mail.gmail.com/
"A controller driver is supposed to either expect data or not, but not both",
client drivers should send to a controller either non-NULL or NULL data,
not both in a mixed fashion.

With this setup, if the doorbell acknowledgement from the remote could be
ignored by the controller for your case, I believe you could separate a
controller for doorbell from the one for non-doorbell and set the
controller as follows.
```
mbox_controller->txdone_poll = true;
mbox_controller->txpoll_period = 0;
mbox_controller->ops->last_tx_done = ... just returns true ...
```
Then it will dequeue the NULL messages as fast as the hrtimer's resolution
interval. Or, you could just use mbox_send_message() and
mbox_client_txdone() pair.

On the other hand, if the doorbell acknowledgement is to be respected by
the controller(this is my case), the controller should wait until the
ongoing doorbell tx is successfully done or not, and this patch could do
it whereas mbox_ring_doorbell() couldn't.

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Jassi Brar 3 weeks, 6 days ago

On Tue, Mar 10, 2026 at 9:00 PM Doug Anderson <dianders@chromium.org> wrote:
>
> > > Ah, interesting! I hadn't thought about it that way...
> > >
> > > I think there are at least a few others that would need to change,
> > > maybe? It looks like these ones:
> > >
> > > drivers/soc/xilinx/zynqmp_power.c
> > > drivers/remoteproc/xlnx_r5_remoteproc.c
> > The underlying driver for both is drivers/mailbox/zynqmp-ipi-mailbox.c
> > which sets
> > txdone_poll = true and implements the .last_tx_done() callback.
> > The client (zynqmp_power.c) should not call mbox_client_txdone()
> >
> > > drivers/firmware/imx/imx-dsp.c
> > The underlying driver is drivers/mailbox/imx-mailbox.c which sets
> > 'txdone_irq = true'
> > and ticks the state machine with mbox_chan_txdone()
> > Again the client need not call mbox_client_txdone()
> >
> > All three clients above are currently failed by the core which
> > misinterprets NULL messages.
> > This patch will fix such clients too, besides avoiding a new api.
>
> ...but doesn't that mean that the behavior of these clients will
> change? Previously all calls to mbox_send_message() immediately called
> through to the mailbox controller. Now, if the previous message hasn't
> finished, the "NULL" messages will queue up.
>
I don't see any problem. As we have checked, all clients call
mbox_client_txdone() after each mbox_send_message() because the
underlying controller just sets the bit and the doorbell is rung. The
clients correctly drive the state-machine. So they are already doing
the right thing and should be fine.

> For instance, let's say that our client calls mbox_send_message() 2
> times in quick succession (10 us apart). Let's say that the remote
> processor takes 1 ms to react.
>
> Previously, the 2 calls to mbox_send_message() would probably be
> coalesced. Each would call through to the mailbox controller, which
> would make sure the doorbell was asserted. After 1 ms, the remote
> processor would confirm the single doorbell it saw.
>
> Now, the first call will ring the doorbell. The second one will queue
> up. After 1 ms, the remote processor will confirm the doorbell, which
> will cause the second doorbell to be sent.
>
Am I overlooking some code? I found no existing client relying on that
bug, luckily. Even if some did, it would be better to fix that client
instead of moving the subsystem around for it.

>
> Looking at the 3 drivers in question, I _guess_ maybe that situation
> never comes up for them? So maybe they're all fine? I don't have tons
> of confidence that I understand enough about these clients to say for
> sure that this doesn't happen...
>
Those 3 clients don't run the state machine (tick), the underlying
controllers do.


>
> > > In my case, I have a mailbox driver that's currently downstream
> > > (though I hope to change that). My mailbox controller has an interrupt
> > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> > > Some clients of this mailbox client want the txdone interrupt, but
> > > some clients of it just care about sending doorbells.
> > >
> > So imx-dsp.c like? Please let me know what is lacking in the core to
> > fully support that. Happy to look into it.
>
> I know that with my downstream mailbox client, if I let NULL messages
> queue up I end up with a queue of a dozen or so NULL messages. The
> downstream client is really taking advantage (AKA abusing) the core's
> current NULL behavior. It truly does want the "mbox_ring_doorbell"
> concept of just making sure the doorbell is asserted and returning
> immediately. If the core changes to start queuing NULL messages, we'll
> have to figure out some sort of workaround...
>
You said your controller has an irq raised for the doorbell sent (?).
If so, your clients should want to wait for that confirmation (the
controller driver would tick via mbox_chan_txdone).
If there is no irq involved and your controller can just set some bit
and the doorbell is guaranteed, then you just need to tie
mbox_send_message() & mbox_client_txdone() , which is the right way
and how other such clients work.
In any case, I will be happy to look at your draft code and make sure
your usecase is supported.

Regards,
Jassi

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Doug Anderson 3 weeks, 5 days ago

Hi,

On Tue, Mar 10, 2026 at 8:42 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
>
> > > > In my case, I have a mailbox driver that's currently downstream
> > > > (though I hope to change that). My mailbox controller has an interrupt
> > > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> > > > Some clients of this mailbox client want the txdone interrupt, but
> > > > some clients of it just care about sending doorbells.
> > > >
> > > So imx-dsp.c like? Please let me know what is lacking in the core to
> > > fully support that. Happy to look into it.
> >
> > I know that with my downstream mailbox client, if I let NULL messages
> > queue up I end up with a queue of a dozen or so NULL messages. The
> > downstream client is really taking advantage (AKA abusing) the core's
> > current NULL behavior. It truly does want the "mbox_ring_doorbell"
> > concept of just making sure the doorbell is asserted and returning
> > immediately. If the core changes to start queuing NULL messages, we'll
> > have to figure out some sort of workaround...
> >
> You said your controller has an irq raised for the doorbell sent (?).

It has an interrupt for when the remote side receives the doorbell
(when it clears the IRQ on its side).

> If so, your clients should want to wait for that confirmation (the
> controller driver would tick via mbox_chan_txdone).

Ah, I think I see what you're saying. Functionally, I think it should
work. Pseudocode for the client:

def ring_doorbell():
  if not previous_doorbell_acked:
    another_doorbell_pending = true
  else:
    previous_doorbell_acked = false;
    mbox_send_message(NULL)

def tx_done():
  if another_doorbell_pending:
    another_doorbell_pending = false
    mbox_send_message(NULL)
  else:
    previous_doorbell_acked = true;

I think that will ensure the other side always gets at least one
future interrupt when the doorbell is rung.

I guess another option would be for the mailbox controller to issue
"tx_done" immediately for NULL messages. With no data to transmit, I
guess one could say that as soon as the interrupt is raised that the
data is "transmitted", so maybe this would be OK too?

-Doug

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Jassi Brar 3 weeks ago

On Thu, Mar 12, 2026 at 3:59 PM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Tue, Mar 10, 2026 at 8:42 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
> >
> > > > > In my case, I have a mailbox driver that's currently downstream
> > > > > (though I hope to change that). My mailbox controller has an interrupt
> > > > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> > > > > Some clients of this mailbox client want the txdone interrupt, but
> > > > > some clients of it just care about sending doorbells.
> > > > >
> > > > So imx-dsp.c like? Please let me know what is lacking in the core to
> > > > fully support that. Happy to look into it.
> > >
> > > I know that with my downstream mailbox client, if I let NULL messages
> > > queue up I end up with a queue of a dozen or so NULL messages. The
> > > downstream client is really taking advantage (AKA abusing) the core's
> > > current NULL behavior. It truly does want the "mbox_ring_doorbell"
> > > concept of just making sure the doorbell is asserted and returning
> > > immediately. If the core changes to start queuing NULL messages, we'll
> > > have to figure out some sort of workaround...
> > >
> > You said your controller has an irq raised for the doorbell sent (?).
>
> It has an interrupt for when the remote side receives the doorbell
> (when it clears the IRQ on its side).
>
For one message we may not care if the irq gets cleared on the remote
side, but for many messages the client should want to ensure the next
message is sent only after the remote has acked the last sent doorbell
(?)
Otherwise, depending upon the h/w and f/w implementation on the remote
side, a doorbell may be 'missed'. But you know your setup better, take
these as just notes.


> > If so, your clients should want to wait for that confirmation (the
> > controller driver would tick via mbox_chan_txdone).
>
> Ah, I think I see what you're saying. Functionally, I think it should
> work. Pseudocode for the client:
>
> def ring_doorbell():
>   if not previous_doorbell_acked:
>     another_doorbell_pending = true
>   else:
>     previous_doorbell_acked = false;
>     mbox_send_message(NULL)
>
> def tx_done():
>   if another_doorbell_pending:
>     another_doorbell_pending = false
>     mbox_send_message(NULL)
>   else:
>     previous_doorbell_acked = true;
>
If you get interrupt from remote, and want to respect that, then your
controller will set txdone_irq=true and call mbox_chan_txdone() Client
shouldn't call mbox_client_txdone()

> I think that will ensure the other side always gets at least one
> future interrupt when the doorbell is rung.
>
> I guess another option would be for the mailbox controller to issue
> "tx_done" immediately for NULL messages. With no data to transmit, I
> guess one could say that as soon as the interrupt is raised that the
> data is "transmitted", so maybe this would be OK too?
>
That is ok, but as I said depending on how the remote h/w and f/w is
set up, multiple back to back doorbells may 'overwrite' some. Not sure
if that will be a problem for your clients.

Cheers!
Jassi

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Joonwon Kang 3 weeks, 4 days ago

> Hi,
> 
> On Tue, Mar 10, 2026 at 8:42 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
> >
> > > > > In my case, I have a mailbox driver that's currently downstream
> > > > > (though I hope to change that). My mailbox controller has an interrupt
> > > > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> > > > > Some clients of this mailbox client want the txdone interrupt, but
> > > > > some clients of it just care about sending doorbells.
> > > > >
> > > > So imx-dsp.c like? Please let me know what is lacking in the core to
> > > > fully support that. Happy to look into it.
> > >
> > > I know that with my downstream mailbox client, if I let NULL messages
> > > queue up I end up with a queue of a dozen or so NULL messages. The
> > > downstream client is really taking advantage (AKA abusing) the core's
> > > current NULL behavior. It truly does want the "mbox_ring_doorbell"
> > > concept of just making sure the doorbell is asserted and returning
> > > immediately. If the core changes to start queuing NULL messages, we'll
> > > have to figure out some sort of workaround...
> > >
> > You said your controller has an irq raised for the doorbell sent (?).
> 
> It has an interrupt for when the remote side receives the doorbell
> (when it clears the IRQ on its side).
> 
> > If so, your clients should want to wait for that confirmation (the
> > controller driver would tick via mbox_chan_txdone).
> 
> Ah, I think I see what you're saying. Functionally, I think it should
> work. Pseudocode for the client:
> 
> def ring_doorbell():
>   if not previous_doorbell_acked:
>     another_doorbell_pending = true
>   else:
>     previous_doorbell_acked = false;
>     mbox_send_message(NULL)
> 
> def tx_done():
>   if another_doorbell_pending:
>     another_doorbell_pending = false
>     mbox_send_message(NULL)
>   else:
>     previous_doorbell_acked = true;
> 
> I think that will ensure the other side always gets at least one
> future interrupt when the doorbell is rung.
> 
> I guess another option would be for the mailbox controller to issue
> "tx_done" immediately for NULL messages. With no data to transmit, I
> guess one could say that as soon as the interrupt is raised that the
> data is "transmitted", so maybe this would be OK too?

One thing to note is that if the mailbox controller issues "tx_done"
immediately in the same thread that is executing
mbox_controller->send_data(), you will encounter deadlock(refer to
chan->lock on the code). Thus, it should be issued after ->send_data() or
in another thread. Either way, it is unavoidable to have window to queue
up the NULL messages in the multi-threads situation anyway. So, the key in
that case will be how quickly the queue is dequeued or how to recover from
the tx failure due to the overflow, I think.

Re: [PATCH] RFC: mailbox: Fix NULL message support in mbox_send_message()

Posted by Doug Anderson 3 weeks, 4 days ago

Hi,

On Fri, Mar 13, 2026 at 1:44 AM Joonwon Kang <joonwonkang@google.com> wrote:
>
> > Hi,
> >
> > On Tue, Mar 10, 2026 at 8:42 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
> > >
> > > > > > In my case, I have a mailbox driver that's currently downstream
> > > > > > (though I hope to change that). My mailbox controller has an interrupt
> > > > > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> > > > > > Some clients of this mailbox client want the txdone interrupt, but
> > > > > > some clients of it just care about sending doorbells.
> > > > > >
> > > > > So imx-dsp.c like? Please let me know what is lacking in the core to
> > > > > fully support that. Happy to look into it.
> > > >
> > > > I know that with my downstream mailbox client, if I let NULL messages
> > > > queue up I end up with a queue of a dozen or so NULL messages. The
> > > > downstream client is really taking advantage (AKA abusing) the core's
> > > > current NULL behavior. It truly does want the "mbox_ring_doorbell"
> > > > concept of just making sure the doorbell is asserted and returning
> > > > immediately. If the core changes to start queuing NULL messages, we'll
> > > > have to figure out some sort of workaround...
> > > >
> > > You said your controller has an irq raised for the doorbell sent (?).
> >
> > It has an interrupt for when the remote side receives the doorbell
> > (when it clears the IRQ on its side).
> >
> > > If so, your clients should want to wait for that confirmation (the
> > > controller driver would tick via mbox_chan_txdone).
> >
> > Ah, I think I see what you're saying. Functionally, I think it should
> > work. Pseudocode for the client:
> >
> > def ring_doorbell():
> >   if not previous_doorbell_acked:
> >     another_doorbell_pending = true
> >   else:
> >     previous_doorbell_acked = false;
> >     mbox_send_message(NULL)
> >
> > def tx_done():
> >   if another_doorbell_pending:
> >     another_doorbell_pending = false
> >     mbox_send_message(NULL)
> >   else:
> >     previous_doorbell_acked = true;
> >
> > I think that will ensure the other side always gets at least one
> > future interrupt when the doorbell is rung.
> >
> > I guess another option would be for the mailbox controller to issue
> > "tx_done" immediately for NULL messages. With no data to transmit, I
> > guess one could say that as soon as the interrupt is raised that the
> > data is "transmitted", so maybe this would be OK too?
>
> One thing to note is that if the mailbox controller issues "tx_done"
> immediately in the same thread that is executing
> mbox_controller->send_data(), you will encounter deadlock(refer to
> chan->lock on the code). Thus, it should be issued after ->send_data() or
> in another thread. Either way, it is unavoidable to have window to queue
> up the NULL messages in the multi-threads situation anyway. So, the key in
> that case will be how quickly the queue is dequeued or how to recover from
> the tx failure due to the overflow, I think.

Ah, right. I guess maybe the answer will just be the one expressed by
my pseudocode above. It will have some slightly different timings, but
I think it should still be fine. :-)

-Doug