drivers/mailbox/mailbox.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)
From: Jassi Brar <jassisinghbrar@gmail.com>
The active_req field serves double duty as both the "is a TX in
flight" flag (NULL means idle) and the storage for the in-flight
message pointer. When a client sends NULL via mbox_send_message(),
active_req is set to NULL, which the framework misinterprets as
"no active request." This breaks the TX state machine by:
- tx_tick() short-circuits on (!mssg), skipping the tx_done
callback and the tx_complete completion
- txdone_hrtimer() skips the channel entirely since active_req
is NULL, so poll-based TX-done detection never fires.
Fix this by introducing a MBOX_NO_MSG sentinel value that means
"no active request," freeing NULL to be valid message data. The
sentinel is internal to the mailbox core and is never exposed to
controller drivers or clients.
The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message
by clients.
Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com>
---
drivers/mailbox/mailbox.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 617ba505691d..d06f6d49deaf 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -20,6 +20,9 @@
#include "mailbox.h"
+/* Sentinel value distinguishing "no active request" from "NULL message data" */
+#define MBOX_NO_MSG ((void *)-1)
+
static LIST_HEAD(mbox_cons);
static DEFINE_MUTEX(con_mutex);
@@ -52,7 +55,7 @@ static void msg_submit(struct mbox_chan *chan)
int err = -EBUSY;
scoped_guard(spinlock_irqsave, &chan->lock) {
- if (!chan->msg_count || chan->active_req)
+ if (!chan->msg_count || chan->active_req != MBOX_NO_MSG)
break;
count = chan->msg_count;
@@ -87,13 +90,13 @@ static void tx_tick(struct mbox_chan *chan, int r)
scoped_guard(spinlock_irqsave, &chan->lock) {
mssg = chan->active_req;
- chan->active_req = NULL;
+ chan->active_req = MBOX_NO_MSG;
}
/* Submit next message */
msg_submit(chan);
- if (!mssg)
+ if (mssg == MBOX_NO_MSG)
return;
/* Notify the client */
@@ -114,7 +117,7 @@ static enum hrtimer_restart txdone_hrtimer(struct hrtimer *hrtimer)
for (i = 0; i < mbox->num_chans; i++) {
struct mbox_chan *chan = &mbox->chans[i];
- if (chan->active_req && chan->cl) {
+ if (chan->active_req != MBOX_NO_MSG && chan->cl) {
txdone = chan->mbox->ops->last_tx_done(chan);
if (txdone)
tx_tick(chan, 0);
@@ -319,7 +322,7 @@ static int __mbox_bind_client(struct mbox_chan *chan, struct mbox_client *cl)
scoped_guard(spinlock_irqsave, &chan->lock) {
chan->msg_free = 0;
chan->msg_count = 0;
- chan->active_req = NULL;
+ chan->active_req = MBOX_NO_MSG;
chan->cl = cl;
init_completion(&chan->tx_complete);
@@ -477,7 +480,7 @@ void mbox_free_channel(struct mbox_chan *chan)
/* The queued TX requests are simply aborted, no callbacks are made */
scoped_guard(spinlock_irqsave, &chan->lock) {
chan->cl = NULL;
- chan->active_req = NULL;
+ chan->active_req = MBOX_NO_MSG;
if (chan->txdone_method == TXDONE_BY_ACK)
chan->txdone_method = TXDONE_BY_POLL;
}
--
2.43.0
> The active_req field serves double duty as both the "is a TX in > flight" flag (NULL means idle) and the storage for the in-flight > message pointer. When a client sends NULL via mbox_send_message(), > active_req is set to NULL, which the framework misinterprets as > "no active request." This breaks the TX state machine by: > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > callback and the tx_complete completion > - txdone_hrtimer() skips the channel entirely since active_req > is NULL, so poll-based TX-done detection never fires. > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > "no active request," freeing NULL to be valid message data. The > sentinel is internal to the mailbox core and is never exposed to > controller drivers or clients. The following drivers are currently using ->active_req which now could be assigned MBOX_NO_MSG. - drivers/mailbox/tegra-hsp.c - drivers/mailbox/mtk-vcp-mailbox.c One of them is using ->active_req to wait until the channel is empty. In this case, strictly speaking, that controller driver should be aware of the sentinel value MBOX_NO_MSG, which means the sentinel value should be exposed to the controller. Or, if a future controller driver to come is to use ->active_req for the same purpose for doorbell or non-doorbell, it should also be aware of the sentinel value anyway. However, I believe that it is not intuitive to the controller developers that a pointer value could be other value than a real memory address, NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason, I think it will be better to change the type of ->active_req to give a better indication to the controller developers, e.g. to integer as in the original patch https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/. Or, we could change those drivers not to use ->active_req, hide ->active_req entirely in the mailbox core and keep this patch. Thanks. > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message > by clients.
On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote: > > > The active_req field serves double duty as both the "is a TX in > > flight" flag (NULL means idle) and the storage for the in-flight > > message pointer. When a client sends NULL via mbox_send_message(), > > active_req is set to NULL, which the framework misinterprets as > > "no active request." This breaks the TX state machine by: > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > callback and the tx_complete completion > > - txdone_hrtimer() skips the channel entirely since active_req > > is NULL, so poll-based TX-done detection never fires. > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > "no active request," freeing NULL to be valid message data. The > > sentinel is internal to the mailbox core and is never exposed to > > controller drivers or clients. > > The following drivers are currently using ->active_req which now could be > assigned MBOX_NO_MSG. > - drivers/mailbox/tegra-hsp.c > - drivers/mailbox/mtk-vcp-mailbox.c > Good catch, Thanks. mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This patch will not introduce any change for it. Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single line change. > One of them is using ->active_req to wait until the channel is empty. In > this case, strictly speaking, that controller driver should be aware of > the sentinel value MBOX_NO_MSG, which means the sentinel value should be > exposed to the controller. Or, if a future controller driver to come is to > use ->active_req for the same purpose for doorbell or non-doorbell, it > should also be aware of the sentinel value anyway. > > However, I believe that it is not intuitive to the controller developers > that a pointer value could be other value than a real memory address, > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason, > I think it will be better to change the type of ->active_req to give a > better indication to the controller developers, e.g. to integer as in the > original patch > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/. > > Or, we could change those drivers not to use ->active_req, hide > ->active_req entirely in the mailbox core and keep this patch. > Yes, ideally controller drivers should not track active_req. But this patch will cause least churn it seems. Thanks Jassi
> On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote: > > > > > The active_req field serves double duty as both the "is a TX in > > > flight" flag (NULL means idle) and the storage for the in-flight > > > message pointer. When a client sends NULL via mbox_send_message(), > > > active_req is set to NULL, which the framework misinterprets as > > > "no active request." This breaks the TX state machine by: > > > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > > callback and the tx_complete completion > > > - txdone_hrtimer() skips the channel entirely since active_req > > > is NULL, so poll-based TX-done detection never fires. > > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > > "no active request," freeing NULL to be valid message data. The > > > sentinel is internal to the mailbox core and is never exposed to > > > controller drivers or clients. > > > > The following drivers are currently using ->active_req which now could be > > assigned MBOX_NO_MSG. > > - drivers/mailbox/tegra-hsp.c > > - drivers/mailbox/mtk-vcp-mailbox.c > > > Good catch, Thanks. > mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This > patch will not introduce any change for it. > Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single > line change. > A more important aspect than single line change would be that you are creating a new API contract that the controller drivers who are to use ->active_req should now have new knowledge that the pointer value could be unconventional value such as -1. Without this additional knowledge, they may easily fail checking if the channel is empty. > > One of them is using ->active_req to wait until the channel is empty. In > > this case, strictly speaking, that controller driver should be aware of > > the sentinel value MBOX_NO_MSG, which means the sentinel value should be > > exposed to the controller. Or, if a future controller driver to come is to > > use ->active_req for the same purpose for doorbell or non-doorbell, it > > should also be aware of the sentinel value anyway. > > > > However, I believe that it is not intuitive to the controller developers > > that a pointer value could be other value than a real memory address, > > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason, > > I think it will be better to change the type of ->active_req to give a > > better indication to the controller developers, e.g. to integer as in the > > original patch > > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/. > > > > Or, we could change those drivers not to use ->active_req, hide > > ->active_req entirely in the mailbox core and keep this patch. > > > Yes, ideally controller drivers should not track active_req. But this > patch will cause least churn it seems. If we will take those two drivers as an exceptional and not recommended case that uses ->active_req directly, it will be fine not to change them except for tegra-hsp.c to check MBOX_NO_MSG. If that is the case and you are to keep this patch, please keep in mind to leave the original patch link in the commit message for better trackability of the discussion and solutions and for credit for the contribution of finding and analyzing this issue and proposing the first solution, which I believe is also important for future contributions to come. Thanks.
Hi, On Mon, Mar 16, 2026 at 10:03 PM Joonwon Kang <joonwonkang@google.com> wrote: > > > On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote: > > > > > > > The active_req field serves double duty as both the "is a TX in > > > > flight" flag (NULL means idle) and the storage for the in-flight > > > > message pointer. When a client sends NULL via mbox_send_message(), > > > > active_req is set to NULL, which the framework misinterprets as > > > > "no active request." This breaks the TX state machine by: > > > > > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > > > callback and the tx_complete completion > > > > - txdone_hrtimer() skips the channel entirely since active_req > > > > is NULL, so poll-based TX-done detection never fires. > > > > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > > > "no active request," freeing NULL to be valid message data. The > > > > sentinel is internal to the mailbox core and is never exposed to > > > > controller drivers or clients. > > > > > > The following drivers are currently using ->active_req which now could be > > > assigned MBOX_NO_MSG. > > > - drivers/mailbox/tegra-hsp.c > > > - drivers/mailbox/mtk-vcp-mailbox.c > > > > > Good catch, Thanks. > > mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This > > patch will not introduce any change for it. > > Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single > > line change. > > > A more important aspect than single line change would be that you are > creating a new API contract that the controller drivers who are to use > ->active_req should now have new knowledge that the pointer value could > be unconventional value such as -1. Without this additional knowledge, > they may easily fail checking if the channel is empty. > > > > One of them is using ->active_req to wait until the channel is empty. In > > > this case, strictly speaking, that controller driver should be aware of > > > the sentinel value MBOX_NO_MSG, which means the sentinel value should be > > > exposed to the controller. Or, if a future controller driver to come is to > > > use ->active_req for the same purpose for doorbell or non-doorbell, it > > > should also be aware of the sentinel value anyway. > > > > > > However, I believe that it is not intuitive to the controller developers > > > that a pointer value could be other value than a real memory address, > > > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason, > > > I think it will be better to change the type of ->active_req to give a > > > better indication to the controller developers, e.g. to integer as in the > > > original patch > > > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/. > > > > > > Or, we could change those drivers not to use ->active_req, hide > > > ->active_req entirely in the mailbox core and keep this patch. > > > > > Yes, ideally controller drivers should not track active_req. But this > > patch will cause least churn it seems. > > If we will take those two drivers as an exceptional and not recommended > case that uses ->active_req directly, it will be fine not to change them > except for tegra-hsp.c to check MBOX_NO_MSG. > > If that is the case and you are to keep this patch, please keep in mind to > leave the original patch link in the commit message for better trackability > of the discussion and solutions and for credit for the contribution of > finding and analyzing this issue and proposing the first solution, which > I believe is also important for future contributions to come. So what's the next steps here, then? I don't think I adequately explained exactly the needs of my mbox client, but I also don't think it's a big deal. I'm convinced it should be fine even if NULL messages get queued. Jassi: are you going to send a new version of your patch? ...or are you expecting Joonwon to post a new version? ...or do you want me to post some variant of these patches? -Doug
On Fri, Mar 20, 2026 at 4:03 PM Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Mon, Mar 16, 2026 at 10:03 PM Joonwon Kang <joonwonkang@google.com> wrote: > > > > > On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote: > > > > > > > > > The active_req field serves double duty as both the "is a TX in > > > > > flight" flag (NULL means idle) and the storage for the in-flight > > > > > message pointer. When a client sends NULL via mbox_send_message(), > > > > > active_req is set to NULL, which the framework misinterprets as > > > > > "no active request." This breaks the TX state machine by: > > > > > > > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > > > > callback and the tx_complete completion > > > > > - txdone_hrtimer() skips the channel entirely since active_req > > > > > is NULL, so poll-based TX-done detection never fires. > > > > > > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > > > > "no active request," freeing NULL to be valid message data. The > > > > > sentinel is internal to the mailbox core and is never exposed to > > > > > controller drivers or clients. > > > > > > > > The following drivers are currently using ->active_req which now could be > > > > assigned MBOX_NO_MSG. > > > > - drivers/mailbox/tegra-hsp.c > > > > - drivers/mailbox/mtk-vcp-mailbox.c > > > > > > > Good catch, Thanks. > > > mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This > > > patch will not introduce any change for it. > > > Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single > > > line change. > > > > > A more important aspect than single line change would be that you are > > creating a new API contract that the controller drivers who are to use > > ->active_req should now have new knowledge that the pointer value could > > be unconventional value such as -1. Without this additional knowledge, > > they may easily fail checking if the channel is empty. > > > > > > One of them is using ->active_req to wait until the channel is empty. In > > > > this case, strictly speaking, that controller driver should be aware of > > > > the sentinel value MBOX_NO_MSG, which means the sentinel value should be > > > > exposed to the controller. Or, if a future controller driver to come is to > > > > use ->active_req for the same purpose for doorbell or non-doorbell, it > > > > should also be aware of the sentinel value anyway. > > > > > > > > However, I believe that it is not intuitive to the controller developers > > > > that a pointer value could be other value than a real memory address, > > > > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason, > > > > I think it will be better to change the type of ->active_req to give a > > > > better indication to the controller developers, e.g. to integer as in the > > > > original patch > > > > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/. > > > > > > > > Or, we could change those drivers not to use ->active_req, hide > > > > ->active_req entirely in the mailbox core and keep this patch. > > > > > > > Yes, ideally controller drivers should not track active_req. But this > > > patch will cause least churn it seems. > > > > If we will take those two drivers as an exceptional and not recommended > > case that uses ->active_req directly, it will be fine not to change them > > except for tegra-hsp.c to check MBOX_NO_MSG. > > > > If that is the case and you are to keep this patch, please keep in mind to > > leave the original patch link in the commit message for better trackability > > of the discussion and solutions and for credit for the contribution of > > finding and analyzing this issue and proposing the first solution, which > > I believe is also important for future contributions to come. > > So what's the next steps here, then? I don't think I adequately > explained exactly the needs of my mbox client, but I also don't think > it's a big deal. I'm convinced it should be fine even if NULL messages > get queued. > I think it will be simpler than you imagine, but yes I too don't think it's a big deal. We can iron out details later. > Jassi: are you going to send a new version of your patch? ...or are > you expecting Joonwon to post a new version? ...or do you want me to > post some variant of these patches? > Honestly I only care about minimal churn to code and API. I think simply using a different sentinel value is simpler than tracking the number of submissions... which is neither strictly required nor reduces the changes we have to make. So I plan to submit my RFC as a patch, the trivial change to tegra-hsp.c and resubmit the qcom fix with your ack. Regards, Jassi
> On Fri, Mar 20, 2026 at 4:03 PM Doug Anderson <dianders@chromium.org> wrote: > > > > Hi, > > > > On Mon, Mar 16, 2026 at 10:03 PM Joonwon Kang <joonwonkang@google.com> wrote: > > > > > > > On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote: > > > > > > > > > > > The active_req field serves double duty as both the "is a TX in > > > > > > flight" flag (NULL means idle) and the storage for the in-flight > > > > > > message pointer. When a client sends NULL via mbox_send_message(), > > > > > > active_req is set to NULL, which the framework misinterprets as > > > > > > "no active request." This breaks the TX state machine by: > > > > > > > > > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > > > > > callback and the tx_complete completion > > > > > > - txdone_hrtimer() skips the channel entirely since active_req > > > > > > is NULL, so poll-based TX-done detection never fires. > > > > > > > > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > > > > > "no active request," freeing NULL to be valid message data. The > > > > > > sentinel is internal to the mailbox core and is never exposed to > > > > > > controller drivers or clients. > > > > > > > > > > The following drivers are currently using ->active_req which now could be > > > > > assigned MBOX_NO_MSG. > > > > > - drivers/mailbox/tegra-hsp.c > > > > > - drivers/mailbox/mtk-vcp-mailbox.c > > > > > > > > > Good catch, Thanks. > > > > mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This > > > > patch will not introduce any change for it. > > > > Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single > > > > line change. > > > > > > > A more important aspect than single line change would be that you are > > > creating a new API contract that the controller drivers who are to use > > > ->active_req should now have new knowledge that the pointer value could > > > be unconventional value such as -1. Without this additional knowledge, > > > they may easily fail checking if the channel is empty. > > > > > > > > One of them is using ->active_req to wait until the channel is empty. In > > > > > this case, strictly speaking, that controller driver should be aware of > > > > > the sentinel value MBOX_NO_MSG, which means the sentinel value should be > > > > > exposed to the controller. Or, if a future controller driver to come is to > > > > > use ->active_req for the same purpose for doorbell or non-doorbell, it > > > > > should also be aware of the sentinel value anyway. > > > > > > > > > > However, I believe that it is not intuitive to the controller developers > > > > > that a pointer value could be other value than a real memory address, > > > > > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason, > > > > > I think it will be better to change the type of ->active_req to give a > > > > > better indication to the controller developers, e.g. to integer as in the > > > > > original patch > > > > > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/. > > > > > > > > > > Or, we could change those drivers not to use ->active_req, hide > > > > > ->active_req entirely in the mailbox core and keep this patch. > > > > > > > > > Yes, ideally controller drivers should not track active_req. But this > > > > patch will cause least churn it seems. > > > > > > If we will take those two drivers as an exceptional and not recommended > > > case that uses ->active_req directly, it will be fine not to change them > > > except for tegra-hsp.c to check MBOX_NO_MSG. > > > > > > If that is the case and you are to keep this patch, please keep in mind to > > > leave the original patch link in the commit message for better trackability > > > of the discussion and solutions and for credit for the contribution of > > > finding and analyzing this issue and proposing the first solution, which > > > I believe is also important for future contributions to come. > > > > So what's the next steps here, then? I don't think I adequately > > explained exactly the needs of my mbox client, but I also don't think > > it's a big deal. I'm convinced it should be fine even if NULL messages > > get queued. > > > I think it will be simpler than you imagine, but yes I too don't think > it's a big deal. We can iron out details later. > > > Jassi: are you going to send a new version of your patch? ...or are > > you expecting Joonwon to post a new version? ...or do you want me to > > post some variant of these patches? > > > Honestly I only care about minimal churn to code and API. I think > simply using a different sentinel value is simpler than tracking the > number of submissions... which is neither strictly required nor > reduces the changes we have to make. > So I plan to submit my RFC as a patch, the trivial change to > tegra-hsp.c and resubmit the qcom fix with your ack. > > Regards, > Jassi I have reviewed the new patch. Regarding the issue I brought up on how this alternative patch had been uploaded without letting the original patch author know and without any tag for better tracking or credit, I hope next time you could use "Reported-by:" or "Link:" tag at least to encourage contributors to keep spending effort finding and analyzing issues in the mailbox framework. I believe it could be better for community and quality of code as also guided in the submitting-patches doc. Sincerely, Joonwon Kang
On Thu, Mar 26, 2026 at 2:31 AM Joonwon Kang <joonwonkang@google.com> wrote: > > > On Fri, Mar 20, 2026 at 4:03 PM Doug Anderson <dianders@chromium.org> wrote: > > > > > > Hi, > > > > > > On Mon, Mar 16, 2026 at 10:03 PM Joonwon Kang <joonwonkang@google.com> wrote: > > > > > > > > > On Fri, Mar 13, 2026 at 5:12 AM Joonwon Kang <joonwonkang@google.com> wrote: > > > > > > > > > > > > > The active_req field serves double duty as both the "is a TX in > > > > > > > flight" flag (NULL means idle) and the storage for the in-flight > > > > > > > message pointer. When a client sends NULL via mbox_send_message(), > > > > > > > active_req is set to NULL, which the framework misinterprets as > > > > > > > "no active request." This breaks the TX state machine by: > > > > > > > > > > > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > > > > > > callback and the tx_complete completion > > > > > > > - txdone_hrtimer() skips the channel entirely since active_req > > > > > > > is NULL, so poll-based TX-done detection never fires. > > > > > > > > > > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > > > > > > "no active request," freeing NULL to be valid message data. The > > > > > > > sentinel is internal to the mailbox core and is never exposed to > > > > > > > controller drivers or clients. > > > > > > > > > > > > The following drivers are currently using ->active_req which now could be > > > > > > assigned MBOX_NO_MSG. > > > > > > - drivers/mailbox/tegra-hsp.c > > > > > > - drivers/mailbox/mtk-vcp-mailbox.c > > > > > > > > > > > Good catch, Thanks. > > > > > mtk-vcp-mailbox.c is fine as is - it assumes only valid requests. This > > > > > patch will not introduce any change for it. > > > > > Yes, tegra-hsp.c should now track MBOX_NO_MSG instead of NULL. Single > > > > > line change. > > > > > > > > > A more important aspect than single line change would be that you are > > > > creating a new API contract that the controller drivers who are to use > > > > ->active_req should now have new knowledge that the pointer value could > > > > be unconventional value such as -1. Without this additional knowledge, > > > > they may easily fail checking if the channel is empty. > > > > > > > > > > One of them is using ->active_req to wait until the channel is empty. In > > > > > > this case, strictly speaking, that controller driver should be aware of > > > > > > the sentinel value MBOX_NO_MSG, which means the sentinel value should be > > > > > > exposed to the controller. Or, if a future controller driver to come is to > > > > > > use ->active_req for the same purpose for doorbell or non-doorbell, it > > > > > > should also be aware of the sentinel value anyway. > > > > > > > > > > > > However, I believe that it is not intuitive to the controller developers > > > > > > that a pointer value could be other value than a real memory address, > > > > > > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason, > > > > > > I think it will be better to change the type of ->active_req to give a > > > > > > better indication to the controller developers, e.g. to integer as in the > > > > > > original patch > > > > > > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/. > > > > > > > > > > > > Or, we could change those drivers not to use ->active_req, hide > > > > > > ->active_req entirely in the mailbox core and keep this patch. > > > > > > > > > > > Yes, ideally controller drivers should not track active_req. But this > > > > > patch will cause least churn it seems. > > > > > > > > If we will take those two drivers as an exceptional and not recommended > > > > case that uses ->active_req directly, it will be fine not to change them > > > > except for tegra-hsp.c to check MBOX_NO_MSG. > > > > > > > > If that is the case and you are to keep this patch, please keep in mind to > > > > leave the original patch link in the commit message for better trackability > > > > of the discussion and solutions and for credit for the contribution of > > > > finding and analyzing this issue and proposing the first solution, which > > > > I believe is also important for future contributions to come. > > > > > > So what's the next steps here, then? I don't think I adequately > > > explained exactly the needs of my mbox client, but I also don't think > > > it's a big deal. I'm convinced it should be fine even if NULL messages > > > get queued. > > > > > I think it will be simpler than you imagine, but yes I too don't think > > it's a big deal. We can iron out details later. > > > > > Jassi: are you going to send a new version of your patch? ...or are > > > you expecting Joonwon to post a new version? ...or do you want me to > > > post some variant of these patches? > > > > > Honestly I only care about minimal churn to code and API. I think > > simply using a different sentinel value is simpler than tracking the > > number of submissions... which is neither strictly required nor > > reduces the changes we have to make. > > So I plan to submit my RFC as a patch, the trivial change to > > tegra-hsp.c and resubmit the qcom fix with your ack. > > > > Regards, > > Jassi > > I have reviewed the new patch. > > Regarding the issue I brought up on how this alternative patch had been > uploaded without letting the original patch author know and without any tag for > better tracking or credit, I hope next time you could use "Reported-by:" or > "Link:" tag at least to encourage contributors to keep spending effort finding > and analyzing issues in the mailbox framework. I believe it could be better for > community and quality of code as also guided in the submitting-patches doc. > I am sorry for the oversight. I should have attributed you as the reporter. I forgot but I will add it before applying. Thanks, Jassi
Hi, On Fri, Mar 13, 2026 at 3:12 AM Joonwon Kang <joonwonkang@google.com> wrote: > > > The active_req field serves double duty as both the "is a TX in > > flight" flag (NULL means idle) and the storage for the in-flight > > message pointer. When a client sends NULL via mbox_send_message(), > > active_req is set to NULL, which the framework misinterprets as > > "no active request." This breaks the TX state machine by: > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > callback and the tx_complete completion > > - txdone_hrtimer() skips the channel entirely since active_req > > is NULL, so poll-based TX-done detection never fires. > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > "no active request," freeing NULL to be valid message data. The > > sentinel is internal to the mailbox core and is never exposed to > > controller drivers or clients. > > The following drivers are currently using ->active_req which now could be > assigned MBOX_NO_MSG. > - drivers/mailbox/tegra-hsp.c > - drivers/mailbox/mtk-vcp-mailbox.c > > One of them is using ->active_req to wait until the channel is empty. In > this case, strictly speaking, that controller driver should be aware of > the sentinel value MBOX_NO_MSG, which means the sentinel value should be > exposed to the controller. Or, if a future controller driver to come is to > use ->active_req for the same purpose for doorbell or non-doorbell, it > should also be aware of the sentinel value anyway. > > However, I believe that it is not intuitive to the controller developers > that a pointer value could be other value than a real memory address, > NULL or error encoded value, which is -1(== MBOX_NO_MSG). For this reason, > I think it will be better to change the type of ->active_req to give a > better indication to the controller developers, e.g. to integer as in the > original patch > https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/. > > Or, we could change those drivers not to use ->active_req, hide > ->active_req entirely in the mailbox core and keep this patch. You make some good points here. Looking at your patch, you're keeping "active_req" as an index into "chan->msg_data". It seems like we could maybe get the best of both worlds if we started with your solution but keep "active_req" as a pointer. It can be a pointer to one of the elements in "chan->msg_data". Then NULL can still mean that no request is active. -Doug
> The active_req field serves double duty as both the "is a TX in > flight" flag (NULL means idle) and the storage for the in-flight > message pointer. When a client sends NULL via mbox_send_message(), > active_req is set to NULL, which the framework misinterprets as > "no active request." This breaks the TX state machine by: > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > callback and the tx_complete completion > - txdone_hrtimer() skips the channel entirely since active_req > is NULL, so poll-based TX-done detection never fires. > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > "no active request," freeing NULL to be valid message data. The > sentinel is internal to the mailbox core and is never exposed to > controller drivers or clients. > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message > by clients. I was surprised to find out this alternative patch while we have been waiting a few months for your response to my previous patch for this issue on https://lore.kernel.org/all/20251126045926.2413532-1-joonwonkang@google.com/. Will you handle this issue on your end? If so, could you leave the original discussion link to the commit message so that we could better track it? Also, could you reply to the original patches next time with what you have decided or at least cc the people who posted the original patches to your new patches for better collaboration? Thanks.
Hi, On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote: > > From: Jassi Brar <jassisinghbrar@gmail.com> > > The active_req field serves double duty as both the "is a TX in > flight" flag (NULL means idle) and the storage for the in-flight > message pointer. When a client sends NULL via mbox_send_message(), > active_req is set to NULL, which the framework misinterprets as > "no active request." This breaks the TX state machine by: > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > callback and the tx_complete completion > - txdone_hrtimer() skips the channel entirely since active_req > is NULL, so poll-based TX-done detection never fires. > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > "no active request," freeing NULL to be valid message data. The > sentinel is internal to the mailbox core and is never exposed to > controller drivers or clients. > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message > by clients. > > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com> > --- > drivers/mailbox/mailbox.c | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) While I can certainly be corrected, I suspect this patch will break a lot of users. My analysis of people that were passing NULL mbox messages is that they _wanted_ the current behavior. They didn't want NULL messages to be queued up as would happen with this patch. Instead, they just wanted to immediately ring the doorbell again to signal an interrupt to the other side. My analysis of clients is why I chose not to implement it this way and isntead introduced the mbox_ring_doorbell(), keeping it as similar to the existing NULL behavior as possible. -Doug
On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote: > > > > From: Jassi Brar <jassisinghbrar@gmail.com> > > > > The active_req field serves double duty as both the "is a TX in > > flight" flag (NULL means idle) and the storage for the in-flight > > message pointer. When a client sends NULL via mbox_send_message(), > > active_req is set to NULL, which the framework misinterprets as > > "no active request." This breaks the TX state machine by: > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > callback and the tx_complete completion > > - txdone_hrtimer() skips the channel entirely since active_req > > is NULL, so poll-based TX-done detection never fires. > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > "no active request," freeing NULL to be valid message data. The > > sentinel is internal to the mailbox core and is never exposed to > > controller drivers or clients. > > > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message > > by clients. > > > > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com> > > --- > > drivers/mailbox/mailbox.c | 15 +++++++++------ > > 1 file changed, 9 insertions(+), 6 deletions(-) > > While I can certainly be corrected, I suspect this patch will break a > lot of users. > > My analysis of people that were passing NULL mbox messages is that > they _wanted_ the current behavior. They didn't want NULL messages to > be queued up as would happen with this patch. Instead, they just > wanted to immediately ring the doorbell again to signal an interrupt > to the other side. > They won't be queued up because they call mbox_client_txdone() immediately after mbox_send_message() From my analysis of the 14 clients, there was just one (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a patch. Can you please point me to the driver you are concerned about? Perhaps I am overlooking something. Regards, Jassi
Hi, On Tue, Mar 10, 2026 at 4:59 PM Jassi Brar <jassisinghbrar@gmail.com> wrote: > > On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@chromium.org> wrote: > > > > Hi, > > > > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote: > > > > > > From: Jassi Brar <jassisinghbrar@gmail.com> > > > > > > The active_req field serves double duty as both the "is a TX in > > > flight" flag (NULL means idle) and the storage for the in-flight > > > message pointer. When a client sends NULL via mbox_send_message(), > > > active_req is set to NULL, which the framework misinterprets as > > > "no active request." This breaks the TX state machine by: > > > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > > callback and the tx_complete completion > > > - txdone_hrtimer() skips the channel entirely since active_req > > > is NULL, so poll-based TX-done detection never fires. > > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > > "no active request," freeing NULL to be valid message data. The > > > sentinel is internal to the mailbox core and is never exposed to > > > controller drivers or clients. > > > > > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message > > > by clients. > > > > > > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com> > > > --- > > > drivers/mailbox/mailbox.c | 15 +++++++++------ > > > 1 file changed, 9 insertions(+), 6 deletions(-) > > > > While I can certainly be corrected, I suspect this patch will break a > > lot of users. > > > > My analysis of people that were passing NULL mbox messages is that > > they _wanted_ the current behavior. They didn't want NULL messages to > > be queued up as would happen with this patch. Instead, they just > > wanted to immediately ring the doorbell again to signal an interrupt > > to the other side. > > > They won't be queued up because they call mbox_client_txdone() > immediately after mbox_send_message() > From my analysis of the 14 clients, there was just one > (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a > patch. > > Can you please point me to the driver you are concerned about? Perhaps > I am overlooking something. Ah, interesting! I hadn't thought about it that way... I think there are at least a few others that would need to change, maybe? It looks like these ones: drivers/soc/xilinx/zynqmp_power.c drivers/remoteproc/xlnx_r5_remoteproc.c drivers/firmware/imx/imx-dsp.c In my case, I have a mailbox driver that's currently downstream (though I hope to change that). My mailbox controller has an interrupt for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone(). Some clients of this mailbox client want the txdone interrupt, but some clients of it just care about sending doorbells. This driver would be able to send "txdone" for doorbells (or it could have a "doorbell done"), but no clients that send doorbells actually care about this signal. That being said, I believe clients do care about doorbells not being queued. -Doug
On Tue, Mar 10, 2026 at 7:15 PM Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Tue, Mar 10, 2026 at 4:59 PM Jassi Brar <jassisinghbrar@gmail.com> wrote: > > > > On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@chromium.org> wrote: > > > > > > Hi, > > > > > > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote: > > > > > > > > From: Jassi Brar <jassisinghbrar@gmail.com> > > > > > > > > The active_req field serves double duty as both the "is a TX in > > > > flight" flag (NULL means idle) and the storage for the in-flight > > > > message pointer. When a client sends NULL via mbox_send_message(), > > > > active_req is set to NULL, which the framework misinterprets as > > > > "no active request." This breaks the TX state machine by: > > > > > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > > > callback and the tx_complete completion > > > > - txdone_hrtimer() skips the channel entirely since active_req > > > > is NULL, so poll-based TX-done detection never fires. > > > > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > > > "no active request," freeing NULL to be valid message data. The > > > > sentinel is internal to the mailbox core and is never exposed to > > > > controller drivers or clients. > > > > > > > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message > > > > by clients. > > > > > > > > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com> > > > > --- > > > > drivers/mailbox/mailbox.c | 15 +++++++++------ > > > > 1 file changed, 9 insertions(+), 6 deletions(-) > > > > > > While I can certainly be corrected, I suspect this patch will break a > > > lot of users. > > > > > > My analysis of people that were passing NULL mbox messages is that > > > they _wanted_ the current behavior. They didn't want NULL messages to > > > be queued up as would happen with this patch. Instead, they just > > > wanted to immediately ring the doorbell again to signal an interrupt > > > to the other side. > > > > > They won't be queued up because they call mbox_client_txdone() > > immediately after mbox_send_message() > > From my analysis of the 14 clients, there was just one > > (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a > > patch. > > > > Can you please point me to the driver you are concerned about? Perhaps > > I am overlooking something. > > Ah, interesting! I hadn't thought about it that way... > > I think there are at least a few others that would need to change, > maybe? It looks like these ones: > > drivers/soc/xilinx/zynqmp_power.c > drivers/remoteproc/xlnx_r5_remoteproc.c The underlying driver for both is drivers/mailbox/zynqmp-ipi-mailbox.c which sets txdone_poll = true and implements the .last_tx_done() callback. The client (zynqmp_power.c) should not call mbox_client_txdone() > drivers/firmware/imx/imx-dsp.c The underlying driver is drivers/mailbox/imx-mailbox.c which sets 'txdone_irq = true' and ticks the state machine with mbox_chan_txdone() Again the client need not call mbox_client_txdone() All three clients above are currently failed by the core which misinterprets NULL messages. This patch will fix such clients too, besides avoiding a new api. > > In my case, I have a mailbox driver that's currently downstream > (though I hope to change that). My mailbox controller has an interrupt > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone(). > Some clients of this mailbox client want the txdone interrupt, but > some clients of it just care about sending doorbells. > So imx-dsp.c like? Please let me know what is lacking in the core to fully support that. Happy to look into it. Regards, Jassi
Hi, On Tue, Mar 10, 2026 at 5:46 PM Jassi Brar <jassisinghbrar@gmail.com> wrote: > > On Tue, Mar 10, 2026 at 7:15 PM Doug Anderson <dianders@chromium.org> wrote: > > > > Hi, > > > > On Tue, Mar 10, 2026 at 4:59 PM Jassi Brar <jassisinghbrar@gmail.com> wrote: > > > > > > On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@chromium.org> wrote: > > > > > > > > Hi, > > > > > > > > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote: > > > > > > > > > > From: Jassi Brar <jassisinghbrar@gmail.com> > > > > > > > > > > The active_req field serves double duty as both the "is a TX in > > > > > flight" flag (NULL means idle) and the storage for the in-flight > > > > > message pointer. When a client sends NULL via mbox_send_message(), > > > > > active_req is set to NULL, which the framework misinterprets as > > > > > "no active request." This breaks the TX state machine by: > > > > > > > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > > > > callback and the tx_complete completion > > > > > - txdone_hrtimer() skips the channel entirely since active_req > > > > > is NULL, so poll-based TX-done detection never fires. > > > > > > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > > > > "no active request," freeing NULL to be valid message data. The > > > > > sentinel is internal to the mailbox core and is never exposed to > > > > > controller drivers or clients. > > > > > > > > > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message > > > > > by clients. > > > > > > > > > > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com> > > > > > --- > > > > > drivers/mailbox/mailbox.c | 15 +++++++++------ > > > > > 1 file changed, 9 insertions(+), 6 deletions(-) > > > > > > > > While I can certainly be corrected, I suspect this patch will break a > > > > lot of users. > > > > > > > > My analysis of people that were passing NULL mbox messages is that > > > > they _wanted_ the current behavior. They didn't want NULL messages to > > > > be queued up as would happen with this patch. Instead, they just > > > > wanted to immediately ring the doorbell again to signal an interrupt > > > > to the other side. > > > > > > > They won't be queued up because they call mbox_client_txdone() > > > immediately after mbox_send_message() > > > From my analysis of the 14 clients, there was just one > > > (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a > > > patch. > > > > > > Can you please point me to the driver you are concerned about? Perhaps > > > I am overlooking something. > > > > Ah, interesting! I hadn't thought about it that way... > > > > I think there are at least a few others that would need to change, > > maybe? It looks like these ones: > > > > drivers/soc/xilinx/zynqmp_power.c > > drivers/remoteproc/xlnx_r5_remoteproc.c > The underlying driver for both is drivers/mailbox/zynqmp-ipi-mailbox.c > which sets > txdone_poll = true and implements the .last_tx_done() callback. > The client (zynqmp_power.c) should not call mbox_client_txdone() > > > drivers/firmware/imx/imx-dsp.c > The underlying driver is drivers/mailbox/imx-mailbox.c which sets > 'txdone_irq = true' > and ticks the state machine with mbox_chan_txdone() > Again the client need not call mbox_client_txdone() > > All three clients above are currently failed by the core which > misinterprets NULL messages. > This patch will fix such clients too, besides avoiding a new api. ...but doesn't that mean that the behavior of these clients will change? Previously all calls to mbox_send_message() immediately called through to the mailbox controller. Now, if the previous message hasn't finished, the "NULL" messages will queue up. For instance, let's say that our client calls mbox_send_message() 2 times in quick succession (10 us apart). Let's say that the remote processor takes 1 ms to react. Previously, the 2 calls to mbox_send_message() would probably be coalesced. Each would call through to the mailbox controller, which would make sure the doorbell was asserted. After 1 ms, the remote processor would confirm the single doorbell it saw. Now, the first call will ring the doorbell. The second one will queue up. After 1 ms, the remote processor will confirm the doorbell, which will cause the second doorbell to be sent. Looking at the 3 drivers in question, I _guess_ maybe that situation never comes up for them? So maybe they're all fine? I don't have tons of confidence that I understand enough about these clients to say for sure that this doesn't happen... > > In my case, I have a mailbox driver that's currently downstream > > (though I hope to change that). My mailbox controller has an interrupt > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone(). > > Some clients of this mailbox client want the txdone interrupt, but > > some clients of it just care about sending doorbells. > > > So imx-dsp.c like? Please let me know what is lacking in the core to > fully support that. Happy to look into it. I know that with my downstream mailbox client, if I let NULL messages queue up I end up with a queue of a dozen or so NULL messages. The downstream client is really taking advantage (AKA abusing) the core's current NULL behavior. It truly does want the "mbox_ring_doorbell" concept of just making sure the doorbell is asserted and returning immediately. If the core changes to start queuing NULL messages, we'll have to figure out some sort of workaround... -Doug
> Hi, > > On Tue, Mar 10, 2026 at 5:46 PM Jassi Brar <jassisinghbrar@gmail.com> wrote: > > > > On Tue, Mar 10, 2026 at 7:15 PM Doug Anderson <dianders@chromium.org> wrote: > > > > > > Hi, > > > > > > On Tue, Mar 10, 2026 at 4:59 PM Jassi Brar <jassisinghbrar@gmail.com> wrote: > > > > > > > > On Tue, Mar 10, 2026 at 6:52 PM Doug Anderson <dianders@chromium.org> wrote: > > > > > > > > > > Hi, > > > > > > > > > > On Tue, Mar 10, 2026 at 4:46 PM <jassisinghbrar@gmail.com> wrote: > > > > > > > > > > > > From: Jassi Brar <jassisinghbrar@gmail.com> > > > > > > > > > > > > The active_req field serves double duty as both the "is a TX in > > > > > > flight" flag (NULL means idle) and the storage for the in-flight > > > > > > message pointer. When a client sends NULL via mbox_send_message(), > > > > > > active_req is set to NULL, which the framework misinterprets as > > > > > > "no active request." This breaks the TX state machine by: > > > > > > > > > > > > - tx_tick() short-circuits on (!mssg), skipping the tx_done > > > > > > callback and the tx_complete completion > > > > > > - txdone_hrtimer() skips the channel entirely since active_req > > > > > > is NULL, so poll-based TX-done detection never fires. > > > > > > > > > > > > Fix this by introducing a MBOX_NO_MSG sentinel value that means > > > > > > "no active request," freeing NULL to be valid message data. The > > > > > > sentinel is internal to the mailbox core and is never exposed to > > > > > > controller drivers or clients. > > > > > > > > > > > > The only tradeoff is that 'MBOX_NO_MSG' can not be used as a message > > > > > > by clients. > > > > > > > > > > > > Signed-off-by: Jassi Brar <jassisinghbrar@gmail.com> > > > > > > --- > > > > > > drivers/mailbox/mailbox.c | 15 +++++++++------ > > > > > > 1 file changed, 9 insertions(+), 6 deletions(-) > > > > > > > > > > While I can certainly be corrected, I suspect this patch will break a > > > > > lot of users. > > > > > > > > > > My analysis of people that were passing NULL mbox messages is that > > > > > they _wanted_ the current behavior. They didn't want NULL messages to > > > > > be queued up as would happen with this patch. Instead, they just > > > > > wanted to immediately ring the doorbell again to signal an interrupt > > > > > to the other side. > > > > > > > > > They won't be queued up because they call mbox_client_txdone() > > > > immediately after mbox_send_message() > > > > From my analysis of the 14 clients, there was just one > > > > (irq-qcom-mpm.c) that was not doing it by oversight, and I submitted a > > > > patch. > > > > > > > > Can you please point me to the driver you are concerned about? Perhaps > > > > I am overlooking something. > > > > > > Ah, interesting! I hadn't thought about it that way... > > > > > > I think there are at least a few others that would need to change, > > > maybe? It looks like these ones: > > > > > > drivers/soc/xilinx/zynqmp_power.c > > > drivers/remoteproc/xlnx_r5_remoteproc.c > > The underlying driver for both is drivers/mailbox/zynqmp-ipi-mailbox.c > > which sets > > txdone_poll = true and implements the .last_tx_done() callback. > > The client (zynqmp_power.c) should not call mbox_client_txdone() > > > > > drivers/firmware/imx/imx-dsp.c > > The underlying driver is drivers/mailbox/imx-mailbox.c which sets > > 'txdone_irq = true' > > and ticks the state machine with mbox_chan_txdone() > > Again the client need not call mbox_client_txdone() > > > > All three clients above are currently failed by the core which > > misinterprets NULL messages. > > This patch will fix such clients too, besides avoiding a new api. > > ...but doesn't that mean that the behavior of these clients will > change? Previously all calls to mbox_send_message() immediately called > through to the mailbox controller. Now, if the previous message hasn't > finished, the "NULL" messages will queue up. > > For instance, let's say that our client calls mbox_send_message() 2 > times in quick succession (10 us apart). Let's say that the remote > processor takes 1 ms to react. > > Previously, the 2 calls to mbox_send_message() would probably be > coalesced. Each would call through to the mailbox controller, which > would make sure the doorbell was asserted. After 1 ms, the remote > processor would confirm the single doorbell it saw. > > Now, the first call will ring the doorbell. The second one will queue > up. After 1 ms, the remote processor will confirm the doorbell, which > will cause the second doorbell to be sent. > > > Looking at the 3 drivers in question, I _guess_ maybe that situation > never comes up for them? So maybe they're all fine? I don't have tons > of confidence that I understand enough about these clients to say for > sure that this doesn't happen... > If multiple threads are ringing the doorbell through the same channel many times within a very short period of time, I think the issue you are concerned about could occur in theory with this patch even if the doorbell ack could be ignored. If that is the case, I believe it will be clients to handle the send failure, e.g. by retrying. > > > > In my case, I have a mailbox driver that's currently downstream > > > (though I hope to change that). My mailbox controller has an interrupt > > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone(). > > > Some clients of this mailbox client want the txdone interrupt, but > > > some clients of it just care about sending doorbells. > > > > > So imx-dsp.c like? Please let me know what is lacking in the core to > > fully support that. Happy to look into it. > > I know that with my downstream mailbox client, if I let NULL messages > queue up I end up with a queue of a dozen or so NULL messages. The > downstream client is really taking advantage (AKA abusing) the core's > current NULL behavior. It truly does want the "mbox_ring_doorbell" > concept of just making sure the doorbell is asserted and returning > immediately. If the core changes to start queuing NULL messages, we'll > have to figure out some sort of workaround... According to the Jassi's comment on https://lore.kernel.org/all/CABb+yY18OEvfc8DDUiZqVeQtkmwcOFCSTMT7KoXb1LVA3RuxdA@mail.gmail.com/ "A controller driver is supposed to either expect data or not, but not both", client drivers should send to a controller either non-NULL or NULL data, not both in a mixed fashion. With this setup, if the doorbell acknowledgement from the remote could be ignored by the controller for your case, I believe you could separate a controller for doorbell from the one for non-doorbell and set the controller as follows. ``` mbox_controller->txdone_poll = true; mbox_controller->txpoll_period = 0; mbox_controller->ops->last_tx_done = ... just returns true ... ``` Then it will dequeue the NULL messages as fast as the hrtimer's resolution interval. Or, you could just use mbox_send_message() and mbox_client_txdone() pair. On the other hand, if the doorbell acknowledgement is to be respected by the controller(this is my case), the controller should wait until the ongoing doorbell tx is successfully done or not, and this patch could do it whereas mbox_ring_doorbell() couldn't.
On Tue, Mar 10, 2026 at 9:00 PM Doug Anderson <dianders@chromium.org> wrote: > > > > Ah, interesting! I hadn't thought about it that way... > > > > > > I think there are at least a few others that would need to change, > > > maybe? It looks like these ones: > > > > > > drivers/soc/xilinx/zynqmp_power.c > > > drivers/remoteproc/xlnx_r5_remoteproc.c > > The underlying driver for both is drivers/mailbox/zynqmp-ipi-mailbox.c > > which sets > > txdone_poll = true and implements the .last_tx_done() callback. > > The client (zynqmp_power.c) should not call mbox_client_txdone() > > > > > drivers/firmware/imx/imx-dsp.c > > The underlying driver is drivers/mailbox/imx-mailbox.c which sets > > 'txdone_irq = true' > > and ticks the state machine with mbox_chan_txdone() > > Again the client need not call mbox_client_txdone() > > > > All three clients above are currently failed by the core which > > misinterprets NULL messages. > > This patch will fix such clients too, besides avoiding a new api. > > ...but doesn't that mean that the behavior of these clients will > change? Previously all calls to mbox_send_message() immediately called > through to the mailbox controller. Now, if the previous message hasn't > finished, the "NULL" messages will queue up. > I don't see any problem. As we have checked, all clients call mbox_client_txdone() after each mbox_send_message() because the underlying controller just sets the bit and the doorbell is rung. The clients correctly drive the state-machine. So they are already doing the right thing and should be fine. > For instance, let's say that our client calls mbox_send_message() 2 > times in quick succession (10 us apart). Let's say that the remote > processor takes 1 ms to react. > > Previously, the 2 calls to mbox_send_message() would probably be > coalesced. Each would call through to the mailbox controller, which > would make sure the doorbell was asserted. After 1 ms, the remote > processor would confirm the single doorbell it saw. > > Now, the first call will ring the doorbell. The second one will queue > up. After 1 ms, the remote processor will confirm the doorbell, which > will cause the second doorbell to be sent. > Am I overlooking some code? I found no existing client relying on that bug, luckily. Even if some did, it would be better to fix that client instead of moving the subsystem around for it. > > Looking at the 3 drivers in question, I _guess_ maybe that situation > never comes up for them? So maybe they're all fine? I don't have tons > of confidence that I understand enough about these clients to say for > sure that this doesn't happen... > Those 3 clients don't run the state machine (tick), the underlying controllers do. > > > > In my case, I have a mailbox driver that's currently downstream > > > (though I hope to change that). My mailbox controller has an interrupt > > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone(). > > > Some clients of this mailbox client want the txdone interrupt, but > > > some clients of it just care about sending doorbells. > > > > > So imx-dsp.c like? Please let me know what is lacking in the core to > > fully support that. Happy to look into it. > > I know that with my downstream mailbox client, if I let NULL messages > queue up I end up with a queue of a dozen or so NULL messages. The > downstream client is really taking advantage (AKA abusing) the core's > current NULL behavior. It truly does want the "mbox_ring_doorbell" > concept of just making sure the doorbell is asserted and returning > immediately. If the core changes to start queuing NULL messages, we'll > have to figure out some sort of workaround... > You said your controller has an irq raised for the doorbell sent (?). If so, your clients should want to wait for that confirmation (the controller driver would tick via mbox_chan_txdone). If there is no irq involved and your controller can just set some bit and the doorbell is guaranteed, then you just need to tie mbox_send_message() & mbox_client_txdone() , which is the right way and how other such clients work. In any case, I will be happy to look at your draft code and make sure your usecase is supported. Regards, Jassi
Hi,
On Tue, Mar 10, 2026 at 8:42 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
>
> > > > In my case, I have a mailbox driver that's currently downstream
> > > > (though I hope to change that). My mailbox controller has an interrupt
> > > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone().
> > > > Some clients of this mailbox client want the txdone interrupt, but
> > > > some clients of it just care about sending doorbells.
> > > >
> > > So imx-dsp.c like? Please let me know what is lacking in the core to
> > > fully support that. Happy to look into it.
> >
> > I know that with my downstream mailbox client, if I let NULL messages
> > queue up I end up with a queue of a dozen or so NULL messages. The
> > downstream client is really taking advantage (AKA abusing) the core's
> > current NULL behavior. It truly does want the "mbox_ring_doorbell"
> > concept of just making sure the doorbell is asserted and returning
> > immediately. If the core changes to start queuing NULL messages, we'll
> > have to figure out some sort of workaround...
> >
> You said your controller has an irq raised for the doorbell sent (?).
It has an interrupt for when the remote side receives the doorbell
(when it clears the IRQ on its side).
> If so, your clients should want to wait for that confirmation (the
> controller driver would tick via mbox_chan_txdone).
Ah, I think I see what you're saying. Functionally, I think it should
work. Pseudocode for the client:
def ring_doorbell():
if not previous_doorbell_acked:
another_doorbell_pending = true
else:
previous_doorbell_acked = false;
mbox_send_message(NULL)
def tx_done():
if another_doorbell_pending:
another_doorbell_pending = false
mbox_send_message(NULL)
else:
previous_doorbell_acked = true;
I think that will ensure the other side always gets at least one
future interrupt when the doorbell is rung.
I guess another option would be for the mailbox controller to issue
"tx_done" immediately for NULL messages. With no data to transmit, I
guess one could say that as soon as the interrupt is raised that the
data is "transmitted", so maybe this would be OK too?
-Doug
On Thu, Mar 12, 2026 at 3:59 PM Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Tue, Mar 10, 2026 at 8:42 PM Jassi Brar <jassisinghbrar@gmail.com> wrote: > > > > > > > In my case, I have a mailbox driver that's currently downstream > > > > > (though I hope to change that). My mailbox controller has an interrupt > > > > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone(). > > > > > Some clients of this mailbox client want the txdone interrupt, but > > > > > some clients of it just care about sending doorbells. > > > > > > > > > So imx-dsp.c like? Please let me know what is lacking in the core to > > > > fully support that. Happy to look into it. > > > > > > I know that with my downstream mailbox client, if I let NULL messages > > > queue up I end up with a queue of a dozen or so NULL messages. The > > > downstream client is really taking advantage (AKA abusing) the core's > > > current NULL behavior. It truly does want the "mbox_ring_doorbell" > > > concept of just making sure the doorbell is asserted and returning > > > immediately. If the core changes to start queuing NULL messages, we'll > > > have to figure out some sort of workaround... > > > > > You said your controller has an irq raised for the doorbell sent (?). > > It has an interrupt for when the remote side receives the doorbell > (when it clears the IRQ on its side). > For one message we may not care if the irq gets cleared on the remote side, but for many messages the client should want to ensure the next message is sent only after the remote has acked the last sent doorbell (?) Otherwise, depending upon the h/w and f/w implementation on the remote side, a doorbell may be 'missed'. But you know your setup better, take these as just notes. > > If so, your clients should want to wait for that confirmation (the > > controller driver would tick via mbox_chan_txdone). > > Ah, I think I see what you're saying. Functionally, I think it should > work. Pseudocode for the client: > > def ring_doorbell(): > if not previous_doorbell_acked: > another_doorbell_pending = true > else: > previous_doorbell_acked = false; > mbox_send_message(NULL) > > def tx_done(): > if another_doorbell_pending: > another_doorbell_pending = false > mbox_send_message(NULL) > else: > previous_doorbell_acked = true; > If you get interrupt from remote, and want to respect that, then your controller will set txdone_irq=true and call mbox_chan_txdone() Client shouldn't call mbox_client_txdone() > I think that will ensure the other side always gets at least one > future interrupt when the doorbell is rung. > > I guess another option would be for the mailbox controller to issue > "tx_done" immediately for NULL messages. With no data to transmit, I > guess one could say that as soon as the interrupt is raised that the > data is "transmitted", so maybe this would be OK too? > That is ok, but as I said depending on how the remote h/w and f/w is set up, multiple back to back doorbells may 'overwrite' some. Not sure if that will be a problem for your clients. Cheers! Jassi
> Hi, > > On Tue, Mar 10, 2026 at 8:42 PM Jassi Brar <jassisinghbrar@gmail.com> wrote: > > > > > > > In my case, I have a mailbox driver that's currently downstream > > > > > (though I hope to change that). My mailbox controller has an interrupt > > > > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone(). > > > > > Some clients of this mailbox client want the txdone interrupt, but > > > > > some clients of it just care about sending doorbells. > > > > > > > > > So imx-dsp.c like? Please let me know what is lacking in the core to > > > > fully support that. Happy to look into it. > > > > > > I know that with my downstream mailbox client, if I let NULL messages > > > queue up I end up with a queue of a dozen or so NULL messages. The > > > downstream client is really taking advantage (AKA abusing) the core's > > > current NULL behavior. It truly does want the "mbox_ring_doorbell" > > > concept of just making sure the doorbell is asserted and returning > > > immediately. If the core changes to start queuing NULL messages, we'll > > > have to figure out some sort of workaround... > > > > > You said your controller has an irq raised for the doorbell sent (?). > > It has an interrupt for when the remote side receives the doorbell > (when it clears the IRQ on its side). > > > If so, your clients should want to wait for that confirmation (the > > controller driver would tick via mbox_chan_txdone). > > Ah, I think I see what you're saying. Functionally, I think it should > work. Pseudocode for the client: > > def ring_doorbell(): > if not previous_doorbell_acked: > another_doorbell_pending = true > else: > previous_doorbell_acked = false; > mbox_send_message(NULL) > > def tx_done(): > if another_doorbell_pending: > another_doorbell_pending = false > mbox_send_message(NULL) > else: > previous_doorbell_acked = true; > > I think that will ensure the other side always gets at least one > future interrupt when the doorbell is rung. > > I guess another option would be for the mailbox controller to issue > "tx_done" immediately for NULL messages. With no data to transmit, I > guess one could say that as soon as the interrupt is raised that the > data is "transmitted", so maybe this would be OK too? One thing to note is that if the mailbox controller issues "tx_done" immediately in the same thread that is executing mbox_controller->send_data(), you will encounter deadlock(refer to chan->lock on the code). Thus, it should be issued after ->send_data() or in another thread. Either way, it is unavoidable to have window to queue up the NULL messages in the multi-threads situation anyway. So, the key in that case will be how quickly the queue is dequeued or how to recover from the tx failure due to the overflow, I think.
Hi, On Fri, Mar 13, 2026 at 1:44 AM Joonwon Kang <joonwonkang@google.com> wrote: > > > Hi, > > > > On Tue, Mar 10, 2026 at 8:42 PM Jassi Brar <jassisinghbrar@gmail.com> wrote: > > > > > > > > > In my case, I have a mailbox driver that's currently downstream > > > > > > (though I hope to change that). My mailbox controller has an interrupt > > > > > > for txdone, so mailbox clients _shouldn't_ call mbox_client_txdone(). > > > > > > Some clients of this mailbox client want the txdone interrupt, but > > > > > > some clients of it just care about sending doorbells. > > > > > > > > > > > So imx-dsp.c like? Please let me know what is lacking in the core to > > > > > fully support that. Happy to look into it. > > > > > > > > I know that with my downstream mailbox client, if I let NULL messages > > > > queue up I end up with a queue of a dozen or so NULL messages. The > > > > downstream client is really taking advantage (AKA abusing) the core's > > > > current NULL behavior. It truly does want the "mbox_ring_doorbell" > > > > concept of just making sure the doorbell is asserted and returning > > > > immediately. If the core changes to start queuing NULL messages, we'll > > > > have to figure out some sort of workaround... > > > > > > > You said your controller has an irq raised for the doorbell sent (?). > > > > It has an interrupt for when the remote side receives the doorbell > > (when it clears the IRQ on its side). > > > > > If so, your clients should want to wait for that confirmation (the > > > controller driver would tick via mbox_chan_txdone). > > > > Ah, I think I see what you're saying. Functionally, I think it should > > work. Pseudocode for the client: > > > > def ring_doorbell(): > > if not previous_doorbell_acked: > > another_doorbell_pending = true > > else: > > previous_doorbell_acked = false; > > mbox_send_message(NULL) > > > > def tx_done(): > > if another_doorbell_pending: > > another_doorbell_pending = false > > mbox_send_message(NULL) > > else: > > previous_doorbell_acked = true; > > > > I think that will ensure the other side always gets at least one > > future interrupt when the doorbell is rung. > > > > I guess another option would be for the mailbox controller to issue > > "tx_done" immediately for NULL messages. With no data to transmit, I > > guess one could say that as soon as the interrupt is raised that the > > data is "transmitted", so maybe this would be OK too? > > One thing to note is that if the mailbox controller issues "tx_done" > immediately in the same thread that is executing > mbox_controller->send_data(), you will encounter deadlock(refer to > chan->lock on the code). Thus, it should be issued after ->send_data() or > in another thread. Either way, it is unavoidable to have window to queue > up the NULL messages in the multi-threads situation anyway. So, the key in > that case will be how quickly the queue is dequeued or how to recover from > the tx failure due to the overflow, I think. Ah, right. I guess maybe the answer will just be the one expressed by my pseudocode above. It will have some slightly different timings, but I think it should still be fine. :-) -Doug
© 2016 - 2026 Red Hat, Inc.