mailbox: Document the behavior of mbox_send_message() w/ NULL mssg

[PATCH] mailbox: Document the behavior of mbox_send_message() w/ NULL mssg

Posted by Douglas Anderson 1 month, 3 weeks ago

The way the mailbox core behaves when you pass a NULL `mssg` parameter
to mbox_send_message() seems a little questionable. Specifically, the
mailbox core stores the currently active message directly in its
`active_req` field. In at least two places it decides that if this
field is `NULL` then there is no active request. That means if `mssg`
is NULL it will always think there is no active request. The two
places where it does this are:

1. If a client calls mbox_send_message(), if `active_req` is NULL then
   it will call the mailbox controller to send the new message even if
   the mailbox controller hasn't yet called mbox_chan_txdone() on the
   previous (NULL) message.
2. The mailbox core will never call the client's `tx_done()` callback
   with a NULL message because `tx_tick()` returns early whenever the
   message is NULL.

The above could be seen as bugs and perhaps could be fixed. However,
doing a `git grep mbox_send_message.*NULL` shows 14 hits in mainline
today and people may be relying on the current behavior. It is,
perhaps, better to accept the current behavior.

The current behavior can actually serve the purpose of providing a
simple way to assert an edge-triggered interrupt to the remote
processor on the other side of the mailbox. Specifically:

1. Like a normal edge-triggered interrupt, if multiple edges arrive
   before the interrupt is Acked they are coalesced.
2. Like a normal edge-triggered interrupt, as long as the receiver
   (the remote processor in this case) "Ack"s the interrupt _before_
   checking for work and the sender (the mailbox client in this case)
   posts the interrupt _after_ adding new work then we can always be
   certain that new work will be noticed. This assumes that the
   mailbox clienut and remote processor have some out-of-band way to
   communicate work and the mailbox is just being used as an
   interrupt.

Document the current behavior so that people can rely on it and know
that it will keep working the same way.

NOTE: if a given mailbox client mixes and matches some NULL and some
non-NULL messages, things could get loopy without additional code
changes and rules. Without code changes, if we transfer a non-NULL
message then we'd stop coalescing future NULL messages until the queue
clears. Also: if we were transferring a NULL message and a non-NULL
came in, we'd send it right away but potentially report `tx_done()`
too early. For now, document mixing and matching NULL and non-NULL
messages as undefined.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
This feels hacky, but I'm worried that if we do something else we'll
break people. I haven't spent tons of time analyzing all of the
existing mailbox users that pass NULL for the message, but I do know
that at least downstream some Pixel code does it and seems to rely on
the current "don't queue up another message but instead just make sure
the remote processor will get an interrupt in the future."

 drivers/mailbox/mailbox.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/mailbox/mailbox.c b/drivers/mailbox/mailbox.c
index 2acc6ec229a4..80861caeb848 100644
--- a/drivers/mailbox/mailbox.c
+++ b/drivers/mailbox/mailbox.c
@@ -238,6 +238,15 @@ EXPORT_SYMBOL_GPL(mbox_client_peek_data);
  * This function could be called from atomic context as it simply
  * queues the data and returns a token against the request.
  *
+ * NOTE: If 'mssg' is NULL, the function has some rather different behavior.
+ *	 - The mailbox controller will be informed of the message but it
+ *	   won't be considered "busy". Future messages will continue to be
+ *	   passed onto the controller even if it never called mbox_chan_txdone()
+ *	 - The client's tx_done() callback will never be called.
+ *	 The above rules allow asserting an "edge-triggered" interrupt to the
+ *	 remote processor in a race-free way. Behavior is undefined if a given
+ *	 channel sometimes has NULL message and sometimes doesn't.
+ *
  * Return: Non-negative integer for successful submission (non-blocking mode)
  *	or transmission over chan (blocking mode).
  *	Negative value denotes failure.
-- 
2.52.0.322.g1dd061c0dc-goog

Re: [PATCH] mailbox: Document the behavior of mbox_send_message() w/ NULL mssg

Posted by Jassi Brar 1 month, 2 weeks ago

On Thu, Dec 18, 2025 at 3:40 PM Douglas Anderson <dianders@chromium.org> wrote:
>
> The way the mailbox core behaves when you pass a NULL `mssg` parameter
> to mbox_send_message() seems a little questionable. Specifically, the
> mailbox core stores the currently active message directly in its
> `active_req` field. In at least two places it decides that if this
> field is `NULL` then there is no active request. That means if `mssg`
> is NULL it will always think there is no active request. The two
> places where it does this are:
>
> 1. If a client calls mbox_send_message(), if `active_req` is NULL then
>    it will call the mailbox controller to send the new message even if
>    the mailbox controller hasn't yet called mbox_chan_txdone() on the
>    previous (NULL) message.
> 2. The mailbox core will never call the client's `tx_done()` callback
>    with a NULL message because `tx_tick()` returns early whenever the
>    message is NULL.
>
> The above could be seen as bugs and perhaps could be fixed. However,
> doing a `git grep mbox_send_message.*NULL` shows 14 hits in mainline
> today and people may be relying on the current behavior. It is,
> perhaps, better to accept the current behavior.
>
> The current behavior can actually serve the purpose of providing a
> simple way to assert an edge-triggered interrupt to the remote
> processor on the other side of the mailbox. Specifically:
>
> 1. Like a normal edge-triggered interrupt, if multiple edges arrive
>    before the interrupt is Acked they are coalesced.
> 2. Like a normal edge-triggered interrupt, as long as the receiver
>    (the remote processor in this case) "Ack"s the interrupt _before_
>    checking for work and the sender (the mailbox client in this case)
>    posts the interrupt _after_ adding new work then we can always be
>    certain that new work will be noticed. This assumes that the
>    mailbox clienut and remote processor have some out-of-band way to
>    communicate work and the mailbox is just being used as an
>    interrupt.
>
> Document the current behavior so that people can rely on it and know
> that it will keep working the same way.
>
> NOTE: if a given mailbox client mixes and matches some NULL and some
> non-NULL messages, things could get loopy without additional code
> changes and rules. Without code changes, if we transfer a non-NULL
> message then we'd stop coalescing future NULL messages until the queue
> clears. Also: if we were transferring a NULL message and a non-NULL
> came in, we'd send it right away but potentially report `tx_done()`
> too early. For now, document mixing and matching NULL and non-NULL
> messages as undefined.
>
So we are talking about  'doorbell' type controllers that don't transfer any
data but only raise an irq to the other end. Ideally the client would pass the
doorbell info via mssg in runtime, but there may be a driver that does not need
that info from the client. I believe some doorbell clients do call
 mbox_send_message(chan, 0) , the arbitrary mssg 0 is unused by the driver
and everything works fine.
 I agree that is not ideal - we should differentiate between nothing and
'null' data. Is that the issue you want addressed?

Thanks.

Re: [PATCH] mailbox: Document the behavior of mbox_send_message() w/ NULL mssg

Posted by Doug Anderson 1 month ago

Hi,

On Sat, Dec 20, 2025 at 3:26 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
>
> On Thu, Dec 18, 2025 at 3:40 PM Douglas Anderson <dianders@chromium.org> wrote:
> >
> > The way the mailbox core behaves when you pass a NULL `mssg` parameter
> > to mbox_send_message() seems a little questionable. Specifically, the
> > mailbox core stores the currently active message directly in its
> > `active_req` field. In at least two places it decides that if this
> > field is `NULL` then there is no active request. That means if `mssg`
> > is NULL it will always think there is no active request. The two
> > places where it does this are:
> >
> > 1. If a client calls mbox_send_message(), if `active_req` is NULL then
> >    it will call the mailbox controller to send the new message even if
> >    the mailbox controller hasn't yet called mbox_chan_txdone() on the
> >    previous (NULL) message.
> > 2. The mailbox core will never call the client's `tx_done()` callback
> >    with a NULL message because `tx_tick()` returns early whenever the
> >    message is NULL.
> >
> > The above could be seen as bugs and perhaps could be fixed. However,
> > doing a `git grep mbox_send_message.*NULL` shows 14 hits in mainline
> > today and people may be relying on the current behavior. It is,
> > perhaps, better to accept the current behavior.
> >
> > The current behavior can actually serve the purpose of providing a
> > simple way to assert an edge-triggered interrupt to the remote
> > processor on the other side of the mailbox. Specifically:
> >
> > 1. Like a normal edge-triggered interrupt, if multiple edges arrive
> >    before the interrupt is Acked they are coalesced.
> > 2. Like a normal edge-triggered interrupt, as long as the receiver
> >    (the remote processor in this case) "Ack"s the interrupt _before_
> >    checking for work and the sender (the mailbox client in this case)
> >    posts the interrupt _after_ adding new work then we can always be
> >    certain that new work will be noticed. This assumes that the
> >    mailbox clienut and remote processor have some out-of-band way to
> >    communicate work and the mailbox is just being used as an
> >    interrupt.
> >
> > Document the current behavior so that people can rely on it and know
> > that it will keep working the same way.
> >
> > NOTE: if a given mailbox client mixes and matches some NULL and some
> > non-NULL messages, things could get loopy without additional code
> > changes and rules. Without code changes, if we transfer a non-NULL
> > message then we'd stop coalescing future NULL messages until the queue
> > clears. Also: if we were transferring a NULL message and a non-NULL
> > came in, we'd send it right away but potentially report `tx_done()`
> > too early. For now, document mixing and matching NULL and non-NULL
> > messages as undefined.
> >
> So we are talking about  'doorbell' type controllers that don't transfer any
> data but only raise an irq to the other end.

Essentially. I'm still a bit new to the mailbox world, but at least in
the mailbox controller I'm more familiar with (and planning to
upstream soon) it's not actually a different type of controller,
though. The controller has shared memory to store a message and then
has a doorbell feature. If the remote processor doesn't need any data
but just needs an interrupt then we can have the mailbox controller
just ring the doorbell without any associated message.

> Ideally the client would pass the
> doorbell info via mssg in runtime, but there may be a driver that does not need
> that info from the client.

Right. Since 'mssg' is defined as arbitrary data it does seem
legitimate for clients to pass NULL for it. It's just unfortunate that
the core behaves differently when `mssg` is NULL...

> I believe some doorbell clients do call
>  mbox_send_message(chan, 0) , the arbitrary mssg 0 is unused by the driver
> and everything works fine.

Right (well, "NULL" instead of "0"). It's actually quite common. Using
`git grep -A2 mbox_send_message.*NULL`, I see:

* drivers/acpi/acpi_pcc.c
* drivers/firmware/arm_scmi/transports/mailbox.c
* drivers/firmware/imx/imx-dsp.c
* drivers/firmware/tegra/bpmp-tegra186.c
* drivers/irqchip/irq-qcom-mpm.c
* drivers/remoteproc/xlnx_r5_remoteproc.c
* drivers/rpmsg/qcom_glink_rpm.c
* drivers/rpmsg/qcom_glink_smem.c
* drivers/rpmsg/qcom_smd.c
* drivers/soc/qcom/qcom_aoss.c
* drivers/soc/qcom/smp2p.c
* drivers/soc/qcom/smsm.c
* drivers/soc/ti/wkup_m3_ipc.c
* drivers/soc/xilinx/zynqmp_power.c

...or did you mean something different when you said drivers are
passing "0" as an argument?

>  I agree that is not ideal - we should differentiate between nothing and
> 'null' data. Is that the issue you want addressed?

I guess I'm just trying to clarify what the behavior _should_ be when
`mssg` is NULL. At first I saw the current behavior as a bug. ...but
then I saw the number of people that were using NULL `mssg` and I
decided that fixing the current behavior might cause problems.

Maybe it's good just to clarify the difference in behavior so we're
all on the same page.

Let's imagine that a mailbox client quickly queues up 3 messages:
  static const u8 bogus_data[] = { 0 };
  mbox_send_message(chan, (void*) bogus_data);
  mbox_send_message(chan, (void*) bogus_data);
  mbox_send_message(chan, (void*) bogus_data);

What will that do? Assuming that the Apps Processor (AP) is much
faster than the remote processor, it will start sending the first
message right away and the other two will be queued. When the first
message is sent, the second message will be sent. When the second
message is sent, the third message will be sent. The remote processor
is guaranteed to see 3 messages.

Writing out a big diagram, I believe this will happen:

AP                                  remote
----------------------------------- -------------------
mbox_send_message(&bogus_data)
  add_to_rbuf()
  msg_submit()
    ops->send_data()
      ...
      drv_trigger_doorbell()
mbox_send_message(&bogus_data)
  add_to_rbuf()
  msg_submit()
    NOP because active_req
mbox_send_message(&bogus_data)
  add_to_rbuf()
  msg_submit()
    NOP because active_req
                                    handle_doorbell()
                                      ack_doorbell()
drv_handle_doorbell_ack()
  mbox_chan_txdone()
    tx_tick()
      msg_submit()
        ops->send_data()
          ...
          drv_trigger_doorbell()
                                    handle_doorbell()
                                      ack_doorbell()
drv_handle_doorbell_ack()
  mbox_chan_txdone()
    tx_tick()
      msg_submit()
        ops->send_data()
          ...
          drv_trigger_doorbell()
                                    handle_doorbell()
                                      ack_doorbell()
drv_handle_doorbell_ack()
  mbox_chan_txdone()
    tx_tick()

Now, let's do the same but with NULL:
  mbox_send_message(chan, NULL);
  mbox_send_message(chan, NULL);
  mbox_send_message(chan, NULL);

Again, assuming that the AP is much faster and that the doorbell is a
normal edge-triggered interrupt, today this will happen:

AP                                  remote
----------------------------------- -------------------
mbox_send_message(NULL)
  add_to_rbuf()
  msg_submit()
    ops->send_data()
      ...
      drv_trigger_doorbell()
mbox_send_message(NULL)
  add_to_rbuf()
  msg_submit()
    ops->send_data()
      ...
      drv_trigger_doorbell()
mbox_send_message(NULL)
  add_to_rbuf()
  msg_submit()
    ops->send_data()
      ...
      drv_trigger_doorbell()
                                    handle_doorbell()
                                      ack_doorbell()
drv_handle_doorbell_ack()
  mbox_chan_txdone()
    tx_tick()

If the remote processor is much faster than the AP or the AP gets
interrupted between the `mbox_send_message(NULL)` calls then obviously
things could look a bit different, but hopefully the above makes it
clear enough?

As I said, at first I viewed the difference between the "NULL" and
"bogus_data" cases as a bug and thought maybe I should fix it, but
there truly could be cases where someone might want the existing
behavior.

Presumably anyone who is passing NULL for `mssg` has some other OOB
(out of band) way to pass data to the remote processor and the
doorbell is just telling the remote proc to look at the new OOB data.
With that, let's look at a (made up) example where the AP is trying to
let the remote processor know whenever the "event_counter" changed.
I'd imagine something like this:

AP                                  remote
----------------------------------- -------------------
shared_oob_event_counter++;
mbox_send_message(NULL)
shared_oob_event_counter++;
mbox_send_message(NULL)
shared_oob_event_counter++;
mbox_send_message(NULL)
                                    handle_doorbell()
                                      read_and_handle_event_counter()
                                      ack_doorbell()

In other words the remote processor sees the doorbell, handles all the
changes to the event counter, and then acks. If we instead fixed the
NULL case to actually queue up additional messages then the remote
processor would get 3 doorbells and the "event_counter" wouldn't
change at all for the last two...

I don't know for sure, but I'm imagining that at least some of the
existing 14 drivers that are passing NULL for `mssg` may be relying on
the existing behavior... From some quick spot checking:

smp2p_update_bits() - Updates (out of band) shared memory and uses
mbox_send_message(NULL) to kick the remote processor. That sounds very
similar to my made up "event_counter" example.

smsm_update_bits() seems similar

Many of the other examples I looked at looked like they weren't
relying on the existing behavior, but I didn't look at every one and
it's harder to trace all the usage of some of the drivers...

Anyway, I'm not 100% set on any one solution, I just see that the NULL
case is (unexpectedly) different and either want it documented or to
fix it so it's not different (ideally without angry people with
pitchforks coming after me because I changed behavior). :-)

-Doug

Re: [PATCH] mailbox: Document the behavior of mbox_send_message() w/ NULL mssg

Posted by Jassi Brar 1 month ago

On Mon, Jan 5, 2026 at 1:00 PM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Sat, Dec 20, 2025 at 3:26 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
> >
> > On Thu, Dec 18, 2025 at 3:40 PM Douglas Anderson <dianders@chromium.org> wrote:
> > >
> > > The way the mailbox core behaves when you pass a NULL `mssg` parameter
> > > to mbox_send_message() seems a little questionable. Specifically, the
> > > mailbox core stores the currently active message directly in its
> > > `active_req` field. In at least two places it decides that if this
> > > field is `NULL` then there is no active request. That means if `mssg`
> > > is NULL it will always think there is no active request. The two
> > > places where it does this are:
> > >
> > > 1. If a client calls mbox_send_message(), if `active_req` is NULL then
> > >    it will call the mailbox controller to send the new message even if
> > >    the mailbox controller hasn't yet called mbox_chan_txdone() on the
> > >    previous (NULL) message.
> > > 2. The mailbox core will never call the client's `tx_done()` callback
> > >    with a NULL message because `tx_tick()` returns early whenever the
> > >    message is NULL.
> > >
> > > The above could be seen as bugs and perhaps could be fixed. However,
> > > doing a `git grep mbox_send_message.*NULL` shows 14 hits in mainline
> > > today and people may be relying on the current behavior. It is,
> > > perhaps, better to accept the current behavior.
> > >
> > > The current behavior can actually serve the purpose of providing a
> > > simple way to assert an edge-triggered interrupt to the remote
> > > processor on the other side of the mailbox. Specifically:
> > >
> > > 1. Like a normal edge-triggered interrupt, if multiple edges arrive
> > >    before the interrupt is Acked they are coalesced.
> > > 2. Like a normal edge-triggered interrupt, as long as the receiver
> > >    (the remote processor in this case) "Ack"s the interrupt _before_
> > >    checking for work and the sender (the mailbox client in this case)
> > >    posts the interrupt _after_ adding new work then we can always be
> > >    certain that new work will be noticed. This assumes that the
> > >    mailbox clienut and remote processor have some out-of-band way to
> > >    communicate work and the mailbox is just being used as an
> > >    interrupt.
> > >
> > > Document the current behavior so that people can rely on it and know
> > > that it will keep working the same way.
> > >
> > > NOTE: if a given mailbox client mixes and matches some NULL and some
> > > non-NULL messages, things could get loopy without additional code
> > > changes and rules. Without code changes, if we transfer a non-NULL
> > > message then we'd stop coalescing future NULL messages until the queue
> > > clears. Also: if we were transferring a NULL message and a non-NULL
> > > came in, we'd send it right away but potentially report `tx_done()`
> > > too early. For now, document mixing and matching NULL and non-NULL
> > > messages as undefined.
> > >
> > So we are talking about  'doorbell' type controllers that don't transfer any
> > data but only raise an irq to the other end.
>
> Essentially. I'm still a bit new to the mailbox world, but at least in
> the mailbox controller I'm more familiar with (and planning to
> upstream soon) it's not actually a different type of controller,
> though. The controller has shared memory to store a message and then
> has a doorbell feature. If the remote processor doesn't need any data
> but just needs an interrupt then we can have the mailbox controller
> just ring the doorbell without any associated message.
>
>
> > Ideally the client would pass the
> > doorbell info via mssg in runtime, but there may be a driver that does not need
> > that info from the client.
>
> Right. Since 'mssg' is defined as arbitrary data it does seem
> legitimate for clients to pass NULL for it. It's just unfortunate that
> the core behaves differently when `mssg` is NULL...
>
>
> > I believe some doorbell clients do call
> >  mbox_send_message(chan, 0) , the arbitrary mssg 0 is unused by the driver
> > and everything works fine.
>
> Right (well, "NULL" instead of "0"). It's actually quite common. Using
> `git grep -A2 mbox_send_message.*NULL`, I see:
>
> * drivers/acpi/acpi_pcc.c
> * drivers/firmware/arm_scmi/transports/mailbox.c
> * drivers/firmware/imx/imx-dsp.c
> * drivers/firmware/tegra/bpmp-tegra186.c
> * drivers/irqchip/irq-qcom-mpm.c
> * drivers/remoteproc/xlnx_r5_remoteproc.c
> * drivers/rpmsg/qcom_glink_rpm.c
> * drivers/rpmsg/qcom_glink_smem.c
> * drivers/rpmsg/qcom_smd.c
> * drivers/soc/qcom/qcom_aoss.c
> * drivers/soc/qcom/smp2p.c
> * drivers/soc/qcom/smsm.c
> * drivers/soc/ti/wkup_m3_ipc.c
> * drivers/soc/xilinx/zynqmp_power.c
>
> ...or did you mean something different when you said drivers are
> passing "0" as an argument?
>
Actually I meant to say some 'constant' value, but yes 0 and NULL
don't make a difference.
The core needs to track the in-flight request in the 'active_req'
pointer which is also checked
to be non-null as a marker of busyness. That _can be_ a problem when
the client sends NULL
as the message, depending upon the submission pattern and speed.
Clients that call with NULL are likely simple one-message-at-a-time
users, but I haven't checked.
If it is a doorbell controller, the mssg may still carry the pointer
to data in shmem that the driver
can copy from tx_prepare() callback.


> >  I agree that is not ideal - we should differentiate between nothing and
> > 'null' data. Is that the issue you want addressed?
>
> I guess I'm just trying to clarify what the behavior _should_ be when
> `mssg` is NULL. At first I saw the current behavior as a bug. ...but
> then I saw the number of people that were using NULL `mssg` and I
> decided that fixing the current behavior might cause problems.
>
> Maybe it's good just to clarify the difference in behavior so we're
> all on the same page.
>
> Let's imagine that a mailbox client quickly queues up 3 messages:
>   static const u8 bogus_data[] = { 0 };
>   mbox_send_message(chan, (void*) bogus_data);
>   mbox_send_message(chan, (void*) bogus_data);
>   mbox_send_message(chan, (void*) bogus_data);
>
> What will that do? Assuming that the Apps Processor (AP) is much
> faster than the remote processor, it will start sending the first
> message right away and the other two will be queued. When the first
> message is sent, the second message will be sent. When the second
> message is sent, the third message will be sent. The remote processor
> is guaranteed to see 3 messages.
>
> Writing out a big diagram, I believe this will happen:
>
> AP                                  remote
> ----------------------------------- -------------------
> mbox_send_message(&bogus_data)
>   add_to_rbuf()
>   msg_submit()
>     ops->send_data()
>       ...
>       drv_trigger_doorbell()
> mbox_send_message(&bogus_data)
>   add_to_rbuf()
>   msg_submit()
>     NOP because active_req
> mbox_send_message(&bogus_data)
>   add_to_rbuf()
>   msg_submit()
>     NOP because active_req
>                                     handle_doorbell()
>                                       ack_doorbell()
> drv_handle_doorbell_ack()
>   mbox_chan_txdone()
>     tx_tick()
>       msg_submit()
>         ops->send_data()
>           ...
>           drv_trigger_doorbell()
>                                     handle_doorbell()
>                                       ack_doorbell()
> drv_handle_doorbell_ack()
>   mbox_chan_txdone()
>     tx_tick()
>       msg_submit()
>         ops->send_data()
>           ...
>           drv_trigger_doorbell()
>                                     handle_doorbell()
>                                       ack_doorbell()
> drv_handle_doorbell_ack()
>   mbox_chan_txdone()
>     tx_tick()
>
>
> Now, let's do the same but with NULL:
>   mbox_send_message(chan, NULL);
>   mbox_send_message(chan, NULL);
>   mbox_send_message(chan, NULL);
>
> Again, assuming that the AP is much faster and that the doorbell is a
> normal edge-triggered interrupt, today this will happen:
>
> AP                                  remote
> ----------------------------------- -------------------
> mbox_send_message(NULL)
>   add_to_rbuf()
>   msg_submit()
>     ops->send_data()
>       ...
>       drv_trigger_doorbell()
> mbox_send_message(NULL)
>   add_to_rbuf()
>   msg_submit()
>     ops->send_data()
>       ...
>       drv_trigger_doorbell()
> mbox_send_message(NULL)
>   add_to_rbuf()
>   msg_submit()
>     ops->send_data()
>       ...
>       drv_trigger_doorbell()
>                                     handle_doorbell()
>                                       ack_doorbell()
> drv_handle_doorbell_ack()
>   mbox_chan_txdone()
>     tx_tick()
>
>
> If the remote processor is much faster than the AP or the AP gets
> interrupted between the `mbox_send_message(NULL)` calls then obviously
> things could look a bit different, but hopefully the above makes it
> clear enough?
>
> As I said, at first I viewed the difference between the "NULL" and
> "bogus_data" cases as a bug and thought maybe I should fix it, but
> there truly could be cases where someone might want the existing
> behavior.
>
> Presumably anyone who is passing NULL for `mssg` has some other OOB
> (out of band) way to pass data to the remote processor and the
> doorbell is just telling the remote proc to look at the new OOB data.
> With that, let's look at a (made up) example where the AP is trying to
> let the remote processor know whenever the "event_counter" changed.
> I'd imagine something like this:
>
> AP                                  remote
> ----------------------------------- -------------------
> shared_oob_event_counter++;
> mbox_send_message(NULL)
> shared_oob_event_counter++;
> mbox_send_message(NULL)
> shared_oob_event_counter++;
> mbox_send_message(NULL)
>                                     handle_doorbell()
>                                       read_and_handle_event_counter()
>                                       ack_doorbell()
>
> In other words the remote processor sees the doorbell, handles all the
> changes to the event counter, and then acks. If we instead fixed the
> NULL case to actually queue up additional messages then the remote
> processor would get 3 doorbells and the "event_counter" wouldn't
> change at all for the last two...
>
> I don't know for sure, but I'm imagining that at least some of the
> existing 14 drivers that are passing NULL for `mssg` may be relying on
> the existing behavior... From some quick spot checking:
>
> smp2p_update_bits() - Updates (out of band) shared memory and uses
> mbox_send_message(NULL) to kick the remote processor. That sounds very
> similar to my made up "event_counter" example.
>
> smsm_update_bits() seems similar
>
> Many of the other examples I looked at looked like they weren't
> relying on the existing behavior, but I didn't look at every one and
> it's harder to trace all the usage of some of the drivers...
>
>
> Anyway, I'm not 100% set on any one solution, I just see that the NULL
> case is (unexpectedly) different and either want it documented or to
> fix it so it's not different (ideally without angry people with
> pitchforks coming after me because I changed behavior). :-)
>
Yes, I too think we should leave the existing api alone to be safe.
For new pure-doorbell clients, how about something like
   #define MBOX_NODATA   ERR_PTR(-ENOMEM)
   mailbox_ring_doorbell(chan)
   {
       mailbox_send_message(chan, MBOX_NODATA)
   }
because the underlying controller driver anyway ignores the submitted
'mssg' pointer.

And of course add the clarification comment in this patch.

Thanks

Re: [PATCH] mailbox: Document the behavior of mbox_send_message() w/ NULL mssg

Posted by Doug Anderson 1 month ago

Hi,

On Mon, Jan 5, 2026 at 5:57 PM Jassi Brar <jassisinghbrar@gmail.com> wrote:
>
> > Right (well, "NULL" instead of "0"). It's actually quite common. Using
> > `git grep -A2 mbox_send_message.*NULL`, I see:
> >
> > * drivers/acpi/acpi_pcc.c
> > * drivers/firmware/arm_scmi/transports/mailbox.c
> > * drivers/firmware/imx/imx-dsp.c
> > * drivers/firmware/tegra/bpmp-tegra186.c
> > * drivers/irqchip/irq-qcom-mpm.c
> > * drivers/remoteproc/xlnx_r5_remoteproc.c
> > * drivers/rpmsg/qcom_glink_rpm.c
> > * drivers/rpmsg/qcom_glink_smem.c
> > * drivers/rpmsg/qcom_smd.c
> > * drivers/soc/qcom/qcom_aoss.c
> > * drivers/soc/qcom/smp2p.c
> > * drivers/soc/qcom/smsm.c
> > * drivers/soc/ti/wkup_m3_ipc.c
> > * drivers/soc/xilinx/zynqmp_power.c
> >
> > ...or did you mean something different when you said drivers are
> > passing "0" as an argument?
> >
> Actually I meant to say some 'constant' value, but yes 0 and NULL
> don't make a difference.
> The core needs to track the in-flight request in the 'active_req'
> pointer which is also checked
> to be non-null as a marker of busyness. That _can be_ a problem when
> the client sends NULL
> as the message, depending upon the submission pattern and speed.
> Clients that call with NULL are likely simple one-message-at-a-time
> users, but I haven't checked.

I guess my worry is that some of them are _not_ one-message-at-a-time
and are relying on the current behavior of the core when `mssg` is
NULL. At least one downstream user I've studied was relying on this...


> > Anyway, I'm not 100% set on any one solution, I just see that the NULL
> > case is (unexpectedly) different and either want it documented or to
> > fix it so it's not different (ideally without angry people with
> > pitchforks coming after me because I changed behavior). :-)
> >
> Yes, I too think we should leave the existing api alone to be safe.
> For new pure-doorbell clients, how about something like
>    #define MBOX_NODATA   ERR_PTR(-ENOMEM)
>    mailbox_ring_doorbell(chan)
>    {
>        mailbox_send_message(chan, MBOX_NODATA)
>    }
> because the underlying controller driver anyway ignores the submitted
> 'mssg' pointer.
>
> And of course add the clarification comment in this patch.

OK, how about this for a plan:

1. A patch to introduce the new mbox_ring_doorbell() API, which will
behave nearly the same as the existing mailbox_send_message(chan,
NULL) case.

2. A series of patches transitioning all existing upstream users that
are passing NULL to use mbox_ring_doorbell().

3. A patch making it an error to pass NULL as the `mmsg`.

Does that work for you?

-Doug

Re: [PATCH] mailbox: Document the behavior of mbox_send_message() w/ NULL mssg

Posted by Jassi Brar 1 month ago

On Tue, Jan 6, 2026 at 11:42 AM Doug Anderson <dianders@chromium.org> wrote:
>
> > > Anyway, I'm not 100% set on any one solution, I just see that the NULL
> > > case is (unexpectedly) different and either want it documented or to
> > > fix it so it's not different (ideally without angry people with
> > > pitchforks coming after me because I changed behavior). :-)
> > >
> > Yes, I too think we should leave the existing api alone to be safe.
> > For new pure-doorbell clients, how about something like
> >    #define MBOX_NODATA   ERR_PTR(-ENOMEM)
> >    mailbox_ring_doorbell(chan)
> >    {
> >        mailbox_send_message(chan, MBOX_NODATA)
> >    }
> > because the underlying controller driver anyway ignores the submitted
> > 'mssg' pointer.
> >
> > And of course add the clarification comment in this patch.
>
> OK, how about this for a plan:
>
> 1. A patch to introduce the new mbox_ring_doorbell() API, which will
> behave nearly the same as the existing mailbox_send_message(chan,
> NULL) case.
>
> 2. A series of patches transitioning all existing upstream users that
> are passing NULL to use mbox_ring_doorbell().
>
> 3. A patch making it an error to pass NULL as the `mmsg`.
>
> Does that work for you?
>
Sounds good. Thank you!