[PATCH v2 00/13] OMAP mailbox FIFO removal

Andrew Davis posted 13 patches 1 year, 10 months ago
drivers/mailbox/Kconfig        |   9 -
drivers/mailbox/omap-mailbox.c | 519 +++++++--------------------------
include/linux/omap-mailbox.h   |  13 -
3 files changed, 108 insertions(+), 433 deletions(-)
[PATCH v2 00/13] OMAP mailbox FIFO removal
Posted by Andrew Davis 1 year, 10 months ago
Hello all,

Core of this series is the last patch removing the message FIFO
from OMAP mailbox. This hurts our real-time performance. It was a
legacy leftover from before the common mailbox framework anyway.

The rest of the patches are cleanups found along the way.

Thanks,
Andrew

Changes for v2:
 - Use threaded irq as suggested by Hari and to
     fix possible "scheduling while atomic" issue
 - Use oneshot irq as we do not want to enable the
     irq again until we clear our the messages
 - Rebase on v6.9-rc3

Andrew Davis (13):
  mailbox: omap: Remove unused omap_mbox_{enable,disable}_irq()
    functions
  mailbox: omap: Remove unused omap_mbox_request_channel() function
  mailbox: omap: Move omap_mbox_irq_t into driver
  mailbox: omap: Move fifo size check to point of use
  mailbox: omap: Remove unneeded header omap-mailbox.h
  mailbox: omap: Remove device class
  mailbox: omap: Use devm_pm_runtime_enable() helper
  mailbox: omap: Merge mailbox child node setup loops
  mailbox: omap: Use function local struct mbox_controller
  mailbox: omap: Use mbox_controller channel list directly
  mailbox: omap: Remove mbox_chan_to_omap_mbox()
  mailbox: omap: Reverse FIFO busy check logic
  mailbox: omap: Remove kernel FIFO message queuing

 drivers/mailbox/Kconfig        |   9 -
 drivers/mailbox/omap-mailbox.c | 519 +++++++--------------------------
 include/linux/omap-mailbox.h   |  13 -
 3 files changed, 108 insertions(+), 433 deletions(-)

-- 
2.39.2
Re: [PATCH v2 00/13] OMAP mailbox FIFO removal
Posted by Dominic Rath 1 year, 7 months ago
Hello Andrew,

On 10.04.2024 15:59, Andrew Davis wrote:
> Changes for v2:
>   - Use threaded irq as suggested by Hari and to
>       fix possible "scheduling while atomic" issue

sorry for beeing late, I noticed this already got merged.

I was wondering what the reason was for ending up with the
threaded irq.

In your v1 thread your final conclusion appeared to be

> So for now I just kept this using the regular IRQ context as before.

We looked into this some time ago, and noticed that the IRQ approach 
caused problems in the virtio/rpmsg code. I'd like to understand if your 
change was for the same reason, or something else we missed before.

Regards,

Dominic
Re: [PATCH v2 00/13] OMAP mailbox FIFO removal
Posted by Andrew Davis 1 year, 7 months ago
On 6/13/24 2:58 AM, Dominic Rath wrote:
> Hello Andrew,
> 
> On 10.04.2024 15:59, Andrew Davis wrote:
>> Changes for v2:
>>   - Use threaded irq as suggested by Hari and to
>>       fix possible "scheduling while atomic" issue
> 
> sorry for beeing late, I noticed this already got merged.
> 
> I was wondering what the reason was for ending up with the
> threaded irq.
> 
> In your v1 thread your final conclusion appeared to be
> 
>> So for now I just kept this using the regular IRQ context as before.
> 
> We looked into this some time ago, and noticed that the IRQ approach caused problems in the virtio/rpmsg code. I'd like to understand if your change was for the same reason, or something else we missed before.
> 

It is most likely the same reason. Seems despite its name, rproc_vq_interrupt() cannot
be called from an IRQ/atomic context. As the following backtrace shows, that function
calls down into functions which are not IRQ safe. So we needed to keep it threaded:

[    5.389374] BUG: scheduling while atomic: (udev-worker)/232/0x00010002
[    5.395917] Modules linked in: videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 phy_j721e_wiz display_connector omap_mailbox(+) videodev tps6594_i2c(+) videobuf2_common phy_can_transceiver at24 cd6
[    5.433562] CPU: 0 PID: 232 Comm: (udev-worker) Not tainted 6.10.0-rc1-next-20240528-dirty #10
[    5.442158] Hardware name: Texas Instruments AM69 SK (DT)
[    5.447540] Call trace:
[    5.449976]  dump_backtrace+0x94/0xec
[    5.453640]  show_stack+0x18/0x24
[    5.456944]  dump_stack_lvl+0x78/0x90
[    5.460598]  dump_stack+0x18/0x24
[    5.463900]  __schedule_bug+0x50/0x68
[    5.467552]  __schedule+0x80c/0xb0c
[    5.471029]  schedule+0x34/0x104
[    5.474243]  schedule_preempt_disabled+0x24/0x40
[    5.478845]  rwsem_down_write_slowpath+0x31c/0x56c
[    5.483622]  down_write+0x90/0x94
[    5.486924]  kernfs_add_one+0x3c/0x148
[    5.490661]  kernfs_create_dir_ns+0x50/0x94
[    5.494830]  sysfs_create_dir_ns+0x70/0x10c
[    5.498999]  kobject_add_internal+0x98/0x26c
[    5.503254]  kobject_add+0x9c/0x10c
[    5.506729]  device_add+0xc0/0x790
[    5.510120]  rpmsg_register_device_override+0x10c/0x1c0
[    5.515333]  rpmsg_register_device+0x14/0x20
[    5.519590]  virtio_rpmsg_create_channel+0xb0/0x104
[    5.524452]  rpmsg_create_channel+0x28/0x60
[    5.528622]  rpmsg_ns_cb+0x100/0x1dc
[    5.532185]  rpmsg_recv_done+0x114/0x2e4
[    5.536094]  vring_interrupt+0x68/0xa4
[    5.539833]  rproc_vq_interrupt+0x2c/0x48
[    5.543830]  k3_r5_rproc_mbox_callback+0x84/0x90 [ti_k3_r5_remoteproc]
[    5.550348]  mbox_chan_received_data+0x1c/0x2c
[    5.554779]  mbox_interrupt+0xa0/0x17c [omap_mailbox]
[    5.559820]  __handle_irq_event_percpu+0x48/0x13c
[    5.564511]  handle_irq_event+0x4c/0xac

Andrew

> Regards,
> 
> Dominic
Re: [PATCH v2 00/13] OMAP mailbox FIFO removal
Posted by Dominic Rath 1 year, 7 months ago
On 13.06.2024 14:22, Andrew Davis wrote:
>> We looked into this some time ago, and noticed that the IRQ approach 
>> caused problems in the virtio/rpmsg code. I'd like to understand if 
>> your change was for the same reason, or something else we missed before.
>>
> 
> It is most likely the same reason. Seems despite its name, 
> rproc_vq_interrupt() cannot
> be called from an IRQ/atomic context. As the following backtrace shows, 
> that function
> calls down into functions which are not IRQ safe. So we needed to keep 
> it threaded:
> 
> [    5.389374] BUG: scheduling while atomic: (udev-worker)/232/0x00010002
> [    5.395917] Modules linked in: videobuf2_dma_contig videobuf2_memops 
> videobuf2_v4l2 phy_j721e_wiz display_connector omap_mailbox(+) videodev 
> tps6594_i2c(+) videobuf2_common phy_can_transceiver at24 cd6
> [    5.433562] CPU: 0 PID: 232 Comm: (udev-worker) Not tainted 
> 6.10.0-rc1-next-20240528-dirty #10
> [    5.442158] Hardware name: Texas Instruments AM69 SK (DT)
> [    5.447540] Call trace:
> [    5.449976]  dump_backtrace+0x94/0xec
> [    5.453640]  show_stack+0x18/0x24
> [    5.456944]  dump_stack_lvl+0x78/0x90
> [    5.460598]  dump_stack+0x18/0x24
> [    5.463900]  __schedule_bug+0x50/0x68
> [    5.467552]  __schedule+0x80c/0xb0c
> [    5.471029]  schedule+0x34/0x104
> [    5.474243]  schedule_preempt_disabled+0x24/0x40
> [    5.478845]  rwsem_down_write_slowpath+0x31c/0x56c
> [    5.483622]  down_write+0x90/0x94
> [    5.486924]  kernfs_add_one+0x3c/0x148
> [    5.490661]  kernfs_create_dir_ns+0x50/0x94
> [    5.494830]  sysfs_create_dir_ns+0x70/0x10c
> [    5.498999]  kobject_add_internal+0x98/0x26c
> [    5.503254]  kobject_add+0x9c/0x10c
> [    5.506729]  device_add+0xc0/0x790
> [    5.510120]  rpmsg_register_device_override+0x10c/0x1c0
> [    5.515333]  rpmsg_register_device+0x14/0x20
> [    5.519590]  virtio_rpmsg_create_channel+0xb0/0x104
> [    5.524452]  rpmsg_create_channel+0x28/0x60
> [    5.528622]  rpmsg_ns_cb+0x100/0x1dc
> [    5.532185]  rpmsg_recv_done+0x114/0x2e4
> [    5.536094]  vring_interrupt+0x68/0xa4
> [    5.539833]  rproc_vq_interrupt+0x2c/0x48
> [    5.543830]  k3_r5_rproc_mbox_callback+0x84/0x90 [ti_k3_r5_remoteproc]
> [    5.550348]  mbox_chan_received_data+0x1c/0x2c
> [    5.554779]  mbox_interrupt+0xa0/0x17c [omap_mailbox]
> [    5.559820]  __handle_irq_event_percpu+0x48/0x13c
> [    5.564511]  handle_irq_event+0x4c/0xac
> 

I looked into this a bit more closely, together with the colleague who 
implemented our internal workaround, and we came up with one more concern:

Have you considered that this synchronous path from the (threaded) IRQ 
draining the mailbox down to the (potentially blocking) rpmsg_* calls 
might let the hardware mailbox grow full?

 From what I remember the vring (?) has room for 512 messages, but the 
hardware mailbox on e.g. the AM64x can only handle four messages. At 
least with the current implementation on TI's MCU+ SDK running on the 
R5f that could cause the R5f to block, waiting for room in the hardware 
mailbox, while there are plenty of vring buffers available.

Best Regards,

Dominic
Re: [PATCH v2 00/13] OMAP mailbox FIFO removal
Posted by Andrew Davis 1 year, 7 months ago
On 6/26/24 9:39 AM, Dominic Rath wrote:
> On 13.06.2024 14:22, Andrew Davis wrote:
>>> We looked into this some time ago, and noticed that the IRQ approach caused problems in the virtio/rpmsg code. I'd like to understand if your change was for the same reason, or something else we missed before.
>>>
>>
>> It is most likely the same reason. Seems despite its name, rproc_vq_interrupt() cannot
>> be called from an IRQ/atomic context. As the following backtrace shows, that function
>> calls down into functions which are not IRQ safe. So we needed to keep it threaded:
>>
>> [    5.389374] BUG: scheduling while atomic: (udev-worker)/232/0x00010002
>> [    5.395917] Modules linked in: videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 phy_j721e_wiz display_connector omap_mailbox(+) videodev tps6594_i2c(+) videobuf2_common phy_can_transceiver at24 cd6
>> [    5.433562] CPU: 0 PID: 232 Comm: (udev-worker) Not tainted 6.10.0-rc1-next-20240528-dirty #10
>> [    5.442158] Hardware name: Texas Instruments AM69 SK (DT)
>> [    5.447540] Call trace:
>> [    5.449976]  dump_backtrace+0x94/0xec
>> [    5.453640]  show_stack+0x18/0x24
>> [    5.456944]  dump_stack_lvl+0x78/0x90
>> [    5.460598]  dump_stack+0x18/0x24
>> [    5.463900]  __schedule_bug+0x50/0x68
>> [    5.467552]  __schedule+0x80c/0xb0c
>> [    5.471029]  schedule+0x34/0x104
>> [    5.474243]  schedule_preempt_disabled+0x24/0x40
>> [    5.478845]  rwsem_down_write_slowpath+0x31c/0x56c
>> [    5.483622]  down_write+0x90/0x94
>> [    5.486924]  kernfs_add_one+0x3c/0x148
>> [    5.490661]  kernfs_create_dir_ns+0x50/0x94
>> [    5.494830]  sysfs_create_dir_ns+0x70/0x10c
>> [    5.498999]  kobject_add_internal+0x98/0x26c
>> [    5.503254]  kobject_add+0x9c/0x10c
>> [    5.506729]  device_add+0xc0/0x790
>> [    5.510120]  rpmsg_register_device_override+0x10c/0x1c0
>> [    5.515333]  rpmsg_register_device+0x14/0x20
>> [    5.519590]  virtio_rpmsg_create_channel+0xb0/0x104
>> [    5.524452]  rpmsg_create_channel+0x28/0x60
>> [    5.528622]  rpmsg_ns_cb+0x100/0x1dc
>> [    5.532185]  rpmsg_recv_done+0x114/0x2e4
>> [    5.536094]  vring_interrupt+0x68/0xa4
>> [    5.539833]  rproc_vq_interrupt+0x2c/0x48
>> [    5.543830]  k3_r5_rproc_mbox_callback+0x84/0x90 [ti_k3_r5_remoteproc]
>> [    5.550348]  mbox_chan_received_data+0x1c/0x2c
>> [    5.554779]  mbox_interrupt+0xa0/0x17c [omap_mailbox]
>> [    5.559820]  __handle_irq_event_percpu+0x48/0x13c
>> [    5.564511]  handle_irq_event+0x4c/0xac
>>
> 
> I looked into this a bit more closely, together with the colleague who implemented our internal workaround, and we came up with one more concern:
> 
> Have you considered that this synchronous path from the (threaded) IRQ draining the mailbox down to the (potentially blocking) rpmsg_* calls might let the hardware mailbox grow full?
> 
>  From what I remember the vring (?) has room for 512 messages, but the hardware mailbox on e.g. the AM64x can only handle four messages. At least with the current implementation on TI's MCU+ SDK running on the R5f that could cause the R5f to block, waiting for room in the hardware mailbox, while there are plenty of vring buffers available.
> 

We would like to switch back to the non-threaded handler at some point. As you mention doing this
in a threaded way increase the risk of the hardware message queue filling and blocking the remote side.
(Plus the threaded handling can add latency to the message handling which should be avoided for real-time
reasons)

The fix might be to have rpmsg_recv_done() callback simply queue the message and then schedule another
worker to actually do the next stage processing. That way complex actions on messages do not block
vring_interrupt() which should be treated like an atomic context call.

Anyway for now, I'd expect the much faster host core (2xA53 @ 1GHZ in AM64) to be able to outpace the
R5s and keep the mailbox drained. Are you actually running into this issue or is the concern based on
ensuring RT performance(not blocking on mailbox queue filled) on the R5 side?

Andrew

> Best Regards,
> 
> Dominic
Re: [PATCH v2 00/13] OMAP mailbox FIFO removal
Posted by Dominic Rath 1 year, 7 months ago
On 13.06.2024 14:22, Andrew Davis wrote:
>> We looked into this some time ago, and noticed that the IRQ approach 
>> caused problems in the virtio/rpmsg code. I'd like to understand if 
>> your change was for the same reason, or something else we missed before.
>>
> 
> It is most likely the same reason. Seems despite its name, 
> rproc_vq_interrupt() cannot
> be called from an IRQ/atomic context. As the following backtrace shows, 
> that function
> calls down into functions which are not IRQ safe. So we needed to keep 
> it threaded:

Thanks for confirming. This is exactly what we've been seeing.

Regards,

Dominic