[v11] crypto/dmaengine: qce: introduce BAM locking and use DMA for register I/O

[PATCH RFC v11 00/12] crypto/dmaengine: qce: introduce BAM locking and use DMA for register I/O

Posted by Bartosz Golaszewski 1 month, 1 week ago

NOTE: Please note that even though this is version 11, I changed the
prefix to RFC as this is an entirely new approach resulting from
discussions under v9. I AM AWARE of the existing memory leaks in the
last patch of this series - I'm sending it because I want to first
discuss the approach and get a green light from Vinod as well as Mani
and Bjorn. Especially when it comes to communicating the address for the
dummy rights from the client to the BAM driver.
/NOTE

Currently the QCE crypto driver accesses the crypto engine registers
directly via CPU. Trust Zone may perform crypto operations simultaneously
resulting in a race condition. To remedy that, let's introduce support
for BAM locking/unlocking to the driver. The BAM driver will now wrap
any existing issued descriptor chains with additional descriptors
performing the locking when the client starts the transaction
(dmaengine_issue_pending()). The client wanting to profit from locking
needs to switch to performing register I/O over DMA and communicate the
address to which to perform the dummy writes via a call to
dmaengine_slave_config().

In the specific case of the BAM DMA this translates to sending command
descriptors performing dummy writes with the relevant flags set. The BAM
will then lock all other pipes not related to the current pipe group, and
keep handling the current pipe only until it sees the the unlock bit.

In order for the locking to work correctly, we also need to switch to
using DMA for all register I/O.

On top of this, the series contains some additional tweaks and
refactoring.

The goal of this is not to improve the performance but to prepare the
driver for supporting decryption into secure buffers in the future.

Tested with tcrypt.ko, kcapi and cryptsetup.

Shout out to Daniel and Udit from Qualcomm for helping me out with some
DMA issues we encountered.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
---
Changes in v11:
- Use new approach, not requiring the client to be involved in locking.
- Add a patch constifying dma_descriptor_metadata_ops
- Rebase on top of v7.0-rc1
- Link to v10: https://lore.kernel.org/r/20251219-qcom-qce-cmd-descr-v10-0-ff7e4bf7dad4@oss.qualcomm.com

Changes in v10:
- Move DESC_FLAG_(UN)LOCK BIT definitions from patch 2 to 3
- Add a patch constifying the dma engine metadata as the first in the
  series
- Use the VERSION register for dummy lock/unlock writes
- Link to v9: https://lore.kernel.org/r/20251128-qcom-qce-cmd-descr-v9-0-9a5f72b89722@linaro.org

Changes in v9:
- Drop the global, generic LOCK/UNLOCK flags and instead use DMA
  descriptor metadata ops to pass BAM-specific information from the QCE
  to the DMA engine
- Link to v8: https://lore.kernel.org/r/20251106-qcom-qce-cmd-descr-v8-0-ecddca23ca26@linaro.org

Changes in v8:
- Rework the command descriptor logic and drop a lot of unneeded code
- Use the physical address for BAM command descriptor access, not the
  mapped DMA address
- Fix the problems with iommu faults on newer platforms
- Generalize the LOCK/UNLOCK flags in dmaengine and reword the docs and
  commit messages
- Make the BAM locking logic stricter in the DMA engine driver
- Add some additional minor QCE driver refactoring changes to the series
- Lots of small reworks and tweaks to rebase on current mainline and fix
  previous issues
- Link to v7: https://lore.kernel.org/all/20250311-qce-cmd-descr-v7-0-db613f5d9c9f@linaro.org/

Changes in v7:
- remove unused code: writing to multiple registers was not used in v6,
  neither were the functions for reading registers over BAM DMA-
- remove
- don't read the SW_VERSION register needlessly in the BAM driver,
  instead: encode the information on whether the IP supports BAM locking
  in device match data
- shrink code where possible with logic modifications (for instance:
  change the implementation of qce_write() instead of replacing it
  everywhere with a new symbol)
- remove duplicated error messages
- rework commit messages
- a lot of shuffling code around for easier review and a more
  streamlined series
- Link to v6: https://lore.kernel.org/all/20250115103004.3350561-1-quic_mdalam@quicinc.com/

Changes in v6:
- change "BAM" to "DMA"
- Ensured this series is compilable with the current Linux-next tip of
  the tree (TOT).

Changes in v5:
- Added DMA_PREP_LOCK and DMA_PREP_UNLOCK flag support in separate patch
- Removed DMA_PREP_LOCK & DMA_PREP_UNLOCK flag
- Added FIELD_GET and GENMASK macro to extract major and minor version

Changes in v4:
- Added feature description and test hardware
  with test command
- Fixed patch version numbering
- Dropped dt-binding patch
- Dropped device tree changes
- Added BAM_SW_VERSION register read
- Handled the error path for the api dma_map_resource()
  in probe
- updated the commit messages for batter redability
- Squash the change where qce_bam_acquire_lock() and
  qce_bam_release_lock() api got introduce to the change where
  the lock/unlock flag get introced
- changed cover letter subject heading to
  "dmaengine: qcom: bam_dma: add cmd descriptor support"
- Added the very initial post for BAM lock/unlock patch link
  as v1 to track this feature

Changes in v3:
- https://lore.kernel.org/lkml/183d4f5e-e00a-8ef6-a589-f5704bc83d4a@quicinc.com/
- Addressed all the comments from v2
- Added the dt-binding
- Fix alignment issue
- Removed type casting from qce_write_reg_dma()
  and qce_read_reg_dma()
- Removed qce_bam_txn = dma->qce_bam_txn; line from
  qce_alloc_bam_txn() api and directly returning
  dma->qce_bam_txn

Changes in v2:
- https://lore.kernel.org/lkml/20231214114239.2635325-1-quic_mdalam@quicinc.com/
- Initial set of patches for cmd descriptor support
- Add client driver to use BAM lock/unlock feature
- Added register read/write via BAM in QCE Crypto driver
  to use BAM lock/unlock feature

---
Bartosz Golaszewski (12):
      crypto: qce - Include algapi.h in the core.h header
      crypto: qce - Remove unused ignore_buf
      crypto: qce - Simplify arguments of devm_qce_dma_request()
      crypto: qce - Use existing devres APIs in devm_qce_dma_request()
      crypto: qce - Map crypto memory for DMA
      crypto: qce - Add BAM DMA support for crypto register I/O
      crypto: qce - Communicate the base physical address to the dmaengine
      dmaengine: constify struct dma_descriptor_metadata_ops
      dmaengine: qcom: bam_dma: convert tasklet to a BH workqueue
      dmaengine: qcom: bam_dma: Extend the driver's device match data
      dmaengine: qcom: bam_dma: Add pipe_lock_supported flag support
      dmaengine: qcom: bam_dma: add support for BAM locking

 drivers/crypto/qce/aead.c       |   2 -
 drivers/crypto/qce/common.c     |  20 +++--
 drivers/crypto/qce/core.c       |  29 ++++++-
 drivers/crypto/qce/core.h       |  11 +++
 drivers/crypto/qce/dma.c        | 151 ++++++++++++++++++++++++++++++-------
 drivers/crypto/qce/dma.h        |  11 ++-
 drivers/crypto/qce/sha.c        |   2 -
 drivers/crypto/qce/skcipher.c   |   2 -
 drivers/dma/qcom/bam_dma.c      | 163 ++++++++++++++++++++++++++++++++++------
 drivers/dma/ti/k3-udma.c        |   2 +-
 drivers/dma/xilinx/xilinx_dma.c |   2 +-
 include/linux/dmaengine.h       |   2 +-
 12 files changed, 321 insertions(+), 76 deletions(-)
---
base-commit: d8cfb11ffd167a413269993d7e24d307ad47aa49
change-id: 20251103-qcom-qce-cmd-descr-c5e9b11fe609

Best regards,
-- 
Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

Re: [PATCH RFC v11 00/12] crypto/dmaengine: qce: introduce BAM locking and use DMA for register I/O

Posted by Manivannan Sadhasivam 1 month ago

On Mon, Mar 02, 2026 at 04:57:13PM +0100, Bartosz Golaszewski wrote:
> NOTE: Please note that even though this is version 11, I changed the
> prefix to RFC as this is an entirely new approach resulting from
> discussions under v9. I AM AWARE of the existing memory leaks in the
> last patch of this series - I'm sending it because I want to first
> discuss the approach and get a green light from Vinod as well as Mani
> and Bjorn. Especially when it comes to communicating the address for the
> dummy rights from the client to the BAM driver.
> /NOTE
> 
> Currently the QCE crypto driver accesses the crypto engine registers
> directly via CPU. Trust Zone may perform crypto operations simultaneously
> resulting in a race condition. To remedy that, let's introduce support
> for BAM locking/unlocking to the driver. The BAM driver will now wrap
> any existing issued descriptor chains with additional descriptors
> performing the locking when the client starts the transaction
> (dmaengine_issue_pending()). The client wanting to profit from locking
> needs to switch to performing register I/O over DMA and communicate the
> address to which to perform the dummy writes via a call to
> dmaengine_slave_config().
> 

Thanks for moving the LOCK/UNLOCK bits out of client to the BAM driver. It looks
neat now. I understand the limitation that for LOCK/UNLOCK, BAM needs to perform
a dummy write to an address in the client register space. So in this case, you
can also use the previous metadata approach to pass the scratchpad register to
the BAM driver from clients. The BAM driver can use this register to perform
LOCK/UNLOCK.

It may sound like I'm suggesting a part of your previous design, but it fits the
design more cleanly IMO. The BAM performs LOCK/UNLOCK on its own, but it gets
the scratchpad register address from the clients through the metadata once.

It is very unfortunate that the IP doesn't accept '0' address for LOCK/UNLOCK or
some of them cannot append LOCK/UNLOCK to the actual CMD descriptors passed from
the clients. These would've made the code/design even more cleaner.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

Re: [PATCH RFC v11 00/12] crypto/dmaengine: qce: introduce BAM locking and use DMA for register I/O

Posted by Stephan Gerhold 1 month ago

On Tue, Mar 03, 2026 at 06:13:56PM +0530, Manivannan Sadhasivam wrote:
> On Mon, Mar 02, 2026 at 04:57:13PM +0100, Bartosz Golaszewski wrote:
> > NOTE: Please note that even though this is version 11, I changed the
> > prefix to RFC as this is an entirely new approach resulting from
> > discussions under v9. I AM AWARE of the existing memory leaks in the
> > last patch of this series - I'm sending it because I want to first
> > discuss the approach and get a green light from Vinod as well as Mani
> > and Bjorn. Especially when it comes to communicating the address for the
> > dummy rights from the client to the BAM driver.
> > /NOTE
> > 
> > Currently the QCE crypto driver accesses the crypto engine registers
> > directly via CPU. Trust Zone may perform crypto operations simultaneously
> > resulting in a race condition. To remedy that, let's introduce support
> > for BAM locking/unlocking to the driver. The BAM driver will now wrap
> > any existing issued descriptor chains with additional descriptors
> > performing the locking when the client starts the transaction
> > (dmaengine_issue_pending()). The client wanting to profit from locking
> > needs to switch to performing register I/O over DMA and communicate the
> > address to which to perform the dummy writes via a call to
> > dmaengine_slave_config().
> > 
> 
> Thanks for moving the LOCK/UNLOCK bits out of client to the BAM driver. It looks
> neat now. I understand the limitation that for LOCK/UNLOCK, BAM needs to perform
> a dummy write to an address in the client register space. So in this case, you
> can also use the previous metadata approach to pass the scratchpad register to
> the BAM driver from clients. The BAM driver can use this register to perform
> LOCK/UNLOCK.
> 
> It may sound like I'm suggesting a part of your previous design, but it fits the
> design more cleanly IMO. The BAM performs LOCK/UNLOCK on its own, but it gets
> the scratchpad register address from the clients through the metadata once.
> 
> It is very unfortunate that the IP doesn't accept '0' address for LOCK/UNLOCK or
> some of them cannot append LOCK/UNLOCK to the actual CMD descriptors passed from
> the clients. These would've made the code/design even more cleaner.
> 

I was staring at the downstream drivers for QCE (qce50.c?) [1] for a bit
and my impression is that they manage to get along without dummy writes.
It's a big mess, but it looks like they always have some commands
(depending on the crypto operation) that they are sending anyway and
they just assign the LOCK/UNLOCK flag to the command descriptor of that.

It is similar for the second relevant user of the LOCK/UNLOCK flags, the
QPIC NAND driver (msm_qpic_nand.c in downstream [2], qcom_nandc.c in
mainline), it is assigned as part of the register programming sequence
instead of using a dummy write. In addition, the UNLOCK flag is
sometimes assigned to a READ command descriptor rather than a WRITE.

@Bartosz: Can we get by without doing any dummy writes?
If not, would a dummy read perhaps be less intrusive than a dummy write?

Thanks,
Stephan

[1]: https://git.codelinaro.org/clo/la/platform/vendor/qcom/opensource/securemsm-kernel/-/blob/sec-kernel.lnx.14.16.r4-rel/crypto-qti/qce50.c
[2]: https://git.codelinaro.org/clo/la/kernel/msm-5.15/-/blob/kernel.lnx.5.15.r46-rel/drivers/mtd/devices/msm_qpic_nand.c#L542-562

Re: [PATCH RFC v11 00/12] crypto/dmaengine: qce: introduce BAM locking and use DMA for register I/O

Posted by Bartosz Golaszewski 1 month ago

On Thu, Mar 5, 2026 at 1:00 PM Stephan Gerhold
<stephan.gerhold@linaro.org> wrote:
>
> On Tue, Mar 03, 2026 at 06:13:56PM +0530, Manivannan Sadhasivam wrote:
> > On Mon, Mar 02, 2026 at 04:57:13PM +0100, Bartosz Golaszewski wrote:
> > > NOTE: Please note that even though this is version 11, I changed the
> > > prefix to RFC as this is an entirely new approach resulting from
> > > discussions under v9. I AM AWARE of the existing memory leaks in the
> > > last patch of this series - I'm sending it because I want to first
> > > discuss the approach and get a green light from Vinod as well as Mani
> > > and Bjorn. Especially when it comes to communicating the address for the
> > > dummy rights from the client to the BAM driver.
> > > /NOTE
> > >
> > > Currently the QCE crypto driver accesses the crypto engine registers
> > > directly via CPU. Trust Zone may perform crypto operations simultaneously
> > > resulting in a race condition. To remedy that, let's introduce support
> > > for BAM locking/unlocking to the driver. The BAM driver will now wrap
> > > any existing issued descriptor chains with additional descriptors
> > > performing the locking when the client starts the transaction
> > > (dmaengine_issue_pending()). The client wanting to profit from locking
> > > needs to switch to performing register I/O over DMA and communicate the
> > > address to which to perform the dummy writes via a call to
> > > dmaengine_slave_config().
> > >
> >
> > Thanks for moving the LOCK/UNLOCK bits out of client to the BAM driver. It looks
> > neat now. I understand the limitation that for LOCK/UNLOCK, BAM needs to perform
> > a dummy write to an address in the client register space. So in this case, you
> > can also use the previous metadata approach to pass the scratchpad register to
> > the BAM driver from clients. The BAM driver can use this register to perform
> > LOCK/UNLOCK.
> >
> > It may sound like I'm suggesting a part of your previous design, but it fits the
> > design more cleanly IMO. The BAM performs LOCK/UNLOCK on its own, but it gets
> > the scratchpad register address from the clients through the metadata once.
> >
> > It is very unfortunate that the IP doesn't accept '0' address for LOCK/UNLOCK or
> > some of them cannot append LOCK/UNLOCK to the actual CMD descriptors passed from
> > the clients. These would've made the code/design even more cleaner.
> >
>
> I was staring at the downstream drivers for QCE (qce50.c?) [1] for a bit
> and my impression is that they manage to get along without dummy writes.
> It's a big mess, but it looks like they always have some commands
> (depending on the crypto operation) that they are sending anyway and
> they just assign the LOCK/UNLOCK flag to the command descriptor of that.
>
> It is similar for the second relevant user of the LOCK/UNLOCK flags, the
> QPIC NAND driver (msm_qpic_nand.c in downstream [2], qcom_nandc.c in
> mainline), it is assigned as part of the register programming sequence
> instead of using a dummy write. In addition, the UNLOCK flag is
> sometimes assigned to a READ command descriptor rather than a WRITE.
>
> @Bartosz: Can we get by without doing any dummy writes?
> If not, would a dummy read perhaps be less intrusive than a dummy write?
>

The HPG says that the LOCK/UNLOCK flag *must* be set on a command
descriptor, not a data descriptor. For a simple encryption we will
typically have a data descriptor and a command descriptor with
register writes. So we need a command descriptor in front of the data
and - while we could technically set the UNLOCK bit on the subsequent
command descriptor - it's unclear from the HPG whether it will unlock
before or after processing the command descriptor with the UNLOCK bit
set. Hence the additional command descriptor at the end.

The HPG also only mentions a write command and says nothing about a
read. In any case: that's the least of the problems as switching to
read doesn't solve the issue of passing the address of the scratchpad
register.

So while some of this *may* just work, I would prefer to stick to what
documentation says *will* work. :)

Bartosz

Re: [PATCH RFC v11 00/12] crypto/dmaengine: qce: introduce BAM locking and use DMA for register I/O

Posted by Stephan Gerhold 1 month ago

On Thu, Mar 05, 2026 at 02:10:55PM +0100, Bartosz Golaszewski wrote:
> On Thu, Mar 5, 2026 at 1:00 PM Stephan Gerhold
> <stephan.gerhold@linaro.org> wrote:
> >
> > On Tue, Mar 03, 2026 at 06:13:56PM +0530, Manivannan Sadhasivam wrote:
> > > On Mon, Mar 02, 2026 at 04:57:13PM +0100, Bartosz Golaszewski wrote:
> > > > NOTE: Please note that even though this is version 11, I changed the
> > > > prefix to RFC as this is an entirely new approach resulting from
> > > > discussions under v9. I AM AWARE of the existing memory leaks in the
> > > > last patch of this series - I'm sending it because I want to first
> > > > discuss the approach and get a green light from Vinod as well as Mani
> > > > and Bjorn. Especially when it comes to communicating the address for the
> > > > dummy rights from the client to the BAM driver.
> > > > /NOTE
> > > >
> > > > Currently the QCE crypto driver accesses the crypto engine registers
> > > > directly via CPU. Trust Zone may perform crypto operations simultaneously
> > > > resulting in a race condition. To remedy that, let's introduce support
> > > > for BAM locking/unlocking to the driver. The BAM driver will now wrap
> > > > any existing issued descriptor chains with additional descriptors
> > > > performing the locking when the client starts the transaction
> > > > (dmaengine_issue_pending()). The client wanting to profit from locking
> > > > needs to switch to performing register I/O over DMA and communicate the
> > > > address to which to perform the dummy writes via a call to
> > > > dmaengine_slave_config().
> > > >
> > >
> > > Thanks for moving the LOCK/UNLOCK bits out of client to the BAM driver. It looks
> > > neat now. I understand the limitation that for LOCK/UNLOCK, BAM needs to perform
> > > a dummy write to an address in the client register space. So in this case, you
> > > can also use the previous metadata approach to pass the scratchpad register to
> > > the BAM driver from clients. The BAM driver can use this register to perform
> > > LOCK/UNLOCK.
> > >
> > > It may sound like I'm suggesting a part of your previous design, but it fits the
> > > design more cleanly IMO. The BAM performs LOCK/UNLOCK on its own, but it gets
> > > the scratchpad register address from the clients through the metadata once.
> > >
> > > It is very unfortunate that the IP doesn't accept '0' address for LOCK/UNLOCK or
> > > some of them cannot append LOCK/UNLOCK to the actual CMD descriptors passed from
> > > the clients. These would've made the code/design even more cleaner.
> > >
> >
> > I was staring at the downstream drivers for QCE (qce50.c?) [1] for a bit
> > and my impression is that they manage to get along without dummy writes.
> > It's a big mess, but it looks like they always have some commands
> > (depending on the crypto operation) that they are sending anyway and
> > they just assign the LOCK/UNLOCK flag to the command descriptor of that.
> >
> > It is similar for the second relevant user of the LOCK/UNLOCK flags, the
> > QPIC NAND driver (msm_qpic_nand.c in downstream [2], qcom_nandc.c in
> > mainline), it is assigned as part of the register programming sequence
> > instead of using a dummy write. In addition, the UNLOCK flag is
> > sometimes assigned to a READ command descriptor rather than a WRITE.
> >
> > @Bartosz: Can we get by without doing any dummy writes?
> > If not, would a dummy read perhaps be less intrusive than a dummy write?
> >
> 
> The HPG says that the LOCK/UNLOCK flag *must* be set on a command
> descriptor, not a data descriptor. For a simple encryption we will
> typically have a data descriptor and a command descriptor with
> register writes. So we need a command descriptor in front of the data
> and - while we could technically set the UNLOCK bit on the subsequent
> command descriptor - it's unclear from the HPG whether it will unlock
> before or after processing the command descriptor with the UNLOCK bit
> set. Hence the additional command descriptor at the end.
>

I won't pretend that I actually understand what the downstream QCE
driver is doing, but e.g. qce_ablk_cipher_req() in the qce50.c I linked
looks like they just put the command descriptor with all the register
writes first and then the data second (followed by another command
descriptor for cleanup/unlocking). Is it actually required to put the
data first?

> The HPG also only mentions a write command and says nothing about a
> read. In any case: that's the least of the problems as switching to
> read doesn't solve the issue of passing the address of the scratchpad
> register.

True.

> 
> So while some of this *may* just work, I would prefer to stick to what
> documentation says *will* work. :)
> 

Well, the question is if there is always a dummy register that can be
safely written (without causing any side effects). This will be always
just based on experiments, since the concept of a dummy write doesn't
seem to exist downstream (and I assume the documentation doesn't suggest
a specific register to use either).

NAND_VERSION (0xf08) might work for qcom_nandc.c (which might be the
only other relevant user of the BAM locking functionality...).

Thanks,
Stephan

Re: [PATCH RFC v11 00/12] crypto/dmaengine: qce: introduce BAM locking and use DMA for register I/O

Posted by Bartosz Golaszewski 1 month ago

On Thu, Mar 5, 2026 at 2:43 PM Stephan Gerhold
<stephan.gerhold@linaro.org> wrote:
>
> On Thu, Mar 05, 2026 at 02:10:55PM +0100, Bartosz Golaszewski wrote:
> > On Thu, Mar 5, 2026 at 1:00 PM Stephan Gerhold
> > <stephan.gerhold@linaro.org> wrote:
> > >
> > > On Tue, Mar 03, 2026 at 06:13:56PM +0530, Manivannan Sadhasivam wrote:
> > > > On Mon, Mar 02, 2026 at 04:57:13PM +0100, Bartosz Golaszewski wrote:
> > > > > NOTE: Please note that even though this is version 11, I changed the
> > > > > prefix to RFC as this is an entirely new approach resulting from
> > > > > discussions under v9. I AM AWARE of the existing memory leaks in the
> > > > > last patch of this series - I'm sending it because I want to first
> > > > > discuss the approach and get a green light from Vinod as well as Mani
> > > > > and Bjorn. Especially when it comes to communicating the address for the
> > > > > dummy rights from the client to the BAM driver.
> > > > > /NOTE
> > > > >
> > > > > Currently the QCE crypto driver accesses the crypto engine registers
> > > > > directly via CPU. Trust Zone may perform crypto operations simultaneously
> > > > > resulting in a race condition. To remedy that, let's introduce support
> > > > > for BAM locking/unlocking to the driver. The BAM driver will now wrap
> > > > > any existing issued descriptor chains with additional descriptors
> > > > > performing the locking when the client starts the transaction
> > > > > (dmaengine_issue_pending()). The client wanting to profit from locking
> > > > > needs to switch to performing register I/O over DMA and communicate the
> > > > > address to which to perform the dummy writes via a call to
> > > > > dmaengine_slave_config().
> > > > >
> > > >
> > > > Thanks for moving the LOCK/UNLOCK bits out of client to the BAM driver. It looks
> > > > neat now. I understand the limitation that for LOCK/UNLOCK, BAM needs to perform
> > > > a dummy write to an address in the client register space. So in this case, you
> > > > can also use the previous metadata approach to pass the scratchpad register to
> > > > the BAM driver from clients. The BAM driver can use this register to perform
> > > > LOCK/UNLOCK.
> > > >
> > > > It may sound like I'm suggesting a part of your previous design, but it fits the
> > > > design more cleanly IMO. The BAM performs LOCK/UNLOCK on its own, but it gets
> > > > the scratchpad register address from the clients through the metadata once.
> > > >
> > > > It is very unfortunate that the IP doesn't accept '0' address for LOCK/UNLOCK or
> > > > some of them cannot append LOCK/UNLOCK to the actual CMD descriptors passed from
> > > > the clients. These would've made the code/design even more cleaner.
> > > >
> > >
> > > I was staring at the downstream drivers for QCE (qce50.c?) [1] for a bit
> > > and my impression is that they manage to get along without dummy writes.
> > > It's a big mess, but it looks like they always have some commands
> > > (depending on the crypto operation) that they are sending anyway and
> > > they just assign the LOCK/UNLOCK flag to the command descriptor of that.
> > >
> > > It is similar for the second relevant user of the LOCK/UNLOCK flags, the
> > > QPIC NAND driver (msm_qpic_nand.c in downstream [2], qcom_nandc.c in
> > > mainline), it is assigned as part of the register programming sequence
> > > instead of using a dummy write. In addition, the UNLOCK flag is
> > > sometimes assigned to a READ command descriptor rather than a WRITE.
> > >
> > > @Bartosz: Can we get by without doing any dummy writes?
> > > If not, would a dummy read perhaps be less intrusive than a dummy write?
> > >
> >
> > The HPG says that the LOCK/UNLOCK flag *must* be set on a command
> > descriptor, not a data descriptor. For a simple encryption we will
> > typically have a data descriptor and a command descriptor with
> > register writes. So we need a command descriptor in front of the data
> > and - while we could technically set the UNLOCK bit on the subsequent
> > command descriptor - it's unclear from the HPG whether it will unlock
> > before or after processing the command descriptor with the UNLOCK bit
> > set. Hence the additional command descriptor at the end.
> >
>
> I won't pretend that I actually understand what the downstream QCE
> driver is doing, but e.g. qce_ablk_cipher_req() in the qce50.c I linked
> looks like they just put the command descriptor with all the register
> writes first and then the data second (followed by another command
> descriptor for cleanup/unlocking). Is it actually required to put the
> data first?
>

Well, now you're getting into the philosophical issue of imposing
requirements on the client which seemed to be the main point of
contention in earlier versions. If you start requiring the client to
put the DMA operations in a certain order (and it's not based on any
HW requirement but rather on how the DMA driver is implemented) then
how is it better than having the client just drive the locking
altogether like pre v11? We won't get away without at least some
requirements - like the client doing register I/O over DMA or
providing the scratchpad address - but I think just wrapping the
existing queue with additional descriptors in a way transparent to
consumers is better in this case. And as I said: the HPG doesn't
explicitly say that it unlocks the pipe *after* the descriptor with
the unlock bit is processed. Doesn't even hint at what real the
ordering is.

> > The HPG also only mentions a write command and says nothing about a
> > read. In any case: that's the least of the problems as switching to
> > read doesn't solve the issue of passing the address of the scratchpad
> > register.
>
> True.
>
> >
> > So while some of this *may* just work, I would prefer to stick to what
> > documentation says *will* work. :)
> >
>
> Well, the question is if there is always a dummy register that can be
> safely written (without causing any side effects). This will be always
> just based on experiments, since the concept of a dummy write doesn't
> seem to exist downstream (and I assume the documentation doesn't suggest
> a specific register to use either).
>

You'd think so but the HPG actually does use the word "dummy" to
describe the write operation with lock/unlock bits set. Though it does
not recommend any particular register to do it.

> NAND_VERSION (0xf08) might work for qcom_nandc.c (which might be the
> only other relevant user of the BAM locking functionality...).
>

Yeah, I do the same for QCE, write to the version register.

Bart

Re: [PATCH RFC v11 00/12] crypto/dmaengine: qce: introduce BAM locking and use DMA for register I/O

Posted by Stephan Gerhold 1 month ago

On Thu, Mar 05, 2026 at 02:54:02PM +0100, Bartosz Golaszewski wrote:
> On Thu, Mar 5, 2026 at 2:43 PM Stephan Gerhold
> <stephan.gerhold@linaro.org> wrote:
> >
> > On Thu, Mar 05, 2026 at 02:10:55PM +0100, Bartosz Golaszewski wrote:
> > > On Thu, Mar 5, 2026 at 1:00 PM Stephan Gerhold
> > > <stephan.gerhold@linaro.org> wrote:
> > > >
> > > > On Tue, Mar 03, 2026 at 06:13:56PM +0530, Manivannan Sadhasivam wrote:
> > > > > On Mon, Mar 02, 2026 at 04:57:13PM +0100, Bartosz Golaszewski wrote:
> > > > > > NOTE: Please note that even though this is version 11, I changed the
> > > > > > prefix to RFC as this is an entirely new approach resulting from
> > > > > > discussions under v9. I AM AWARE of the existing memory leaks in the
> > > > > > last patch of this series - I'm sending it because I want to first
> > > > > > discuss the approach and get a green light from Vinod as well as Mani
> > > > > > and Bjorn. Especially when it comes to communicating the address for the
> > > > > > dummy rights from the client to the BAM driver.
> > > > > > /NOTE
> > > > > >
> > > > > > Currently the QCE crypto driver accesses the crypto engine registers
> > > > > > directly via CPU. Trust Zone may perform crypto operations simultaneously
> > > > > > resulting in a race condition. To remedy that, let's introduce support
> > > > > > for BAM locking/unlocking to the driver. The BAM driver will now wrap
> > > > > > any existing issued descriptor chains with additional descriptors
> > > > > > performing the locking when the client starts the transaction
> > > > > > (dmaengine_issue_pending()). The client wanting to profit from locking
> > > > > > needs to switch to performing register I/O over DMA and communicate the
> > > > > > address to which to perform the dummy writes via a call to
> > > > > > dmaengine_slave_config().
> > > > > >
> > > > >
> > > > > Thanks for moving the LOCK/UNLOCK bits out of client to the BAM driver. It looks
> > > > > neat now. I understand the limitation that for LOCK/UNLOCK, BAM needs to perform
> > > > > a dummy write to an address in the client register space. So in this case, you
> > > > > can also use the previous metadata approach to pass the scratchpad register to
> > > > > the BAM driver from clients. The BAM driver can use this register to perform
> > > > > LOCK/UNLOCK.
> > > > >
> > > > > It may sound like I'm suggesting a part of your previous design, but it fits the
> > > > > design more cleanly IMO. The BAM performs LOCK/UNLOCK on its own, but it gets
> > > > > the scratchpad register address from the clients through the metadata once.
> > > > >
> > > > > It is very unfortunate that the IP doesn't accept '0' address for LOCK/UNLOCK or
> > > > > some of them cannot append LOCK/UNLOCK to the actual CMD descriptors passed from
> > > > > the clients. These would've made the code/design even more cleaner.
> > > > >
> > > >
> > > > I was staring at the downstream drivers for QCE (qce50.c?) [1] for a bit
> > > > and my impression is that they manage to get along without dummy writes.
> > > > It's a big mess, but it looks like they always have some commands
> > > > (depending on the crypto operation) that they are sending anyway and
> > > > they just assign the LOCK/UNLOCK flag to the command descriptor of that.
> > > >
> > > > It is similar for the second relevant user of the LOCK/UNLOCK flags, the
> > > > QPIC NAND driver (msm_qpic_nand.c in downstream [2], qcom_nandc.c in
> > > > mainline), it is assigned as part of the register programming sequence
> > > > instead of using a dummy write. In addition, the UNLOCK flag is
> > > > sometimes assigned to a READ command descriptor rather than a WRITE.
> > > >
> > > > @Bartosz: Can we get by without doing any dummy writes?
> > > > If not, would a dummy read perhaps be less intrusive than a dummy write?
> > > >
> > >
> > > The HPG says that the LOCK/UNLOCK flag *must* be set on a command
> > > descriptor, not a data descriptor. For a simple encryption we will
> > > typically have a data descriptor and a command descriptor with
> > > register writes. So we need a command descriptor in front of the data
> > > and - while we could technically set the UNLOCK bit on the subsequent
> > > command descriptor - it's unclear from the HPG whether it will unlock
> > > before or after processing the command descriptor with the UNLOCK bit
> > > set. Hence the additional command descriptor at the end.
> > >
> >
> > I won't pretend that I actually understand what the downstream QCE
> > driver is doing, but e.g. qce_ablk_cipher_req() in the qce50.c I linked
> > looks like they just put the command descriptor with all the register
> > writes first and then the data second (followed by another command
> > descriptor for cleanup/unlocking). Is it actually required to put the
> > data first?
> >
> 
> Well, now you're getting into the philosophical issue of imposing
> requirements on the client which seemed to be the main point of
> contention in earlier versions. If you start requiring the client to
> put the DMA operations in a certain order (and it's not based on any
> HW requirement but rather on how the DMA driver is implemented) then
> how is it better than having the client just drive the locking
> altogether like pre v11? We won't get away without at least some
> requirements - like the client doing register I/O over DMA or
> providing the scratchpad address - but I think just wrapping the
> existing queue with additional descriptors in a way transparent to
> consumers is better in this case. And as I said: the HPG doesn't
> explicitly say that it unlocks the pipe *after* the descriptor with
> the unlock bit is processed. Doesn't even hint at what real the
> ordering is.
> 

Yes, I think the transparent approach here is reasonable given the
design of the Linux DMA engine API. Since Mani commented "It is very
unfortunate that the IP doesn't [...]" I mainly wanted to point out that
this is probably because the main use cases the HW designers had in mind
don't strictly require use of a dummy descriptor. The comment about the
dummy descriptors might be more of a side note than an actual
recommendation to implement it that way (otherwise, the downstream
drivers would likely use the dummy descriptor approach as well).

> > > The HPG also only mentions a write command and says nothing about a
> > > read. In any case: that's the least of the problems as switching to
> > > read doesn't solve the issue of passing the address of the scratchpad
> > > register.
> >
> > True.
> >
> > >
> > > So while some of this *may* just work, I would prefer to stick to what
> > > documentation says *will* work. :)
> > >
> >
> > Well, the question is if there is always a dummy register that can be
> > safely written (without causing any side effects). This will be always
> > just based on experiments, since the concept of a dummy write doesn't
> > seem to exist downstream (and I assume the documentation doesn't suggest
> > a specific register to use either).
> >
> 
> You'd think so but the HPG actually does use the word "dummy" to
> describe the write operation with lock/unlock bits set. Though it does
> not recommend any particular register to do it.
> 

I guess the documentation I'm looking at (8.7.3.4 BAM operation in the
public APQ8016E TRM) might be an excerpt from some older version of the
BAM HPG. Is also has a note about "dummy" command descriptors:

  "NOTE: Pipe locking and unlocking should appear only in
   command-descriptor. In case a lock is required on a data descriptor
   this can be implemented by a dummy command descriptor with
   lock/unlock bit asserted preceding/following the data descriptor."

This one doesn't make any difference between READ and WRITE command
descriptors (and both are documented in the chapter).

Personally, I would prefer using a read over a write if possible. Unless
you can confirm that the register used for the dummy write is actually
read-only *and* write-ignore, writing to the register is essentially
undefined behavior. It will probably do the right thing on most
platforms, but there could also be one out there where writing to the
register triggers an error or potentially even silently ends up writing
into another register. Register logic can be fun in practice, commit
e9a48ea4d90b ("irqchip/qcom-pdc: Workaround hardware register bug on
X1E80100") [1] is a good example of that. :')

Thanks,
Stephan

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e9a48ea4d90be251e0d057d41665745caccb0351

Re: [PATCH RFC v11 00/12] crypto/dmaengine: qce: introduce BAM locking and use DMA for register I/O

Posted by Bartosz Golaszewski 1 month ago

On Thu, Mar 5, 2026 at 5:16 PM Stephan Gerhold
<stephan.gerhold@linaro.org> wrote:
>
> >
> > You'd think so but the HPG actually does use the word "dummy" to
> > describe the write operation with lock/unlock bits set. Though it does
> > not recommend any particular register to do it.
> >
>
> I guess the documentation I'm looking at (8.7.3.4 BAM operation in the
> public APQ8016E TRM) might be an excerpt from some older version of the
> BAM HPG. Is also has a note about "dummy" command descriptors:
>
>   "NOTE: Pipe locking and unlocking should appear only in
>    command-descriptor. In case a lock is required on a data descriptor
>    this can be implemented by a dummy command descriptor with
>    lock/unlock bit asserted preceding/following the data descriptor."
>
> This one doesn't make any difference between READ and WRITE command
> descriptors (and both are documented in the chapter).
>
> Personally, I would prefer using a read over a write if possible. Unless
> you can confirm that the register used for the dummy write is actually
> read-only *and* write-ignore, writing to the register is essentially
> undefined behavior. It will probably do the right thing on most
> platforms, but there could also be one out there where writing to the
> register triggers an error or potentially even silently ends up writing
> into another register. Register logic can be fun in practice, commit
> e9a48ea4d90b ("irqchip/qcom-pdc: Workaround hardware register bug on
> X1E80100") [1] is a good example of that. :')
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e9a48ea4d90be251e0d057d41665745caccb0351

I agree in general but I also learned from the QCE team at Qualcomm
that apparently there were some issues with register reads over DMA,
which makes writes preferable. While the VERSION register is
officially read-only, I've been told writing to it is safe. So it's a
choice between doing a READ that may not work on some platforms and
doing a WRITE that may theoretically not work on some platforms.

I'll send v12 which will be a proper series with using register
metadata and correctly freeing resources and we can rediscuss. Doing
one or the other is actually a minor details of the whole thing after
all.

Bart

Re: [PATCH RFC v11 00/12] crypto/dmaengine: qce: introduce BAM locking and use DMA for register I/O

Posted by Bartosz Golaszewski 1 month ago

On Tue, Mar 3, 2026 at 1:44 PM Manivannan Sadhasivam <mani@kernel.org> wrote:
>
> On Mon, Mar 02, 2026 at 04:57:13PM +0100, Bartosz Golaszewski wrote:
> > NOTE: Please note that even though this is version 11, I changed the
> > prefix to RFC as this is an entirely new approach resulting from
> > discussions under v9. I AM AWARE of the existing memory leaks in the
> > last patch of this series - I'm sending it because I want to first
> > discuss the approach and get a green light from Vinod as well as Mani
> > and Bjorn. Especially when it comes to communicating the address for the
> > dummy rights from the client to the BAM driver.
> > /NOTE
> >
> > Currently the QCE crypto driver accesses the crypto engine registers
> > directly via CPU. Trust Zone may perform crypto operations simultaneously
> > resulting in a race condition. To remedy that, let's introduce support
> > for BAM locking/unlocking to the driver. The BAM driver will now wrap
> > any existing issued descriptor chains with additional descriptors
> > performing the locking when the client starts the transaction
> > (dmaengine_issue_pending()). The client wanting to profit from locking
> > needs to switch to performing register I/O over DMA and communicate the
> > address to which to perform the dummy writes via a call to
> > dmaengine_slave_config().
> >
>
> Thanks for moving the LOCK/UNLOCK bits out of client to the BAM driver. It looks
> neat now. I understand the limitation that for LOCK/UNLOCK, BAM needs to perform
> a dummy write to an address in the client register space. So in this case, you
> can also use the previous metadata approach to pass the scratchpad register to
> the BAM driver from clients. The BAM driver can use this register to perform
> LOCK/UNLOCK.
>

I thought about reusing descriptor metadata but this is attached to
every descriptor created with dmaengine_prep_slave_sg() (as opposed to
doing it once with dmaengine_slave_config()). We would also still need
a custom struct in the existing BAM DMA header.

I'm fine with that. Vinod: could you please say if that's sound for
you and if so - I'll rework it and also fix the memory leaks by
reworking the cleanup path.

> It may sound like I'm suggesting a part of your previous design, but it fits the
> design more cleanly IMO. The BAM performs LOCK/UNLOCK on its own, but it gets
> the scratchpad register address from the clients through the metadata once.
>

If that gets it upstream, then it's fine with me.

> It is very unfortunate that the IP doesn't accept '0' address for LOCK/UNLOCK or
> some of them cannot append LOCK/UNLOCK to the actual CMD descriptors passed from
> the clients. These would've made the code/design even more cleaner.
>

For sure but I double-checked this. :(

Bart