drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 261 +++++++++++++++++++- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 33 +++ 2 files changed, 286 insertions(+), 8 deletions(-)
Resending the patches by Zhen Lei <thunder.leizhen@huawei.com> that add support for SMMU ECMDQ feature. Tested this feature on a Marvell SoC by implementing a smmu-test driver. This test driver spawns a thread per CPU and each thread keeps sending map, table-walk and unmap requests for a fixed duration. Using this test driver, we compared ECMDQ vs SMMU with software batching support and saw ~5% improvement with ECMDQ. Performance numbers are mentioned below: Total Requests Average Requests Difference Per CPU wrt ECMDQ ----------------------------------------------------------------- ECMDQ 239286381 2991079 CMDQ Batch Size 1 228232187 2852902 -4.62% CMDQ Batch Size 32 233465784 2918322 -2.43% CMDQ Batch Size 64 231679588 2895994 -3.18% CMDQ Batch Size 128 233189030 2914862 -2.55% CMDQ Batch Size 256 230965773 2887072 -3.48% v2 --> v3: 1. Rebased on linux 6.9-rc5 v1 --> v2: 1. Drop patch "iommu/arm-smmu-v3: Add arm_smmu_ecmdq_issue_cmdlist() for non-shared ECMDQ" in v1 2. Drop patch "iommu/arm-smmu-v3: Add support for less than one ECMDQ per core" in v1 3. Replace rwlock with IPI to support lockless protection against the write operation to bit 'ERRACK' during error handling and the read operation to bit 'ERRACK' during command insertion. 4. Standardize variable names. - struct arm_smmu_ecmdq *__percpu *ecmdq; + struct arm_smmu_ecmdq *__percpu *ecmdqs; 5. Add member 'iobase' to struct arm_smmu_device to record the start physical address of the SMMU, to replace translation operation (vmalloc_to_pfn(smmu->base) << PAGE_SHIFT) + phys_addr_t iobase; - smmu_dma_base = (vmalloc_to_pfn(smmu->base) << PAGE_SHIFT); 6. Cancel below union. Whether ECMDQ is enabled is determined only based on 'ecmdq_enabled'. - union { - u32 nr_ecmdq; - u32 ecmdq_enabled; - }; + u32 nr_ecmdq; + bool ecmdq_enabled; 7. Eliminate some sparse check warnings. For example. - struct arm_smmu_ecmdq *ecmdq; + struct arm_smmu_ecmdq __percpu *ecmdq; Zhen Lei (2): iommu/arm-smmu-v3: Add support for ECMDQ register mode iommu/arm-smmu-v3: Ensure that a set of associated commands are inserted in the same ECMDQ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 261 +++++++++++++++++++- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 33 +++ 2 files changed, 286 insertions(+), 8 deletions(-) -- 2.34.1
On Thu, Apr 25, 2024 at 07:41:50AM -0700, Tanmay Jagdale wrote: > Resending the patches by Zhen Lei <thunder.leizhen@huawei.com> that add > support for SMMU ECMDQ feature. > > Tested this feature on a Marvell SoC by implementing a smmu-test driver. > This test driver spawns a thread per CPU and each thread keeps sending > map, table-walk and unmap requests for a fixed duration. So this is not just measuring invalidation performance but basically the DMA API performance to do map/unmap operations? What is "batch size" ? Does this HW support the range invalidation? How many invalidation commands does earch test cycle generate? > Total Requests Average Requests Difference > Per CPU wrt ECMDQ > ----------------------------------------------------------------- > ECMDQ 239286381 2991079 > CMDQ Batch Size 1 228232187 2852902 -4.62% > CMDQ Batch Size 32 233465784 2918322 -2.43% > CMDQ Batch Size 64 231679588 2895994 -3.18% > CMDQ Batch Size 128 233189030 2914862 -2.55% > CMDQ Batch Size 256 230965773 2887072 -3.48% If this is really 5% for a typical DMA API map/unmap cycle then that seems interesting to me. If it is 5% for just the invalidation command then it is harder to say. I'd suggest to present your results in terms of latency to do a dma API map/unmap cycle, and to show how the results scale as you add more threads. Does even 2 threads start to show a 4-5% gain? Also, I'm wondering how ATS would interact, I think we have a head-of-line blocking issue with ATS.. Allowing ATS to progress concurrently with unrelated parallel invalidations may also be interesting, esepcially for multi-SVA workloads. Jason
On Thu, Apr 25, 2024 at 07:41:50AM -0700, Tanmay Jagdale wrote: > Resending the patches by Zhen Lei <thunder.leizhen@huawei.com> that add > support for SMMU ECMDQ feature. > > Tested this feature on a Marvell SoC by implementing a smmu-test driver. > This test driver spawns a thread per CPU and each thread keeps sending > map, table-walk and unmap requests for a fixed duration. > > Using this test driver, we compared ECMDQ vs SMMU with software batching > support and saw ~5% improvement with ECMDQ. Performance numbers are > mentioned below: > > Total Requests Average Requests Difference > Per CPU wrt ECMDQ > ----------------------------------------------------------------- > ECMDQ 239286381 2991079 > CMDQ Batch Size 1 228232187 2852902 -4.62% > CMDQ Batch Size 32 233465784 2918322 -2.43% > CMDQ Batch Size 64 231679588 2895994 -3.18% > CMDQ Batch Size 128 233189030 2914862 -2.55% > CMDQ Batch Size 256 230965773 2887072 -3.48% These are pretty small improvements in a targetted micro-benchmark. Do you have any real-world numbers showing that this is worthwhile? For example, running something like netperf. Will
On 2024/4/25 22:41, Tanmay Jagdale wrote: > Resending the patches by Zhen Lei <thunder.leizhen@huawei.com> that add > support for SMMU ECMDQ feature. > > Tested this feature on a Marvell SoC by implementing a smmu-test driver. > This test driver spawns a thread per CPU and each thread keeps sending > map, table-walk and unmap requests for a fixed duration. > > Using this test driver, we compared ECMDQ vs SMMU with software batching > support and saw ~5% improvement with ECMDQ. Performance numbers are This data is similar to the performance simulated by our EMU. > mentioned below: > > Total Requests Average Requests Difference > Per CPU wrt ECMDQ > ----------------------------------------------------------------- > ECMDQ 239286381 2991079 > CMDQ Batch Size 1 228232187 2852902 -4.62% > CMDQ Batch Size 32 233465784 2918322 -2.43% > CMDQ Batch Size 64 231679588 2895994 -3.18% > CMDQ Batch Size 128 233189030 2914862 -2.55% > CMDQ Batch Size 256 230965773 2887072 -3.48% > > > v2 --> v3: > 1. Rebased on linux 6.9-rc5 > > v1 --> v2: Thanks. > 1. Drop patch "iommu/arm-smmu-v3: Add arm_smmu_ecmdq_issue_cmdlist() for > non-shared ECMDQ" in v1 > 2. Drop patch "iommu/arm-smmu-v3: Add support for less than one ECMDQ > per core" in v1 > 3. Replace rwlock with IPI to support lockless protection against the > write operation to bit > 'ERRACK' during error handling and the read operation to bit 'ERRACK' > during command insertion. > 4. Standardize variable names. > - struct arm_smmu_ecmdq *__percpu *ecmdq; > + struct arm_smmu_ecmdq *__percpu *ecmdqs; > > 5. Add member 'iobase' to struct arm_smmu_device to record the start > physical > address of the SMMU, to replace translation operation > (vmalloc_to_pfn(smmu->base) << PAGE_SHIFT) > + phys_addr_t iobase; > - smmu_dma_base = (vmalloc_to_pfn(smmu->base) << PAGE_SHIFT); > > 6. Cancel below union. Whether ECMDQ is enabled is determined only based > on 'ecmdq_enabled'. > - union { > - u32 nr_ecmdq; > - u32 ecmdq_enabled; > - }; > + u32 nr_ecmdq; > + bool ecmdq_enabled; > > 7. Eliminate some sparse check warnings. For example. > - struct arm_smmu_ecmdq *ecmdq; > + struct arm_smmu_ecmdq __percpu *ecmdq; > > > Zhen Lei (2): > iommu/arm-smmu-v3: Add support for ECMDQ register mode > iommu/arm-smmu-v3: Ensure that a set of associated commands are > inserted in the same ECMDQ > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 261 +++++++++++++++++++- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 33 +++ > 2 files changed, 286 insertions(+), 8 deletions(-) > -- Regards, Zhen Lei
© 2016 - 2025 Red Hat, Inc.