[PATCH V1 0/3] Enable UFS MCQ support for SM8650 and SM8750

Ram Kumar Dwivedi posted 3 patches 2 months, 1 week ago
There is a newer version of this series
.../devicetree/bindings/ufs/qcom,ufs.yaml     | 21 ++++++++++++-------
arch/arm64/boot/dts/qcom/sm8650.dtsi          |  9 +++++++-
arch/arm64/boot/dts/qcom/sm8750.dtsi          | 10 +++++++--
3 files changed, 29 insertions(+), 11 deletions(-)
[PATCH V1 0/3] Enable UFS MCQ support for SM8650 and SM8750
Posted by Ram Kumar Dwivedi 2 months, 1 week ago
This patch series enables Multi-Circular Queue (MCQ) support for the UFS
host controller on Qualcomm SM8650 and SM8750 platforms. MCQ is a modern
queuing model that improves performance and scalability by allowing
multiple hardware queues.

Although MCQ support has been present in the UFS driver for several years,
this is the first time it is being enabled via Device Tree for these
platforms.

Patch 1 updates the device tree bindings to allow the additional register
regions and reg-names required for MCQ operation.

Patches 2 and 3 update the device trees for SM8650 and SM8750 respectively
to enable MCQ by adding the necessary register mappings and MSI parent.

Tested on internal hardware for both platforms.

Palash Kambar (1):
  arm64: dts: qcom: sm8750: Enable MCQ support for UFS controller

Ram Kumar Dwivedi (2):
  dt-bindings: ufs: qcom: Add MCQ support to reg and reg-names
  arm64: dts: qcom: sm8650: Enable MCQ support for UFS controller

 .../devicetree/bindings/ufs/qcom,ufs.yaml     | 21 ++++++++++++-------
 arch/arm64/boot/dts/qcom/sm8650.dtsi          |  9 +++++++-
 arch/arm64/boot/dts/qcom/sm8750.dtsi          | 10 +++++++--
 3 files changed, 29 insertions(+), 11 deletions(-)

-- 
2.50.1
Re: [PATCH V1 0/3] Enable UFS MCQ support for SM8650 and SM8750
Posted by neil.armstrong@linaro.org 2 months ago
Hi,

On 30/07/2025 10:22, Ram Kumar Dwivedi wrote:
> This patch series enables Multi-Circular Queue (MCQ) support for the UFS
> host controller on Qualcomm SM8650 and SM8750 platforms. MCQ is a modern
> queuing model that improves performance and scalability by allowing
> multiple hardware queues.
> 
> Although MCQ support has been present in the UFS driver for several years,
> this is the first time it is being enabled via Device Tree for these
> platforms.
> 
> Patch 1 updates the device tree bindings to allow the additional register
> regions and reg-names required for MCQ operation.
> 
> Patches 2 and 3 update the device trees for SM8650 and SM8750 respectively
> to enable MCQ by adding the necessary register mappings and MSI parent.
> 
> Tested on internal hardware for both platforms.
> 
> Palash Kambar (1):
>    arm64: dts: qcom: sm8750: Enable MCQ support for UFS controller
> 
> Ram Kumar Dwivedi (2):
>    dt-bindings: ufs: qcom: Add MCQ support to reg and reg-names
>    arm64: dts: qcom: sm8650: Enable MCQ support for UFS controller
> 
>   .../devicetree/bindings/ufs/qcom,ufs.yaml     | 21 ++++++++++++-------
>   arch/arm64/boot/dts/qcom/sm8650.dtsi          |  9 +++++++-
>   arch/arm64/boot/dts/qcom/sm8750.dtsi          | 10 +++++++--
>   3 files changed, 29 insertions(+), 11 deletions(-)
> 

I ran some tests on the SM8650-QRD, and it works so please add my:
Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-QRD

I ran some fio tests, comparing the v6.15, v6.16 (with threaded irqs)
and next + mcq support, and here's the analysis on the results:

Significant Performance Gains in Write Operations with Multiple Jobs:
The "mcq" change shows a substantial improvement in both IOPS and bandwidth for write operations with 8 jobs.
Moderate Improvement in Single Job Operations (Read and Write):
For single job operations (read and write), the "mcq" change generally leads to positive, albeit less dramatic, improvements in IOPS and bandwidth.
Slight Decrease in Read Operations with Multiple Jobs:
Interestingly, for read operations with 8 jobs, there's a slight decrease in both IOPS and bandwidth with the "mcq" kernel.

The raw results are:
Board: sm8650-qrd

read / 1 job
                v6.15     v6.16  next+mcq
iops (min)  3,996.00  5,921.60  4,661.20
iops (max)  4,772.80  6,491.20  5,027.60
iops (avg)  4,526.25  6,295.31  4,979.81
cpu % usr       4.62      2.96      5.68
cpu % sys      21.45     17.88     25.58
bw (MB/s)      18.54     25.78     20.40

read / 8 job
                 v6.15      v6.16   next+mcq
iops (min)  51,867.60  51,575.40  56,818.40
iops (max)  67,513.60  64,456.40  65,379.60
iops (avg)  64,314.80  62,136.76  63,016.07
cpu % usr        3.98       3.72       3.85
cpu % sys       16.70      17.16      14.87
bw (MB/s)      263.60     254.40     258.20

write / 1 job
                v6.15     v6.16  next+mcq
iops (min)  5,654.80  8,060.00  7,117.20
iops (max)  6,720.40  8,852.00  7,706.80
iops (avg)  6,576.91  8,579.81  7,459.97
cpu % usr       7.48      3.79      6.73
cpu % sys      41.09     23.27     30.66
bw (MB/s)      26.96     35.16     30.56

write / 8 job
                  v6.15       v6.16    next+mcq
iops (min)   84,687.80   95,043.40  114,054.00
iops (max)  107,620.80  113,572.00  164,526.00
iops (avg)   97,910.86  105,927.38  149,071.43
cpu % usr         5.43        4.38        2.88
cpu % sys        21.73       20.29       16.09
bw (MB/s)       400.80      433.80      610.40

The test suite is:
for rw in read write ; do
     echo "rw: ${rw}"
     for jobs in 1 8 ; do
         echo "jobs: ${jobs}"
         for it in $(seq 1 5) ; do
             fio --name=rand${rw} --rw=rand${rw} \
                 --ioengine=libaio --direct=1 \
                 --bs=4k --numjobs=${jobs} --size=32m \
                 --runtime=30 --time_based --end_fsync=1 \
                 --group_reporting --filename=/dev/disk/by-partlabel/super \
             | grep -E '(iops|sys=|READ:|WRITE:)'
             sleep 5
         done
     done
done

Thanks,
Neil
Re: [PATCH V1 0/3] Enable UFS MCQ support for SM8650 and SM8750
Posted by Manivannan Sadhasivam 2 months ago
On Thu, Jul 31, 2025 at 10:50:21AM GMT, neil.armstrong@linaro.org wrote:
> Hi,
> 
> On 30/07/2025 10:22, Ram Kumar Dwivedi wrote:
> > This patch series enables Multi-Circular Queue (MCQ) support for the UFS
> > host controller on Qualcomm SM8650 and SM8750 platforms. MCQ is a modern
> > queuing model that improves performance and scalability by allowing
> > multiple hardware queues.
> > 
> > Although MCQ support has been present in the UFS driver for several years,
> > this is the first time it is being enabled via Device Tree for these
> > platforms.
> > 
> > Patch 1 updates the device tree bindings to allow the additional register
> > regions and reg-names required for MCQ operation.
> > 
> > Patches 2 and 3 update the device trees for SM8650 and SM8750 respectively
> > to enable MCQ by adding the necessary register mappings and MSI parent.
> > 
> > Tested on internal hardware for both platforms.
> > 
> > Palash Kambar (1):
> >    arm64: dts: qcom: sm8750: Enable MCQ support for UFS controller
> > 
> > Ram Kumar Dwivedi (2):
> >    dt-bindings: ufs: qcom: Add MCQ support to reg and reg-names
> >    arm64: dts: qcom: sm8650: Enable MCQ support for UFS controller
> > 
> >   .../devicetree/bindings/ufs/qcom,ufs.yaml     | 21 ++++++++++++-------
> >   arch/arm64/boot/dts/qcom/sm8650.dtsi          |  9 +++++++-
> >   arch/arm64/boot/dts/qcom/sm8750.dtsi          | 10 +++++++--
> >   3 files changed, 29 insertions(+), 11 deletions(-)
> > 
> 
> I ran some tests on the SM8650-QRD, and it works so please add my:
> Tested-by: Neil Armstrong <neil.armstrong@linaro.org> # on SM8650-QRD
> 

Thanks Neil for testing it out!

> I ran some fio tests, comparing the v6.15, v6.16 (with threaded irqs)
> and next + mcq support, and here's the analysis on the results:
> 
> Significant Performance Gains in Write Operations with Multiple Jobs:
> The "mcq" change shows a substantial improvement in both IOPS and bandwidth for write operations with 8 jobs.
> Moderate Improvement in Single Job Operations (Read and Write):
> For single job operations (read and write), the "mcq" change generally leads to positive, albeit less dramatic, improvements in IOPS and bandwidth.
> Slight Decrease in Read Operations with Multiple Jobs:
> Interestingly, for read operations with 8 jobs, there's a slight decrease in both IOPS and bandwidth with the "mcq" kernel.
> 
> The raw results are:
> Board: sm8650-qrd
> 
> read / 1 job
>                v6.15     v6.16  next+mcq
> iops (min)  3,996.00  5,921.60  4,661.20
> iops (max)  4,772.80  6,491.20  5,027.60
> iops (avg)  4,526.25  6,295.31  4,979.81
> cpu % usr       4.62      2.96      5.68
> cpu % sys      21.45     17.88     25.58
> bw (MB/s)      18.54     25.78     20.40
> 

It is interesting to note the % of CPU time spent with MCQ in the 1 job case.
Looks like it is spending more time here. I'm wondering if it is the ESI
limitation/overhead.

- Mani

> read / 8 job
>                 v6.15      v6.16   next+mcq
> iops (min)  51,867.60  51,575.40  56,818.40
> iops (max)  67,513.60  64,456.40  65,379.60
> iops (avg)  64,314.80  62,136.76  63,016.07
> cpu % usr        3.98       3.72       3.85
> cpu % sys       16.70      17.16      14.87
> bw (MB/s)      263.60     254.40     258.20
> 
> write / 1 job
>                v6.15     v6.16  next+mcq
> iops (min)  5,654.80  8,060.00  7,117.20
> iops (max)  6,720.40  8,852.00  7,706.80
> iops (avg)  6,576.91  8,579.81  7,459.97
> cpu % usr       7.48      3.79      6.73
> cpu % sys      41.09     23.27     30.66
> bw (MB/s)      26.96     35.16     30.56
> 
> write / 8 job
>                  v6.15       v6.16    next+mcq
> iops (min)   84,687.80   95,043.40  114,054.00
> iops (max)  107,620.80  113,572.00  164,526.00
> iops (avg)   97,910.86  105,927.38  149,071.43
> cpu % usr         5.43        4.38        2.88
> cpu % sys        21.73       20.29       16.09
> bw (MB/s)       400.80      433.80      610.40
> 
> The test suite is:
> for rw in read write ; do
>     echo "rw: ${rw}"
>     for jobs in 1 8 ; do
>         echo "jobs: ${jobs}"
>         for it in $(seq 1 5) ; do
>             fio --name=rand${rw} --rw=rand${rw} \
>                 --ioengine=libaio --direct=1 \
>                 --bs=4k --numjobs=${jobs} --size=32m \
>                 --runtime=30 --time_based --end_fsync=1 \
>                 --group_reporting --filename=/dev/disk/by-partlabel/super \
>             | grep -E '(iops|sys=|READ:|WRITE:)'
>             sleep 5
>         done
>     done
> done
> 
> Thanks,
> Neil

-- 
மணிவண்ணன் சதாசிவம்