[RFC] Scheduler: DMA Engine regression because of sched/fair changes

Alexander Fomichev posted 1 patch 4 years, 5 months ago
[RFC] Scheduler: DMA Engine regression because of sched/fair changes
Posted by Alexander Fomichev 4 years, 5 months ago
CC: Mel Gorman <mgorman@suse.de>
CC: linux@yadro.com

Hi all,

There's a huge regression found, which affects Intel Xeon's DMA Engine
performance between v4.14 LTS and modern kernels. In certain
circumstances the speed in dmatest is more than 6 times lower.

	- Hardware -
I did testing on 2 systems:
1) Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (Supermicro X11DAi-N)
2) Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz (YADRO Vegman S220)

	- Measurement -
The dmatest result speed decreases with almost any test settings.
Although the most significant impact is revealed with 64K transfers. The
following parameters were used:

modprobe dmatest iterations=1000 timeout=2000 test_buf_size=0x100000 transfer_size=0x10000 norandom=1
echo "dma0chan0" > /sys/module/dmatest/parameters/channel
echo 1 > /sys/module/dmatest/parameters/run

Every test csse was performed at least 3 times. All detailed results are
below.

	- Analysis -
Bisecting revealed 2 different bad commits for those 2 systems, but both
change the same function/condition in the same file.
For the system (1) the bad commit is:
[7332dec055f2457c386032f7e9b2991eb05c2a0a] sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
For the system (2) the bad commit is:
[806486c377e33ab662de6d47902e9e2a32b79368] sched/fair: Do not migrate if the prev_cpu is idle

	- Additional check -
Attempting to revert the changes above, a dirty patch for the (current)
kernel v5.16.0-rc5 was tested too:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6f16dfb74246..0a58cc00b1b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5931,8 +5931,8 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
         * a cpufreq perspective, it's better to have higher utilisation
         * on one CPU.
         */
-       if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-               return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
+       if (available_idle_cpu(this_cpu))
+               return this_cpu;

        if (sync && cpu_rq(this_cpu)->nr_running == 1)
                return this_cpu;

Please, take a look if this makes sense. But with this patch applied the
performance of DMA Engine restores.

	- Dmatest results TL;DR -

System (1) before bad commit:
---------------------
[  519.894642] dmatest: Added 1 threads using dma0chan0
[  525.383021] dmatest: Started 1 threads using dma0chan0
[  528.521915] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 98367.10 iops 6295494 KB/s (0)
[  544.851751] dmatest: Added 1 threads using dma0chan0
[  546.460064] dmatest: Started 1 threads using dma0chan0
[  549.609504] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 100310.96 iops 6419901 KB/s (0)
[  562.178365] dmatest: Added 1 threads using dma0chan0
[  563.852534] dmatest: Started 1 threads using dma0chan0
[  567.004898] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 98580.44 iops 6309148 KB/s (0)
---------------------

System (1) on HEAD=bad commit:
---------------------
[  149.555401] dmatest: Added 1 threads using dma0chan0
[  154.162444] dmatest: Started 1 threads using dma0chan0
[  157.490868] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 26653.87 iops 1705847 KB/s (0)
[  176.783450] dmatest: Added 1 threads using dma0chan0
[  178.428518] dmatest: Started 1 threads using dma0chan0
[  181.606531] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 14194.86 iops 908471 KB/s (0)
[  192.125218] dmatest: Added 1 threads using dma0chan0
[  194.060029] dmatest: Started 1 threads using dma0chan0
[  197.235265] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 14757.09 iops 944454 KB/s (0)
---------------------

Systen (1) on v5.16.0-rc5:
---------------------
[ 1430.860170] dmatest: Added 1 threads using dma0chan0
[ 1437.367447] dmatest: Started 1 threads using dma0chan0
[ 1442.756660] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 24837.31 iops 1589588 KB/s (0)
[ 1561.614191] dmatest: Added 1 threads using dma0chan0
[ 1562.816375] dmatest: Started 1 threads using dma0chan0
[ 1566.619614] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 13666.05 iops 874627 KB/s (0)
[ 1585.019601] dmatest: Added 1 threads using dma0chan0
[ 1587.585741] dmatest: Started 1 threads using dma0chan0
[ 1591.386816] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 13521.91 iops 865402 KB/s (0)
---------------------

System (1) on v5.16.0-rc5 with dirty patch:
---------------------
[  733.571508] dmatest: Added 1 threads using dma0chan0
[  746.050800] dmatest: Started 1 threads using dma0chan0
[  749.765600] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 87260.03 iops 5584642 KB/s (0)
[  915.051955] dmatest: Added 1 threads using dma0chan0
[  916.550732] dmatest: Started 1 threads using dma0chan0
[  920.267525] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 88464.25 iops 5661712 KB/s (0)
[  936.781273] dmatest: Added 1 threads using dma0chan0
[  939.528616] dmatest: Started 1 threads using dma0chan0
[  943.247694] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 88833.61 iops 5685351 KB/s (0)
---------------------

System (2) before bad commit:
---------------------
[  481.309411] dmatest: Added 1 threads using dma0chan0
[  491.197425] dmatest: Started 1 threads using dma0chan0
[  497.047315] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 78988.94 iops 5055292 KB/s (0)
[  506.057101] dmatest: Added 1 threads using dma0chan0
[  508.939426] dmatest: Started 1 threads using dma0chan0
[  514.788823] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 77754.44 iops 4976284 KB/s (0)
[  531.894587] dmatest: Added 1 threads using dma0chan0
[  534.053360] dmatest: Started 1 threads using dma0chan0
[  539.906424] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 76988.21 iops 4927246 KB/s (0)
---------------------

System (2) on HEAD=bad commit:
---------------------
[44522.892995] dmatest: Added 1 threads using dma0chan0
[44526.193331] dmatest: Started 1 threads using dma0chan0
[44532.043932] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 80360.01 iops 5143040 KB/s (0)
[44561.121118] dmatest: Added 1 threads using dma0chan0
[44562.868428] dmatest: Started 1 threads using dma0chan0
[44568.808577] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 16080.53 iops 1029154 KB/s (0)
[44728.597409] dmatest: Added 1 threads using dma0chan0
[44730.301566] dmatest: Started 1 threads using dma0chan0
[44736.259009] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 16091.91 iops 1029882 KB/s (0)
---------------------

Thanks for reading.

-- 
Regards,
  Alexander Fomichev
Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
Posted by Mel Gorman 4 years, 5 months ago
On Wed, Jan 12, 2022 at 06:26:09PM +0300, Alexander Fomichev wrote:
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux@yadro.com
> 
> Hi all,
> 
> There's a huge regression found, which affects Intel Xeon's DMA Engine
> performance between v4.14 LTS and modern kernels. In certain
> circumstances the speed in dmatest is more than 6 times lower.
> 
> 	- Hardware -
> I did testing on 2 systems:
> 1) Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (Supermicro X11DAi-N)
> 2) Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz (YADRO Vegman S220)
> 
> 	- Measurement -
> The dmatest result speed decreases with almost any test settings.
> Although the most significant impact is revealed with 64K transfers. The
> following parameters were used:
> 
> modprobe dmatest iterations=1000 timeout=2000 test_buf_size=0x100000 transfer_size=0x10000 norandom=1
> echo "dma0chan0" > /sys/module/dmatest/parameters/channel
> echo 1 > /sys/module/dmatest/parameters/run
> 
> Every test csse was performed at least 3 times. All detailed results are
> below.
> 
> 	- Analysis -
> Bisecting revealed 2 different bad commits for those 2 systems, but both
> change the same function/condition in the same file.
> For the system (1) the bad commit is:
> [7332dec055f2457c386032f7e9b2991eb05c2a0a] sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
> For the system (2) the bad commit is:
> [806486c377e33ab662de6d47902e9e2a32b79368] sched/fair: Do not migrate if the prev_cpu is idle
> 
> 	- Additional check -
> Attempting to revert the changes above, a dirty patch for the (current)
> kernel v5.16.0-rc5 was tested too:
> 

The consequences of the patch is allowing interrupts to migrate tasks away
from potentially cache hot data -- L1 misses if the two CPUs share LLC
or incurring remote memory access if migrating cross-node. The secondary
concern is that excessive migration from interrupts that round-robin CPUs
will mean that the CPU does not increase frequency. Minimally, the RFC
patch introduces regressions of their own. The comments cover the two
scenarios of interest

+        * If this_cpu is idle, it implies the wakeup is from interrupt
+        * context. Only allow the move if cache is shared. Otherwise an
+        * interrupt intensive workload could force all tasks onto one
+        * node depending on the IO topology or IRQ affinity settings.

(This one causes remote memory accesses and potentially overutilisation
of a subset of nodes)

+        * If the prev_cpu is idle and cache affine then avoid a migration.
+        * There is no guarantee that the cache hot data from an interrupt
+        * is more important than cache hot data on the prev_cpu and from
+        * a cpufreq perspective, it's better to have higher utilisation
+        * on one CPU.

(This one incurs L1/L2 misses due to a migration even though LLC may be
shared)

The tests don't say but what CPUs to the dmatest interrupts get
delivered to? dmatest appears to be an exception that the *only* hot
data of concern is also related to the interrupt as the DMA operation is
validated.

However, given that the point of a DMA engine is to transfer data without
the host CPU being involved and the interrupt is delivered on completion,
how realistic is it that the DMA data is immediately accessed on completion
by normal workloads that happen to use the DMA engine? What impact does
it have to tbe test is noverify or polling is used?

-- 
Mel Gorman
SUSE Labs
Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
Posted by Alexander Fomichev 4 years, 5 months ago
On Wed, Jan 12, 2022 at 05:05:12PM +0000, Mel Gorman wrote:
> On Wed, Jan 12, 2022 at 06:26:09PM +0300, Alexander Fomichev wrote:
> > CC: Mel Gorman <mgorman@suse.de>
> > CC: linux@yadro.com
> > 
> > Hi all,
> > 
> > There's a huge regression found, which affects Intel Xeon's DMA Engine
> > performance between v4.14 LTS and modern kernels. In certain
> > circumstances the speed in dmatest is more than 6 times lower.
> > 
> > 	- Hardware -
> > I did testing on 2 systems:
> > 1) Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (Supermicro X11DAi-N)
> > 2) Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz (YADRO Vegman S220)
> > 
> > 	- Measurement -
> > The dmatest result speed decreases with almost any test settings.
> > Although the most significant impact is revealed with 64K transfers. The
> > following parameters were used:
> > 
> > modprobe dmatest iterations=1000 timeout=2000 test_buf_size=0x100000 transfer_size=0x10000 norandom=1
> > echo "dma0chan0" > /sys/module/dmatest/parameters/channel
> > echo 1 > /sys/module/dmatest/parameters/run
> > 
> > Every test csse was performed at least 3 times. All detailed results are
> > below.
> > 
> > 	- Analysis -
> > Bisecting revealed 2 different bad commits for those 2 systems, but both
> > change the same function/condition in the same file.
> > For the system (1) the bad commit is:
> > [7332dec055f2457c386032f7e9b2991eb05c2a0a] sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
> > For the system (2) the bad commit is:
> > [806486c377e33ab662de6d47902e9e2a32b79368] sched/fair: Do not migrate if the prev_cpu is idle
> > 
> > 	- Additional check -
> > Attempting to revert the changes above, a dirty patch for the (current)
> > kernel v5.16.0-rc5 was tested too:
> > 
> 
> The consequences of the patch is allowing interrupts to migrate tasks away
> from potentially cache hot data -- L1 misses if the two CPUs share LLC
> or incurring remote memory access if migrating cross-node. The secondary
> concern is that excessive migration from interrupts that round-robin CPUs
> will mean that the CPU does not increase frequency. Minimally, the RFC
> patch introduces regressions of their own. The comments cover the two
> scenarios of interest
> 
> +        * If this_cpu is idle, it implies the wakeup is from interrupt
> +        * context. Only allow the move if cache is shared. Otherwise an
> +        * interrupt intensive workload could force all tasks onto one
> +        * node depending on the IO topology or IRQ affinity settings.
> 
> (This one causes remote memory accesses and potentially overutilisation
> of a subset of nodes)
> 
> +        * If the prev_cpu is idle and cache affine then avoid a migration.
> +        * There is no guarantee that the cache hot data from an interrupt
> +        * is more important than cache hot data on the prev_cpu and from
> +        * a cpufreq perspective, it's better to have higher utilisation
> +        * on one CPU.
> 
> (This one incurs L1/L2 misses due to a migration even though LLC may be
> shared)
> 
> The tests don't say but what CPUs to the dmatest interrupts get
> delivered to? dmatest appears to be an exception that the *only* hot
> data of concern is also related to the interrupt as the DMA operation is
> validated.
> 
> However, given that the point of a DMA engine is to transfer data without
> the host CPU being involved and the interrupt is delivered on completion,
> how realistic is it that the DMA data is immediately accessed on completion
> by normal workloads that happen to use the DMA engine? What impact does
> it have to tbe test is noverify or polling is used?

Thanks for the comment. Some additional notes regarding the issue.

1) You're right. When options "noverify=1" and "polling=1" are used.
then no performance reducing occurs.

2) DMA Engine on certain devices, e.g. Switchtec DMA and AMD PTDMA, is
used particularly for off-CPU data transfer via device's NTB to a remote
host. In NTRDMA project, which I'm involved to, DMA Engine sends data to
remote ring buffer and on data arrival CPU processes local ring buffers.

3) I checked dmatest with noverify=0 on PTDMA dirver: AMD EPYC 7313 16-Core
Processor/ASRock ROMED8-2T. The regression occurs on this hardware too.

4) Do you mean that with noverify=N and dirty patch, data verification
is performed on cached data and thus measured performance is fake?

5) What DMA Engine enabled drivers (and dmatest) should use as design
pattern to conform migration/cache behavior? Does scheduler optimisation
conflict to DMA Engine performance in general?

6) I didn't suggest RFC patch to real world usage. It was just a test
case to find out a low speed cause.

Comments/answers/suggestions are welcome.

-- 
Regards,
  Alexander
Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
Posted by Mel Gorman 4 years, 5 months ago
On Mon, Jan 17, 2022 at 11:19:05AM +0300, Alexander Fomichev wrote:
> On Wed, Jan 12, 2022 at 05:05:12PM +0000, Mel Gorman wrote:
> > On Wed, Jan 12, 2022 at 06:26:09PM +0300, Alexander Fomichev wrote:
> > > CC: Mel Gorman <mgorman@suse.de>
> > > CC: linux@yadro.com
> > > 
> > > Hi all,
> > > 
> > > There's a huge regression found, which affects Intel Xeon's DMA Engine
> > > performance between v4.14 LTS and modern kernels. In certain
> > > circumstances the speed in dmatest is more than 6 times lower.
> > > 
> > > 	- Hardware -
> > > I did testing on 2 systems:
> > > 1) Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (Supermicro X11DAi-N)
> > > 2) Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz (YADRO Vegman S220)
> > > 
> > > 	- Measurement -
> > > The dmatest result speed decreases with almost any test settings.
> > > Although the most significant impact is revealed with 64K transfers. The
> > > following parameters were used:
> > > 
> > > modprobe dmatest iterations=1000 timeout=2000 test_buf_size=0x100000 transfer_size=0x10000 norandom=1
> > > echo "dma0chan0" > /sys/module/dmatest/parameters/channel
> > > echo 1 > /sys/module/dmatest/parameters/run
> > > 
> > > Every test csse was performed at least 3 times. All detailed results are
> > > below.
> > > 
> > > 	- Analysis -
> > > Bisecting revealed 2 different bad commits for those 2 systems, but both
> > > change the same function/condition in the same file.
> > > For the system (1) the bad commit is:
> > > [7332dec055f2457c386032f7e9b2991eb05c2a0a] sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
> > > For the system (2) the bad commit is:
> > > [806486c377e33ab662de6d47902e9e2a32b79368] sched/fair: Do not migrate if the prev_cpu is idle
> > > 
> > > 	- Additional check -
> > > Attempting to revert the changes above, a dirty patch for the (current)
> > > kernel v5.16.0-rc5 was tested too:
> > > 
> > 
> > The consequences of the patch is allowing interrupts to migrate tasks away
> > from potentially cache hot data -- L1 misses if the two CPUs share LLC
> > or incurring remote memory access if migrating cross-node. The secondary
> > concern is that excessive migration from interrupts that round-robin CPUs
> > will mean that the CPU does not increase frequency. Minimally, the RFC
> > patch introduces regressions of their own. The comments cover the two
> > scenarios of interest
> > 
> > +        * If this_cpu is idle, it implies the wakeup is from interrupt
> > +        * context. Only allow the move if cache is shared. Otherwise an
> > +        * interrupt intensive workload could force all tasks onto one
> > +        * node depending on the IO topology or IRQ affinity settings.
> > 
> > (This one causes remote memory accesses and potentially overutilisation
> > of a subset of nodes)
> > 
> > +        * If the prev_cpu is idle and cache affine then avoid a migration.
> > +        * There is no guarantee that the cache hot data from an interrupt
> > +        * is more important than cache hot data on the prev_cpu and from
> > +        * a cpufreq perspective, it's better to have higher utilisation
> > +        * on one CPU.
> > 
> > (This one incurs L1/L2 misses due to a migration even though LLC may be
> > shared)
> > 
> > The tests don't say but what CPUs to the dmatest interrupts get
> > delivered to? dmatest appears to be an exception that the *only* hot
> > data of concern is also related to the interrupt as the DMA operation is
> > validated.
> > 
> > However, given that the point of a DMA engine is to transfer data without
> > the host CPU being involved and the interrupt is delivered on completion,
> > how realistic is it that the DMA data is immediately accessed on completion
> > by normal workloads that happen to use the DMA engine? What impact does
> > it have to tbe test is noverify or polling is used?
> 
> Thanks for the comment. Some additional notes regarding the issue.
> 
> 1) You're right. When options "noverify=1" and "polling=1" are used.
> then no performance reducing occurs.
> 

How about just noverify=1 on its own? It's a stronger indicator that
cache hotness is a factor.

> 2) DMA Engine on certain devices, e.g. Switchtec DMA and AMD PTDMA, is
> used particularly for off-CPU data transfer via device's NTB to a remote
> host. In NTRDMA project, which I'm involved to, DMA Engine sends data to
> remote ring buffer and on data arrival CPU processes local ring buffers.
> 

Is there any impact of the patch in this case? Given that it's a remote
host, the data is likely cache cold anyway.

> 3) I checked dmatest with noverify=0 on PTDMA dirver: AMD EPYC 7313 16-Core
> Processor/ASRock ROMED8-2T. The regression occurs on this hardware too.
> 

I expect it would be the same reason, the data is cache cold for the
CPU.

> 4) Do you mean that with noverify=N and dirty patch, data verification
> is performed on cached data and thus measured performance is fake?
> 

I think it's the data verification going slower because the tasks are
not aggressively migrating on interrupt. The flip side is other
interrupts such as IO completion should not migrate the tasks given that
the interrupt is not necessarily correlated with data hotness.

> 5) What DMA Engine enabled drivers (and dmatest) should use as design
> pattern to conform migration/cache behavior? Does scheduler optimisation
> conflict to DMA Engine performance in general?
> 

I'm not familiar with DMA engine drivers but if they use wake_up
interfaces then passing WF_SYNC or calling the wake_up_*_sync helpers
may force the migration.

> 6) I didn't suggest RFC patch to real world usage. It was just a test
> case to find out a low speed cause.
> 

I understand but I'm relunctant to go with the dirty patch on the
grounds it would reintroduce another class of regressions.

-- 
Mel Gorman
SUSE Labs
Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
Posted by Alexander Fomichev 4 years, 5 months ago
On Mon, Jan 17, 2022 at 10:27:01AM +0000, Mel Gorman wrote:
> > 1) You're right. When options "noverify=1" and "polling=1" are used.
> > then no performance reducing occurs.
> 
> How about just noverify=1 on its own? It's a stronger indicator that
> cache hotness is a factor.
> 

With "noverify=1 polled=0" the performance reduction is only 10-20%,
but still exists.

-----< v5.15.8-vanilla >-----
[17057.866760] dmatest: Added 1 threads using dma0chan0
[17060.133880] dmatest: Started 1 threads using dma0chan0
[17060.154343] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 49338.85 iops 3157686 KB/s (0)
[17063.737887] dmatest: Added 1 threads using dma0chan0
[17065.113838] dmatest: Started 1 threads using dma0chan0
[17065.137659] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42183.41 iops 2699738 KB/s (0)
[17100.339989] dmatest: Added 1 threads using dma0chan0
[17102.190764] dmatest: Started 1 threads using dma0chan0
[17102.214285] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42844.89 iops 2742073 KB/s (0)
-----< end >-----

-----< 5.15.8-ioat-ptdma-dirty-fix+ >-----
[ 6183.356549] dmatest: Added 1 threads using dma0chan0
[ 6187.868237] dmatest: Started 1 threads using dma0chan0
[ 6187.887389] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52753.74 iops 3376239 KB/s (0)
[ 6201.913154] dmatest: Added 1 threads using dma0chan0
[ 6204.701340] dmatest: Started 1 threads using dma0chan0
[ 6204.720490] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52614.96 iops 3367357 KB/s (0)
[ 6285.114603] dmatest: Added 1 threads using dma0chan0
[ 6287.031875] dmatest: Started 1 threads using dma0chan0
[ 6287.050278] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 54939.01 iops 3516097 KB/s (0)
-----< end >-----

> > 2) DMA Engine on certain devices, e.g. Switchtec DMA and AMD PTDMA, is
> > used particularly for off-CPU data transfer via device's NTB to a remote
> > host. In NTRDMA project, which I'm involved to, DMA Engine sends data to
> > remote ring buffer and on data arrival CPU processes local ring buffers.
> > 
> 
> Is there any impact of the patch in this case? Given that it's a remote
> host, the data is likely cache cold anyway.
> 

It's complicated. Currently we have a bunch of problems with the
project. So we do decomposition and try to solve them separately. Here
we faced the DMA Engine issue.

> > 4) Do you mean that with noverify=N and dirty patch, data verification
> > is performed on cached data and thus measured performance is fake?
> > 
> 
> I think it's the data verification going slower because the tasks are
> not aggressively migrating on interrupt. The flip side is other
> interrupts such as IO completion should not migrate the tasks given that
> the interrupt is not necessarily correlated with data hotness.
> 

It's quite strange, because dmatest substitutes verification time from
overall test time. I suspect measurement may be inaccurate.

> > 5) What DMA Engine enabled drivers (and dmatest) should use as design
> > pattern to conform migration/cache behavior? Does scheduler optimisation
> > conflict to DMA Engine performance in general?
> > 
> 
> I'm not familiar with DMA engine drivers but if they use wake_up
> interfaces then passing WF_SYNC or calling the wake_up_*_sync helpers
> may force the migration.
> 

Thanks for the advice. I'll try to check if this is a solution.


-- 
Regards,
  Alexander
Re: [RFC] Scheduler: DMA Engine regression because of sched/fair changes
Posted by Thorsten Leemhuis 4 years, 5 months ago
[TLDR: I'm adding this regression to regzbot, the Linux kernel
regression tracking bot; most text you find below is compiled from a few
templates paragraphs some of you might have seen already.]

Hi, this is your Linux kernel regression tracker speaking.

Thanks for the report.

Adding the regression mailing list to the list of recipients, as it
should be in the loop for all regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

To be sure this issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

On 12.01.22 16:26, Alexander Fomichev wrote:
> CC: Mel Gorman <mgorman@suse.de>
> CC: linux@yadro.com
> 
> Hi all,
> 
> There's a huge regression found, which affects Intel Xeon's DMA Engine
> performance between v4.14 LTS and modern kernels. In certain
> circumstances the speed in dmatest is more than 6 times lower.
> 
> 	- Hardware -
> I did testing on 2 systems:
> 1) Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (Supermicro X11DAi-N)
> 2) Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz (YADRO Vegman S220)
> 
> 	- Measurement -
> The dmatest result speed decreases with almost any test settings.
> Although the most significant impact is revealed with 64K transfers. The
> following parameters were used:
> 
> modprobe dmatest iterations=1000 timeout=2000 test_buf_size=0x100000 transfer_size=0x10000 norandom=1
> echo "dma0chan0" > /sys/module/dmatest/parameters/channel
> echo 1 > /sys/module/dmatest/parameters/run
> 
> Every test csse was performed at least 3 times. All detailed results are
> below.
> 
> 	- Analysis -
> Bisecting revealed 2 different bad commits for those 2 systems, but both
> change the same function/condition in the same file.
> For the system (1) the bad commit is:
> [7332dec055f2457c386032f7e9b2991eb05c2a0a] sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
> For the system (2) the bad commit is:
> [806486c377e33ab662de6d47902e9e2a32b79368] sched/fair: Do not migrate if the prev_cpu is idle

Uhh, regzbot is not prepared for this, hence I'll simply pick
7332dec055f2457c386032f7e9b2991eb05c2a0a

#regzbot ^introduced 7332dec055f2457c386032f7e9b2991eb05c2a0a
#regzbot title sched: DMA Engine regression because of sched/fair changes
#regzbot ignore-activity

Reminder: when fixing the issue, please add a 'Link:' tag with the URL
to the report (the parent of this mail) using the kernel.org redirector,
as explained in 'Documentation/process/submitting-patches.rst'. Regzbot
then will automatically mark the regression as resolved once the fix
lands in the appropriate tree. For more details about regzbot see footer.

Sending this to everyone that got the initial report, to make all aware
of the tracking. I also hope that messages like this motivate people to
directly get at least the regression mailing list and ideally even
regzbot involved when dealing with regressions, as messages like this
wouldn't be needed then.

Don't worry, I'll send further messages wrt to this regression just to
the lists (with a tag in the subject so people can filter them away), as
long as they are intended just for regzbot. With a bit of luck no such
messages will be needed anyway.

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply, that's in everyone's interest.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

> 	- Additional check -
> Attempting to revert the changes above, a dirty patch for the (current)
> kernel v5.16.0-rc5 was tested too:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6f16dfb74246..0a58cc00b1b8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5931,8 +5931,8 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>          * a cpufreq perspective, it's better to have higher utilisation
>          * on one CPU.
>          */
> -       if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> -               return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> +       if (available_idle_cpu(this_cpu))
> +               return this_cpu;
> 
>         if (sync && cpu_rq(this_cpu)->nr_running == 1)
>                 return this_cpu;
> 
> Please, take a look if this makes sense. But with this patch applied the
> performance of DMA Engine restores.
> 
> 	- Dmatest results TL;DR -
> 
> System (1) before bad commit:
> ---------------------
> [  519.894642] dmatest: Added 1 threads using dma0chan0
> [  525.383021] dmatest: Started 1 threads using dma0chan0
> [  528.521915] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 98367.10 iops 6295494 KB/s (0)
> [  544.851751] dmatest: Added 1 threads using dma0chan0
> [  546.460064] dmatest: Started 1 threads using dma0chan0
> [  549.609504] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 100310.96 iops 6419901 KB/s (0)
> [  562.178365] dmatest: Added 1 threads using dma0chan0
> [  563.852534] dmatest: Started 1 threads using dma0chan0
> [  567.004898] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 98580.44 iops 6309148 KB/s (0)
> ---------------------
> 
> System (1) on HEAD=bad commit:
> ---------------------
> [  149.555401] dmatest: Added 1 threads using dma0chan0
> [  154.162444] dmatest: Started 1 threads using dma0chan0
> [  157.490868] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 26653.87 iops 1705847 KB/s (0)
> [  176.783450] dmatest: Added 1 threads using dma0chan0
> [  178.428518] dmatest: Started 1 threads using dma0chan0
> [  181.606531] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 14194.86 iops 908471 KB/s (0)
> [  192.125218] dmatest: Added 1 threads using dma0chan0
> [  194.060029] dmatest: Started 1 threads using dma0chan0
> [  197.235265] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 14757.09 iops 944454 KB/s (0)
> ---------------------
> 
> Systen (1) on v5.16.0-rc5:
> ---------------------
> [ 1430.860170] dmatest: Added 1 threads using dma0chan0
> [ 1437.367447] dmatest: Started 1 threads using dma0chan0
> [ 1442.756660] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 24837.31 iops 1589588 KB/s (0)
> [ 1561.614191] dmatest: Added 1 threads using dma0chan0
> [ 1562.816375] dmatest: Started 1 threads using dma0chan0
> [ 1566.619614] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 13666.05 iops 874627 KB/s (0)
> [ 1585.019601] dmatest: Added 1 threads using dma0chan0
> [ 1587.585741] dmatest: Started 1 threads using dma0chan0
> [ 1591.386816] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 13521.91 iops 865402 KB/s (0)
> ---------------------
> 
> System (1) on v5.16.0-rc5 with dirty patch:
> ---------------------
> [  733.571508] dmatest: Added 1 threads using dma0chan0
> [  746.050800] dmatest: Started 1 threads using dma0chan0
> [  749.765600] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 87260.03 iops 5584642 KB/s (0)
> [  915.051955] dmatest: Added 1 threads using dma0chan0
> [  916.550732] dmatest: Started 1 threads using dma0chan0
> [  920.267525] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 88464.25 iops 5661712 KB/s (0)
> [  936.781273] dmatest: Added 1 threads using dma0chan0
> [  939.528616] dmatest: Started 1 threads using dma0chan0
> [  943.247694] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 88833.61 iops 5685351 KB/s (0)
> ---------------------
> 
> System (2) before bad commit:
> ---------------------
> [  481.309411] dmatest: Added 1 threads using dma0chan0
> [  491.197425] dmatest: Started 1 threads using dma0chan0
> [  497.047315] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 78988.94 iops 5055292 KB/s (0)
> [  506.057101] dmatest: Added 1 threads using dma0chan0
> [  508.939426] dmatest: Started 1 threads using dma0chan0
> [  514.788823] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 77754.44 iops 4976284 KB/s (0)
> [  531.894587] dmatest: Added 1 threads using dma0chan0
> [  534.053360] dmatest: Started 1 threads using dma0chan0
> [  539.906424] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 76988.21 iops 4927246 KB/s (0)
> ---------------------
> 
> System (2) on HEAD=bad commit:
> ---------------------
> [44522.892995] dmatest: Added 1 threads using dma0chan0
> [44526.193331] dmatest: Started 1 threads using dma0chan0
> [44532.043932] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 80360.01 iops 5143040 KB/s (0)
> [44561.121118] dmatest: Added 1 threads using dma0chan0
> [44562.868428] dmatest: Started 1 threads using dma0chan0
> [44568.808577] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 16080.53 iops 1029154 KB/s (0)
> [44728.597409] dmatest: Added 1 threads using dma0chan0
> [44730.301566] dmatest: Started 1 threads using dma0chan0
> [44736.259009] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 16091.91 iops 1029882 KB/s (0)
> ---------------------
> 
> Thanks for reading.
> 

---
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and/or the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
tell #regzbot about it in the report, as that will ensure the regression
gets on the radar of regzbot and the regression tracker. That's in your
interest, as they will make sure the report won't fall through the
cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include a 'Link:' tag to the report in the commit message, as explained
in Documentation/process/submitting-patches.rst
That aspect was recently was made more explicit in commit 1f57bd42b77c:
https://git.kernel.org/linus/1f57bd42b77c