[PATCH 0/5] sched: Lazy preemption muck

Peter Zijlstra posted 5 patches 1 month, 3 weeks ago
[PATCH 0/5] sched: Lazy preemption muck
Posted by Peter Zijlstra 1 month, 3 weeks ago
Hi!

During LPC Thomas reminded me that the lazy preemption stuff was not there yet.

So here goes, robot says it builds, and I checked both a regular and PREEMPT_RT
build boots and can change the mode.

Please have a poke.
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Mike Galbraith 1 month, 1 week ago
On Mon, 2024-10-07 at 09:46 +0200, Peter Zijlstra wrote:
> Hi!
>
> During LPC Thomas reminded me that the lazy preemption stuff was not there yet.
>
> So here goes, robot says it builds, and I checked both a regular and PREEMPT_RT
> build boots and can change the mode.
>
> Please have a poke.

Lazy seems to be the least sleeper friendly, and voluntary the most
switch happy of the three models tested (a bit counter-intuitive).  Not
sure I should post these results, nobody will ever notice these deltas
in their desktop, but they are carefully gathered cold hard numbers, so
what the heck, deleting them takes only a moment.

All data collected against virgin master with only this patch set then
applied, ie there are zero deltas other than the lazy set and model
selection.  The load profiled is my favorite completely hands off mixed
compute vs chrome desktop load, 3 runs of lazy and full compared to 2
runs of voluntary, I called that good enough after looking, not being
willing to do any more of what I have done oodles of in the past days.

I found sum delay deltas interesting.. and sensible after pondering,
nothing being free, delaying preemption of wider bodied compute tasks
has to impact somewhere. Should desktop components become very busy,
they'll be doing the impacting.

Full per run summaries attached, high speed scroll version below.

desktop util 18.1% static voluntary - virgin source
desktop util 18.3% static voluntary - +lazy patches
desktop util 17.5% lazy             - ditto...
desktop util 17.0% lazy
desktop util 16.8% lazy
desktop util 17.8% full
desktop util 17.8% full
desktop util 17.8% full

Profile summary bottom lines, same order (ignore bottom line delay max, includes nice tasks)
 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
voluntary
  TOTAL:                |2281560.794 ms | 18514617 |                 |       60.836 ms |   1101651.105 ms |
  TOTAL:                |2280493.511 ms | 18679820 |                 |       49.278 ms |   1117078.210 ms |
lazy                                                                                   avg 1109364.657       1.000
  TOTAL:                |2302856.995 ms | 15141600 |                 |       68.303 ms |   1316229.527 ms |
  TOTAL:                |2299911.697 ms | 15979830 |                 |       29.361 ms |   1276128.457 ms |
  TOTAL:                |2298180.079 ms | 16665916 |                 |       34.918 ms |   1271137.849 ms |
full                                                                                   avg 1287831.944       1.160
  TOTAL:                |2279751.696 ms | 16628663 |                 |       36.929 ms |    977433.245 ms |
  TOTAL:                |2278306.187 ms | 15846728 |                 |       79.954 ms |    974772.318 ms |
  TOTAL:                |2282262.311 ms | 17152068 |                 |       28.004 ms |    990068.590 ms |
                                                                                       avg  980758.051        .884

Detail extract for the 3 top players:
voluntary
  massive_intr:(9)      |1869421.285 ms |  5752848 | avg:   0.056 ms | max:  23.502 ms | sum:324608.328 ms |
  massive_intr:(9)      |1864362.822 ms |  5606194 | avg:   0.059 ms | max:  20.343 ms | sum:328920.012 ms |
                                                                                       avg   326764.170      1.000
  dav1d-worker:(8)      | 155700.590 ms |  1091524 | avg:   0.300 ms | max:  25.215 ms | sum:327735.383 ms |
  dav1d-worker:(8)      | 160730.867 ms |  1123134 | avg:   0.299 ms | max:  26.783 ms | sum:335553.750 ms |
                                                                                       avg   331644.566      1.000
  X:2534                |  51690.650 ms |   223470 | avg:   0.122 ms | max:  10.367 ms | sum:27371.213 ms |
  X:2522                |  51355.479 ms |   232615 | avg:   0.121 ms | max:  14.223 ms | sum:28243.127 ms |
                                                                                       avg   27807.170       1.000
lazy
  massive_intr:(9)      |1900362.729 ms |  4443987 | avg:   0.066 ms | max:  23.997 ms | sum:294806.423 ms |
  massive_intr:(9)      |1909494.321 ms |  4624111 | avg:   0.061 ms | max:  20.215 ms | sum:282640.873 ms |
  massive_intr:(9)      |1913482.381 ms |  4896729 | avg:   0.057 ms | max:  19.473 ms | sum:277651.582 ms |
                                                                                       avg   285032.959       .872
  dav1d-worker:(8)      | 164649.511 ms |   849691 | avg:   0.531 ms | max:  21.537 ms | sum:451592.116 ms |
  dav1d-worker:(8)      | 151579.745 ms |   850584 | avg:   0.505 ms | max:  29.119 ms | sum:429540.789 ms |
  dav1d-worker:(8)      | 145472.762 ms |   860714 | avg:   0.489 ms | max:  22.005 ms | sum:420927.393 ms |
                                                                                       avg   434020.099      1.308
  X:2508                |  53119.214 ms |   317117 | avg:   0.135 ms | max:  11.995 ms | sum:42917.248 ms |
  X:2508                |  53047.311 ms |   311231 | avg:   0.137 ms | max:  17.974 ms | sum:42601.534 ms |
  X:2508                |  48692.053 ms |   333137 | avg:   0.129 ms | max:  13.263 ms | sum:43140.962 ms |
                                                                                       avg   42886.581       1.542
full
  massive_intr:(9)      |1875426.524 ms |  5559171 | avg:   0.057 ms | max:  19.817 ms | sum:315854.813 ms |
  massive_intr:(9)      |1873997.628 ms |  4945112 | avg:   0.064 ms | max:  18.035 ms | sum:315257.596 ms |
  massive_intr:(9)      |1876339.600 ms |  5569489 | avg:   0.057 ms | max:  20.473 ms | sum:317771.140 ms |
                                                                                       avg   316294.516       .967
  dav1d-worker:(8)      | 157496.413 ms |   914670 | avg:   0.333 ms | max:  24.006 ms | sum:304944.125 ms |
  dav1d-worker:(8)      | 161723.573 ms |   879090 | avg:   0.354 ms | max:  26.070 ms | sum:310806.145 ms |
  dav1d-worker:(8)      | 157653.936 ms |   929556 | avg:   0.331 ms | max:  21.664 ms | sum:307315.109 ms |
                                                                                       avg   307688.459       .927
  X:2502                |  52845.691 ms |   131220 | avg:   0.135 ms | max:  11.479 ms | sum:17656.654 ms |
  X:2502                |  47884.248 ms |   134000 | avg:   0.148 ms | max:  12.068 ms | sum:19768.297 ms |
  X:2502                |  51731.371 ms |   145167 | avg:   0.132 ms | max:  15.995 ms | sum:19180.803 ms |
                                                                                       avg   18868.584        .678



key: + postfix to name means lazy patch set applied, +lazy meany dynamic with lazy default.
     No postfix or only + postfix means static voluntary (my chosen default).
     end of line postfix in parens = model during test run

perf sched lat -i perf.data.full.6.12.0.g6485cf5-master-virgin --sort=runtime -S 15 -T (static voluntary)

 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(9)      |1869421.285 ms |  5752848 | avg:   0.056 ms | max:  23.502 ms | sum:324608.328 ms |
  dav1d-worker:(8)      | 155700.590 ms |  1091524 | avg:   0.300 ms | max:  25.215 ms | sum:327735.383 ms |
  X:2534                |  51690.650 ms |   223470 | avg:   0.122 ms | max:  10.367 ms | sum:27371.213 ms |
  VizCompositorTh:4732  |  37826.030 ms |   176382 | avg:   0.183 ms | max:  11.527 ms | sum:32269.244 ms |
  ThreadPoolForeg:(50)  |  23708.799 ms |   162325 | avg:   0.276 ms | max:  22.754 ms | sum:44853.075 ms |
  kworker/u32:3-e:67    |  21922.222 ms |  3950152 | avg:   0.013 ms | max:  12.980 ms | sum:49901.452 ms |
  kwin_x11:2927         |  18956.612 ms |   197986 | avg:   0.089 ms | max:  14.978 ms | sum:17669.582 ms |
  chrome:(9)            |  18146.220 ms |   162581 | avg:   0.181 ms | max:  18.931 ms | sum:29448.275 ms |
  VideoFrameCompo:4740  |  16994.546 ms |   206270 | avg:   0.159 ms | max:  12.552 ms | sum:32785.727 ms |
  kworker/u32:2-e:66    |  12923.440 ms |  2337049 | avg:   0.013 ms | max:  12.206 ms | sum:29439.455 ms |
  kthreadd:(4)          |   8651.856 ms |  1545898 | avg:   0.013 ms | max:  13.244 ms | sum:20645.416 ms |
  Compositor:(2)        |   7434.109 ms |   121329 | avg:   0.154 ms | max:  10.758 ms | sum:18628.354 ms |
  Chrome_ChildIOT:(6)   |   7164.010 ms |   242003 | avg:   0.167 ms | max:  12.135 ms | sum:40383.425 ms |
  Media:4739            |   6188.846 ms |   100660 | avg:   0.173 ms | max:  16.322 ms | sum:17384.775 ms |
  kworker/u32:5+e:69    |   4557.938 ms |   812666 | avg:   0.012 ms | max:  11.581 ms | sum:10014.729 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2281560.794 ms | 18514617 |                 |       60.836 ms |   1101651.105 ms |
 ----------------------------------------------------------------------------------------------------------
desktop util 18.1%

perf sched lat -i perf.data.full.6.12.0.g6485cf5-master-virgin+ --sort=runtime -S 15 -T (static voluntary)

 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(9)      |1864362.822 ms |  5606194 | avg:   0.059 ms | max:  20.343 ms | sum:328920.012 ms |
  dav1d-worker:(8)      | 160730.867 ms |  1123134 | avg:   0.299 ms | max:  26.783 ms | sum:335553.750 ms |
  X:2522                |  51355.479 ms |   232615 | avg:   0.121 ms | max:  14.223 ms | sum:28243.127 ms |
  VizCompositorTh:4708  |  37705.826 ms |   175678 | avg:   0.186 ms | max:  12.171 ms | sum:32642.549 ms |
  kworker/u32:1-e:65    |  23712.990 ms |  4302933 | avg:   0.012 ms | max:  14.379 ms | sum:50073.553 ms |
  ThreadPoolForeg:(51)  |  23354.248 ms |   146424 | avg:   0.305 ms | max:  32.721 ms | sum:44677.022 ms |
  chrome:(9)            |  17611.783 ms |   144228 | avg:   0.199 ms | max:  22.482 ms | sum:28732.916 ms |
  VideoFrameCompo:4716  |  17198.595 ms |   183878 | avg:   0.178 ms | max:  15.286 ms | sum:32792.967 ms |
  kwin_x11:2923         |  17183.572 ms |   190690 | avg:   0.105 ms | max:  11.328 ms | sum:20051.466 ms |
  kworker/u32:3-e:67    |  16584.163 ms |  3004657 | avg:   0.012 ms | max:  12.297 ms | sum:36189.183 ms |
  kworker/u32:6+e:70    |   8787.851 ms |  1589820 | avg:   0.012 ms | max:   9.698 ms | sum:19232.965 ms |
  Chrome_ChildIOT:(7)   |   7652.436 ms |   240093 | avg:   0.172 ms | max:  15.259 ms | sum:41196.085 ms |
  Compositor:(2)        |   7504.380 ms |   110731 | avg:   0.176 ms | max:  15.048 ms | sum:19475.037 ms |
  Media:4715            |   6080.907 ms |    97641 | avg:   0.180 ms | max:  16.142 ms | sum:17589.459 ms |
  kworker/u32:4-e:68    |   3700.463 ms |   665402 | avg:   0.011 ms | max:   9.711 ms | sum: 7460.158 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2280493.511 ms | 18679820 |                 |       49.278 ms |   1117078.210 ms |
 ----------------------------------------------------------------------------------------------------------
desktop util 18.3%

perf sched lat -i perf.data.full.6.12.0.g6485cf5-master-virgin+lazy --sort=runtime -S 15 -T (lazy)

 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(9)      |1900362.729 ms |  4443987 | avg:   0.066 ms | max:  23.997 ms | sum:294806.423 ms |
  dav1d-worker:(8)      | 164649.511 ms |   849691 | avg:   0.531 ms | max:  21.537 ms | sum:451592.116 ms |
  X:2508                |  53119.214 ms |   317117 | avg:   0.135 ms | max:  11.995 ms | sum:42917.248 ms |
  VizCompositorTh:4721  |  36975.869 ms |   177119 | avg:   0.214 ms | max:  14.383 ms | sum:37942.002 ms |
  ThreadPoolForeg:(50)  |  21374.995 ms |   161261 | avg:   0.333 ms | max:  21.738 ms | sum:53733.127 ms |
  chrome:(9)            |  17705.822 ms |   148144 | avg:   0.212 ms | max:  13.684 ms | sum:31335.048 ms |
  kwin_x11:2923         |  16293.433 ms |   195305 | avg:   0.095 ms | max:  11.748 ms | sum:18627.124 ms |
  VideoFrameCompo:4729  |  15607.441 ms |   192906 | avg:   0.198 ms | max:  26.420 ms | sum:38273.853 ms |
  kworker/u32:2+e:67    |  14288.685 ms |  2697907 | avg:   0.016 ms | max:  11.137 ms | sum:42158.772 ms |
  kworker/u32:4-e:69    |  10809.304 ms |  2030937 | avg:   0.014 ms | max:  12.982 ms | sum:29118.478 ms |
  kworker/u32:0-e:11    |   9670.564 ms |  1804924 | avg:   0.015 ms | max:  12.053 ms | sum:27919.959 ms |
  Compositor:(2)        |   7720.155 ms |   117533 | avg:   0.227 ms | max:  13.540 ms | sum:26654.650 ms |
  Chrome_ChildIOT:(8)   |   7168.602 ms |   267822 | avg:   0.237 ms | max:  15.399 ms | sum:63353.885 ms |
  Media:4728            |   6501.220 ms |   111442 | avg:   0.278 ms | max:  15.456 ms | sum:31018.087 ms |
  kworker/u32:1-e:66    |   2912.893 ms |   542561 | avg:   0.014 ms | max:   8.670 ms | sum: 7829.852 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2302856.995 ms | 15141600 |                 |       68.303 ms |   1316229.527 ms |
 ----------------------------------------------------------------------------------------------------------
desktop util 17.5%

perf sched lat -i perf.data.full.6.12.0.g6485cf5-master-virgin+lazy --sort=runtime -S 15 -T (lazy)

 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(9)      |1909494.321 ms |  4624111 | avg:   0.061 ms | max:  20.215 ms | sum:282640.873 ms |
  dav1d-worker:(8)      | 151579.745 ms |   850584 | avg:   0.505 ms | max:  29.119 ms | sum:429540.789 ms |
  X:2508                |  53047.311 ms |   311231 | avg:   0.137 ms | max:  17.974 ms | sum:42601.534 ms |
  VizCompositorTh:6893  |  37686.973 ms |   170197 | avg:   0.216 ms | max:  11.311 ms | sum:36749.812 ms |
  ThreadPoolForeg:(44)  |  21710.844 ms |   146025 | avg:   0.362 ms | max:  28.152 ms | sum:52931.194 ms |
  chrome:(9)            |  16248.165 ms |   131728 | avg:   0.210 ms | max:  16.033 ms | sum:27695.200 ms |
  VideoFrameCompo:6900  |  15484.140 ms |   169752 | avg:   0.223 ms | max:  13.208 ms | sum:37797.691 ms |
  kwin_x11:2923         |  15121.315 ms |   197942 | avg:   0.100 ms | max:  11.027 ms | sum:19749.568 ms |
  kthreadd:(5)          |  13906.868 ms |  2631171 | avg:   0.013 ms | max:  10.195 ms | sum:34262.373 ms |
  kworker/u32:1:6943    |  10539.864 ms |  1973511 | avg:   0.015 ms | max:  11.001 ms | sum:28668.999 ms |
  kworker/u32:10-:443   |  10211.276 ms |  1901469 | avg:   0.014 ms | max:  10.742 ms | sum:26474.182 ms |
  Compositor:(2)        |   7353.627 ms |   117889 | avg:   0.215 ms | max:  23.917 ms | sum:25324.895 ms |
  Chrome_ChildIOT:(6)   |   7005.489 ms |   256715 | avg:   0.243 ms | max:  17.614 ms | sum:62365.700 ms |
  Media:6899            |   6223.856 ms |   107444 | avg:   0.287 ms | max:  13.516 ms | sum:30849.560 ms |
  kworker/u32:0+e:5784  |   4113.435 ms |   783320 | avg:   0.015 ms | max:  10.960 ms | sum:11400.850 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2299911.697 ms | 15979830 |                 |       29.361 ms |   1276128.457 ms |
 ----------------------------------------------------------------------------------------------------------
desktop util 17.0%

perf sched lat -i perf.data.full.6.12.0.g6485cf5-master-virgin+lazy --sort=runtime -S 15 -T (lazy)

 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(9)      |1913482.381 ms |  4896729 | avg:   0.057 ms | max:  19.473 ms | sum:277651.582 ms |
  dav1d-worker:(8)      | 145472.762 ms |   860714 | avg:   0.489 ms | max:  22.005 ms | sum:420927.393 ms |
  X:2508                |  48692.053 ms |   333137 | avg:   0.129 ms | max:  13.263 ms | sum:43140.962 ms |
  VizCompositorTh:7462  |  38247.597 ms |   190153 | avg:   0.194 ms | max:  15.145 ms | sum:36874.818 ms |
  ThreadPoolForeg:(48)  |  20907.127 ms |   154823 | avg:   0.337 ms | max:  22.605 ms | sum:52112.821 ms |
  chrome:(9)            |  17173.249 ms |   145326 | avg:   0.206 ms | max:  15.988 ms | sum:29981.372 ms |
  kwin_x11:2923         |  16655.041 ms |   206479 | avg:   0.096 ms | max:  20.335 ms | sum:19726.071 ms |
  kworker/u32:3-e:6967  |  16211.545 ms |  3036606 | avg:   0.013 ms | max:  14.750 ms | sum:39934.280 ms |
  VideoFrameCompo:7469  |  15487.829 ms |   183878 | avg:   0.205 ms | max:  11.149 ms | sum:37702.223 ms |
  kworker/u32:10+:443   |  12143.699 ms |  2296001 | avg:   0.015 ms | max:  12.346 ms | sum:35025.772 ms |
  kthreadd:(3)          |   8806.543 ms |  1663465 | avg:   0.013 ms | max:  10.732 ms | sum:22443.117 ms |
  Compositor:(2)        |   7433.129 ms |   118757 | avg:   0.215 ms | max:  13.616 ms | sum:25591.282 ms |
  Chrome_ChildIOT:(7)   |   7305.424 ms |   270553 | avg:   0.230 ms | max:  15.037 ms | sum:62314.818 ms |
  kworker/u32:0-e:5784  |   7032.254 ms |  1305707 | avg:   0.016 ms | max:  10.448 ms | sum:21170.660 ms |
  Media:7468            |   6246.502 ms |   114745 | avg:   0.269 ms | max:  14.205 ms | sum:30891.031 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2298180.079 ms | 16665916 |                 |       34.918 ms |   1271137.849 ms |
 ----------------------------------------------------------------------------------------------------------
desktop util 16.8%

perf sched lat -i perf.data.full.6.12.0.g6485cf5-master-virgin+lazy --sort=runtime -S 15 -T (full)

 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(9)      |1875426.524 ms |  5559171 | avg:   0.057 ms | max:  19.817 ms | sum:315854.813 ms |
  dav1d-worker:(8)      | 157496.413 ms |   914670 | avg:   0.333 ms | max:  24.006 ms | sum:304944.125 ms |
  X:2502                |  52845.691 ms |   131220 | avg:   0.135 ms | max:  11.479 ms | sum:17656.654 ms |
  VizCompositorTh:4696  |  38197.336 ms |   141737 | avg:   0.137 ms | max:  18.307 ms | sum:19416.930 ms |
  ThreadPoolForeg:(51)  |  23793.225 ms |   134975 | avg:   0.261 ms | max:  31.737 ms | sum:35236.105 ms |
  kworker/u32:0+e:11    |  17946.233 ms |  3215726 | avg:   0.013 ms | max:  10.694 ms | sum:43005.547 ms |
  chrome:(9)            |  17507.091 ms |   126012 | avg:   0.193 ms | max:  23.634 ms | sum:24345.823 ms |
  VideoFrameCompo:4704  |  16601.850 ms |   182901 | avg:   0.113 ms | max:  19.033 ms | sum:20589.840 ms |
  kwin_x11:2893         |  15846.601 ms |   127128 | avg:   0.062 ms | max:  12.846 ms | sum: 7940.111 ms |
  kworker/u32:1-e:66    |  12786.558 ms |  2287904 | avg:   0.014 ms | max:  11.131 ms | sum:31205.819 ms |
  kworker/u32:2-e:67    |  10965.214 ms |  1980760 | avg:   0.014 ms | max:   9.275 ms | sum:27031.219 ms |
  Compositor:(2)        |   7367.684 ms |   110725 | avg:   0.122 ms | max:  13.756 ms | sum:13491.791 ms |
  Chrome_ChildIOT:(6)   |   7104.071 ms |   201664 | avg:   0.147 ms | max:  12.163 ms | sum:29735.926 ms |
  Media:4703            |   6053.286 ms |    98687 | avg:   0.150 ms | max:  12.463 ms | sum:14815.330 ms |
  kworker/u32:9-e:441   |   3229.031 ms |   573689 | avg:   0.014 ms | max:   9.088 ms | sum: 7951.808 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2279751.696 ms | 16628663 |                 |       36.929 ms |    977433.245 ms |
 ----------------------------------------------------------------------------------------------------------
desktop util 17.8%

perf sched lat -i perf.data.full.6.12.0.g6485cf5-master-virgin+lazy --sort=runtime -S 15 -T (full)

 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(9)      |1873997.628 ms |  4945112 | avg:   0.064 ms | max:  18.035 ms | sum:315257.596 ms |
  dav1d-worker:(8)      | 161723.573 ms |   879090 | avg:   0.354 ms | max:  26.070 ms | sum:310806.145 ms |
  X:2502                |  47884.248 ms |   134000 | avg:   0.148 ms | max:  12.068 ms | sum:19768.297 ms |
  VizCompositorTh:5049  |  37631.978 ms |   134948 | avg:   0.139 ms | max:  20.005 ms | sum:18802.998 ms |
  ThreadPoolForeg:(48)  |  23020.803 ms |   116519 | avg:   0.291 ms | max:  29.007 ms | sum:33854.527 ms |
  chrome:(9)            |  17637.863 ms |   119032 | avg:   0.204 ms | max:  19.511 ms | sum:24251.886 ms |
  VideoFrameCompo:5056  |  17097.638 ms |   147969 | avg:   0.140 ms | max:  11.943 ms | sum:20646.996 ms |
  kwin_x11:2893         |  16276.749 ms |   129768 | avg:   0.083 ms | max:  12.346 ms | sum:10783.203 ms |
  kworker/u32:0+e:11    |  13930.578 ms |  2506472 | avg:   0.012 ms | max:  10.309 ms | sum:30183.580 ms |
  kworker/u32:1-e:66    |  12053.560 ms |  2168793 | avg:   0.012 ms | max:   9.961 ms | sum:26421.505 ms |
  kthreadd:(4)          |   8379.998 ms |  1491424 | avg:   0.014 ms | max:   9.089 ms | sum:20605.324 ms |
  Compositor:(2)        |   7525.028 ms |   103715 | avg:   0.136 ms | max:  13.257 ms | sum:14070.263 ms |
  Chrome_ChildIOT:(8)   |   7029.210 ms |   190254 | avg:   0.155 ms | max:  20.680 ms | sum:29474.689 ms |
  Media:5055            |   6172.148 ms |    90196 | avg:   0.166 ms | max:  11.774 ms | sum:15011.825 ms |
  kworker/u32:9-i:441   |   5586.635 ms |  1007957 | avg:   0.011 ms | max:  14.529 ms | sum:11402.402 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2278306.187 ms | 15846728 |                 |       79.954 ms |    974772.318 ms |
 ----------------------------------------------------------------------------------------------------------
desktop util 17.8%

perf sched lat -i perf.data.full.6.12.0.g6485cf5-master-virgin+lazy --sort=runtime -S 15 -T (full)

 ----------------------------------------------------------------------------------------------------------
  Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Sum delay ms     |
 ----------------------------------------------------------------------------------------------------------
  massive_intr:(9)      |1876339.600 ms |  5569489 | avg:   0.057 ms | max:  20.473 ms | sum:317771.140 ms |
  dav1d-worker:(8)      | 157653.936 ms |   929556 | avg:   0.331 ms | max:  21.664 ms | sum:307315.109 ms |
  X:2502                |  51731.371 ms |   145167 | avg:   0.132 ms | max:  15.995 ms | sum:19180.803 ms |
  VizCompositorTh:5879  |  37636.433 ms |   145822 | avg:   0.128 ms | max:  13.675 ms | sum:18675.236 ms |
  ThreadPoolForeg:(50)  |  22727.468 ms |   118237 | avg:   0.297 ms | max:  28.004 ms | sum:35105.904 ms |
  kworker/u32:1+e:66    |  18944.682 ms |  3398565 | avg:   0.013 ms | max:  12.287 ms | sum:44070.394 ms |
  chrome:(9)            |  17697.852 ms |   128113 | avg:   0.193 ms | max:  19.843 ms | sum:24676.480 ms |
  kthreadd:(3)          |  17129.315 ms |  3074837 | avg:   0.013 ms | max:  14.654 ms | sum:41107.807 ms |
  VideoFrameCompo:5886  |  16956.260 ms |   167159 | avg:   0.125 ms | max:  12.120 ms | sum:20955.174 ms |
  kwin_x11:2893         |  15744.805 ms |   128937 | avg:   0.076 ms | max:  15.872 ms | sum: 9811.977 ms |
  kworker/u32:3-e:5390  |   7597.088 ms |  1370365 | avg:   0.013 ms | max:   9.726 ms | sum:18185.745 ms |
  Compositor:(2)        |   7591.781 ms |   110688 | avg:   0.127 ms | max:  13.707 ms | sum:14065.114 ms |
  Chrome_ChildIOT:(6)   |   6916.739 ms |   198787 | avg:   0.149 ms | max:  12.443 ms | sum:29695.890 ms |
  Media:5885            |   6161.402 ms |    95843 | avg:   0.158 ms | max:  13.595 ms | sum:15189.381 ms |
  kworker/u32:0-e:11    |   3038.438 ms |   520012 | avg:   0.014 ms | max:   9.409 ms | sum: 7259.283 ms |
 ----------------------------------------------------------------------------------------------------------
  TOTAL:                |2282262.311 ms | 17152068 |                 |       28.004 ms |    990068.590 ms |
 ----------------------------------------------------------------------------------------------------------
desktop util 17.8%

Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Thomas Meyer 2 weeks, 6 days ago
Mike Galbraith <efault@gmx.de> writes:
> Full per run summaries attached, high speed scroll version below.
>
> desktop util 18.1% static voluntary - virgin source
> desktop util 18.3% static voluntary - +lazy patches
> desktop util 17.5% lazy             - ditto...
> desktop util 17.0% lazy
> desktop util 16.8% lazy
> desktop util 17.8% full
> desktop util 17.8% full
> desktop util 17.8% full

Can you please elaborate a bit more were those values, e.g. 18,1%, come from?
How to get those? I couldn't find a connection to your raw data.

Sorry for asking this probably stupid question,

mfg
thomas
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Mike Galbraith 2 weeks, 5 days ago
On Thu, 2024-11-07 at 18:21 +0100, Thomas Meyer wrote:
>
> Mike Galbraith <efault@gmx.de> writes:
> > Full per run summaries attached, high speed scroll version below.
> >
> > desktop util 18.1% static voluntary - virgin source
> > desktop util 18.3% static voluntary - +lazy patches
> > desktop util 17.5% lazy             - ditto...
> > desktop util 17.0% lazy
> > desktop util 16.8% lazy
> > desktop util 17.8% full
> > desktop util 17.8% full
> > desktop util 17.8% full
>
> Can you please elaborate a bit more were those values, e.g. 18,1%, come from?
> How to get those? I couldn't find a connection to your raw data.

I use a slightly twiddled perf to profile, ala perf sched record
<whatever load>, perf sched lat to emit the profile, then do the total
runtime accumulation vs competitor runtime arithmetic. Both competitor
and desktop load are selected for maximal hands off consistency, poke
start, watch yet another 5 minutes of BigBuckBunny to make sure the
internet doesn't hiccup during recording.. and done.

> Sorry for asking this probably stupid question,

Nah, the only silly question is one not pondered first.

	-Mike
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Sebastian Andrzej Siewior 1 month, 2 weeks ago
On 2024-10-07 09:46:09 [+0200], Peter Zijlstra wrote:
> Hi!
> 
> During LPC Thomas reminded me that the lazy preemption stuff was not there yet.
> 
> So here goes, robot says it builds, and I checked both a regular and PREEMPT_RT
> build boots and can change the mode.
> 
> Please have a poke.

I poked. Very happy. Can change mode as well.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Sebastian
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Sebastian Andrzej Siewior 1 month, 2 weeks ago
On 2024-10-07 09:46:09 [+0200], Peter Zijlstra wrote:
> Hi!
> 
> During LPC Thomas reminded me that the lazy preemption stuff was not there yet.
> 
> So here goes, robot says it builds, and I checked both a regular and PREEMPT_RT
> build boots and can change the mode.
> 
> Please have a poke.

While comparing this vs what I have:
- need_resched()
  It checked both (tif_need_resched_lazy() || tif_need_resched()) while
  now it only looks at tif_need_resched().
  Also ensured that raw_irqentry_exit_cond_resched() does not trigger on
  lazy. 
  I guess you can argue both ways what makes sense, just noting…

- __account_cfs_rq_runtime() and hrtick_start_fair()
  Both have a resched_curr() instead of resched_curr_lazy(). Is this on
  purpose?

This is actually the main difference (ignoring the moving the RT bits
and dynamic-sched). The lazy-resched is slightly different but it should
do the same thing.
I have also tracing and riscv bits which I port tomorrow, test and add
to your pile.

Sebastian
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Peter Zijlstra 1 month, 2 weeks ago
On Tue, Oct 08, 2024 at 05:32:32PM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-10-07 09:46:09 [+0200], Peter Zijlstra wrote:
> > Hi!
> > 
> > During LPC Thomas reminded me that the lazy preemption stuff was not there yet.
> > 
> > So here goes, robot says it builds, and I checked both a regular and PREEMPT_RT
> > build boots and can change the mode.
> > 
> > Please have a poke.
> 
> While comparing this vs what I have:
> - need_resched()
>   It checked both (tif_need_resched_lazy() || tif_need_resched()) while
>   now it only looks at tif_need_resched().
>   Also ensured that raw_irqentry_exit_cond_resched() does not trigger on
>   lazy. 
>   I guess you can argue both ways what makes sense, just noting…

Right, changing need_resched() would mean auditing all existing users,
and there's *MANY* of them.

> - __account_cfs_rq_runtime() and hrtick_start_fair()
>   Both have a resched_curr() instead of resched_curr_lazy(). Is this on
>   purpose?

Sorta, so __account_cfs_rq_runtime() is bandwidth enforcement, and I
figured that since bandwidth is like resource isolation it should be as
accurate as possible.

hrtick is disabled by default (because expensive) and so it doesn't
matter much, but it's purpose is to increase accuracy and hence I left
it untouched for now.
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Ankur Arora 1 month, 2 weeks ago
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-10-07 09:46:09 [+0200], Peter Zijlstra wrote:
>> Hi!
>>
>> During LPC Thomas reminded me that the lazy preemption stuff was not there yet.
>>
>> So here goes, robot says it builds, and I checked both a regular and PREEMPT_RT
>> build boots and can change the mode.
>>
>> Please have a poke.
>
> I have also tracing and riscv bits which I port tomorrow, test and add
> to your pile.

I also have a set of RCU, powerpc, and tracing bits. Since you are
working on the tracing part as well, maybe we can just unify our
patches before sending to Peter?

My patches are at:

 git://github.com/terminus/linux/ sched/lazy

--
ankur
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Ankur Arora 1 month, 2 weeks ago
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-10-07 09:46:09 [+0200], Peter Zijlstra wrote:
>> Hi!
>>
>> During LPC Thomas reminded me that the lazy preemption stuff was not there yet.
>>
>> So here goes, robot says it builds, and I checked both a regular and PREEMPT_RT
>> build boots and can change the mode.
>>
>> Please have a poke.
>
> While comparing this vs what I have:
> - need_resched()
>   It checked both (tif_need_resched_lazy() || tif_need_resched()) while
>   now it only looks at tif_need_resched().
>   Also ensured that raw_irqentry_exit_cond_resched() does not trigger on
>   lazy.
>   I guess you can argue both ways what makes sense, just noting…

I think we want need_resched() to be only tif_need_resched(). That way
preemption in lazy mode *only* happens at the user mode boundary.

If the scheduler wants to preempt imminently, it just sets (or upgrades to)
TIF_NEED_RESCHED.

> - __account_cfs_rq_runtime() and hrtick_start_fair()
>   Both have a resched_curr() instead of resched_curr_lazy(). Is this on
>   purpose?
>
> This is actually the main difference (ignoring the moving the RT bits
> and dynamic-sched). The lazy-resched is slightly different but it should
> do the same thing.
> I have also tracing and riscv bits which I port tomorrow, test and add
> to your pile.
>
> Sebastian


--
ankur
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Sebastian Andrzej Siewior 1 month, 2 weeks ago
On 2024-10-08 21:40:05 [-0700], Ankur Arora wrote:
> > While comparing this vs what I have:
> > - need_resched()
> >   It checked both (tif_need_resched_lazy() || tif_need_resched()) while
> >   now it only looks at tif_need_resched().
> >   Also ensured that raw_irqentry_exit_cond_resched() does not trigger on
> >   lazy.
> >   I guess you can argue both ways what makes sense, just noting…
> 
> I think we want need_resched() to be only tif_need_resched(). That way
> preemption in lazy mode *only* happens at the user mode boundary.

There are places such as __clear_extent_bit() or select_collect() where
need_resched() is checked and if 0 they loop again. For these kind of
users it would probably make sense to allow them to preempt themself.
We could also add a new function which checks both and audit all users
and check what would make sense base on $criteria.

Sebastian
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Peter Zijlstra 1 month, 2 weeks ago
On Wed, Oct 09, 2024 at 08:20:19AM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-10-08 21:40:05 [-0700], Ankur Arora wrote:
> > > While comparing this vs what I have:
> > > - need_resched()
> > >   It checked both (tif_need_resched_lazy() || tif_need_resched()) while
> > >   now it only looks at tif_need_resched().
> > >   Also ensured that raw_irqentry_exit_cond_resched() does not trigger on
> > >   lazy.
> > >   I guess you can argue both ways what makes sense, just noting…
> > 
> > I think we want need_resched() to be only tif_need_resched(). That way
> > preemption in lazy mode *only* happens at the user mode boundary.
> 
> There are places such as __clear_extent_bit() or select_collect() where
> need_resched() is checked and if 0 they loop again. For these kind of
> users it would probably make sense to allow them to preempt themself.
> We could also add a new function which checks both and audit all users
> and check what would make sense base on $criteria.

Do we really need this -- wasn't the idea to have thing 'delay' until
the actual NEED_RESCHED bit gets set?

Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Steven Rostedt 1 month, 2 weeks ago
On Wed, 9 Oct 2024 10:02:02 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Oct 09, 2024 at 08:20:19AM +0200, Sebastian Andrzej Siewior wrote:
> > On 2024-10-08 21:40:05 [-0700], Ankur Arora wrote:  
> > > > While comparing this vs what I have:
> > > > - need_resched()
> > > >   It checked both (tif_need_resched_lazy() || tif_need_resched()) while
> > > >   now it only looks at tif_need_resched().
> > > >   Also ensured that raw_irqentry_exit_cond_resched() does not trigger on
> > > >   lazy.
> > > >   I guess you can argue both ways what makes sense, just noting…  
> > > 
> > > I think we want need_resched() to be only tif_need_resched(). That way
> > > preemption in lazy mode *only* happens at the user mode boundary.  
> > 
> > There are places such as __clear_extent_bit() or select_collect() where
> > need_resched() is checked and if 0 they loop again. For these kind of
> > users it would probably make sense to allow them to preempt themself.
> > We could also add a new function which checks both and audit all users
> > and check what would make sense base on $criteria.  
> 
> Do we really need this -- wasn't the idea to have thing 'delay' until
> the actual NEED_RESCHED bit gets set?

If we think about it as what would happen with the current PREEMPT_NONE,
wouldn't need_resched() return true as soon as NEED_RESCHED is set? That
means, with PREEMPT_AUTO, it should return true if LAZY_NEED_RESCHED is
set, right? That would make it act the same in both cases.

-- Steve
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Thomas Gleixner 1 month, 2 weeks ago
On Wed, Oct 09 2024 at 10:01, Steven Rostedt wrote:
> On Wed, 9 Oct 2024 10:02:02 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Wed, Oct 09, 2024 at 08:20:19AM +0200, Sebastian Andrzej Siewior wrote:
>> > On 2024-10-08 21:40:05 [-0700], Ankur Arora wrote:  
>> > > > While comparing this vs what I have:
>> > > > - need_resched()
>> > > >   It checked both (tif_need_resched_lazy() || tif_need_resched()) while
>> > > >   now it only looks at tif_need_resched().
>> > > >   Also ensured that raw_irqentry_exit_cond_resched() does not trigger on
>> > > >   lazy.
>> > > >   I guess you can argue both ways what makes sense, just noting…  
>> > > 
>> > > I think we want need_resched() to be only tif_need_resched(). That way
>> > > preemption in lazy mode *only* happens at the user mode boundary.  
>> > 
>> > There are places such as __clear_extent_bit() or select_collect() where
>> > need_resched() is checked and if 0 they loop again. For these kind of
>> > users it would probably make sense to allow them to preempt themself.
>> > We could also add a new function which checks both and audit all users
>> > and check what would make sense base on $criteria.  
>> 
>> Do we really need this -- wasn't the idea to have thing 'delay' until
>> the actual NEED_RESCHED bit gets set?
>
> If we think about it as what would happen with the current PREEMPT_NONE,
> wouldn't need_resched() return true as soon as NEED_RESCHED is set? That
> means, with PREEMPT_AUTO, it should return true if LAZY_NEED_RESCHED is
> set, right? That would make it act the same in both cases.

I don't think so. Quite some of these need_resched() things have been
sprinkled around to address the issues with PREEMPT_NONE.

We need to look at those places and figure out whether they need it when
LAZY is enabled. There might be a few which want to look at both flags,
but my expectation is that those are less than 5% of the overall usage.

Peter's choice is the right one. That forces us to look at all of them
and figure out whether they need to be extended to include the lazy bit
or not. Those which do not need it can be eliminated when LAZY is in
effect because that will preempt on the next possible preemption point
once the non-lazy bit is set in the tick.

Remember, the goal is to eliminate all except LAZY (RT is a different
scope) and make the kernel behave correctly to the expectation of LAZY
(somewhere between NONE and VOLUNTARY) and non-LAZY (equivalent to
FULL).

The reason why this works is that preempt_enable() actually has a
meaning while it does not with NONE.

Thanks,

        tglx
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Steven Rostedt 1 month, 2 weeks ago
On Wed, 09 Oct 2024 22:13:51 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> > If we think about it as what would happen with the current PREEMPT_NONE,
> > wouldn't need_resched() return true as soon as NEED_RESCHED is set? That
> > means, with PREEMPT_AUTO, it should return true if LAZY_NEED_RESCHED is
> > set, right? That would make it act the same in both cases.  
> 
> I don't think so. Quite some of these need_resched() things have been
> sprinkled around to address the issues with PREEMPT_NONE.

My point is that the need_resched() that are sprinkled around are checking
if a schedule has been requested or not. From that point alone,
LAZY_NEED_RESCHED is something that requested a schedule.

> 
> We need to look at those places and figure out whether they need it when
> LAZY is enabled. There might be a few which want to look at both flags,
> but my expectation is that those are less than 5% of the overall usage.
> 
> Peter's choice is the right one. That forces us to look at all of them
> and figure out whether they need to be extended to include the lazy bit
> or not. Those which do not need it can be eliminated when LAZY is in
> effect because that will preempt on the next possible preemption point
> once the non-lazy bit is set in the tick.
> 
> Remember, the goal is to eliminate all except LAZY (RT is a different
> scope) and make the kernel behave correctly to the expectation of LAZY
> (somewhere between NONE and VOLUNTARY) and non-LAZY (equivalent to
> FULL).
> 
> The reason why this works is that preempt_enable() actually has a
> meaning while it does not with NONE.

Looking around, I see the pattern of checking need_resched() from within a
loop where a spinlock is held. Then after the break of the loop and release
of the spinlock, cond_resched() is checked, and the loop is entered again.

Thus, I guess this is the reason you are saying that it should just check
NEED_RESCHED and not the LAZY variant. Because if we remove that
cond_resched() then it would just re-enter the loop again with the LAZY
being set.

Hmm, but then again...

Perhaps these cond_resched() is proper? That is, the need_resched() /
cond_resched() is not something that is being done for PREEMPT_NONE, but
for preempt/voluntary kernels too. Maybe these cond_resched() should stay?
If we spin in the loop for one more tick, that is actually changing the
behavior of PREEMPT_NONE and PREEMPT_VOLUNTARY, as the need_resched()/cond_resched()
helps with latency. If we just wait for the next tick, these loops (and
there's a lot of them) will all now run for one tick longer than if
PREEMPT_NONE or PREEMPT_VOLUNTARY were set today.


-- Steve
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Tianchen Ding 1 month, 2 weeks ago
On 2024/10/10 04:43, Steven Rostedt wrote:

[...]

> Hmm, but then again...
> 
> Perhaps these cond_resched() is proper? That is, the need_resched() /
> cond_resched() is not something that is being done for PREEMPT_NONE, but
> for preempt/voluntary kernels too. Maybe these cond_resched() should stay?
> If we spin in the loop for one more tick, that is actually changing the
> behavior of PREEMPT_NONE and PREEMPT_VOLUNTARY, as the need_resched()/cond_resched()
> helps with latency. If we just wait for the next tick, these loops (and
> there's a lot of them) will all now run for one tick longer than if
> PREEMPT_NONE or PREEMPT_VOLUNTARY were set today.
> 

Agree.

And for PREEMPT_LAZIEST, this becomes worse. The fair_class tasks will be 
delayed more than 1 tick. They may be starved until a non-fair class task comes 
to "save" them.

cond_resched() is designed for NONE/VOLUNTARY to avoid spinning in kernel and 
prevent softlockup. However, it is a nop in PREEMPT_LAZIEST, and things may be 
broken...
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Thomas Gleixner 1 month, 2 weeks ago
On Thu, Oct 10 2024 at 11:12, Tianchen Ding wrote:
> On 2024/10/10 04:43, Steven Rostedt wrote:
>> Perhaps these cond_resched() is proper? That is, the need_resched() /
>> cond_resched() is not something that is being done for PREEMPT_NONE, but
>> for preempt/voluntary kernels too. Maybe these cond_resched() should stay?
>> If we spin in the loop for one more tick, that is actually changing the
>> behavior of PREEMPT_NONE and PREEMPT_VOLUNTARY, as the need_resched()/cond_resched()
>> helps with latency. If we just wait for the next tick, these loops (and
>> there's a lot of them) will all now run for one tick longer than if
>> PREEMPT_NONE or PREEMPT_VOLUNTARY were set today.
>> 
>
> Agree.
>
> And for PREEMPT_LAZIEST, this becomes worse. The fair_class tasks will be 
> delayed more than 1 tick. They may be starved until a non-fair class task comes 
> to "save" them.

Everybody agreed already that PREEMPT_LAZIEST is silly and not going to
happen. Nothing to see here.

> cond_resched() is designed for NONE/VOLUNTARY to avoid spinning in kernel and 
> prevent softlockup. However, it is a nop in PREEMPT_LAZIEST, and things may be 
> broken...

cond_resched() is not designed. It's an ill-defined bandaid and the
purpose of LAZY is to remove it completely along with the preemption
models which depend on it.

Thanks,

        tglx
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Thomas Gleixner 1 month, 2 weeks ago
On Wed, Oct 09 2024 at 16:43, Steven Rostedt wrote:
> On Wed, 09 Oct 2024 22:13:51 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> The reason why this works is that preempt_enable() actually has a
>> meaning while it does not with NONE.
>
> Looking around, I see the pattern of checking need_resched() from within a
> loop where a spinlock is held. Then after the break of the loop and release
> of the spinlock, cond_resched() is checked, and the loop is entered again.
>
> Thus, I guess this is the reason you are saying that it should just check
> NEED_RESCHED and not the LAZY variant. Because if we remove that
> cond_resched() then it would just re-enter the loop again with the LAZY
> being set.
>
> Hmm, but then again...
>
> Perhaps these cond_resched() is proper? That is, the need_resched() /
> cond_resched() is not something that is being done for PREEMPT_NONE, but
> for preempt/voluntary kernels too. Maybe these cond_resched() should stay?
> If we spin in the loop for one more tick, that is actually changing the
> behavior of PREEMPT_NONE and PREEMPT_VOLUNTARY, as the need_resched()/cond_resched()
> helps with latency. If we just wait for the next tick, these loops (and
> there's a lot of them) will all now run for one tick longer than if
> PREEMPT_NONE or PREEMPT_VOLUNTARY were set today.

You are looking at it from the wrong perspective. You are trying to
preserve the status quo. I know that's the path of least effort but it's
the fundamentally wrong approach.

    spin_lock(L);
    while ($cond) {
          do_stuff();
          if (need_resched()) {
          	spin_unlock(L);
                resched();
                spin_lock(L);
          }
    }
    spin_unlock(L);

is the bogus pattern which was introduced to deal with the NONE
shortcomings. That's what we want to get rid of and not proliferate.

For the transition phase we obviously need to do:

    while ($cond) {
          spin_lock(L);
          do_stuff();
          spin_unlock(L);
          cond_resched();
    }

And once all the problems with LAZY are sorted then this cond_resched()
line just goes away and the loop looks like this:

    while ($cond) {
          spin_lock(L);
          do_stuff();
          spin_unlock(L);
    }

There is absolutely no argument that the spinlock held section needs to
spawn the loop. We fixed up several of these constructs over the years
and none of them caused a performance regression. Quite the contrary
quite some of them improved performance because dropping the lock lets
other CPUs interleave.

Seriously, this crap preserving mindset has to stop. If we go new ways
then we go them all the way.

There is a cost involved for cleaning the existing crap up, but that
cost is well worth it, because anything else is just adding to the ever
increasing accumulation of technical debt. We have enough of that
already.

Thanks,

        tglx
RE: [PATCH 0/5] sched: Lazy preemption muck
Posted by David Laight 1 month, 2 weeks ago
...
> And once all the problems with LAZY are sorted then this cond_resched()
> line just goes away and the loop looks like this:
> 
>     while ($cond) {
>           spin_lock(L);
>           do_stuff();
>           spin_unlock(L);
>     }

The problem with that pattern is the cost of the atomics.
Thay can easily be significant especially if there are
a lot of iterations and do_stuff() is cheap;

If $cond needs the lock, the code is really:
	spin_lock(L);
	while ($cond) {
		do_stuff();
		spin_unlock(L);
		spin_lock(L);
	}
	spin_unlock(L);
which make it even more obvious that you need a cheap
test to optimise away the unlock/lock pair.

Perhaps it could be wrapped inside a spin_relax(L)?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
RE: [PATCH 0/5] sched: Lazy preemption muck
Posted by Thomas Gleixner 1 month, 2 weeks ago
On Thu, Oct 10 2024 at 10:23, David Laight wrote:
> ...
>> And once all the problems with LAZY are sorted then this cond_resched()
>> line just goes away and the loop looks like this:
>> 
>>     while ($cond) {
>>           spin_lock(L);
>>           do_stuff();
>>           spin_unlock(L);
>>     }
>
> The problem with that pattern is the cost of the atomics.
> Thay can easily be significant especially if there are
> a lot of iterations and do_stuff() is cheap;
>
> If $cond needs the lock, the code is really:
> 	spin_lock(L);
> 	while ($cond) {
> 		do_stuff();
> 		spin_unlock(L);
> 		spin_lock(L);
> 	}
> 	spin_unlock(L);
>
> which make it even more obvious that you need a cheap
> test to optimise away the unlock/lock pair.

You cannot optimize the unlock/lock pair away for a large number of
iterations because then you bring back the problem of extended
latencies.

It does not matter whether $cond is cheap and do_stuff() is cheap. If
you have enough iterations then even a cheap do_stuff() causes massive
latencies, unless you keep the horrible cond_resched() mess, which we
are trying to remove.

What you are proposing is a programming antipattern and the lock/unlock
around do_stuff() in the clean loop I outlined is mostly free when there
is no contention, unless you use a pointless micro benchmark which has
an empty (or almost empty) do_stuff() implementation. We are not
optimizing for completely irrelevant theoretical nonsense.

Thanks,

        tglx
RE: [PATCH 0/5] sched: Lazy preemption muck
Posted by David Laight 1 month, 2 weeks ago
From: Thomas Gleixner
> Sent: 13 October 2024 20:02
> 
> On Thu, Oct 10 2024 at 10:23, David Laight wrote:
> > ...
> >> And once all the problems with LAZY are sorted then this cond_resched()
> >> line just goes away and the loop looks like this:
> >>
> >>     while ($cond) {
> >>           spin_lock(L);
> >>           do_stuff();
> >>           spin_unlock(L);
> >>     }
> >
> > The problem with that pattern is the cost of the atomics.
> > Thay can easily be significant especially if there are
> > a lot of iterations and do_stuff() is cheap;
> >
> > If $cond needs the lock, the code is really:
> > 	spin_lock(L);
> > 	while ($cond) {
> > 		do_stuff();
> > 		spin_unlock(L);
> > 		spin_lock(L);
> > 	}
> > 	spin_unlock(L);
> >
> > which make it even more obvious that you need a cheap
> > test to optimise away the unlock/lock pair.
> 
> You cannot optimize the unlock/lock pair away for a large number of
> iterations because then you bring back the problem of extended
> latencies.
> 
> It does not matter whether $cond is cheap and do_stuff() is cheap. If
> you have enough iterations then even a cheap do_stuff() causes massive
> latencies, unless you keep the horrible cond_resched() mess, which we
> are trying to remove.

While cond_resched() can probably go, you need a cheap need_resched()
so the loop above can contain:
		if (need_resched()) {
			spin_unlock(L);
			spin_lock(L);
		}
to avoid the atomics when both $cond and do_stuff() are cheap
but there are a lot of iterations.

There will also be cases where it isn't anywhere near as simple
as unlock/lock (eg traversing a linked list) because additional
code is needed to ensure the loop can be continued.

> What you are proposing is a programming antipattern and the lock/unlock
> around do_stuff() in the clean loop I outlined is mostly free when there
> is no contention, unless you use a pointless micro benchmark which has
> an empty (or almost empty) do_stuff() implementation. We are not
> optimizing for completely irrelevant theoretical nonsense.

Aren't you adding a extra pair of atomics on every iteration.
That is going to be noticeable.
Never mind the cases where it isn't that simple.

	David

> 
> Thanks,
> 
>         tglx
> 

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Steven Rostedt 1 month, 2 weeks ago
On Wed, 09 Oct 2024 23:06:00 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> > Perhaps these cond_resched() is proper? That is, the need_resched() /
> > cond_resched() is not something that is being done for PREEMPT_NONE, but
> > for preempt/voluntary kernels too. Maybe these cond_resched() should stay?
> > If we spin in the loop for one more tick, that is actually changing the
> > behavior of PREEMPT_NONE and PREEMPT_VOLUNTARY, as the need_resched()/cond_resched()
> > helps with latency. If we just wait for the next tick, these loops (and
> > there's a lot of them) will all now run for one tick longer than if
> > PREEMPT_NONE or PREEMPT_VOLUNTARY were set today.  
> 
> You are looking at it from the wrong perspective. You are trying to
> preserve the status quo. I know that's the path of least effort but it's
> the fundamentally wrong approach.
> 
>     spin_lock(L);
>     while ($cond) {
>           do_stuff();
>           if (need_resched()) {
>           	spin_unlock(L);
>                 resched();
>                 spin_lock(L);
>           }
>     }
>     spin_unlock(L);
> 
> is the bogus pattern which was introduced to deal with the NONE
> shortcomings. That's what we want to get rid of and not proliferate.

So this is actually a different issue that you are trying to address. But I
don't see how it had to deal with NONE, as even PREEMPT would suffer from
that loop, right? As the you can't preempt while holding the spinlock.

> 
> For the transition phase we obviously need to do:
> 
>     while ($cond) {
>           spin_lock(L);
>           do_stuff();
>           spin_unlock(L);
>           cond_resched();
>     }

But if $cond needs to be protected by spin_lock(), what then?

	spin_lock();
	while ($cond) {
		do_stuff();
		spin_unlock();
		spin_lock();
	}
	spin_unlock();

?

> 
> And once all the problems with LAZY are sorted then this cond_resched()
> line just goes away and the loop looks like this:
> 
>     while ($cond) {
>           spin_lock(L);
>           do_stuff();
>           spin_unlock(L);
>     }
> 
> There is absolutely no argument that the spinlock held section needs to
> spawn the loop. We fixed up several of these constructs over the years
> and none of them caused a performance regression. Quite the contrary
> quite some of them improved performance because dropping the lock lets
> other CPUs interleave.
> 
> Seriously, this crap preserving mindset has to stop. If we go new ways
> then we go them all the way.

It's not about "crap preserving" but more of taking smaller steps. Then we
can see where a regression happened if one does come up. Kind of like how
you did the x86 64bit/32bit merge. Do steps that keep things as close to
what they were at the start and slowly move toward your goals.

-- Steve
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Thomas Gleixner 1 month, 2 weeks ago
On Wed, Oct 09 2024 at 17:19, Steven Rostedt wrote:
> On Wed, 09 Oct 2024 23:06:00 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> For the transition phase we obviously need to do:
>> 
>>     while ($cond) {
>>           spin_lock(L);
>>           do_stuff();
>>           spin_unlock(L);
>>           cond_resched();
>>     }
>
> But if $cond needs to be protected by spin_lock(), what then?
>
> 	spin_lock();
> 	while ($cond) {
> 		do_stuff();
> 		spin_unlock();
> 		spin_lock();
> 	}
> 	spin_unlock();
>

Seriously? The proper pattern for this is to do either:

	while (READ_ONCE($cond)) {
		scoped_guard(spinlock)(L) {
                	if ($cond)
                		break;
			do_stuff();
                }
		cond_resched(); // To be removed
	}

or in case that $cond is more complex and needs lock protection:

	while (true) {
		scoped_guard(spinlock)(L) {
                	if ($cond)
                		break;
			do_stuff();
                }
		cond_resched(); // To be removed
	}

You get the idea, no?

>> Seriously, this crap preserving mindset has to stop. If we go new ways
>> then we go them all the way.
>
> It's not about "crap preserving" but more of taking smaller steps. Then we
> can see where a regression happened if one does come up. Kind of like how
> you did the x86 64bit/32bit merge. Do steps that keep things as close to
> what they were at the start and slowly move toward your goals.

Correct. But if you want to do small steps then you have to look at all
the usage sites first and prepare them _before_ introducing the new
model. That's what I have done for the past 20 years.

The comparison to the 32/64bit merge is completely bogus because that
was just the purely mechanical collocation of the files to make it easy
to consolidate them afterwards. The consolidation was the real effort.

If you want a proper example then look at the CPU hotplug cleanup. There
was a large pile of preparatory patches before we even started to
convert to the state machine concept.

Look at all the other things we've done in the past 20 years of
refactoring to make RT possible. They all follow the same scheme:

   1) Analyze _all_ usage sites, i.e. make an inventory

   2) Define a migration path, i.e. come up with proper abstractions

   3) Convert the usage sites over to the new abstractions

   4) Replace the mechanics in the new abstractions

I certainly tried to short curcuit in the early days, but I learned
instantaneously that the short circuit approach is doomed especially
when you deal with changing the most fundamental parts of an OS.

Your idea of taking smaller steps is fundamentally flawed as it fails
to look at the bigger picture first and just tries to emulate the status
quo without doing the preparatory steps upfront.

Peter's approach is perfectly fine because it provides an opportunity,
which allows to do the analysis (#1) not only by inspection, but also by
observation, without being disruptive.

That seems to be the more painful approach, but I can assure you it's
less painful than the 'emulate crap' just to make progress approach.

Why?

It forces people to actually analyze the problems and not work around
them with yet another magic duct tape solution which is equally ill
defined as cond_resched() or the hideous might_sleep() hack.

The goal is to reduce technical debt and not to replace it with a
different variant.

Thanks,

        tglx
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Steven Rostedt 1 month, 2 weeks ago
On Thu, 10 Oct 2024 01:16:27 +0200
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, Oct 09 2024 at 17:19, Steven Rostedt wrote:
> > On Wed, 09 Oct 2024 23:06:00 +0200
> > Thomas Gleixner <tglx@linutronix.de> wrote:  
> >> For the transition phase we obviously need to do:
> >> 
> >>     while ($cond) {
> >>           spin_lock(L);
> >>           do_stuff();
> >>           spin_unlock(L);
> >>           cond_resched();
> >>     }  
> >
> > But if $cond needs to be protected by spin_lock(), what then?
> >
> > 	spin_lock();
> > 	while ($cond) {
> > 		do_stuff();
> > 		spin_unlock();
> > 		spin_lock();
> > 	}
> > 	spin_unlock();
> >  
> 
> Seriously? The proper pattern for this is to do either:
> 
> 	while (READ_ONCE($cond)) {
> 		scoped_guard(spinlock)(L) {
>                 	if ($cond)
>                 		break;
> 			do_stuff();
>                 }
> 		cond_resched(); // To be removed
> 	}
> 
> or in case that $cond is more complex and needs lock protection:
> 
> 	while (true) {
> 		scoped_guard(spinlock)(L) {
>                 	if ($cond)
>                 		break;
> 			do_stuff();
>                 }
> 		cond_resched(); // To be removed
> 	}
> 
> You get the idea, no?

Basically still the same logic, just a different syntax. Speaking of which...

> 
> >> Seriously, this crap preserving mindset has to stop. If we go new ways
> >> then we go them all the way.  
> >
> > It's not about "crap preserving" but more of taking smaller steps. Then we
> > can see where a regression happened if one does come up. Kind of like how
> > you did the x86 64bit/32bit merge. Do steps that keep things as close to
> > what they were at the start and slowly move toward your goals.  
> 
> Correct. But if you want to do small steps then you have to look at all
> the usage sites first and prepare them _before_ introducing the new
> model. That's what I have done for the past 20 years.
> 
> The comparison to the 32/64bit merge is completely bogus because that
> was just the purely mechanical collocation of the files to make it easy
> to consolidate them afterwards. The consolidation was the real effort.
> 
> If you want a proper example then look at the CPU hotplug cleanup. There
> was a large pile of preparatory patches before we even started to
> convert to the state machine concept.
> 
> Look at all the other things we've done in the past 20 years of
> refactoring to make RT possible. They all follow the same scheme:
> 
>    1) Analyze _all_ usage sites, i.e. make an inventory
> 
>    2) Define a migration path, i.e. come up with proper abstractions
> 
>    3) Convert the usage sites over to the new abstractions
> 
>    4) Replace the mechanics in the new abstractions
> 
> I certainly tried to short curcuit in the early days, but I learned
> instantaneously that the short circuit approach is doomed especially
> when you deal with changing the most fundamental parts of an OS.
> 
> Your idea of taking smaller steps is fundamentally flawed as it fails
> to look at the bigger picture first and just tries to emulate the status
> quo without doing the preparatory steps upfront.
> 
> Peter's approach is perfectly fine because it provides an opportunity,
> which allows to do the analysis (#1) not only by inspection, but also by
> observation, without being disruptive.
> 
> That seems to be the more painful approach, but I can assure you it's
> less painful than the 'emulate crap' just to make progress approach.
> 
> Why?
> 
> It forces people to actually analyze the problems and not work around
> them with yet another magic duct tape solution which is equally ill
> defined as cond_resched() or the hideous might_sleep() hack.
> 
> The goal is to reduce technical debt and not to replace it with a
> different variant.
> 

The above definitely sounds like rational to switch everything over to Rust!

  /me runs!!!!

-- Steve
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Thomas Gleixner 1 month, 2 weeks ago
On Wed, Oct 09 2024 at 19:29, Steven Rostedt wrote:
> On Thu, 10 Oct 2024 01:16:27 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> On Wed, Oct 09 2024 at 17:19, Steven Rostedt wrote:
>> > On Wed, 09 Oct 2024 23:06:00 +0200
>> > Thomas Gleixner <tglx@linutronix.de> wrote:  
>> >> For the transition phase we obviously need to do:
>> >> 
>> >>     while ($cond) {
>> >>           spin_lock(L);
>> >>           do_stuff();
>> >>           spin_unlock(L);
>> >>           cond_resched();
>> >>     }  
>> >
>> > But if $cond needs to be protected by spin_lock(), what then?
>> >
>> > 	spin_lock();
>> > 	while ($cond) {
>> > 		do_stuff();
>> > 		spin_unlock();
>> > 		spin_lock();
>> > 	}
>> > 	spin_unlock();
>> >  
>> 
>> Seriously? The proper pattern for this is to do either:
>> 
>> 	while (READ_ONCE($cond)) {
>> 		scoped_guard(spinlock)(L) {
>>                 	if ($cond)
>>                 		break;
>> 			do_stuff();
>>                 }
>> 		cond_resched(); // To be removed
>> 	}
>> 
>> or in case that $cond is more complex and needs lock protection:
>> 
>> 	while (true) {
>> 		scoped_guard(spinlock)(L) {
>>                 	if ($cond)
>>                 		break;
>> 			do_stuff();
>>                 }
>> 		cond_resched(); // To be removed
>> 	}
>> 
>> You get the idea, no?
>
> Basically still the same logic, just a different syntax. Speaking of which...

Of course it's the same logic at the very end, but it's logic which is
understandable, makes sense and allows to change the underlying
infrastructure without disruption.

Refactoring is an art and if you get it wrong you actually make it
worse. Been there, done that.

>> It forces people to actually analyze the problems and not work around
>> them with yet another magic duct tape solution which is equally ill
>> defined as cond_resched() or the hideous might_sleep() hack.
>> 
>> The goal is to reduce technical debt and not to replace it with a
>> different variant.
>> 
>
> The above definitely sounds like rational to switch everything over to Rust!
>
>   /me runs!!!!

Why are you running away?

I'm all for moving the kernel to a less error prone programming
language. I've done ADA programming in my former professional live and
I'm well aware about the advantages of a "safe" programming language.

For me Rust is definitely the right way to go, aside of my personal
distaste of the syntax and the insanity of abandoning an existing
universe of ADA tools just because of 'not invented here', but that's a
different discussion to be had.

That said, converging to Rust requires a massive effort of being
disciplined on the C side in the first place. Otherwise the Rust
abstractions will just end up being a mostly useless wrapper around the
existing C insanities. Which in turn will neither reduce technical debt,
nor be able to hold up the promise to squash a whole class of bugs.

Quite the contrary it will accelerate the problems we already have
instead of reducing them, while everyone can pretend that we are so much
better off.

TBH, converging to Rust is a real chance to move away from the "features
first, correctness later" approach, but the way our community is
approaching it right now is just following the old scheme by making Rust
compatible to the "features first, correctness later" mantra.

If you still are telling me that:

>> > 	spin_lock();
>> > 	while ($cond) {
>> > 		do_stuff();
>> > 		spin_unlock();
>> > 		spin_lock();
>> > 	}
>> > 	spin_unlock();

is just different syntax than

>> 	while (true) {
>> 		scoped_guard(spinlock)(L) {
>>                 	if ($cond)
>>                 		break;
>> 			do_stuff();
>>             }
>> 	}

then please explain to me how you map your proposed scheme to a language
which is scope based by design.

Thanks,

        tglx
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Sebastian Andrzej Siewior 1 month, 2 weeks ago
On 2024-10-09 10:02:02 [+0200], Peter Zijlstra wrote:
> > There are places such as __clear_extent_bit() or select_collect() where
> > need_resched() is checked and if 0 they loop again. For these kind of
> > users it would probably make sense to allow them to preempt themself.
> > We could also add a new function which checks both and audit all users
> > and check what would make sense base on $criteria.
> 
> Do we really need this -- wasn't the idea to have thing 'delay' until
> the actual NEED_RESCHED bit gets set?

Not sure. They asked for it and have a lock acquired. But I guess it
could make sense to do as much work as possible until they have finally
to go.
This what we used to have in the RT tree, I don't complain just compare
notes.

Is there any punishment from the scheduler if they get preempted vs
leave early? It is just the time slice they used up and become less
eligible next time?

Let me do some more testing, this looks good so far.

Sebastian
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Ankur Arora 1 month, 2 weeks ago
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2024-10-08 21:40:05 [-0700], Ankur Arora wrote:
>> > While comparing this vs what I have:
>> > - need_resched()
>> >   It checked both (tif_need_resched_lazy() || tif_need_resched()) while
>> >   now it only looks at tif_need_resched().
>> >   Also ensured that raw_irqentry_exit_cond_resched() does not trigger on
>> >   lazy.
>> >   I guess you can argue both ways what makes sense, just noting…
>>
>> I think we want need_resched() to be only tif_need_resched(). That way
>> preemption in lazy mode *only* happens at the user mode boundary.
>
> There are places such as __clear_extent_bit() or select_collect() where
> need_resched() is checked and if 0 they loop again. For these kind of
> users it would probably make sense to allow them to preempt themself.
> We could also add a new function which checks both and audit all users
> and check what would make sense base on $criteria.

Yeah, I remember having the same thought. But the problem is that the
need_resched() checks are all over the kernel. And, figuring out a good
criteria for each of them seems like it might be similar to the
placement problem for cond_resched() -- both being workload dependent.

And, given that the maximum time in the lazy state is limited, it seems
like it'll be simplest to just circumscribe the time spent in the lazy
state by upgrading to TIF_NEED_RESCHED based on a some time limit.

That seems to do the job quite well, as Thomas' hog example showed:
 https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/

--
ankur
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Mike Galbraith 1 month, 2 weeks ago
On Mon, 2024-10-07 at 09:46 +0200, Peter Zijlstra wrote:
> Hi!
>
> During LPC Thomas reminded me that the lazy preemption stuff was not there yet.
>
> So here goes, robot says it builds, and I checked both a regular and PREEMPT_RT
> build boots and can change the mode.
>
> Please have a poke.

My box seems content.

I'm gonna miss VOLATILE (voluntary;) when it's eventually whacked, but
not a lot.. or at all general case, security and whatnot over time have
beaten up high end switchers far worse.

tbench 8 30s - single run, box=i7-4790

master static
  voluntary     3620.45 MB/sec   1.000   mean3 3613.45  stddev3 14.08
  voluntary     2706.54 MB/sec    .747   +cpu_mitigations
  voluntary     4028.72 MB/sec   1.112   nostalgia (4.4.231)
  preempt       3449.35 MB/sec    .952
  none          3631.99 MB/sec   1.003

master dynamic
  none          3548.19 MB/sec    .980
  voluntary     3495.77 MB/sec    .965
  full          3476.50 MB/sec    .960
  lazy          3473.95 MB/sec    .959
  laziest       3492.09 MB/sec    .964

master dynamic rt
  full          2352.58 MB/sec    .649
  lazy          2986.28 MB/sec    .824
  laziest       2977.63 MB/sec    .822

distro kernel dynamic
  none          1744.51 MB/sec    .481
  none          2189.74 MB/sec    .604  mitigations=off
Re: [PATCH 0/5] sched: Lazy preemption muck
Posted by Sebastian Andrzej Siewior 1 month, 3 weeks ago
On 2024-10-07 09:46:09 [+0200], Peter Zijlstra wrote:
> Hi!
Hi,

> During LPC Thomas reminded me that the lazy preemption stuff was not there yet.
oh thank you.
I still have an older version in my tree. I will take a look and back to
you.

Sebastian