[PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig

Ingo Molnar posted 15 patches 7 months, 2 weeks ago
[PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig
Posted by Ingo Molnar 7 months, 2 weeks ago
Since the x86 defconfig aims to be a distro kernel work-alike with
fewer drivers and a shorter build time, enable a handful of
popular scheduler and cgroups options that are typically enabled
on major Linux distributions.

The options enabled is a superset of the latest Ubuntu and Fedora
kernel debugging configs, using Ubuntu's config-6.11.0-24-generic
file, Fedora's kernel-x86_64-fedora.config and RHEL's
kernel-x86_64-rhel.config from kernel-ark.git.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
---
 arch/x86/configs/defconfig.x86_64 | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/x86/configs/defconfig.x86_64 b/arch/x86/configs/defconfig.x86_64
index 3c4a03633328..225aed921e21 100644
--- a/arch/x86/configs/defconfig.x86_64
+++ b/arch/x86/configs/defconfig.x86_64
@@ -2,6 +2,7 @@ CONFIG_WERROR=y
 CONFIG_SYSVIPC=y
 CONFIG_POSIX_MQUEUE=y
 CONFIG_AUDIT=y
+# CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
 CONFIG_NO_HZ=y
 CONFIG_HIGH_RES_TIMERS=y
 CONFIG_BPF_SYSCALL=y
@@ -11,26 +12,45 @@ CONFIG_BPF_PRELOAD=y
 CONFIG_BPF_PRELOAD_UMD=y
 CONFIG_BPF_LSM=y
 CONFIG_PREEMPT_VOLUNTARY=y
+CONFIG_SCHED_CORE=y
+CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
+CONFIG_IRQ_TIME_ACCOUNTING=y
 CONFIG_BSD_PROCESS_ACCT=y
+CONFIG_BSD_PROCESS_ACCT_V3=y
 CONFIG_TASKSTATS=y
 CONFIG_TASK_DELAY_ACCT=y
 CONFIG_TASK_XACCT=y
 CONFIG_TASK_IO_ACCOUNTING=y
+CONFIG_PSI=y
+CONFIG_PSI_DEFAULT_DISABLED=y
 CONFIG_LOG_BUF_SHIFT=18
-CONFIG_CGROUPS=y
+CONFIG_PRINTK_INDEX=y
+CONFIG_UCLAMP_TASK=y
+CONFIG_NUMA_BALANCING=y
+CONFIG_MEMCG=y
+CONFIG_MEMCG_V1=y
 CONFIG_BLK_CGROUP=y
-CONFIG_CGROUP_SCHED=y
+CONFIG_CFS_BANDWIDTH=y
+CONFIG_UCLAMP_TASK_GROUP=y
 CONFIG_CGROUP_PIDS=y
 CONFIG_CGROUP_RDMA=y
+CONFIG_CGROUP_DMEM=y
 CONFIG_CGROUP_FREEZER=y
 CONFIG_CGROUP_HUGETLB=y
 CONFIG_CPUSETS=y
+CONFIG_CPUSETS_V1=y
 CONFIG_CGROUP_DEVICE=y
 CONFIG_CGROUP_CPUACCT=y
 CONFIG_CGROUP_PERF=y
 CONFIG_CGROUP_BPF=y
 CONFIG_CGROUP_MISC=y
 CONFIG_CGROUP_DEBUG=y
+CONFIG_NAMESPACES=y
+CONFIG_USER_NS=y
+CONFIG_CHECKPOINT_RESTORE=y
+CONFIG_SCHED_AUTOGROUP=y
+CONFIG_SYSFS_SYSCALL=y
+CONFIG_EXPERT=y
 CONFIG_KALLSYMS_ALL=y
 CONFIG_PROFILING=y
 CONFIG_KEXEC=y
@@ -305,7 +325,6 @@ CONFIG_LIST_HARDENED=y
 CONFIG_PRINTK_TIME=y
 CONFIG_BOOT_PRINTK_DELAY=y
 CONFIG_DYNAMIC_DEBUG=y
-CONFIG_DEBUG_KERNEL=y
 CONFIG_STRIP_ASM_SYMS=y
 CONFIG_HEADERS_INSTALL=y
 CONFIG_DEBUG_SECTION_MISMATCH=y
-- 
2.45.2
Re: [PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig
Posted by Michal Koutný 6 months, 3 weeks ago
On Tue, May 06, 2025 at 07:09:22PM +0200, Ingo Molnar <mingo@kernel.org> wrote:
> +CONFIG_MEMCG_V1=y

Ugh.

> +CONFIG_CPUSETS_V1=y

Ugh.

Those config options were introduced to retire old code (their Kconfig
defaults are N).

I'd prefer if these defaults matched the Kconfig ones (and leave it up
to distros if they need them some time longer).

Thanks,
Michal
Re: [PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig
Posted by Arnd Bergmann 7 months, 2 weeks ago
On Tue, May 6, 2025, at 19:09, Ingo Molnar wrote:
> Since the x86 defconfig aims to be a distro kernel work-alike with
> fewer drivers and a shorter build time, enable a handful of
> popular scheduler and cgroups options that are typically enabled
> on major Linux distributions.
>
> The options enabled is a superset of the latest Ubuntu and Fedora
> kernel debugging configs, using Ubuntu's config-6.11.0-24-generic
> file, Fedora's kernel-x86_64-fedora.config and RHEL's
> kernel-x86_64-rhel.config from kernel-ark.git.

I think having a way to get something close to a distro config
is super userful for common options like this, but I wonder if
we could turn this into a kernel/configs/*.config fragment
instead that gets shared across architectures.

> +CONFIG_SYSFS_SYSCALL=y
> +CONFIG_EXPERT=y
>  CONFIG_KALLSYMS_ALL=y
>  CONFIG_PROFILING=y

I really don't like enabling CONFIG_EXPERT=y in a generic
defconfig. What changes if you turn this off?

Based on the help text for CONFIG_EXPERT, nothing we
consider the default should ever be guarded by it. If there
is something that distros commonly that is prevented by
EXPERT=n, it would be better to relay the dependency on that
particular thing.

    Arnd
Re: [PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig
Posted by Ingo Molnar 7 months, 2 weeks ago
* Arnd Bergmann <arnd@arndb.de> wrote:

> > +CONFIG_SYSFS_SYSCALL=y
> > +CONFIG_EXPERT=y
> >  CONFIG_KALLSYMS_ALL=y
> >  CONFIG_PROFILING=y
> 
> I really don't like enabling CONFIG_EXPERT=y in a generic
> defconfig. What changes if you turn this off?

That's a good question.

Disabling it gives me material changes for 4 options:

	--- .config.before
	+++ .config.after
	-CONFIG_EXPERT=y
	-CONFIG_ARCH_HAS_ZONE_DMA_SET=y
	+CONFIG_RFKILL_INPUT=y
	-CONFIG_PCIE_BUS_DEFAULT=y
	+CONFIG_DEBUG_MEMORY_INIT=y

1) CONFIG_DEBUG_MEMORY_INIT

The CONFIG_DEBUG_MEMORY_INIT default is super weird:

  config DEBUG_MEMORY_INIT
        bool "Debug memory initialisation" if EXPERT
        default !EXPERT

I this might in fact be a bug, and Ubuntu might have fallen victim to 
it:

  .config.fedora: CONFIG_DEBUG_MEMORY_INIT=y
  .config.ubuntu: # CONFIG_DEBUG_MEMORY_INIT is not set

I believe this should be 'default y', or 'default n'.

2) CONFIG_ARCH_HAS_ZONE_DMA_SET

This one is an interim Kconfig helper flag, and it's a bit weird as 
well:

  arch/x86/Kconfig:       select ARCH_HAS_ZONE_DMA_SET if EXPERT

I *think* the intent here is to make configurability of ZONE_DMA and 
ZONE_DMA32 dependent on EXPERT, while still giving architectures an 
opt-in as well:

 config ZONE_DMA
        bool "Support DMA zone" if ARCH_HAS_ZONE_DMA_SET
        default y if ARM64 || X86

 config ZONE_DMA32
        bool "Support DMA32 zone" if ARCH_HAS_ZONE_DMA_SET
        depends on !X86_32
        default y if ARM64

I think the better approach would be to make the EXPERT policy at the 
ZONE_DMA and ZONE_DMA32 level:

        bool "Support DMA zone" if ARCH_HAS_ZONE_DMA_SET && EXPERT

but it should be functionally equivalent.

3) RFKILL_INPUT

I think this one's a bug too:

 config RFKILL_INPUT
        bool "RF switch input support" if EXPERT
        depends on RFKILL
        depends on INPUT = y || RFKILL = INPUT
        default y if !EXPERT

Basically if you turn on EXPERT, the default changes from Y to N.

I think this should be a plain 'default y'.

4) CONFIG_PCIE_BUS_DEFAULT

I think this is quite confusing code as well:

  choice
        prompt "PCI Express hierarchy optimization setting"
        default PCIE_BUS_DEFAULT
        depends on PCI && EXPERT
        help
  ...

  config PCIE_BUS_DEFAULT
        bool "Default"
        depends on PCI
        help
          Default choice; ensure that the MPS matches upstream bridge.

  ...
  endchoice

So the intent here is clearly to steer users towards picking 
PCIE_BUS_DEFAULT.

But the 'depends' line turns off the option entirely on !EXPERT.

Which happens to work due to how the config options are used by the PCI 
code:

  #ifdef CONFIG_PCIE_BUS_TUNE_OFF
  enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_TUNE_OFF;
  #elif defined CONFIG_PCIE_BUS_SAFE
  enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE;
  #elif defined CONFIG_PCIE_BUS_PERFORMANCE
  enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PERFORMANCE;
  #elif defined CONFIG_PCIE_BUS_PEER2PEER
  enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_PEER2PEER;
  #else
  enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_DEFAULT;
  #endif

But this is highly unintuitive IMO. A cleaner implementation would be 
to always have CONFIG_PCIE_BUS_DEFAULT enabled on !EXPERT, which can be 
done by making the configurability of the choice-list depend on EXPERT:

  choice
        prompt "PCI Express hierarchy optimization setting" if EXPERT
        default PCIE_BUS_DEFAULT
        depends on PCI

> Based on the help text for CONFIG_EXPERT, nothing we
> consider the default should ever be guarded by it. If there
> is something that distros commonly that is prevented by
> EXPERT=n, it would be better to relay the dependency on that
> particular thing.

I think distro kernel maintainers mainly inherited their old configs 
and aren't afraid of CONFIG_EXPERT.

Thus *all* major distros I checked have CONFIG_EXPERT enabled: Ubuntu, 
Fedora, Debian, you name it. So literally over 99% of our users use a 
kernel that has CONFIG_EXPERT=y in it. Which is perfectly fine, distro 
kernel maintainers *are* the ultimate experts in this matter - but 
their choices inevitably make it to users configuring their own 
kernels: if users type 'make localmodconfig' they'll have 
CONFIG_EXPERT=y.

So I don't think we should ostracize CONFIG_EXPERT too much. :)

Otherwise I think you were right: 2 out of 4 of the configuration 
settings that change due to EXPERT are outright bugs IMO, the other 2 
are weird code that could be done in a more standard fashion, resulting 
in an invariant .config when EXPERT is toggled on/off.

Also, I kinda don't mind having CONFIG_EXPERT=y in the kernel 
defconfig: it's a helper config for *kernel developers* who want to 
have finegrained control over debug facilities and other details, it's 
not something for users - the resulting kernels won't result in a fully 
working system on modern x86 systems.

Thanks,

	Ingo
Re: [PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig
Posted by Yafang Shao 7 months, 2 weeks ago
Hello Mingo,

 > +CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
 > +CONFIG_IRQ_TIME_ACCOUNTING=y

Enabling CONFIG_IRQ_TIME_ACCOUNTING=y can lead to user-visible 
behavioral changes. For more context, please refer to the related 
discussion here: 
https://lore.kernel.org/all/20241222024734.63894-1-laoar.shao@gmail.com/ .

If we decide to enable it by default, we should clearly document this 
behavior change. Below is the patch I wrote earlier but haven’t sent out 
for review yet.

----

Subject: [PATCH] init/Kconfig: document behavior change when enabling
  IRQ_TIME_ACCOUNTING

After we enabled CONFIG_IRQ_TIME_ACCOUNTING, we noticed that the IRQ
usage is not accounted to the tasks and thus not accounted to the CPU
cgroup neither. This behavior change results in issues [0] in our
production servers and finally we have to revert it.

We'd better clearly document this behavior change in case it might
matter to the user.

Link: 
https://lore.kernel.org/all/20241222024734.63894-1-laoar.shao@gmail.com/ [0]
Suggested-by: "Michal Koutný" <mkoutny@suse.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
---
  init/Kconfig | 8 ++++++++
  1 file changed, 8 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index a20e6efd3f0f..191df0b5cf1c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -563,6 +563,14 @@ config IRQ_TIME_ACCOUNTING
        transitions between softirq and hardirq state, so there can be a
        small performance impact.

+      Enabling IRQ_TIME_ACCOUNTING excludes IRQ usage from the CPU usage
+      statistics of individual tasks and, consequently, it is not accounted
+      for in CPU cgroups. As a result, a task's CPU usage will accurately
+      reflect only its user time and system time. IRQ usage is instead
+      attributed at the global level and can be observed in metrics such as
+      /proc/stat or, potentially, at the cgroup level in files like
+      irq.pressure.
+
        If in doubt, say N here.

  config HAVE_SCHED_AVG_IRQ
Re: [PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig
Posted by Ingo Molnar 7 months, 2 weeks ago
* Yafang Shao <laoar.shao@gmail.com> wrote:

> Hello Mingo,
> 
> > +CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
> > +CONFIG_IRQ_TIME_ACCOUNTING=y
> 
> Enabling CONFIG_IRQ_TIME_ACCOUNTING=y can lead to user-visible behavioral
> changes. For more context, please refer to the related discussion here:
> https://lore.kernel.org/all/20241222024734.63894-1-laoar.shao@gmail.com/ .

Yeah. I actually agree with your series. It (re-)includes IRQ/softirq 
time in task CPU usage statistics even under IRQ_TIME_ACCOUNTING=y, 
while still keeping the finegrained IRQ/softirq statistics as well, 
correct?

The Kconfig option is also arguably rather misleading:

config IRQ_TIME_ACCOUNTING
        bool "Fine granularity task level IRQ time accounting"
        depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
        help
          Select this option to enable fine granularity task irq time
          accounting. This is done by reading a timestamp on each
          transitions between softirq and hardirq state, so there can be a
          small performance impact.

It only warns about a small performance impact, but doesn't warn that 
CPU accounting is changed in an incompatible fashion that surprises 
tooling...

But I think we should probably treat this as a bug, not as lack of 
documentation. Peter, do you concur?

> If we decide to enable it by default, we should clearly document this 
> behavior change. Below is the patch I wrote earlier but haven’t sent 
> out for review yet.

Note that it's not enabled by default - this patch is just about the 
x86 defconfig.

Thanks,

	Ingo
Re: [PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig
Posted by Yafang Shao 7 months, 2 weeks ago
On Wed, May 7, 2025 at 3:06 PM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > Hello Mingo,
> >
> > > +CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
> > > +CONFIG_IRQ_TIME_ACCOUNTING=y
> >
> > Enabling CONFIG_IRQ_TIME_ACCOUNTING=y can lead to user-visible behavioral
> > changes. For more context, please refer to the related discussion here:
> > https://lore.kernel.org/all/20241222024734.63894-1-laoar.shao@gmail.com/ .
>
> Yeah. I actually agree with your series. It (re-)includes IRQ/softirq
> time in task CPU usage statistics even under IRQ_TIME_ACCOUNTING=y,
> while still keeping the finegrained IRQ/softirq statistics as well,
> correct?

Correct.

>
> The Kconfig option is also arguably rather misleading:
>
> config IRQ_TIME_ACCOUNTING
>         bool "Fine granularity task level IRQ time accounting"
>         depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
>         help
>           Select this option to enable fine granularity task irq time
>           accounting. This is done by reading a timestamp on each
>           transitions between softirq and hardirq state, so there can be a
>           small performance impact.
>
> It only warns about a small performance impact, but doesn't warn that
> CPU accounting is changed in an incompatible fashion that surprises
> tooling...

Yes, this breaks our userspace tools.


>
> But I think we should probably treat this as a bug, not as lack of
> documentation. Peter, do you concur?
>
> > If we decide to enable it by default, we should clearly document this
> > behavior change. Below is the patch I wrote earlier but haven’t sent
> > out for review yet.
>
> Note that it's not enabled by default - this patch is just about the
> x86 defconfig.
>
> Thanks,
>
>         Ingo



--
Regards
Yafang
Re: [PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig
Posted by Ingo Molnar 7 months, 2 weeks ago
* Yafang Shao <laoar.shao@gmail.com> wrote:

> On Wed, May 7, 2025 at 3:06 PM Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > > Hello Mingo,
> > >
> > > > +CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
> > > > +CONFIG_IRQ_TIME_ACCOUNTING=y
> > >
> > > Enabling CONFIG_IRQ_TIME_ACCOUNTING=y can lead to user-visible behavioral
> > > changes. For more context, please refer to the related discussion here:
> > > https://lore.kernel.org/all/20241222024734.63894-1-laoar.shao@gmail.com/ .
> >
> > Yeah. I actually agree with your series. It (re-)includes IRQ/softirq
> > time in task CPU usage statistics even under IRQ_TIME_ACCOUNTING=y,
> > while still keeping the finegrained IRQ/softirq statistics as well,
> > correct?
> 
> Correct.
> 
> >
> > The Kconfig option is also arguably rather misleading:
> >
> > config IRQ_TIME_ACCOUNTING
> >         bool "Fine granularity task level IRQ time accounting"
> >         depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
> >         help
> >           Select this option to enable fine granularity task irq time
> >           accounting. This is done by reading a timestamp on each
> >           transitions between softirq and hardirq state, so there can be a
> >           small performance impact.
> >
> > It only warns about a small performance impact, but doesn't warn that
> > CPU accounting is changed in an incompatible fashion that surprises
> > tooling...
> 
> Yes, this breaks our userspace tools.

Okay, so 2 out of your 3 fixes are upstream already:

  763a744e24a8 ("sched: Don't account irq time if sched_clock_irqtime is disabled")
  a6fd16148fdd ("sched, psi: Don't account irq time if sched_clock_irqtime is disabled")

But we don't have this one yet:

  [PATCH v8 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING

  https://lore.kernel.org/r/20250103022409.2544-5-laoar.shao@gmail.com

which is also essential to fully fix the tooling regression, right?

I think this last patch fell between the cracks, I didn't see any 
fundamental objections against the fix.

Since the patch does not apply cleanly anymore, mind sending a fresh 
-v9 version against v6.15-rc5 or so?

Thanks,

	Ingo
Re: [PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig
Posted by Yafang Shao 7 months ago
On Thu, May 8, 2025 at 12:23 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > On Wed, May 7, 2025 at 3:06 PM Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > >
> > > * Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > > Hello Mingo,
> > > >
> > > > > +CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
> > > > > +CONFIG_IRQ_TIME_ACCOUNTING=y
> > > >
> > > > Enabling CONFIG_IRQ_TIME_ACCOUNTING=y can lead to user-visible behavioral
> > > > changes. For more context, please refer to the related discussion here:
> > > > https://lore.kernel.org/all/20241222024734.63894-1-laoar.shao@gmail.com/ .
> > >
> > > Yeah. I actually agree with your series. It (re-)includes IRQ/softirq
> > > time in task CPU usage statistics even under IRQ_TIME_ACCOUNTING=y,
> > > while still keeping the finegrained IRQ/softirq statistics as well,
> > > correct?
> >
> > Correct.
> >
> > >
> > > The Kconfig option is also arguably rather misleading:
> > >
> > > config IRQ_TIME_ACCOUNTING
> > >         bool "Fine granularity task level IRQ time accounting"
> > >         depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
> > >         help
> > >           Select this option to enable fine granularity task irq time
> > >           accounting. This is done by reading a timestamp on each
> > >           transitions between softirq and hardirq state, so there can be a
> > >           small performance impact.
> > >
> > > It only warns about a small performance impact, but doesn't warn that
> > > CPU accounting is changed in an incompatible fashion that surprises
> > > tooling...
> >
> > Yes, this breaks our userspace tools.
>
> Okay, so 2 out of your 3 fixes are upstream already:
>
>   763a744e24a8 ("sched: Don't account irq time if sched_clock_irqtime is disabled")
>   a6fd16148fdd ("sched, psi: Don't account irq time if sched_clock_irqtime is disabled")
>
> But we don't have this one yet:
>
>   [PATCH v8 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING
>
>   https://lore.kernel.org/r/20250103022409.2544-5-laoar.shao@gmail.com
>
> which is also essential to fully fix the tooling regression, right?
>
> I think this last patch fell between the cracks, I didn't see any
> fundamental objections against the fix.
>
> Since the patch does not apply cleanly anymore, mind sending a fresh
> -v9 version against v6.15-rc5 or so?

Hello Ingo,

I have sent the v9:
https://lore.kernel.org/all/20250511030800.1900-1-laoar.shao@gmail.com/

Could you please help review this? I’d appreciate your feedback.

-- 
Regards
Yafang
Re: [PATCH 13/15] x86/kconfig/64: Enable popular scheduler, cgroups and namespaces options in the defconfig
Posted by Yafang Shao 7 months, 2 weeks ago
On Thu, May 8, 2025 at 12:23 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > On Wed, May 7, 2025 at 3:06 PM Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > >
> > > * Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > > Hello Mingo,
> > > >
> > > > > +CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
> > > > > +CONFIG_IRQ_TIME_ACCOUNTING=y
> > > >
> > > > Enabling CONFIG_IRQ_TIME_ACCOUNTING=y can lead to user-visible behavioral
> > > > changes. For more context, please refer to the related discussion here:
> > > > https://lore.kernel.org/all/20241222024734.63894-1-laoar.shao@gmail.com/ .
> > >
> > > Yeah. I actually agree with your series. It (re-)includes IRQ/softirq
> > > time in task CPU usage statistics even under IRQ_TIME_ACCOUNTING=y,
> > > while still keeping the finegrained IRQ/softirq statistics as well,
> > > correct?
> >
> > Correct.
> >
> > >
> > > The Kconfig option is also arguably rather misleading:
> > >
> > > config IRQ_TIME_ACCOUNTING
> > >         bool "Fine granularity task level IRQ time accounting"
> > >         depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
> > >         help
> > >           Select this option to enable fine granularity task irq time
> > >           accounting. This is done by reading a timestamp on each
> > >           transitions between softirq and hardirq state, so there can be a
> > >           small performance impact.
> > >
> > > It only warns about a small performance impact, but doesn't warn that
> > > CPU accounting is changed in an incompatible fashion that surprises
> > > tooling...
> >
> > Yes, this breaks our userspace tools.
>
> Okay, so 2 out of your 3 fixes are upstream already:
>
>   763a744e24a8 ("sched: Don't account irq time if sched_clock_irqtime is disabled")
>   a6fd16148fdd ("sched, psi: Don't account irq time if sched_clock_irqtime is disabled")

Right.

>
> But we don't have this one yet:
>
>   [PATCH v8 4/4] sched: Fix cgroup irq time for CONFIG_IRQ_TIME_ACCOUNTING
>
>   https://lore.kernel.org/r/20250103022409.2544-5-laoar.shao@gmail.com
>
> which is also essential to fully fix the tooling regression, right?

This patch resolves the container tooling regression but does not
address tools depending on getrusage() for CPU measurement. The
getrusage() fix will be implemented in a subsequent patch.

>
> I think this last patch fell between the cracks, I didn't see any
> fundamental objections against the fix.
>
> Since the patch does not apply cleanly anymore, mind sending a fresh
> -v9 version against v6.15-rc5 or so?

I will send a new version.

-- 
Regards
Yafang