[PATCH v3 2/2] debugobjects: Allow object pool refill mostly in non-atomic context

Waiman Long posted 2 patches 3 months, 3 weeks ago
[PATCH v3 2/2] debugobjects: Allow object pool refill mostly in non-atomic context
Posted by Waiman Long 3 months, 3 weeks ago
With PREEMPT_RT kernel, object pool refill can only happen in preemptible
context. For other !PREEMPT_RT kernels, pool refill can happen in any
context. This can sometimes lead to problem like the following circular
locking dependency shown below.

  -> #3 (&zone->lock){-.-.}-{2:2}:
  -> #2 (&base->lock){-.-.}-{2:2}:
  -> #1 (&console_sch_key){-.-.}-{2:2}:
  -> #0 (console_owner){..-.}-{0:0}:

The "console_owner" is from calling printk() from the following call
chain:

  rmqueue_bulk() => expand() => __warn_printk() => _printk()

This is due to the invocation of the VM_WARN_ONCE() macro in
__add_to_free_list().

The "base->lock" is from lock_timer_base() and "zone->lock" is due to
calling add_timer_on() leading to debug_object_activate() doing actual
memory allocation in pool refill acquiring the zone lock.

The "console_sch_key" comes from a s390 console driver in
driver/s390/cio.  The console_sch_key -> timer dependency happens
because the console driver is setting a timeout value while holding
its lock. Apparently it is pretty common for a console driver to use
timer for timeout or other timing purposes. So this may happen to other
console drivers as well.

There are three debug objects functions that will invoke
debug_objects_fill_pool() for pool refill - __debug_object_init(),
debug_object_activate() & debug_object_assert_init().  Thomas suggested
that we may enforce the pool refill only in the init function and
remove the debug_objects_fill_pool() call from the other two to avoid
the kind of circular locking problem shown above. It is because the init
function can be called in a cluster with many debug objects initialized
consecutively which can lead to exhaustion of the global object pool
if we disable the init function from doing pool refill. See [1] for
such an example. The call patterns of the other two are typically more
spread out.  Of the three, the activate function is called at least an
order of magnitude more than the other two.

Removing the pool refill call from the other two may make pool
exhaustion happen more easily leading to the disabling of the debug
object tracking. As a middle ground, we will allow pool refill from the
activate and assert_init functions if they are called from a non-atomic
context which is roughly half of the times depending on the workloads.

As in_atomic() may not know preemption has been disabled, when
a spinlock has been acquired for example, if CONFIG_PREEMPT_COUNT
hasn't been set. So make DEBUG_OBJECTS select PREEMPT_COUNT to make
sure that the preemption state is properly captured. The overhead of
adding PREEMPT_COUNT should be insignificant compared with the overhead
imposed by enabling the debug object tracking code itself.

[1] https://lore.kernel.org/lkml/202506121115.b69b8c2-lkp@intel.com/

Signed-off-by: Waiman Long <longman@redhat.com>
---
 lib/Kconfig.debug  |  1 +
 lib/debugobjects.c | 15 +++++++++++----
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ebe33181b6e6..854a2f12a64b 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -723,6 +723,7 @@ source "mm/Kconfig.debug"
 config DEBUG_OBJECTS
 	bool "Debug object operations"
 	depends on DEBUG_KERNEL
+	select PREEMPT_COUNT
 	help
 	  If you say Y here, additional code will be inserted into the
 	  kernel to track the life time of various objects and validate
diff --git a/lib/debugobjects.c b/lib/debugobjects.c
index 5598105ecf0d..d85f87f359d2 100644
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -700,11 +700,18 @@ static struct debug_obj *lookup_object_or_alloc(void *addr, struct debug_bucket
 	return NULL;
 }
 
-static void debug_objects_fill_pool(void)
+static void debug_objects_fill_pool(bool init)
 {
 	if (!static_branch_likely(&obj_cache_enabled))
 		return;
 
+	/*
+	 * Attempt to fill the pool only if called from debug_objects_init()
+	 * or not in atomic context.
+	 */
+	if (!init && in_atomic())
+		return;
+
 	if (likely(!pool_should_refill(&pool_global)))
 		return;
 
@@ -740,7 +747,7 @@ __debug_object_init(void *addr, const struct debug_obj_descr *descr, int onstack
 	struct debug_bucket *db;
 	unsigned long flags;
 
-	debug_objects_fill_pool();
+	debug_objects_fill_pool(true);
 
 	db = get_bucket((unsigned long) addr);
 
@@ -817,7 +824,7 @@ int debug_object_activate(void *addr, const struct debug_obj_descr *descr)
 	if (!debug_objects_enabled)
 		return 0;
 
-	debug_objects_fill_pool();
+	debug_objects_fill_pool(false);
 
 	db = get_bucket((unsigned long) addr);
 
@@ -1006,7 +1013,7 @@ void debug_object_assert_init(void *addr, const struct debug_obj_descr *descr)
 	if (!debug_objects_enabled)
 		return;
 
-	debug_objects_fill_pool();
+	debug_objects_fill_pool(false);
 
 	db = get_bucket((unsigned long) addr);
 
-- 
2.49.0
Re: [PATCH v3 2/2] debugobjects: Allow object pool refill mostly in non-atomic context
Posted by Thomas Gleixner 3 months, 2 weeks ago
On Tue, Jun 17 2025 at 01:35, Waiman Long wrote:
> Removing the pool refill call from the other two may make pool
> exhaustion happen more easily leading to the disabling of the debug
> object tracking.

May is not a technical argument. Why would that happen?

As we established before the only reason why activate() or assert() can
consume an object from the pool is when the object was statically
initialized.

Are there really enough statically initialized objects to actually lead
to pool depletion without that activate/assert hackery?

> As a middle ground, we will allow pool refill from the
> activate and assert_init functions if they are called from a non-atomic
> context which is roughly half of the times depending on the workloads.

Which brings you back to the square one on RT enabled kernels because
the held locks in this circular lock chain scenario do _NOT_ disable
preemption, so in_atomic() evaluates to false ....

Thanks,

        tglx
Re: [PATCH v3 2/2] debugobjects: Allow object pool refill mostly in non-atomic context
Posted by kernel test robot 3 months, 2 weeks ago

Hello,

we don't have enough knowledge to understand why this patch causes seemingly
auto-reboot in this trinity tests. but the issue is quite persistent.

94d15b49922495cd 81c49454d188c6eb34bb736f432
---------------- ---------------------------
       fail:runs  %reproduction    fail:runs
           |             |             |
           :50         100%          50:50    last_state.is_incomplete_run
           :50          98%          50:50    last_state.running

we don't see hint from dmesg uploaded to [1], so we also attached one dmesg
from parent FYI in case we miss anything.


kernel test robot noticed "last_state.running" on:

commit: 81c49454d188c6eb34bb736f432e4b1ef623d70f ("[PATCH v3 2/2] debugobjects: Allow object pool refill mostly in non-atomic context")
url: https://github.com/intel-lab-lkp/linux/commits/Waiman-Long/debugobjects-Show-the-state-of-debug_objects_enabled/20250617-133635
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-nonmm-unstable
patch link: https://lore.kernel.org/all/20250617053527.1223411-3-longman@redhat.com/
patch subject: [PATCH v3 2/2] debugobjects: Allow object pool refill mostly in non-atomic context

in testcase: trinity
version: 
with following parameters:

	runtime: 300s
	group: group-01
	nr_groups: 5



config: i386-randconfig-004-20250618
compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202506201449.f3047c4-lkp@intel.com


LKP: ttyS0: 246: Kernel tests: Boot OK!
LKP: ttyS0: 246: HOSTNAME vm-snb, MAC 52:54:00:12:34:56, kernel 6.16.0-rc2-00044-g81c49454d188 1

Poky (Yocto Project Reference Distro) 2.7+snapshot vm-snb /dev/ttyS0

vm-snb login: LKP: ttyS0: 246:  /lkp/lkp/src/bin/run-lkp /lkp/jobs/scheduled/vm-meta-243/trinity-group-01-5-300s-yocto-i386-minimal-20190520.cgz-i386-randconfig-004-20250618-81c49454d188-20250619-2383740-f5ihl3-14.yaml
[   13.087284][  T290] mkdir: can't create directory '/sys/kernel/debug': Operation not permitted
[   13.087284][  T290] mount: mounting debug on /sys/kernel/debug failed: No such file or directory
[   34.877921][ T3450] UDPLite: UDP-Lite is deprecated and scheduled to be removed in 2025, please contact the netdev mailing list
[   34.879165][ T3450] trinity-main uses obsolete (PF_INET,SOCK_PACKET)
[   34.931165][ T3450] UDPLite6: UDP-Lite is deprecated and scheduled to be removed in 2025, please contact the netdev mailing list
[   35.302518][ T3582] VFS: Warning: trinity-c0 using old stat() call. Recompile your binary.
[   35.303816][ T3582] VFS: Warning: trinity-c0 using old stat() call. Recompile your binary.
[   35.307560][ T3585] VFS: Warning: trinity-c3 using old stat() call. Recompile your binary.
[   35.308370][ T3585] VFS: Warning: trinity-c3 using old stat() call. Recompile your binary.
[   35.314526][ T3582] VFS: Warning: trinity-c0 using old stat() call. Recompile your binary.



[1]
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250620/202506201449.f3047c4-lkp@intel.com


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki