arch/arm64/kernel/syscall.c | 2 + arch/x86/entry/common.c | 4 + crypto/api.c | 7 +- include/linux/completion.h | 44 +- include/linux/dept.h | 596 ++++++++ include/linux/dept_page.h | 78 + include/linux/dept_sdt.h | 67 + include/linux/hardirq.h | 3 + include/linux/irqflags.h | 71 +- include/linux/llist.h | 8 - include/linux/lockdep.h | 186 ++- include/linux/lockdep_types.h | 3 + include/linux/mutex.h | 22 + include/linux/page-flags.h | 45 +- include/linux/pagemap.h | 7 +- include/linux/percpu-rwsem.h | 4 +- include/linux/rtmutex.h | 1 + include/linux/rwlock.h | 42 + include/linux/rwlock_api_smp.h | 8 +- include/linux/rwlock_types.h | 1 + include/linux/rwsem.h | 22 + include/linux/sched.h | 7 + include/linux/seqlock.h | 60 +- include/linux/spinlock.h | 21 + include/linux/spinlock_types_raw.h | 3 + include/linux/swait.h | 4 + include/linux/types.h | 8 + include/linux/wait.h | 6 +- init/init_task.c | 2 + init/main.c | 4 + kernel/Makefile | 1 + kernel/cpu.c | 2 +- kernel/dependency/Makefile | 4 + kernel/dependency/dept.c | 2938 ++++++++++++++++++++++++++++++++++++ kernel/dependency/dept_hash.h | 10 + kernel/dependency/dept_internal.h | 26 + kernel/dependency/dept_object.h | 13 + kernel/dependency/dept_proc.c | 92 ++ kernel/exit.c | 7 + kernel/fork.c | 2 + kernel/locking/lockdep.c | 28 +- kernel/locking/spinlock_rt.c | 24 +- kernel/module.c | 2 + kernel/sched/completion.c | 12 +- kernel/sched/core.c | 10 + kernel/sched/swait.c | 10 + kernel/sched/wait.c | 16 + kernel/sched/wait_bit.c | 5 +- kernel/workqueue.c | 3 + lib/Kconfig.debug | 28 + mm/filemap.c | 68 + mm/page_ext.c | 5 + 52 files changed, 4558 insertions(+), 84 deletions(-) create mode 100644 include/linux/dept.h create mode 100644 include/linux/dept_page.h create mode 100644 include/linux/dept_sdt.h create mode 100644 kernel/dependency/Makefile create mode 100644 kernel/dependency/dept.c create mode 100644 kernel/dependency/dept_hash.h create mode 100644 kernel/dependency/dept_internal.h create mode 100644 kernel/dependency/dept_object.h create mode 100644 kernel/dependency/dept_proc.c
Hi Linus and folks,
I've been developing a tool for detecting deadlock possibilities by
tracking wait/event rather than lock(?) acquisition order to try to
cover all synchonization machanisms. It's done on v5.18-rc3 tag.
https://github.com/lgebyungchulpark/linux-dept/commits/dept1.20_on_v5.18-rc3
Benifit:
0. Works with all lock primitives.
1. Works with wait_for_completion()/complete().
2. Works with 'wait' on PG_locked.
3. Works with 'wait' on PG_writeback.
4. Works with swait/wakeup.
5. Works with waitqueue.
6. Multiple reports are allowed.
7. Deduplication control on multiple reports.
8. Withstand false positives thanks to 6.
9. Easy to tag any wait/event.
Future work:
0. To make it more stable.
1. To separates Dept from Lockdep.
2. To improves performance in terms of time and space.
3. To use Dept as a dependency engine for Lockdep.
4. To add any missing tags of wait/event in the kernel.
5. To deduplicate stack trace.
How to interpret reports:
1. E(event) in each context cannot be triggered because of the
W(wait) that cannot be woken.
2. The stack trace helping find the problematic code is located
in each conext's detail.
Thanks,
Byungchul
---
Changes from v5:
1. Use just pr_warn_once() rather than WARN_ONCE() on the lack
of internal resources because WARN_*() printing stacktrace is
too much for informing the lack. (feedback from Ted, Hyeonggon)
2. Fix trivial bugs like missing initializing a struct before
using it.
3. Assign a different class per task when handling onstack
variables for waitqueue or the like. Which makes Dept
distinguish between onstack variables of different tasks so
as to prevent false positives. (reported by Hyeonggon)
4. Make Dept aware of even raw_local_irq_*() to prevent false
positives. (reported by Hyeonggon)
5. Don't consider dependencies between the events that might be
triggered within __schedule() and the waits that requires
__schedule(), real ones. (reported by Hyeonggon)
6. Unstage the staged wait that has prepare_to_wait_event()'ed
*and* yet to get to __schedule(), if we encounter __schedule()
in-between for another sleep, which is possible if e.g. a
mutex_lock() exists in 'condition' of ___wait_event().
7. Turn on CONFIG_PROVE_LOCKING when CONFIG_DEPT is on, to rely
on the hardirq and softirq entrance tracing to make Dept more
portable for now.
Changes from v4:
1. Fix some bugs that produce false alarms.
2. Distinguish each syscall context from another *for arm64*.
3. Make it not warn it but just print it in case Dept ring
buffer gets exhausted. (feedback from Hyeonggon)
4. Explicitely describe "EXPERIMENTAL" and "Dept might produce
false positive reports" in Kconfig. (feedback from Ted)
Changes from v3:
1. Dept shouldn't create dependencies between different depths
of a class that were indicated by *_lock_nested(). Dept
normally doesn't but it does once another lock class comes
in. So fixed it. (feedback from Hyeonggon)
2. Dept considered a wait as a real wait once getting to
__schedule() even if it has been set to TASK_RUNNING by wake
up sources in advance. Fixed it so that Dept doesn't consider
the case as a real wait. (feedback from Jan Kara)
3. Stop tracking dependencies with a map once the event
associated with the map has been handled. Dept will start to
work with the map again, on the next sleep.
Changes from v2:
1. Disable Dept on bit_wait_table[] in sched/wait_bit.c
reporting a lot of false positives, which is my fault.
Wait/event for bit_wait_table[] should've been tagged in a
higher layer for better work, which is a future work.
(feedback from Jan Kara)
2. Disable Dept on crypto_larval's completion to prevent a false
positive.
Changes from v1:
1. Fix coding style and typo. (feedback from Steven)
2. Distinguish each work context from another in workqueue.
3. Skip checking lock acquisition with nest_lock, which is about
correct lock usage that should be checked by Lockdep.
Changes from RFC(v0):
1. Prevent adding a wait tag at prepare_to_wait() but __schedule().
(feedback from Linus and Matthew)
2. Use try version at lockdep_acquire_cpus_lock() annotation.
3. Distinguish each syscall context from another.
Byungchul Park (21):
llist: Move llist_{head,node} definition to types.h
dept: Implement Dept(Dependency Tracker)
dept: Apply Dept to spinlock
dept: Apply Dept to mutex families
dept: Apply Dept to rwlock
dept: Apply Dept to wait_for_completion()/complete()
dept: Apply Dept to seqlock
dept: Apply Dept to rwsem
dept: Add proc knobs to show stats and dependency graph
dept: Introduce split map concept and new APIs for them
dept: Apply Dept to wait/event of PG_{locked,writeback}
dept: Apply SDT to swait
dept: Apply SDT to wait(waitqueue)
locking/lockdep, cpu/hotplus: Use a weaker annotation in AP thread
dept: Distinguish each syscall context from another
dept: Distinguish each work from another
dept: Disable Dept within the wait_bit layer by default
dept: Disable Dept on struct crypto_larval's completion for now
dept: Differentiate onstack maps from others of different tasks in
class
dept: Do not add dependencies between events within scheduler and
sleeps
dept: Unstage wait when tagging a normal sleep wait
arch/arm64/kernel/syscall.c | 2 +
arch/x86/entry/common.c | 4 +
crypto/api.c | 7 +-
include/linux/completion.h | 44 +-
include/linux/dept.h | 596 ++++++++
include/linux/dept_page.h | 78 +
include/linux/dept_sdt.h | 67 +
include/linux/hardirq.h | 3 +
include/linux/irqflags.h | 71 +-
include/linux/llist.h | 8 -
include/linux/lockdep.h | 186 ++-
include/linux/lockdep_types.h | 3 +
include/linux/mutex.h | 22 +
include/linux/page-flags.h | 45 +-
include/linux/pagemap.h | 7 +-
include/linux/percpu-rwsem.h | 4 +-
include/linux/rtmutex.h | 1 +
include/linux/rwlock.h | 42 +
include/linux/rwlock_api_smp.h | 8 +-
include/linux/rwlock_types.h | 1 +
include/linux/rwsem.h | 22 +
include/linux/sched.h | 7 +
include/linux/seqlock.h | 60 +-
include/linux/spinlock.h | 21 +
include/linux/spinlock_types_raw.h | 3 +
include/linux/swait.h | 4 +
include/linux/types.h | 8 +
include/linux/wait.h | 6 +-
init/init_task.c | 2 +
init/main.c | 4 +
kernel/Makefile | 1 +
kernel/cpu.c | 2 +-
kernel/dependency/Makefile | 4 +
kernel/dependency/dept.c | 2938 ++++++++++++++++++++++++++++++++++++
kernel/dependency/dept_hash.h | 10 +
kernel/dependency/dept_internal.h | 26 +
kernel/dependency/dept_object.h | 13 +
kernel/dependency/dept_proc.c | 92 ++
kernel/exit.c | 7 +
kernel/fork.c | 2 +
kernel/locking/lockdep.c | 28 +-
kernel/locking/spinlock_rt.c | 24 +-
kernel/module.c | 2 +
kernel/sched/completion.c | 12 +-
kernel/sched/core.c | 10 +
kernel/sched/swait.c | 10 +
kernel/sched/wait.c | 16 +
kernel/sched/wait_bit.c | 5 +-
kernel/workqueue.c | 3 +
lib/Kconfig.debug | 28 +
mm/filemap.c | 68 +
mm/page_ext.c | 5 +
52 files changed, 4558 insertions(+), 84 deletions(-)
create mode 100644 include/linux/dept.h
create mode 100644 include/linux/dept_page.h
create mode 100644 include/linux/dept_sdt.h
create mode 100644 kernel/dependency/Makefile
create mode 100644 kernel/dependency/dept.c
create mode 100644 kernel/dependency/dept_hash.h
create mode 100644 kernel/dependency/dept_internal.h
create mode 100644 kernel/dependency/dept_object.h
create mode 100644 kernel/dependency/dept_proc.c
--
1.9.1
I tried DEPT-v6 applied against 5.18-rc5, and it reported the following positive. The reason why it's nonsense is that in context A's [W] wait: [ 1538.545054] [W] folio_wait_bit_common(pglocked:0): [ 1538.545370] [<ffffffff81259944>] __filemap_get_folio+0x3e4/0x420 [ 1538.545763] stacktrace: [ 1538.545928] folio_wait_bit_common+0x2fa/0x460 [ 1538.546248] __filemap_get_folio+0x3e4/0x420 [ 1538.546558] pagecache_get_page+0x11/0x40 [ 1538.546852] ext4_mb_init_group+0x80/0x2e0 [ 1538.547152] ext4_mb_good_group_nolock+0x2a3/0x2d0 ... we're reading the block allocation bitmap into the page cache. This does not correspond to a real inode, and so we don't actually take ei->i_data_sem in this on the psuedo-inode used. In contast, context's B's [W] and [E]'s stack traces, the folio_wait_bit is clearly associated with page which is mapped to a real inode: [ 1538.553656] [W] down_write(&ei->i_data_sem:0): [ 1538.553948] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680 [ 1538.554320] stacktrace: [ 1538.554485] ext4_map_blocks+0x17b/0x680 [ 1538.554772] mpage_map_and_submit_extent+0xef/0x530 [ 1538.555122] ext4_writepages+0x798/0x990 [ 1538.555409] do_writepages+0xcf/0x1c0 [ 1538.555682] __writeback_single_inode+0x58/0x3f0 [ 1538.556014] writeback_sb_inodes+0x210/0x540 ... [ 1538.558621] [E] folio_wake_bit(pglocked:0): [ 1538.558896] [<ffffffff814418c0>] ext4_bio_write_page+0x400/0x560 [ 1538.559290] stacktrace: [ 1538.559455] ext4_bio_write_page+0x400/0x560 [ 1538.559765] mpage_submit_page+0x5c/0x80 [ 1538.560051] mpage_map_and_submit_buffers+0x15a/0x250 [ 1538.560409] mpage_map_and_submit_extent+0x134/0x530 [ 1538.560764] ext4_writepages+0x798/0x990 [ 1538.561057] do_writepages+0xcf/0x1c0 [ 1538.561329] __writeback_single_inode+0x58/0x3f0 ... In any case, this will ***never*** deadlock, and it's due to DEPT fundamentally not understanding that waiting on different pages may be due to inodes that come from completely different inodes, and so there is zero possible chance this would never deadlock. I suspect there will be similar false positives for tests (or userspace) that uses copy_file_range(2) or send_file(2) system calls. I've included the full DEPT log report below. - Ted generic/011 [20:11:16][ 1533.411773] run fstests generic/011 at 2022-05-07 20:11:16 [ 1533.509603] DEPT_INFO_ONCE: Need to expand the ring buffer. [ 1536.910044] DEPT_INFO_ONCE: Pool(wait) is empty. [ 1538.533315] =================================================== [ 1538.533793] DEPT: Circular dependency has been detected. [ 1538.534199] 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571 Not tainted [ 1538.534645] --------------------------------------------------- [ 1538.535035] summary [ 1538.535177] --------------------------------------------------- [ 1538.535567] *** DEADLOCK *** [ 1538.535567] [ 1538.535854] context A [ 1538.536008] [S] down_write(&ei->i_data_sem:0) [ 1538.536323] [W] folio_wait_bit_common(pglocked:0) [ 1538.536655] [E] up_write(&ei->i_data_sem:0) [ 1538.536958] [ 1538.537063] context B [ 1538.537216] [S] (unknown)(pglocked:0) [ 1538.537480] [W] down_write(&ei->i_data_sem:0) [ 1538.537789] [E] folio_wake_bit(pglocked:0) [ 1538.538082] [ 1538.538184] [S]: start of the event context [ 1538.538460] [W]: the wait blocked [ 1538.538680] [E]: the event not reachable [ 1538.538939] --------------------------------------------------- [ 1538.539327] context A's detail [ 1538.539530] --------------------------------------------------- [ 1538.539918] context A [ 1538.540072] [S] down_write(&ei->i_data_sem:0) [ 1538.540382] [W] folio_wait_bit_common(pglocked:0) [ 1538.540712] [E] up_write(&ei->i_data_sem:0) [ 1538.541015] [ 1538.541119] [S] down_write(&ei->i_data_sem:0): [ 1538.541410] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680 [ 1538.541782] stacktrace: [ 1538.541946] ext4_map_blocks+0x17b/0x680 [ 1538.542234] ext4_getblk+0x5f/0x1f0 [ 1538.542493] ext4_bread+0xc/0x70 [ 1538.542736] ext4_append+0x48/0xf0 [ 1538.542991] ext4_init_new_dir+0xc8/0x160 [ 1538.543284] ext4_mkdir+0x19a/0x320 [ 1538.543542] vfs_mkdir+0x83/0xe0 [ 1538.543788] do_mkdirat+0x8c/0x130 [ 1538.544042] __x64_sys_mkdir+0x29/0x30 [ 1538.544319] do_syscall_64+0x40/0x90 [ 1538.544584] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1538.544949] [ 1538.545054] [W] folio_wait_bit_common(pglocked:0): [ 1538.545370] [<ffffffff81259944>] __filemap_get_folio+0x3e4/0x420 [ 1538.545763] stacktrace: [ 1538.545928] folio_wait_bit_common+0x2fa/0x460 [ 1538.546248] __filemap_get_folio+0x3e4/0x420 [ 1538.546558] pagecache_get_page+0x11/0x40 [ 1538.546852] ext4_mb_init_group+0x80/0x2e0 [ 1538.547152] ext4_mb_good_group_nolock+0x2a3/0x2d0 [ 1538.547496] ext4_mb_regular_allocator+0x391/0x780 [ 1538.547840] ext4_mb_new_blocks+0x44e/0x720 [ 1538.548145] ext4_ext_map_blocks+0x7f1/0xd00 [ 1538.548455] ext4_map_blocks+0x19e/0x680 [ 1538.548743] ext4_getblk+0x5f/0x1f0 [ 1538.549006] ext4_bread+0xc/0x70 [ 1538.549250] ext4_append+0x48/0xf0 [ 1538.549505] ext4_init_new_dir+0xc8/0x160 [ 1538.549798] ext4_mkdir+0x19a/0x320 [ 1538.550058] vfs_mkdir+0x83/0xe0 [ 1538.550302] do_mkdirat+0x8c/0x130 [ 1538.550557] [ 1538.550660] [E] up_write(&ei->i_data_sem:0): [ 1538.550940] (N/A) [ 1538.551071] --------------------------------------------------- [ 1538.551459] context B's detail [ 1538.551662] --------------------------------------------------- [ 1538.552047] context B [ 1538.552202] [S] (unknown)(pglocked:0) [ 1538.552466] [W] down_write(&ei->i_data_sem:0) [ 1538.552775] [E] folio_wake_bit(pglocked:0) [ 1538.553071] [ 1538.553174] [S] (unknown)(pglocked:0): [ 1538.553422] (N/A) [ 1538.553553] [ 1538.553656] [W] down_write(&ei->i_data_sem:0): [ 1538.553948] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680 [ 1538.554320] stacktrace: [ 1538.554485] ext4_map_blocks+0x17b/0x680 [ 1538.554772] mpage_map_and_submit_extent+0xef/0x530 [ 1538.555122] ext4_writepages+0x798/0x990 [ 1538.555409] do_writepages+0xcf/0x1c0 [ 1538.555682] __writeback_single_inode+0x58/0x3f0 [ 1538.556014] writeback_sb_inodes+0x210/0x540 [ 1538.556324] __writeback_inodes_wb+0x4c/0xe0 [ 1538.556635] wb_writeback+0x298/0x450 [ 1538.556911] wb_do_writeback+0x29e/0x320 [ 1538.557199] wb_workfn+0x6a/0x2c0 [ 1538.557447] process_one_work+0x302/0x650 [ 1538.557743] worker_thread+0x55/0x400 [ 1538.558013] kthread+0xf0/0x120 [ 1538.558251] ret_from_fork+0x1f/0x30 [ 1538.558518] [ 1538.558621] [E] folio_wake_bit(pglocked:0): [ 1538.558896] [<ffffffff814418c0>] ext4_bio_write_page+0x400/0x560 [ 1538.559290] stacktrace: [ 1538.559455] ext4_bio_write_page+0x400/0x560 [ 1538.559765] mpage_submit_page+0x5c/0x80 [ 1538.560051] mpage_map_and_submit_buffers+0x15a/0x250 [ 1538.560409] mpage_map_and_submit_extent+0x134/0x530 [ 1538.560764] ext4_writepages+0x798/0x990 [ 1538.561057] do_writepages+0xcf/0x1c0 [ 1538.561329] __writeback_single_inode+0x58/0x3f0 [ 1538.561662] writeback_sb_inodes+0x210/0x540 [ 1538.561973] __writeback_inodes_wb+0x4c/0xe0 [ 1538.562283] wb_writeback+0x298/0x450 [ 1538.562555] wb_do_writeback+0x29e/0x320 [ 1538.562842] wb_workfn+0x6a/0x2c0 [ 1538.563095] process_one_work+0x302/0x650 [ 1538.563387] worker_thread+0x55/0x400 [ 1538.563658] kthread+0xf0/0x120 [ 1538.563895] ret_from_fork+0x1f/0x30 [ 1538.564161] --------------------------------------------------- [ 1538.564548] information that might be helpful [ 1538.564832] --------------------------------------------------- [ 1538.565223] CPU: 1 PID: 46539 Comm: dirstress Not tainted 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571 [ 1538.565854] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [ 1538.566394] Call Trace: [ 1538.566559] <TASK> [ 1538.566701] dump_stack_lvl+0x4f/0x68 [ 1538.566945] print_circle.cold+0x15b/0x169 [ 1538.567218] ? print_circle+0xe0/0xe0 [ 1538.567461] cb_check_dl+0x55/0x60 [ 1538.567687] bfs+0xd5/0x1b0 [ 1538.567874] add_dep+0xd3/0x1a0 [ 1538.568083] ? __filemap_get_folio+0x3e4/0x420 [ 1538.568374] add_wait+0xe3/0x250 [ 1538.568590] ? __filemap_get_folio+0x3e4/0x420 [ 1538.568886] dept_wait_split_map+0xb1/0x130 [ 1538.569163] folio_wait_bit_common+0x2fa/0x460 [ 1538.569456] ? lock_is_held_type+0xfc/0x130 [ 1538.569733] __filemap_get_folio+0x3e4/0x420 [ 1538.570013] ? __lock_release+0x1b2/0x2c0 [ 1538.570278] pagecache_get_page+0x11/0x40 [ 1538.570543] ext4_mb_init_group+0x80/0x2e0 [ 1538.570813] ? ext4_get_group_desc+0xb2/0x200 [ 1538.571102] ext4_mb_good_group_nolock+0x2a3/0x2d0 [ 1538.571418] ext4_mb_regular_allocator+0x391/0x780 [ 1538.571733] ? rcu_read_lock_sched_held+0x3f/0x70 [ 1538.572044] ? trace_kmem_cache_alloc+0x2c/0xd0 [ 1538.572343] ? kmem_cache_alloc+0x1f7/0x3f0 [ 1538.572618] ext4_mb_new_blocks+0x44e/0x720 [ 1538.572896] ext4_ext_map_blocks+0x7f1/0xd00 [ 1538.573179] ? find_held_lock+0x2b/0x80 [ 1538.573434] ext4_map_blocks+0x19e/0x680 [ 1538.573693] ext4_getblk+0x5f/0x1f0 [ 1538.573927] ext4_bread+0xc/0x70 [ 1538.574141] ext4_append+0x48/0xf0 [ 1538.574369] ext4_init_new_dir+0xc8/0x160 [ 1538.574634] ext4_mkdir+0x19a/0x320 [ 1538.574866] vfs_mkdir+0x83/0xe0 [ 1538.575082] do_mkdirat+0x8c/0x130 [ 1538.575308] __x64_sys_mkdir+0x29/0x30 [ 1538.575557] do_syscall_64+0x40/0x90 [ 1538.575795] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1538.576128] RIP: 0033:0x7f0960466b07 [ 1538.576367] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48 [ 1538.577576] RSP: 002b:00007ffd0fa955a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053 [ 1538.578069] RAX: ffffffffffffffda RBX: 0000000000000239 RCX: 00007f0960466b07 [ 1538.578533] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffd0fa955d0 [ 1538.578995] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000010 [ 1538.579458] R10: 00007ffd0fa95345 R11: 0000000000000246 R12: 00000000000003e8 [ 1538.579923] R13: 0000000000000000 R14: 00007ffd0fa955d0 R15: 00007ffd0fa95dd0 [ 1538.580389] </TASK> [ 1540.581382] EXT4-fs (vdb): mounted filesystem with ordered data mode. Quota mode: none. [20:11:24] 8s P.S. Later on the console, the test ground to the halt because DEPT started WARNING over and over and over again.... [ 3129.686102] DEPT_WARN_ON: dt->ecxt_held_pos == DEPT_MAX_ECXT_HELD [ 3129.686396] ? __might_fault+0x32/0x80 [ 3129.686660] WARNING: CPU: 1 PID: 107320 at kernel/dependency/dept.c:1537 add_ecxt+0x1c0/0x1d0 [ 3129.687040] ? __might_fault+0x32/0x80 [ 3129.687282] CPU: 1 PID: 107320 Comm: aio-stress Tainted: G W 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571 with multiple CPU's completely spamming the serial console. This should probably be a WARN_ON_ONCE, or some thing that disables DEPT entirely, since apparently won't be any useful DEPT reports (or any useful kernel work, for that matteR) is going to be happening after this.
Oh, one other problem with DEPT --- it's SLOW --- the overhead is enormous. Using kvm-xfstests[1] running "kvm-xfstests smoke", here are some sample times: LOCKDEP DEPT Time to first test 49 seconds 602 seconds ext4/001 2 s 22 s ext4/003 2 s 8 s ext4/005 0 s 7 s ext4/020 1 s 8 s ext4/021 11 s 17 s ext4/023 0 s 83 s generic/001 4 s 76 s generic/002 0 s 11 s generic/003 10 s 19 s There are some large variations; in some cases, some xfstests take 10x as much time or more to run. In fact, when I first started the kvm-xfstests run with DEPT, I thought something had hung and that tests would never start. (In fact, with gce-xfstests the default watchdog "something has gone terribly wrong with the kexec" had fired, and I didn't get any test results using gce-xfstests at all. If DEPT goes in without any optimizations, I'm going to have to adjust the watchdogs timers for gce-xfstests.) The bottom line is that at the moment, between the false positives, and the significant overhead imposed by DEPT, I would suggest that if DEPT ever does go in, that it should be possible to disable DEPT and only use the existing CONFIG_PROVE_LOCKING version of LOCKDEP, just because DEPT is S - L - O - W. [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md - Ted P.S. Darrick and I both have disabled using LOCKDEP by default because it slows down ext4 -g auto testing by a factor 2, and xfs -g auto testing by a factor of 3. So the fact that DEPT is a factor of 2x to 10x or more slower than LOCKDEP when running various xfstests tests should be a real concern.
On Mon, May 09, 2022 at 06:28:17PM -0400, Theodore Ts'o wrote: > Oh, one other problem with DEPT --- it's SLOW --- the overhead is > enormous. Using kvm-xfstests[1] running "kvm-xfstests smoke", here > are some sample times: Yes, right. DEPT has never been optimized. It rather turns on CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on because of porting issue. I have no choice but to rely on those to develop DEPT out of tree. Of course, that's what I don't like. Plus, for now, I'm focusing on removing false positives. Once it's considered settled down, I will work on performance optimizaition. But it should still keep relying on Lockdep CONFIGs and adding additional overhead on it until DEPT can be developed in the tree. > LOCKDEP DEPT > Time to first test 49 seconds 602 seconds > ext4/001 2 s 22 s > ext4/003 2 s 8 s > ext4/005 0 s 7 s > ext4/020 1 s 8 s > ext4/021 11 s 17 s > ext4/023 0 s 83 s > generic/001 4 s 76 s > generic/002 0 s 11 s > generic/003 10 s 19 s > > There are some large variations; in some cases, some xfstests take 10x > as much time or more to run. In fact, when I first started the > kvm-xfstests run with DEPT, I thought something had hung and that > tests would never start. (In fact, with gce-xfstests the default > watchdog "something has gone terribly wrong with the kexec" had fired, > and I didn't get any test results using gce-xfstests at all. If DEPT > goes in without any optimizations, I'm going to have to adjust the > watchdogs timers for gce-xfstests.) Thank you for informing it. I will go for the optimization as well. > The bottom line is that at the moment, between the false positives, > and the significant overhead imposed by DEPT, I would suggest that if > DEPT ever does go in, that it should be possible to disable DEPT and > only use the existing CONFIG_PROVE_LOCKING version of LOCKDEP, just > because DEPT is S - L - O - W. > > [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md > > - Ted > > P.S. Darrick and I both have disabled using LOCKDEP by default > because it slows down ext4 -g auto testing by a factor 2, and xfs -g > auto testing by a factor of 3. So the fact that DEPT is a factor of > 2x to 10x or more slower than LOCKDEP when running various xfstests > tests should be a real concern. DEPT is tracking way more objects than Lockdep so it's inevitable to be slower, but let me try to make it have the similar performance to Lockdep. Byungchul
On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote: > Yes, right. DEPT has never been optimized. It rather turns on > CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on > because of porting issue. I have no choice but to rely on those to > develop DEPT out of tree. Of course, that's what I don't like. Sure, but blaming the overhead on unnecessary CONFIG_PROVE_LOCKING overhead can explain only a tiny fraction of the slowdown. Consider: if time to first test (time to boot the kernel, setup the test environment, figure out which tests to run, etc.) is 12 seconds w/o LOCKDEP, 49 seconds with LOCKDEP/PROVE_LOCKING and 602 seconds with DEPT, you can really only blame 37 seconds out of the 602 seconds of DEPT on unnecessary PROVE_LOCKING overhead. So let's assume we can get rid of all of the PROVE_LOCKING overhead. We're still talking about 12 seconds for time-to-first test without any lock debugging, versus ** 565 ** seconds for time-to-first test with DEPT. That's a factor of 47x for DEPT sans LOCKDEP overhead, compared to a 4x overhead for PROVE_LOCKING. > Plus, for now, I'm focusing on removing false positives. Once it's > considered settled down, I will work on performance optimizaition. But > it should still keep relying on Lockdep CONFIGs and adding additional > overhead on it until DEPT can be developed in the tree. Well, please take a look at the false positive which I reported. I suspect that in order to fix that particular false positive, we'll either need to have a way to disable DEPT on waiting on all page/folio dirty bits, or it will need to treat pages from different inodes and/or address spaces as being entirely separate classes, instead of collapsing all inode dirty bits, and all of various inode's mutexes (such as ext4's i_data_sem) as being part of a single object class. > DEPT is tracking way more objects than Lockdep so it's inevitable to be > slower, but let me try to make it have the similar performance to > Lockdep. In order to eliminate some of these false positives, I suspect it's going to increase the number of object classes that DEPT will need to track even *more*. At which point, the cost/benefit of DEPT may get called into question, especially if all of the false positives can't be suppressed. - Ted
Ted wrote: > On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote: > > Yes, right. DEPT has never been optimized. It rather turns on > > CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on > > because of porting issue. I have no choice but to rely on those to > > develop DEPT out of tree. Of course, that's what I don't like. > > Sure, but blaming the overhead on unnecessary CONFIG_PROVE_LOCKING > overhead can explain only a tiny fraction of the slowdown. Consider: > if time to first test (time to boot the kernel, setup the test > environment, figure out which tests to run, etc.) is 12 seconds w/o > LOCKDEP, 49 seconds with LOCKDEP/PROVE_LOCKING and 602 seconds with > DEPT, you can really only blame 37 seconds out of the 602 seconds of > DEPT on unnecessary PROVE_LOCKING overhead. > > So let's assume we can get rid of all of the PROVE_LOCKING overhead. > We're still talking about 12 seconds for time-to-first test without > any lock debugging, versus ** 565 ** seconds for time-to-first test > with DEPT. That's a factor of 47x for DEPT sans LOCKDEP overhead, > compared to a 4x overhead for PROVE_LOCKING. Okay. I will work on it. > > Plus, for now, I'm focusing on removing false positives. Once it's > > considered settled down, I will work on performance optimizaition. But > > it should still keep relying on Lockdep CONFIGs and adding additional > > overhead on it until DEPT can be developed in the tree. > > Well, please take a look at the false positive which I reported. I > suspect that in order to fix that particular false positive, we'll > either need to have a way to disable DEPT on waiting on all page/folio > dirty bits, or it will need to treat pages from different inodes > and/or address spaces as being entirely separate classes, instead of > collapsing all inode dirty bits, and all of various inode's mutexes > (such as ext4's i_data_sem) as being part of a single object class. I'd rather solve it by assigning different classes to different types of inode. This is the right way. > > DEPT is tracking way more objects than Lockdep so it's inevitable to be > > slower, but let me try to make it have the similar performance to > > Lockdep. > > In order to eliminate some of these false positives, I suspect it's > going to increase the number of object classes that DEPT will need to > track even *more*. At which point, the cost/benefit of DEPT may get > called into question, especially if all of the false positives can't > be suppressed. Look. Let's talk in general terms. There's no way to get rid of the false positives all the way. It's a decision issue for *balancing* between considering potential cases and only real ones. Definitely, potential is not real. The more potential things we consider, the higher the chances are, that false positives appear. But yes. The advantage we'd take by detecting potential ones should be higher than the risk of being bothered by false ones. Do you think a tool is useless if it produces a few false positives? Of course, it'd be a problem if it's too many, but otherwise, I think it'd be a great tool if the advantage > the risk. Don't get me wrong here. It doesn't mean DEPT is perfect for now. The performance should be improved and false alarms that appear should be removed, of course. I'm talking about the direction. For now, there's no tool to track wait/event itself in Linux kernel - a subset of the functionality exists tho. DEPT is the 1st try for that purpose and can be a useful tool by the right direction. I know what you are concerning about. I bet it's false positives that are going to bother you once merged. I'll insist that DEPT shouldn't be used as a mandatory testing tool until considered stable enough. But what about ones who would take the advantage use DEPT. Why don't you think of folks who will take the advantage from the hints about dependency of synchronization esp. when their subsystem requires very complicated synchronization? Should a tool be useful only in a final testing stage? What about the usefulness during development stage? It's worth noting DEPT works with any wait/event so any lockups e.g. even by HW-SW interface, retry logic or the like can be detected by DEPT once all waits and events are tagged properly. I believe the advantage by that is much higher than the bad side facing false alarms. It's just my opinion. I'm goning to respect the majority opinion. Byungchul > > - Ted >
On Tue, May 10, 2022 at 02:37:40PM +0900, Byungchul Park wrote: > Ted wrote: > > On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote: > > > DEPT is tracking way more objects than Lockdep so it's inevitable to be > > > slower, but let me try to make it have the similar performance to > > > Lockdep. > > > > In order to eliminate some of these false positives, I suspect it's > > going to increase the number of object classes that DEPT will need to > > track even *more*. At which point, the cost/benefit of DEPT may get > > called into question, especially if all of the false positives can't > > be suppressed. > > Look. Let's talk in general terms. There's no way to get rid of the > false positives all the way. It's a decision issue for *balancing* > between considering potential cases and only real ones. Definitely, > potential is not real. The more potential things we consider, the higher > the chances are, that false positives appear. > > But yes. The advantage we'd take by detecting potential ones should be > higher than the risk of being bothered by false ones. Do you think a > tool is useless if it produces a few false positives? Of course, it'd > be a problem if it's too many, but otherwise, I think it'd be a great > tool if the advantage > the risk. > > Don't get me wrong here. It doesn't mean DEPT is perfect for now. The > performance should be improved and false alarms that appear should be > removed, of course. I'm talking about the direction. > > For now, there's no tool to track wait/event itself in Linux kernel - > a subset of the functionality exists tho. DEPT is the 1st try for that > purpose and can be a useful tool by the right direction. > > I know what you are concerning about. I bet it's false positives that > are going to bother you once merged. I'll insist that DEPT shouldn't be > used as a mandatory testing tool until considered stable enough. But > what about ones who would take the advantage use DEPT. Why don't you > think of folks who will take the advantage from the hints about > dependency of synchronization esp. when their subsystem requires very > complicated synchronization? Should a tool be useful only in a final > testing stage? What about the usefulness during development stage? > > It's worth noting DEPT works with any wait/event so any lockups e.g. > even by HW-SW interface, retry logic or the like can be detected by DEPT > once all waits and events are tagged properly. I believe the advantage > by that is much higher than the bad side facing false alarms. It's just > my opinion. I'm goning to respect the majority opinion. s/take advantage/have the benefit/g Byungchul
On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote:
>
> Hi Linus and folks,
>
> I've been developing a tool for detecting deadlock possibilities by
> tracking wait/event rather than lock(?) acquisition order to try to
> cover all synchonization machanisms.
So what is the actual status of reports these days?
Last time I looked at some reports, it gave a lot of false positives
due to mis-understanding prepare_to_sleep().
For this all to make sense, it would need to not have false positives
(or at least a very small number of them together with a way to sanely
get rid of them), and also have a track record of finding things that
lockdep doesn't.
Maybe such reports have been sent out with the current situation, and
I haven't seen them.
Linus
On Wed, May 04, 2022 at 11:17:02AM -0700, Linus Torvalds wrote: > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > Hi Linus and folks, > > > > I've been developing a tool for detecting deadlock possibilities by > > tracking wait/event rather than lock(?) acquisition order to try to > > cover all synchonization machanisms. > > So what is the actual status of reports these days? I'd like to mention one important thing here. Reportability would get stronger if the more wait-event pairs get tagged everywhere DEPT can work. Everything e.g. HW-SW interface, any retry logic and so on can be a wait-event pair if they work wait or event anyway. For example, polling on an IO mapped read register and initiating the HW to go for the event also can be a pair. Definitely those make DEPT more useful. --- The way to use the APIs: 1. Define SDT(Simple Dependency Tracker) DEFINE_DEPT_SDT(my_hw_event); <- add this 2. Tag on the waits sdt_wait(&my_hw_event); <- add this ... retry logic until my hw work done ... <- the original code 3. Tag on the events sdt_event(&my_hw_event); <- add this run_my_hw(); <- the original code --- These are all we should do. I believe DEPT would be a very useful tool once all wait-event pairs get tagged by the developers in all subsystems and device drivers. Byungchul > Last time I looked at some reports, it gave a lot of false positives > due to mis-understanding prepare_to_sleep(). > > For this all to make sense, it would need to not have false positives > (or at least a very small number of them together with a way to sanely > get rid of them), and also have a track record of finding things that > lockdep doesn't. > > Maybe such reports have been sent out with the current situation, and > I haven't seen them. > > Linus
Linus wrote: > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > Hi Linus and folks, > > > > I've been developing a tool for detecting deadlock possibilities by > > tracking wait/event rather than lock(?) acquisition order to try to > > cover all synchonization machanisms. > > So what is the actual status of reports these days? > > Last time I looked at some reports, it gave a lot of false positives > due to mis-understanding prepare_to_sleep(). Yes, it was. I handled the case in the following way: 1. Stage the wait at prepare_to_sleep(), which might be used at commit. Which has yet to be an actual wait that Dept considers. 2. If the condition for sleep is true, the wait will be committed at __schedule(). The wait becomes an actual one that Dept considers. 3. If the condition is false and the task gets back to TASK_RUNNING, clean(=reset) the staged wait. That way, Dept only works with what actually hits to __schedule() for the waits through sleep. > For this all to make sense, it would need to not have false positives > (or at least a very small number of them together with a way to sanely Yes. I agree with you. I got rid of them that way I described above. > get rid of them), and also have a track record of finding things that > lockdep doesn't. I have some reports that wait_for_completion or waitqueue is involved. It's worth noting those are not tracked by Lockdep. I'm checking if those are true positive or not. I will share those reports once I get more convinced for that. > Maybe such reports have been sent out with the current situation, and > I haven't seen them. Dept reports usually have been sent to me privately, not in LKML. As I told you, I'm planning to share them. Byungchul > > Linus >
On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote: > Linus wrote: > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > > > Hi Linus and folks, > > > > > > I've been developing a tool for detecting deadlock possibilities by > > > tracking wait/event rather than lock(?) acquisition order to try to > > > cover all synchonization machanisms. > > > > So what is the actual status of reports these days? > > > > Last time I looked at some reports, it gave a lot of false positives > > due to mis-understanding prepare_to_sleep(). > > Yes, it was. I handled the case in the following way: > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit. > Which has yet to be an actual wait that Dept considers. > 2. If the condition for sleep is true, the wait will be committed at > __schedule(). The wait becomes an actual one that Dept considers. > 3. If the condition is false and the task gets back to TASK_RUNNING, > clean(=reset) the staged wait. > > That way, Dept only works with what actually hits to __schedule() for > the waits through sleep. > > > For this all to make sense, it would need to not have false positives > > (or at least a very small number of them together with a way to sanely > > Yes. I agree with you. I got rid of them that way I described above. > IMHO DEPT should not report what lockdep allows (Not talking about wait events). I mean lockdep allows some kind of nested locks but DEPT reports them. When I was collecting reports from DEPT on varous configurations, Most of them was report of down_write_nested(), which is allowed in lockdep. DEPT should not report at least what we know it's not a real deadlock. Otherwise there will be reports that is never fixed, which is quite unpleasant and reporters cannot examine all of them if it's real deadlock or not. > > get rid of them), and also have a track record of finding things that > > lockdep doesn't. > > I have some reports that wait_for_completion or waitqueue is involved. > It's worth noting those are not tracked by Lockdep. I'm checking if > those are true positive or not. I will share those reports once I get > more convinced for that. > > > Maybe such reports have been sent out with the current situation, and > > I haven't seen them. > > Dept reports usually have been sent to me privately, not in LKML. As I > told you, I'm planning to share them. > > Byungchul > > > > > Linus > > -- Thanks, Hyeonggon
On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote: > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote: > > Linus wrote: > > > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > > > > > Hi Linus and folks, > > > > > > > > I've been developing a tool for detecting deadlock possibilities by > > > > tracking wait/event rather than lock(?) acquisition order to try to > > > > cover all synchonization machanisms. > > > > > > So what is the actual status of reports these days? > > > > > > Last time I looked at some reports, it gave a lot of false positives > > > due to mis-understanding prepare_to_sleep(). > > > > Yes, it was. I handled the case in the following way: > > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit. > > Which has yet to be an actual wait that Dept considers. > > 2. If the condition for sleep is true, the wait will be committed at > > __schedule(). The wait becomes an actual one that Dept considers. > > 3. If the condition is false and the task gets back to TASK_RUNNING, > > clean(=reset) the staged wait. > > > > That way, Dept only works with what actually hits to __schedule() for > > the waits through sleep. > > > > > For this all to make sense, it would need to not have false positives > > > (or at least a very small number of them together with a way to sanely > > > > Yes. I agree with you. I got rid of them that way I described above. > > > > IMHO DEPT should not report what lockdep allows (Not talking about No. > wait events). I mean lockdep allows some kind of nested locks but > DEPT reports them. You have already asked exactly same question in another thread of LKML. That time I answered to it but let me explain it again. --- CASE 1. lock L with depth n lock_nested L' with depth n + 1 ... unlock L' unlock L This case is allowed by Lockdep. This case is allowed by DEPT cuz it's not a deadlock. CASE 2. lock L with depth n lock A lock_nested L' with depth n + 1 ... unlock L' unlock A unlock L This case is allowed by Lockdep. This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. --- The following scenario would explain why CASE 2 is problematic. THREAD X THREAD Y lock L with depth n lock L' with depth n lock A lock A lock_nested L' with depth n + 1 lock_nested L'' with depth n + 1 ... ... unlock L' unlock L'' unlock A unlock A unlock L unlock L' Yes. I need to check if the report you shared with me is a true one, but it's not because DEPT doesn't work with *_nested() APIs. Byungchul
On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote: > On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote: > > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote: > > > Linus wrote: > > > > > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > > > > > > > Hi Linus and folks, > > > > > > > > > > I've been developing a tool for detecting deadlock possibilities by > > > > > tracking wait/event rather than lock(?) acquisition order to try to > > > > > cover all synchonization machanisms. > > > > > > > > So what is the actual status of reports these days? > > > > > > > > Last time I looked at some reports, it gave a lot of false positives > > > > due to mis-understanding prepare_to_sleep(). > > > > > > Yes, it was. I handled the case in the following way: > > > > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit. > > > Which has yet to be an actual wait that Dept considers. > > > 2. If the condition for sleep is true, the wait will be committed at > > > __schedule(). The wait becomes an actual one that Dept considers. > > > 3. If the condition is false and the task gets back to TASK_RUNNING, > > > clean(=reset) the staged wait. > > > > > > That way, Dept only works with what actually hits to __schedule() for > > > the waits through sleep. > > > > > > > For this all to make sense, it would need to not have false positives > > > > (or at least a very small number of them together with a way to sanely > > > > > > Yes. I agree with you. I got rid of them that way I described above. > > > > > > > IMHO DEPT should not report what lockdep allows (Not talking about > > No. > > > wait events). I mean lockdep allows some kind of nested locks but > > DEPT reports them. > > You have already asked exactly same question in another thread of > LKML. That time I answered to it but let me explain it again. > > --- > > CASE 1. > > lock L with depth n > lock_nested L' with depth n + 1 > ... > unlock L' > unlock L > > This case is allowed by Lockdep. > This case is allowed by DEPT cuz it's not a deadlock. > > CASE 2. > > lock L with depth n > lock A > lock_nested L' with depth n + 1 > ... > unlock L' > unlock A > unlock L > > This case is allowed by Lockdep. > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > Yeah, in previous threads we discussed this [1] And the case was: scan_mutex -> object_lock -> kmemleak_lock -> object_lock And dept reported: object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as deadlock. But IIUC - What DEPT reported happens only under scan_mutex and It is not simple just not to take them because the object can be removed from the list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock. Just I'm still not sure that someone will fix the warning in the future - even if the locking rule is not good - if it will not cause a real deadlock. > --- > > The following scenario would explain why CASE 2 is problematic. > > THREAD X THREAD Y > > lock L with depth n > lock L' with depth n > lock A > lock A > lock_nested L' with depth n + 1 > lock_nested L'' with depth n + 1 > ... ... > unlock L' unlock L'' > unlock A unlock A > unlock L unlock L' > > Yes. I need to check if the report you shared with me is a true one, but > it's not because DEPT doesn't work with *_nested() APIs. > Sorry, It was not right just to say DEPT doesn't work with _nested() APIs. > Byungchul [1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/ -- Thanks, Hyeonggon
On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote: > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote: > > On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote: > > > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote: > > > > Linus wrote: > > > > > > > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > > > > > > > > > Hi Linus and folks, > > > > > > > > > > > > I've been developing a tool for detecting deadlock possibilities by > > > > > > tracking wait/event rather than lock(?) acquisition order to try to > > > > > > cover all synchonization machanisms. > > > > > > > > > > So what is the actual status of reports these days? > > > > > > > > > > Last time I looked at some reports, it gave a lot of false positives > > > > > due to mis-understanding prepare_to_sleep(). > > > > > > > > Yes, it was. I handled the case in the following way: > > > > > > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit. > > > > Which has yet to be an actual wait that Dept considers. > > > > 2. If the condition for sleep is true, the wait will be committed at > > > > __schedule(). The wait becomes an actual one that Dept considers. > > > > 3. If the condition is false and the task gets back to TASK_RUNNING, > > > > clean(=reset) the staged wait. > > > > > > > > That way, Dept only works with what actually hits to __schedule() for > > > > the waits through sleep. > > > > > > > > > For this all to make sense, it would need to not have false positives > > > > > (or at least a very small number of them together with a way to sanely > > > > > > > > Yes. I agree with you. I got rid of them that way I described above. > > > > > > > > > > IMHO DEPT should not report what lockdep allows (Not talking about > > > > No. > > > > > wait events). I mean lockdep allows some kind of nested locks but > > > DEPT reports them. > > > > You have already asked exactly same question in another thread of > > LKML. That time I answered to it but let me explain it again. > > > > --- > > > > CASE 1. > > > > lock L with depth n > > lock_nested L' with depth n + 1 > > ... > > unlock L' > > unlock L > > > > This case is allowed by Lockdep. > > This case is allowed by DEPT cuz it's not a deadlock. > > > > CASE 2. > > > > lock L with depth n > > lock A > > lock_nested L' with depth n + 1 > > ... > > unlock L' > > unlock A > > unlock L > > > > This case is allowed by Lockdep. > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > > > Yeah, in previous threads we discussed this [1] > > And the case was: > scan_mutex -> object_lock -> kmemleak_lock -> object_lock > And dept reported: > object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as > deadlock. > > But IIUC - What DEPT reported happens only under scan_mutex and > It is not simple just not to take them because the object can be removed from the > list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock. That should be one of the following order: 1. kmemleak_lock -> object_lock -> object_lock(nested) 2. object_lock -> object_lock(nested) -> kmemleak_lock > Just I'm still not sure that someone will fix the warning in the future - even if the > locking rule is not good - if it will not cause a real deadlock. There's more important thing than making code just work for now. For example, maintainance, communcation via code between current developers and potential new commers in the future and so on. At least, a comment describing why the wrong order in the code is safe should be added. I wouldn't allow the current order in the code if I were the maintainer. Byungchul > > --- > > > > The following scenario would explain why CASE 2 is problematic. > > > > THREAD X THREAD Y > > > > lock L with depth n > > lock L' with depth n > > lock A > > lock A > > lock_nested L' with depth n + 1 > > lock_nested L'' with depth n + 1 > > ... ... > > unlock L' unlock L'' > > unlock A unlock A > > unlock L unlock L' > > > > Yes. I need to check if the report you shared with me is a true one, but > > it's not because DEPT doesn't work with *_nested() APIs. > > > > Sorry, It was not right just to say DEPT doesn't work with _nested() APIs. > > > Byungchul > > [1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/ > > -- > Thanks, > Hyeonggon
On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote: > On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote: > > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote: > > > On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote: > > > > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote: > > > > > Linus wrote: > > > > > > > > > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote: > > > > > > > > > > > > > > Hi Linus and folks, > > > > > > > > > > > > > > I've been developing a tool for detecting deadlock possibilities by > > > > > > > tracking wait/event rather than lock(?) acquisition order to try to > > > > > > > cover all synchonization machanisms. > > > > > > > > > > > > So what is the actual status of reports these days? > > > > > > > > > > > > Last time I looked at some reports, it gave a lot of false positives > > > > > > due to mis-understanding prepare_to_sleep(). > > > > > > > > > > Yes, it was. I handled the case in the following way: > > > > > > > > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit. > > > > > Which has yet to be an actual wait that Dept considers. > > > > > 2. If the condition for sleep is true, the wait will be committed at > > > > > __schedule(). The wait becomes an actual one that Dept considers. > > > > > 3. If the condition is false and the task gets back to TASK_RUNNING, > > > > > clean(=reset) the staged wait. > > > > > > > > > > That way, Dept only works with what actually hits to __schedule() for > > > > > the waits through sleep. > > > > > > > > > > > For this all to make sense, it would need to not have false positives > > > > > > (or at least a very small number of them together with a way to sanely > > > > > > > > > > Yes. I agree with you. I got rid of them that way I described above. > > > > > > > > > > > > > IMHO DEPT should not report what lockdep allows (Not talking about > > > > > > No. > > > > > > > wait events). I mean lockdep allows some kind of nested locks but > > > > DEPT reports them. > > > > > > You have already asked exactly same question in another thread of > > > LKML. That time I answered to it but let me explain it again. > > > > > > --- > > > > > > CASE 1. > > > > > > lock L with depth n > > > lock_nested L' with depth n + 1 > > > ... > > > unlock L' > > > unlock L > > > > > > This case is allowed by Lockdep. > > > This case is allowed by DEPT cuz it's not a deadlock. > > > > > > CASE 2. > > > > > > lock L with depth n > > > lock A > > > lock_nested L' with depth n + 1 > > > ... > > > unlock L' > > > unlock A > > > unlock L > > > > > > This case is allowed by Lockdep. > > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > > > > > > Yeah, in previous threads we discussed this [1] > > > > And the case was: > > scan_mutex -> object_lock -> kmemleak_lock -> object_lock > > And dept reported: > > object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as > > deadlock. > > > > But IIUC - What DEPT reported happens only under scan_mutex and > > It is not simple just not to take them because the object can be removed from the > > list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock. > > > That should be one of the following order: > > 1. kmemleak_lock -> object_lock -> object_lock(nested) > 2. object_lock -> object_lock(nested) -> kmemleak_lock > > > Just I'm still not sure that someone will fix the warning in the future - even if the > > locking rule is not good - if it will not cause a real deadlock. > > There's more important thing than making code just work for now. For > example, maintainance, communcation via code between current developers > and potential new commers in the future and so on. Then we will get same reports from DEPT until already existing bad code (even if it does not cause deadlock) is reworked. If you think that is right thing to do, okay. > At least, a comment describing why the wrong order in the code is safe > should be added. AFAIK The comment is already there in mm/kmemleak.c. > I wouldn't allow the current order in the code if I > were the maintainer. [+Cc Catalin] He may have opinion. Thanks, Hyeonggon > Byungchul > > > > --- > > > > > > The following scenario would explain why CASE 2 is problematic. > > > > > > THREAD X THREAD Y > > > > > > lock L with depth n > > > lock L' with depth n > > > lock A > > > lock A > > > lock_nested L' with depth n + 1 > > > lock_nested L'' with depth n + 1 > > > ... ... > > > unlock L' unlock L'' > > > unlock A unlock A > > > unlock L unlock L' > > > > > > Yes. I need to check if the report you shared with me is a true one, but > > > it's not because DEPT doesn't work with *_nested() APIs. > > > > > > > Sorry, It was not right just to say DEPT doesn't work with _nested() APIs. > > > > > Byungchul > > > > [1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/ > > > > -- > > Thanks, > > Hyeonggon -- Thanks, Hyeonggon
On Wed, May 11, 2022 at 07:04:51PM +0900, Hyeonggon Yoo wrote: > On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote: > > On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote: > > > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote: > > > > CASE 1. > > > > > > > > lock L with depth n > > > > lock_nested L' with depth n + 1 > > > > ... > > > > unlock L' > > > > unlock L > > > > > > > > This case is allowed by Lockdep. > > > > This case is allowed by DEPT cuz it's not a deadlock. > > > > > > > > CASE 2. > > > > > > > > lock L with depth n > > > > lock A > > > > lock_nested L' with depth n + 1 > > > > ... > > > > unlock L' > > > > unlock A > > > > unlock L > > > > > > > > This case is allowed by Lockdep. > > > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > > > > > Yeah, in previous threads we discussed this [1] > > > > > > And the case was: > > > scan_mutex -> object_lock -> kmemleak_lock -> object_lock > > > And dept reported: > > > object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as > > > deadlock. > > > > > > But IIUC - What DEPT reported happens only under scan_mutex and it > > > is not simple just not to take them because the object can be > > > removed from the list and freed while scanning via kmemleak_free() > > > without kmemleak_lock and object_lock. The above kmemleak sequence shouldn't deadlock since those locks, even if taken in a different order, are serialised by scan_mutex. For various reasons, trying to reduce the latency, I ended up with some fine-grained, per-object locking. For object allocation (rbtree modification) and tree search, we use kmemleak_lock. During scanning (which can take minutes under scan_mutex), we want to prevent (a) long latencies and (b) freeing the object being scanned. We release the locks regularly for (a) and hold the object->lock for (b). In another thread Byungchul mentioned: | context X context Y | | lock mutex A lock mutex A | lock B lock C | lock C lock B | unlock C unlock B | unlock B unlock C | unlock mutex A unlock mutex A | | In my opinion, lock B and lock C are unnecessary if they are always | along with lock mutex A. Or we should keep correct lock order across all | the code. If these are the only two places, yes, locks B and C would be unnecessary. But we have those locks acquired (not nested) on the allocation path (kmemleak_lock) and freeing path (object->lock). We don't want to block those paths while scan_mutex is held. That said, we may be able to use a single kmemleak_lock for everything. The object freeing path may be affected slightly during scanning but the code does release it every MAX_SCAN_SIZE bytes. It may even get slightly faster as we'd hammer a single lock (I'll do some benchmarks). But from a correctness perspective, I think the DEPT tool should be improved a bit to detect when such out of order locking is serialised by an enclosing lock/mutex. -- Catalin
On Thu, May 19, 2022 at 11:11:10AM +0100, Catalin Marinas wrote: > On Wed, May 11, 2022 at 07:04:51PM +0900, Hyeonggon Yoo wrote: > > On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote: > > > On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote: > > > > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote: > > > > > CASE 1. > > > > > > > > > > lock L with depth n > > > > > lock_nested L' with depth n + 1 > > > > > ... > > > > > unlock L' > > > > > unlock L > > > > > > > > > > This case is allowed by Lockdep. > > > > > This case is allowed by DEPT cuz it's not a deadlock. > > > > > > > > > > CASE 2. > > > > > > > > > > lock L with depth n > > > > > lock A > > > > > lock_nested L' with depth n + 1 > > > > > ... > > > > > unlock L' > > > > > unlock A > > > > > unlock L > > > > > > > > > > This case is allowed by Lockdep. > > > > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > > > > > > > Yeah, in previous threads we discussed this [1] > > > > > > > > And the case was: > > > > scan_mutex -> object_lock -> kmemleak_lock -> object_lock > > > > And dept reported: > > > > object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as > > > > deadlock. > > > > > > > > But IIUC - What DEPT reported happens only under scan_mutex and it > > > > is not simple just not to take them because the object can be > > > > removed from the list and freed while scanning via kmemleak_free() > > > > without kmemleak_lock and object_lock. > > The above kmemleak sequence shouldn't deadlock since those locks, even > if taken in a different order, are serialised by scan_mutex. For various > reasons, trying to reduce the latency, I ended up with some > fine-grained, per-object locking. I understand why you introduced the fine-grained lock. However, the different order should be avoided anyway. As Steven said, Lockdep also should've detected this case, say, this would have been detected if Lockdep worked correctly. It's not a technical issue to make a tool skip the reversed order when it's already protected by another lock. Because each lock has its own purpose as you explained, no body knows if the cases might arise that use kmemleak_lock and object_lock only w/o holding scan_mutex someday. I'm wondering how other folks think this case should be handled tho. > For object allocation (rbtree modification) and tree search, we use > kmemleak_lock. During scanning (which can take minutes under > scan_mutex), we want to prevent (a) long latencies and (b) freeing the > object being scanned. We release the locks regularly for (a) and hold > the object->lock for (b). > > In another thread Byungchul mentioned: > > | context X context Y > | > | lock mutex A lock mutex A > | lock B lock C > | lock C lock B > | unlock C unlock B > | unlock B unlock C > | unlock mutex A unlock mutex A > | > | In my opinion, lock B and lock C are unnecessary if they are always > | along with lock mutex A. Or we should keep correct lock order across all > | the code. > > If these are the only two places, yes, locks B and C would be > unnecessary. But we have those locks acquired (not nested) on the > allocation path (kmemleak_lock) and freeing path (object->lock). We > don't want to block those paths while scan_mutex is held. > > That said, we may be able to use a single kmemleak_lock for everything. > The object freeing path may be affected slightly during scanning but the > code does release it every MAX_SCAN_SIZE bytes. It may even get slightly > faster as we'd hammer a single lock (I'll do some benchmarks). > > But from a correctness perspective, I think the DEPT tool should be > improved a bit to detect when such out of order locking is serialised by > an enclosing lock/mutex. Again, I don't think this is a technical issue. Byungchul > > -- > Catalin
On Mon, 9 May 2022 09:16:37 +0900 Byungchul Park <byungchul.park@lge.com> wrote: > CASE 2. > > lock L with depth n > lock A > lock_nested L' with depth n + 1 > ... > unlock L' > unlock A > unlock L > > This case is allowed by Lockdep. > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > --- > > The following scenario would explain why CASE 2 is problematic. > > THREAD X THREAD Y > > lock L with depth n > lock L' with depth n > lock A > lock A > lock_nested L' with depth n + 1 I'm confused by what exactly you are saying is a deadlock above. Are you saying that lock A and L' are inversed? If so, lockdep had better detect that regardless of L. A nested lock associates the the nesting with the same type of lock. That is, in lockdep nested tells lockdep not to trigger on the L and L' but it will not ignore that A was taken. -- Steve > lock_nested L'' with depth n + 1 > ... ... > unlock L' unlock L'' > unlock A unlock A > unlock L unlock L'
On Mon, May 09, 2022 at 04:47:12PM -0400, Steven Rostedt wrote: > On Mon, 9 May 2022 09:16:37 +0900 > Byungchul Park <byungchul.park@lge.com> wrote: > > > CASE 2. > > > > lock L with depth n > > lock A > > lock_nested L' with depth n + 1 > > ... > > unlock L' > > unlock A > > unlock L > > > > This case is allowed by Lockdep. > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*. > > > > --- > > > > The following scenario would explain why CASE 2 is problematic. > > > > THREAD X THREAD Y > > > > lock L with depth n > > lock L' with depth n > > lock A > > lock A > > lock_nested L' with depth n + 1 > > I'm confused by what exactly you are saying is a deadlock above. > > Are you saying that lock A and L' are inversed? If so, lockdep had better Hi Steven, Yes, I was talking about A and L'. > detect that regardless of L. A nested lock associates the the nesting with When I checked Lockdep code, L' with depth n + 1 and L' with depth n have different classes in Lockdep. That's why I said Lockdep cannot detect it. By any chance, has it changed so as to consider this case? Or am I missing something? > the same type of lock. That is, in lockdep nested tells lockdep not to > trigger on the L and L' but it will not ignore that A was taken. It will not ignore A but it would work like this: THREAD X THREAD Y lock Ln lock Ln lock A lock A lock_nested Lm lock_nested Lm So, Lockdep considers this case safe, actually not tho. Byungchul > > -- Steve > > > > > lock_nested L'' with depth n + 1 > > ... ... > > unlock L' unlock L'' > > unlock A unlock A > > unlock L unlock L'
On Tue, 10 May 2022 08:38:38 +0900 Byungchul Park <byungchul.park@lge.com> wrote: > Yes, I was talking about A and L'. > > > detect that regardless of L. A nested lock associates the the nesting with > > When I checked Lockdep code, L' with depth n + 1 and L' with depth n > have different classes in Lockdep. If that's the case, then that's a bug in lockdep. > > That's why I said Lockdep cannot detect it. By any chance, has it > changed so as to consider this case? Or am I missing something? No, it's not that lockdep cannot detect it, it should detect it. If it is not detecting it, then we need to fix that. -- Steve
On Tue, May 10, 2022 at 10:12:54AM -0400, Steven Rostedt wrote: > On Tue, 10 May 2022 08:38:38 +0900 > Byungchul Park <byungchul.park@lge.com> wrote: > > > Yes, I was talking about A and L'. > > > > > detect that regardless of L. A nested lock associates the the nesting with > > > > When I checked Lockdep code, L' with depth n + 1 and L' with depth n > > have different classes in Lockdep. > > If that's the case, then that's a bug in lockdep. Yes, agree. I should've said 'Lockdep doesn't detect it currently.' rather than 'Lockdep can't detect it.'. I also think we make it for this case by fixing the bug in Lockdep. > > > > That's why I said Lockdep cannot detect it. By any chance, has it > > changed so as to consider this case? Or am I missing something? > > No, it's not that lockdep cannot detect it, it should detect it. If it is > not detecting it, then we need to fix that. Yes. Byungchul > > -- Steve
+cc mcgrof@kernel.org (firmware) +cc holt@sgi.com (syscall reboot) Hi Luis, Robin and folks, I'm developing a tool for lockup detection, DETP(Dependency Tracker). I got a DEPT report from Hyeonggon - Thanks, Hyeonggon! It doesn't mean the code *definitely* has a deadlock. However, it looks problematic to me. So I'd like to ask things to see if it is actually. Because Hyeonggon didn't run decode_stacktrace.sh before sending it to me, I don't have a report having a better debugging information. But I can explain it in this mail. The problematic scenario looks like: PROCESS A PROCESS B WORKER C __do_sys_reboot() __do_sys_reboot() mutex_lock(&system_transition_mutex) ... mutex_lock(&system_transition_mutex) <- stuck ... request_firmware_work_func() _request_firmware() firmware_fallback_sysfs() usermodehelper_read_lock_wait() down_read(&umhelper_sem) ... fw_load_sysfs_fallback() fw_sysfs_wait_timeout() wait_for_completion_killable_timeout(&fw_st->completion) <- stuck kernel_halt() __usermodehelper_disable() down_write(&umhelper_sem) <- stuck -------------------------------------------------------- All the 3 contexts are stuck at this point. -------------------------------------------------------- PROCESS A PROCESS B WORKER C ... up_write(&umhelper_sem) ... mutex_unlock(&system_transition_mutex) <- cannot wake up B ... kernel_halt() notifier_call_chain() hw_shutdown_notify() kill_pending_fw_fallback_reqs() __fw_load_abort() complete_all(&fw_st->completion) <- cannot wake up C ... usermodeheler_read_unlock() up_read(&umhelper_sem) <- cannot wake up A So I think this scenario is problematic. Or am I missing something here? Or do you think it's okay because the wait_for_completion_*() has a timeout? AFAIK, timeouts are not supposed to fire in normal cases. It'd be appreciated if you share your opinion on the report. Byungchul --- [ 18.136012][ T1] =================================================== [ 18.136419][ T1] DEPT: Circular dependency has been detected. [ 18.136782][ T1] 5.18.0-rc3-57979-gc2b89afca919 #2374 Tainted: G B [ 18.137249][ T1] --------------------------------------------------- [ 18.137649][ T1] summary [ 18.137823][ T1] --------------------------------------------------- [ 18.138222][ T1] *** DEADLOCK *** [ 18.138222][ T1] [ 18.138569][ T1] context A [ 18.138754][ T1] [S] __mutex_lock_common(system_transition_mutex:0) [ 18.139170][ T1] [W] down_write(umhelper_sem:0) [ 18.139482][ T1] [E] mutex_unlock(system_transition_mutex:0) [ 18.139865][ T1] [ 18.140004][ T1] context B [ 18.140189][ T1] [S] (unknown)(&fw_st->completion:0) [ 18.140527][ T1] [W] __mutex_lock_common(system_transition_mutex:0) [ 18.140942][ T1] [E] complete_all(&fw_st->completion:0) [ 18.141295][ T1] [ 18.141434][ T1] context C [ 18.141618][ T1] [S] down_read(umhelper_sem:0) [ 18.141926][ T1] [W] wait_for_completion_killable_timeout(&fw_st->completion:0) [ 18.142402][ T1] [E] up_read(umhelper_sem:0) [ 18.142699][ T1] [ 18.142837][ T1] [S]: start of the event context [ 18.143134][ T1] [W]: the wait blocked [ 18.143379][ T1] [E]: the event not reachable [ 18.143661][ T1] --------------------------------------------------- [ 18.144063][ T1] context A's detail [ 18.144293][ T1] --------------------------------------------------- [ 18.144691][ T1] context A [ 18.144875][ T1] [S] __mutex_lock_common(system_transition_mutex:0) [ 18.145290][ T1] [W] down_write(umhelper_sem:0) [ 18.145602][ T1] [E] mutex_unlock(system_transition_mutex:0) [ 18.145982][ T1] [ 18.146120][ T1] [S] __mutex_lock_common(system_transition_mutex:0): [ 18.146519][ T1] [<ffffffff810ee14c>] __do_sys_reboot+0x11f/0x24f [ 18.146907][ T1] stacktrace: [ 18.147101][ T1] __mutex_lock+0x1f3/0x3f3 [ 18.147396][ T1] __do_sys_reboot+0x11f/0x24f [ 18.147706][ T1] do_syscall_64+0xd4/0xfb [ 18.148001][ T1] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.148379][ T1] [ 18.148517][ T1] [W] down_write(umhelper_sem:0): [ 18.148815][ T1] [<ffffffff810d9c14>] __usermodehelper_disable+0x80/0x17f [ 18.149243][ T1] stacktrace: [ 18.149438][ T1] __dept_wait+0x115/0x15b [ 18.149726][ T1] dept_wait+0xcd/0xf3 [ 18.149993][ T1] down_write+0x4e/0x82 [ 18.150266][ T1] __usermodehelper_disable+0x80/0x17f [ 18.150615][ T1] kernel_halt+0x33/0x5d [ 18.150893][ T1] __do_sys_reboot+0x197/0x24f [ 18.151201][ T1] do_syscall_64+0xd4/0xfb [ 18.151489][ T1] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.151866][ T1] [ 18.152004][ T1] [E] mutex_unlock(system_transition_mutex:0): [ 18.152368][ T1] (N/A) [ 18.152532][ T1] --------------------------------------------------- [ 18.152931][ T1] context B's detail [ 18.153161][ T1] --------------------------------------------------- [ 18.153559][ T1] context B [ 18.153743][ T1] [S] (unknown)(&fw_st->completion:0) [ 18.154082][ T1] [W] __mutex_lock_common(system_transition_mutex:0) [ 18.154496][ T1] [E] complete_all(&fw_st->completion:0) [ 18.154848][ T1] [ 18.154987][ T1] [S] (unknown)(&fw_st->completion:0): [ 18.155310][ T1] (N/A) [ 18.155473][ T1] [ 18.155612][ T1] [W] __mutex_lock_common(system_transition_mutex:0): [ 18.156014][ T1] [<ffffffff810ee14c>] __do_sys_reboot+0x11f/0x24f [ 18.156400][ T1] stacktrace: [ 18.156594][ T1] __mutex_lock+0x1ce/0x3f3 [ 18.156887][ T1] __do_sys_reboot+0x11f/0x24f [ 18.157196][ T1] do_syscall_64+0xd4/0xfb [ 18.157482][ T1] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.157856][ T1] [ 18.157993][ T1] [E] complete_all(&fw_st->completion:0): [ 18.158330][ T1] [<ffffffff81c04230>] kill_pending_fw_fallback_reqs+0x66/0x95 [ 18.158778][ T1] stacktrace: [ 18.158973][ T1] complete_all+0x28/0x58 [ 18.159256][ T1] kill_pending_fw_fallback_reqs+0x66/0x95 [ 18.159624][ T1] fw_shutdown_notify+0x7/0xa [ 18.159929][ T1] notifier_call_chain+0x4f/0x81 [ 18.160246][ T1] blocking_notifier_call_chain+0x4c/0x64 [ 18.160611][ T1] kernel_halt+0x13/0x5d [ 18.160889][ T1] __do_sys_reboot+0x197/0x24f [ 18.161197][ T1] do_syscall_64+0xd4/0xfb [ 18.161485][ T1] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.161861][ T1] --------------------------------------------------- [ 18.162260][ T1] context C's detail [ 18.162490][ T1] --------------------------------------------------- [ 18.162889][ T1] context C [ 18.163073][ T1] [S] down_read(umhelper_sem:0) [ 18.163379][ T1] [W] wait_for_completion_killable_timeout(&fw_st->completion:0) [ 18.163855][ T1] [E] up_read(umhelper_sem:0) [ 18.164150][ T1] [ 18.164288][ T1] [S] down_read(umhelper_sem:0): [ 18.164578][ T1] [<ffffffff810d8f99>] usermodehelper_read_lock_wait+0xad/0x139 [ 18.165027][ T1] stacktrace: [ 18.165220][ T1] down_read+0x74/0x85 [ 18.165487][ T1] usermodehelper_read_lock_wait+0xad/0x139 [ 18.165860][ T1] firmware_fallback_sysfs+0x118/0x521 [ 18.166207][ T1] _request_firmware+0x7ef/0x91b [ 18.166525][ T1] request_firmware_work_func+0xb1/0x13b [ 18.166884][ T1] process_one_work+0x4c3/0x771 [ 18.167199][ T1] worker_thread+0x37f/0x49e [ 18.167497][ T1] kthread+0x122/0x131 [ 18.167768][ T1] ret_from_fork+0x1f/0x30 [ 18.168055][ T1] [ 18.168192][ T1] [W] wait_for_completion_killable_timeout(&fw_st->completion:0): [ 18.168650][ T1] [<ffffffff81c046b7>] firmware_fallback_sysfs+0x42a/0x521 [ 18.169076][ T1] stacktrace: [ 18.169270][ T1] wait_for_completion_killable_timeout+0x3c/0x58 [ 18.169676][ T1] firmware_fallback_sysfs+0x42a/0x521 [ 18.170024][ T1] _request_firmware+0x7ef/0x91b [ 18.170341][ T1] request_firmware_work_func+0xb1/0x13b [ 18.170699][ T1] process_one_work+0x4c3/0x771 [ 18.171012][ T1] worker_thread+0x37f/0x49e [ 18.171309][ T1] kthread+0x122/0x131 [ 18.171575][ T1] ret_from_fork+0x1f/0x30 [ 18.171863][ T1] [ 18.172001][ T1] [E] up_read(umhelper_sem:0): [ 18.172281][ T1] (N/A) [ 18.172445][ T1] --------------------------------------------------- [ 18.172844][ T1] information that might be helpful [ 18.173151][ T1] --------------------------------------------------- [ 18.173549][ T1] CPU: 0 PID: 1 Comm: init Tainted: G B 5.18.0-rc3-57979-gc2b89afca919 #2374 [ 18.174144][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 18.174687][ T1] Call Trace: [ 18.174882][ T1] <TASK> [ 18.175056][ T1] print_circle+0x6a2/0x6f9 [ 18.175326][ T1] ? extend_queue+0xf4/0xf4 [ 18.175594][ T1] ? __list_add_valid+0xce/0xf6 [ 18.175886][ T1] ? __list_add+0x45/0x4e [ 18.176144][ T1] ? print_circle+0x6f9/0x6f9 [ 18.176422][ T1] cb_check_dl+0xc0/0xc7 [ 18.176675][ T1] bfs+0x1c8/0x27b [ 18.176897][ T1] ? disconnect_class+0x24c/0x24c [ 18.177195][ T1] ? __list_add+0x45/0x4e [ 18.177454][ T1] add_dep+0x140/0x217 [ 18.177697][ T1] ? add_ecxt+0x33b/0x33b [ 18.177955][ T1] ? llist_del_first+0xc/0x46 [ 18.178234][ T1] add_wait+0x58a/0x723 [ 18.178482][ T1] ? __thaw_task+0x3e/0x3e [ 18.178745][ T1] ? add_dep+0x217/0x217 [ 18.178998][ T1] ? llist_add_batch+0x33/0x4c [ 18.179281][ T1] ? check_new_class+0x139/0x30d [ 18.179574][ T1] __dept_wait+0x115/0x15b [ 18.179837][ T1] ? __usermodehelper_disable+0x80/0x17f [ 18.180170][ T1] ? add_wait+0x723/0x723 [ 18.180426][ T1] ? lock_release+0x338/0x338 [ 18.180704][ T1] ? __usermodehelper_disable+0x80/0x17f [ 18.181037][ T1] dept_wait+0xcd/0xf3 [ 18.181280][ T1] down_write+0x4e/0x82 [ 18.181527][ T1] ? __usermodehelper_disable+0x80/0x17f [ 18.181861][ T1] __usermodehelper_disable+0x80/0x17f [ 18.182184][ T1] ? __usermodehelper_set_disable_depth+0x3a/0x3a [ 18.182565][ T1] ? dept_ecxt_exit+0x1c9/0x1f7 [ 18.182854][ T1] ? blocking_notifier_call_chain+0x57/0x64 [ 18.183205][ T1] kernel_halt+0x33/0x5d [ 18.183458][ T1] __do_sys_reboot+0x197/0x24f [ 18.183742][ T1] ? kernel_power_off+0x6b/0x6b [ 18.184033][ T1] ? dept_on+0x27/0x33 [ 18.184275][ T1] ? dept_exit+0x38/0x42 [ 18.184526][ T1] ? dept_on+0x27/0x33 [ 18.184767][ T1] ? dept_on+0x27/0x33 [ 18.185008][ T1] ? dept_exit+0x38/0x42 [ 18.185260][ T1] ? dept_enirq_transition+0x25a/0x295 [ 18.185582][ T1] ? syscall_enter_from_user_mode+0x47/0x71 [ 18.185930][ T1] ? dept_aware_softirqs_disable+0x1e/0x1e [ 18.186274][ T1] ? syscall_enter_from_user_mode+0x47/0x71 [ 18.186622][ T1] do_syscall_64+0xd4/0xfb [ 18.186883][ T1] ? asm_exc_page_fault+0x1e/0x30 [ 18.187180][ T1] ? dept_aware_softirqs_disable+0x1e/0x1e [ 18.187526][ T1] ? dept_kernel_enter+0x15/0x1e [ 18.187821][ T1] ? do_syscall_64+0x13/0xfb [ 18.188094][ T1] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 18.188443][ T1] RIP: 0033:0x4485f7 [ 18.188672][ T1] Code: 00 75 05 48 83 c4 28 c3 e8 26 17 00 00 66 0f 1f 44 00 00 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8 [ 18.189822][ T1] RSP: 002b:00007ffc92206f28 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9 [ 18.190316][ T1] RAX: ffffffffffffffda RBX: 00007ffc92207118 RCX: 00000000004485f7 [ 18.190784][ T1] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead [ 18.191254][ T1] RBP: 00007ffc92206f30 R08: 00000000016376a0 R09: 0000000000000000 [ 18.191722][ T1] R10: 00000000004c3820 R11: 0000000000000202 R12: 0000000000000001 [ 18.192194][ T1] R13: 00007ffc92207108 R14: 00000000004bf8d0 R15: 0000000000000001 [ 18.192667][ T1] </TASK>
Hello, Just took a look out of curiosity. On Thu, May 12, 2022 at 02:25:57PM +0900, Byungchul Park wrote: > PROCESS A PROCESS B WORKER C > > __do_sys_reboot() > __do_sys_reboot() > mutex_lock(&system_transition_mutex) > ... mutex_lock(&system_transition_mutex) <- stuck > ... > request_firmware_work_func() > _request_firmware() > firmware_fallback_sysfs() > usermodehelper_read_lock_wait() > down_read(&umhelper_sem) > ... > fw_load_sysfs_fallback() > fw_sysfs_wait_timeout() > wait_for_completion_killable_timeout(&fw_st->completion) <- stuck > kernel_halt() > __usermodehelper_disable() > down_write(&umhelper_sem) <- stuck > > -------------------------------------------------------- > All the 3 contexts are stuck at this point. > -------------------------------------------------------- > > PROCESS A PROCESS B WORKER C > > ... > up_write(&umhelper_sem) > ... > mutex_unlock(&system_transition_mutex) <- cannot wake up B > > ... > kernel_halt() > notifier_call_chain() > hw_shutdown_notify() > kill_pending_fw_fallback_reqs() > __fw_load_abort() > complete_all(&fw_st->completion) <- cannot wake up C > > ... > usermodeheler_read_unlock() > up_read(&umhelper_sem) <- cannot wake up A I'm not sure I'm reading it correctly but it looks like "process B" column is superflous given that it's waiting on the same lock to do the same thing that A is already doing (besides, you can't really halt the machine twice). What it's reporting seems to be ABBA deadlock between A waiting on umhelper_sem and C waiting on fw_st->completion. The report seems spurious: 1. wait_for_completion_killable_timeout() doesn't need someone to wake it up to make forward progress because it will unstick itself after timeout expires. 2. complete_all() from __fw_load_abort() isn't the only source of wakeup. The fw loader can be, and mainly should be, woken up by firmware loading actually completing instead of being aborted. I guess the reason why B shows up there is because the operation order is such that just between A and C, the complete_all() takes place before __usermodehlper_disable(), so the whole thing kinda doesn't make sense as you can't block a past operation by a future one. Inserting process B introduces the reverse ordering. Thanks. -- tejun
Tejun wrote: > Hello, Hello, > I'm not sure I'm reading it correctly but it looks like "process B" column I think you're interpreting the report correctly. > is superflous given that it's waiting on the same lock to do the same thing > that A is already doing (besides, you can't really halt the machine twice). Indeed! I've been in a daze. I thought kernel_halt() can be called twice by two different purposes. Sorry for the noise. > What it's reporting seems to be ABBA deadlock between A waiting on > umhelper_sem and C waiting on fw_st->completion. The report seems spurious: > > 1. wait_for_completion_killable_timeout() doesn't need someone to wake it up > to make forward progress because it will unstick itself after timeout > expires. I have a question about this one. Yes, it would never been stuck thanks to timeout. However, IIUC, timeouts are not supposed to expire in normal cases. So I thought a timeout expiration means not a normal case so need to inform it in terms of dependency so as to prevent further expiraton. That's why I have been trying to track even timeout'ed APIs. Do you think DEPT shouldn't track timeout APIs? If I was wrong, I shouldn't track the timeout APIs any more. > 2. complete_all() from __fw_load_abort() isn't the only source of wakeup. > The fw loader can be, and mainly should be, woken up by firmware loading > actually completing instead of being aborted. This is the point I'd like to ask. In normal cases, fw_load_done() might happen, of course, if the loading gets completed. However, I was wondering if the kernel ensures either fw_load_done() or fw_load_abort() to be called by *another* context while kernel_halt(). > Thanks. Thank you very much! Byungchul > > -- > tejun >
Hello, On Thu, May 12, 2022 at 08:18:24PM +0900, Byungchul Park wrote: > > 1. wait_for_completion_killable_timeout() doesn't need someone to wake it up > > to make forward progress because it will unstick itself after timeout > > expires. > > I have a question about this one. Yes, it would never been stuck thanks > to timeout. However, IIUC, timeouts are not supposed to expire in normal > cases. So I thought a timeout expiration means not a normal case so need > to inform it in terms of dependency so as to prevent further expiraton. > That's why I have been trying to track even timeout'ed APIs. > > Do you think DEPT shouldn't track timeout APIs? If I was wrong, I > shouldn't track the timeout APIs any more. Without actually surveying the use cases, I can't say for sure but my experience has been that we often get pretty creative with timeouts and it's something people actively think about and monitor (and it's usually not subtle). Given that, I'm skeptical about how much value it'd add for a dependency checker to warn about timeouts. It might be net negative than the other way around. > > 2. complete_all() from __fw_load_abort() isn't the only source of wakeup. > > The fw loader can be, and mainly should be, woken up by firmware loading > > actually completing instead of being aborted. > > This is the point I'd like to ask. In normal cases, fw_load_done() might > happen, of course, if the loading gets completed. However, I was > wondering if the kernel ensures either fw_load_done() or fw_load_abort() > to be called by *another* context while kernel_halt(). We'll have to walk through the code to tell that. On a cursory look tho, up until that point (just before shutting down usermode helper), I don't see anything which would actively block firmware loading. Thanks. -- tejun
On Thu, May 12, 2022 at 08:18:24PM +0900, Byungchul Park wrote:
> I have a question about this one. Yes, it would never been stuck thanks
> to timeout. However, IIUC, timeouts are not supposed to expire in normal
> cases. So I thought a timeout expiration means not a normal case so need
> to inform it in terms of dependency so as to prevent further expiraton.
> That's why I have been trying to track even timeout'ed APIs.
As I beleive I've already pointed out to you previously in ext4 and
ocfs2, the jbd2 timeout every five seconds happens **all** the time
while the file system is mounted. Commits more frequently than five
seconds is the exception case, at least for desktops/laptop workloads.
We *don't* get to the timeout only when a userspace process calls
fsync(2), or if the journal was incorrectly sized by the system
administrator so that it's too small, and the workload has so many
file system mutations that we have to prematurely close the
transaction ahead of the 5 second timeout.
> Do you think DEPT shouldn't track timeout APIs? If I was wrong, I
> shouldn't track the timeout APIs any more.
DEPT tracking timeouts will cause false positives in at least some
cases. At the very least, there needs to be an easy way to suppress
these false positives on a per wait/mutex/spinlock basis.
- Ted
On Thu, May 12, 2022 at 09:56:46AM -0400, Theodore Ts'o wrote: > On Thu, May 12, 2022 at 08:18:24PM +0900, Byungchul Park wrote: > > I have a question about this one. Yes, it would never been stuck thanks > > to timeout. However, IIUC, timeouts are not supposed to expire in normal > > cases. So I thought a timeout expiration means not a normal case so need > > to inform it in terms of dependency so as to prevent further expiraton. > > That's why I have been trying to track even timeout'ed APIs. > > As I beleive I've already pointed out to you previously in ext4 and > ocfs2, the jbd2 timeout every five seconds happens **all** the time > while the file system is mounted. Commits more frequently than five > seconds is the exception case, at least for desktops/laptop workloads. Thanks, Ted. It's easy to stop tracking APIs with timeout. I've been just afraid that the cases that we want to suppress anyway will be skipped. However, I should stop it if it produces too many false alarms. > We *don't* get to the timeout only when a userspace process calls > fsync(2), or if the journal was incorrectly sized by the system > administrator so that it's too small, and the workload has so many > file system mutations that we have to prematurely close the > transaction ahead of the 5 second timeout. Yeah... It's how journaling works. Thanks. > > Do you think DEPT shouldn't track timeout APIs? If I was wrong, I > > shouldn't track the timeout APIs any more. > > DEPT tracking timeouts will cause false positives in at least some > cases. At the very least, there needs to be an easy way to suppress > these false positives on a per wait/mutex/spinlock basis. The easy way is to stop tracking those that are along with timeout until DEPT starts to consider waits/events by timeout functionality itself. Thanks. Byungchul > > - Ted
© 2016 - 2026 Red Hat, Inc.