include/linux/completion.h | 42 +- include/linux/dept.h | 523 +++++++ include/linux/dept_page.h | 78 ++ include/linux/dept_sdt.h | 62 + include/linux/hardirq.h | 3 + include/linux/irqflags.h | 33 +- include/linux/llist.h | 8 - include/linux/lockdep.h | 156 ++- include/linux/lockdep_types.h | 3 + include/linux/mutex.h | 31 + include/linux/page-flags.h | 45 +- include/linux/pagemap.h | 7 +- include/linux/percpu-rwsem.h | 10 +- include/linux/rtmutex.h | 7 + include/linux/rwlock.h | 48 + include/linux/rwlock_api_smp.h | 8 +- include/linux/rwlock_types.h | 7 + include/linux/rwsem.h | 31 + include/linux/sched.h | 7 + include/linux/seqlock.h | 59 +- include/linux/spinlock.h | 24 + include/linux/spinlock_types_raw.h | 13 + include/linux/swait.h | 4 + include/linux/types.h | 8 + include/linux/wait.h | 6 +- init/init_task.c | 2 + init/main.c | 4 + kernel/Makefile | 1 + kernel/cpu.c | 2 +- kernel/dependency/Makefile | 5 + kernel/dependency/dept.c | 2702 ++++++++++++++++++++++++++++++++++++ kernel/dependency/dept_hash.h | 10 + kernel/dependency/dept_internal.h | 26 + kernel/dependency/dept_object.h | 13 + kernel/dependency/dept_proc.c | 93 ++ kernel/entry/common.c | 3 + kernel/exit.c | 1 + kernel/fork.c | 2 + kernel/locking/lockdep.c | 12 +- kernel/module.c | 2 + kernel/sched/completion.c | 12 +- kernel/sched/core.c | 3 + kernel/sched/swait.c | 10 + kernel/sched/wait.c | 16 + kernel/softirq.c | 6 +- kernel/trace/trace_preemptirq.c | 19 +- lib/Kconfig.debug | 21 + mm/filemap.c | 68 + mm/page_ext.c | 5 + 49 files changed, 4204 insertions(+), 57 deletions(-) create mode 100644 include/linux/dept.h create mode 100644 include/linux/dept_page.h create mode 100644 include/linux/dept_sdt.h create mode 100644 kernel/dependency/Makefile create mode 100644 kernel/dependency/dept.c create mode 100644 kernel/dependency/dept_hash.h create mode 100644 kernel/dependency/dept_internal.h create mode 100644 kernel/dependency/dept_object.h create mode 100644 kernel/dependency/dept_proc.c
Hi Linus and folks,
I've been developing a tool for detecting deadlock possibilities by
tracking wait/event rather than lock(?) acquisition order to try to
cover all synchonization machanisms. It's done on v5.17-rc1 tag.
https://github.com/lgebyungchulpark/linux-dept/commits/dept1.12_on_v5.17-rc1
Benifit:
0. Works with all lock primitives.
1. Works with wait_for_completion()/complete().
2. Works with 'wait' on PG_locked.
3. Works with 'wait' on PG_writeback.
4. Works with swait/wakeup.
5. Works with waitqueue.
6. Multiple reports are allowed.
7. Deduplication control on multiple reports.
8. Withstand false positives thanks to 6.
9. Easy to tag any wait/event.
Future work:
0. To make it more stable.
1. To separates Dept from Lockdep.
2. To improves performance in terms of time and space.
3. To use Dept as a dependency engine for Lockdep.
4. To add any missing tags of wait/event in the kernel.
5. To deduplicate stack trace.
I've got several reports from the tool. Some of them look like false
alarms and some others look like real deadlock possibility. Because of
my unfamiliarity of the domain, it's hard to confirm if it's a real one.
Let me add the reports on this email thread.
How to interpret the report is:
1. E(event) in each context cannot be triggered because of the
W(wait) that cannot be woken.
2. The stack trace helping find the problematic code is located
in each conext's detail.
Changes from RFC:
1. Prevent adding a wait tag at prepare_to_wait() but __schedule().
2. Use try version at lockdep_acquire_cpus_lock() annotation.
3. Distinguish each syscall context from another.
Thanks,
Byungchul
Byungchul Park (16):
llist: Move llist_{head,node} definition to types.h
dept: Implement Dept(Dependency Tracker)
dept: Embed Dept data in Lockdep
dept: Apply Dept to spinlock
dept: Apply Dept to mutex families
dept: Apply Dept to rwlock
dept: Apply Dept to wait_for_completion()/complete()
dept: Apply Dept to seqlock
dept: Apply Dept to rwsem
dept: Add proc knobs to show stats and dependency graph
dept: Introduce split map concept and new APIs for them
dept: Apply Dept to wait/event of PG_{locked,writeback}
dept: Apply SDT to swait
dept: Apply SDT to wait(waitqueue)
locking/lockdep, cpu/hotplus: Use a weaker annotation in AP thread
dept: Distinguish each syscall context from another
include/linux/completion.h | 42 +-
include/linux/dept.h | 523 +++++++
include/linux/dept_page.h | 78 ++
include/linux/dept_sdt.h | 62 +
include/linux/hardirq.h | 3 +
include/linux/irqflags.h | 33 +-
include/linux/llist.h | 8 -
include/linux/lockdep.h | 156 ++-
include/linux/lockdep_types.h | 3 +
include/linux/mutex.h | 31 +
include/linux/page-flags.h | 45 +-
include/linux/pagemap.h | 7 +-
include/linux/percpu-rwsem.h | 10 +-
include/linux/rtmutex.h | 7 +
include/linux/rwlock.h | 48 +
include/linux/rwlock_api_smp.h | 8 +-
include/linux/rwlock_types.h | 7 +
include/linux/rwsem.h | 31 +
include/linux/sched.h | 7 +
include/linux/seqlock.h | 59 +-
include/linux/spinlock.h | 24 +
include/linux/spinlock_types_raw.h | 13 +
include/linux/swait.h | 4 +
include/linux/types.h | 8 +
include/linux/wait.h | 6 +-
init/init_task.c | 2 +
init/main.c | 4 +
kernel/Makefile | 1 +
kernel/cpu.c | 2 +-
kernel/dependency/Makefile | 5 +
kernel/dependency/dept.c | 2702 ++++++++++++++++++++++++++++++++++++
kernel/dependency/dept_hash.h | 10 +
kernel/dependency/dept_internal.h | 26 +
kernel/dependency/dept_object.h | 13 +
kernel/dependency/dept_proc.c | 93 ++
kernel/entry/common.c | 3 +
kernel/exit.c | 1 +
kernel/fork.c | 2 +
kernel/locking/lockdep.c | 12 +-
kernel/module.c | 2 +
kernel/sched/completion.c | 12 +-
kernel/sched/core.c | 3 +
kernel/sched/swait.c | 10 +
kernel/sched/wait.c | 16 +
kernel/softirq.c | 6 +-
kernel/trace/trace_preemptirq.c | 19 +-
lib/Kconfig.debug | 21 +
mm/filemap.c | 68 +
mm/page_ext.c | 5 +
49 files changed, 4204 insertions(+), 57 deletions(-)
create mode 100644 include/linux/dept.h
create mode 100644 include/linux/dept_page.h
create mode 100644 include/linux/dept_sdt.h
create mode 100644 kernel/dependency/Makefile
create mode 100644 kernel/dependency/dept.c
create mode 100644 kernel/dependency/dept_hash.h
create mode 100644 kernel/dependency/dept_internal.h
create mode 100644 kernel/dependency/dept_object.h
create mode 100644 kernel/dependency/dept_proc.c
--
1.9.1
On Thu, Feb 17, 2022 at 07:57:36PM +0900, Byungchul Park wrote: > > I've got several reports from the tool. Some of them look like false > alarms and some others look like real deadlock possibility. Because of > my unfamiliarity of the domain, it's hard to confirm if it's a real one. > Let me add the reports on this email thread. The problem is we have so many potentially invalid, or so-rare-as-to-be-not-worth-the-time-to-investigate-in-the- grand-scheme-of-all-of-the-fires-burning-on-maintainers laps that it's really not reasonable to ask maintainers to determine whether something is a false alarm or not. If I want more of these unreliable potential bug reports to investigate, there is a huge backlog in Syzkaller. :-) Looking at the second ext4 report, it doesn't make any sense. Context A is the kjournald thread. We don't do a commit until (a) the timeout expires, or (b) someone explicitly requests that a commit happen waking up j_wait_commit. I'm guessing that complaint here is that DEPT thinks nothing is explicitly requesting a wake up. But note that after 5 seconds (or whatever journal->j_commit_interval) is configured to be we *will* always start a commit. So ergo, there can't be a deadlock. At a higher level of discussion, it's an unfair tax on maintainer's times to ask maintainers to help you debug DEPT for you. Tools like Syzkaller and DEPT are useful insofar as they save us time in making our subsystems better. But until you can prove that it's not going to be a massive denial of service attack on maintainer's time, at the *very* least keep an RFC on the patch, or add massive warnings that more often than not DEPT is going to be sending maintainers on a wild goose chase. If you know there there "appear to be false positives", you need to make sure you've tracked them all down before trying to ask that this be merged. You may also want to add some documentation about why we should trust this; in particular for wait channels, when a process calls schedule() there may be multiple reasons why the thread will wake up --- in the worst case, such as in the select(2) or epoll(2) system call, there may be literally thousands of reasons (one for every file desriptor the select is waiting on) --- why the process will wake up and thus resolve the potential "deadlock" that DEPT is worrying about. How is DEPT going to handle those cases? If the answer is that things need to be tagged, then at least disclose potential reasons why DEPT might be untrustworthy to save your reviewers time. I know that you're trying to help us, but this tool needs to be far better than Lockdep before we should think about merging it. Even if it finds 5% more potential deadlocks, if it creates 95% more false positive reports --- and the ones it finds are crazy things that rarely actually happen in practice, are the costs worth the benefits? And who is bearing the costs, and who is receiving the benefits? Regards, - Ted
On Thu, 17 Feb 2022 10:51:09 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > I know that you're trying to help us, but this tool needs to be far > better than Lockdep before we should think about merging it. Even if > it finds 5% more potential deadlocks, if it creates 95% more false > positive reports --- and the ones it finds are crazy things that > rarely actually happen in practice, are the costs worth the benefits? > And who is bearing the costs, and who is receiving the benefits? I personally believe that there's potential that this can be helpful and we will want to merge it. But, what I believe Ted is trying to say is, if you do not know if the report is a bug or not, please do not ask the maintainers to determine it for you. This is a good opportunity for you to look to see why your tool reported an issue, and learn that subsystem. Look at if this is really a bug or not, and investigate why. The likely/unlikely tracing I do finds issues all over the kernel. But before I report anything, I look at the subsystem and determine *why* it's reporting what it does. In some cases, it's just a config issue. Where, I may submit a patch saying "this is 100% wrong in X config, and we should just remove the "unlikely". But I did the due diligence to find out exactly what the issue is, and why the tooling reported what it reported. I want to stress that your Dept tooling looks to have the potential of being something that will be worth while including. But the false positives needs to be down to the rate of lockdep false positives. As Ted said, if it's reporting 95% false positives, nobody is going to look at the 5% of real bugs that it finds. -- Steve
On Thu, Feb 17, 2022 at 12:00:05PM -0500, Steven Rostedt wrote: > On Thu, 17 Feb 2022 10:51:09 -0500 > "Theodore Ts'o" <tytso@mit.edu> wrote: > > > I know that you're trying to help us, but this tool needs to be far > > better than Lockdep before we should think about merging it. Even if > > it finds 5% more potential deadlocks, if it creates 95% more false > > positive reports --- and the ones it finds are crazy things that > > rarely actually happen in practice, are the costs worth the benefits? > > And who is bearing the costs, and who is receiving the benefits? > > I personally believe that there's potential that this can be helpful and we > will want to merge it. > > But, what I believe Ted is trying to say is, if you do not know if the > report is a bug or not, please do not ask the maintainers to determine it > for you. This is a good opportunity for you to look to see why your tool > reported an issue, and learn that subsystem. Look at if this is really a > bug or not, and investigate why. Appreciate your feedback. I'll be more careful in reporting things, and I think I need to make it more conservative... > The likely/unlikely tracing I do finds issues all over the kernel. But > before I report anything, I look at the subsystem and determine *why* it's > reporting what it does. In some cases, it's just a config issue. Where, I > may submit a patch saying "this is 100% wrong in X config, and we should > just remove the "unlikely". But I did the due diligence to find out exactly > what the issue is, and why the tooling reported what it reported. I'll try my best to do things that way. However, thing is that there's few reports with my system... That's why I shared Dept in LKML space. > I want to stress that your Dept tooling looks to have the potential of > being something that will be worth while including. But the false positives > needs to be down to the rate of lockdep false positives. As Ted said, if > it's reporting 95% false positives, nobody is going to look at the 5% of > real bugs that it finds. Agree. Dept should not be merged if so. I'm not pushing ahead, but I'm convinced that Dept works what a dependency tracker should do. Let's see how valuable it is esp. in the middle of developing something in the kernel. Thanks, Byungchul
On Thu, Feb 17, 2022 at 10:51:09AM -0500, Theodore Ts'o wrote: > On Thu, Feb 17, 2022 at 07:57:36PM +0900, Byungchul Park wrote: > > > > I've got several reports from the tool. Some of them look like false > > alarms and some others look like real deadlock possibility. Because of > > my unfamiliarity of the domain, it's hard to confirm if it's a real one. > > Let me add the reports on this email thread. > > The problem is we have so many potentially invalid, or > so-rare-as-to-be-not-worth-the-time-to-investigate-in-the- > grand-scheme-of-all-of-the-fires-burning-on-maintainers laps that it's > really not reasonable to ask maintainers to determine whether Even though I might have been wrong and might be gonna be wrong, you look so arrogant. You were hasty to judge and trying to walk over me. I reported it because I thought it was a real problem but couldn't confirm it. For the other reports that I thought was not real, I didn't even mention it. If you are talking about the previous report, then I felt so sorry as I told you. I skimmed through the part of the waits... Basically, I respect you and appreciate your feedback. Hope you not get me wrong. > Looking at the second ext4 report, it doesn't make any sense. Context > A is the kjournald thread. We don't do a commit until (a) the timeout > expires, or (b) someone explicitly requests that a commit happen > waking up j_wait_commit. I'm guessing that complaint here is that > DEPT thinks nothing is explicitly requesting a wake up. But note that > after 5 seconds (or whatever journal->j_commit_interval) is configured > to be we *will* always start a commit. So ergo, there can't be a deadlock. Yeah, it might not be a *deadlock deadlock* because the wait will be anyway woken up by one of the wake up points you mentioned. However, the dependency looks problematic because the three contexts participating in the dependency chain would be stuck for a while until one eventually wakes it up. I bet it would not be what you meant. Again. It's not critical but problematic. Or am I missing something? > At a higher level of discussion, it's an unfair tax on maintainer's > times to ask maintainers to help you debug DEPT for you. Tools like > Syzkaller and DEPT are useful insofar as they save us time in making > our subsystems better. But until you can prove that it's not going to > be a massive denial of service attack on maintainer's time, at the Partially I agree. I would understand you even if you don't support Dept until you think it's valuable enough. However, let me keep asking things to fs folks, not you, even though I would cc you on it. > If you know there there "appear to be false positives", you need to > make sure you've tracked them all down before trying to ask that this > be merged. To track them all down, I need to ask LKML because Dept works perfectly with my system. I don't want it to be merged with a lot of false positive still in there, either. > You may also want to add some documentation about why we should trust > this; in particular for wait channels, when a process calls schedule() > there may be multiple reasons why the thread will wake up --- in the > worst case, such as in the select(2) or epoll(2) system call, there > may be literally thousands of reasons (one for every file desriptor > the select is waiting on) --- why the process will wake up and thus > resolve the potential "deadlock" that DEPT is worrying about. How is > DEPT going to handle those cases? If the answer is that things need Thank you for the information but I don't get it which case you are concerning. I'd like to ask you a specific senario of that so that we can discuss it more - maybe I guess I could answer to it tho, but I won't ask you. Just give me an instance only if you think it's worthy. You look like a guy who unconditionally blames on new things before understanding it rather than asking and discussing. Again. I also think anyone doesn't have to spend his or her time for what he or she think is not worthy enough. > I know that you're trying to help us, but this tool needs to be far > better than Lockdep before we should think about merging it. Even if > it finds 5% more potential deadlocks, if it creates 95% more false It should not get merged for sure if so, but it sounds too sarcastic. Let's see if it creates 95% false positives for real. If it's true and I can't control it, I will give up. That's what I should do. There are a lot of factors to judge how valuable Dept is. Dept would be useful especially in the middle of development, rather than in the final state in the tree. It'd be appreciated if you think that sides more, too. Thanks, Byungchul
[ 7.009608] =================================================== [ 7.009613] DEPT: Circular dependency has been detected. [ 7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G W [ 7.009616] --------------------------------------------------- [ 7.009617] summary [ 7.009618] --------------------------------------------------- [ 7.009618] *** DEADLOCK *** [ 7.009618] [ 7.009619] context A [ 7.009619] [S] (unknown)(&(bit_wait_table + i)->dmap:0) [ 7.009621] [W] down_write(&ei->i_data_sem:0) [ 7.009623] [E] event(&(bit_wait_table + i)->dmap:0) [ 7.009624] [ 7.009625] context B [ 7.009625] [S] down_read(&ei->i_data_sem:0) [ 7.009626] [W] wait(&(bit_wait_table + i)->dmap:0) [ 7.009627] [E] up_read(&ei->i_data_sem:0) [ 7.009628] [ 7.009629] [S]: start of the event context [ 7.009629] [W]: the wait blocked [ 7.009630] [E]: the event not reachable [ 7.009631] --------------------------------------------------- [ 7.009631] context A's detail [ 7.009632] --------------------------------------------------- [ 7.009632] context A [ 7.009633] [S] (unknown)(&(bit_wait_table + i)->dmap:0) [ 7.009634] [W] down_write(&ei->i_data_sem:0) [ 7.009635] [E] event(&(bit_wait_table + i)->dmap:0) [ 7.009636] [ 7.009636] [S] (unknown)(&(bit_wait_table + i)->dmap:0): [ 7.009638] (N/A) [ 7.009638] [ 7.009639] [W] down_write(&ei->i_data_sem:0): [ 7.009639] ext4_truncate (fs/ext4/inode.c:4187) [ 7.009645] stacktrace: [ 7.009646] down_write (kernel/locking/rwsem.c:1514) [ 7.009648] ext4_truncate (fs/ext4/inode.c:4187) [ 7.009650] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) [ 7.009652] generic_perform_write (mm/filemap.c:3784) [ 7.009654] ext4_buffered_write_iter (fs/ext4/file.c:269) [ 7.009657] ext4_file_write_iter (fs/ext4/file.c:677) [ 7.009659] new_sync_write (fs/read_write.c:504 (discriminator 1)) [ 7.009662] vfs_write (fs/read_write.c:590) [ 7.009663] ksys_write (fs/read_write.c:644) [ 7.009664] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 7.009667] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) [ 7.009669] [ 7.009670] [E] event(&(bit_wait_table + i)->dmap:0): [ 7.009671] __wake_up_common (kernel/sched/wait.c:108) [ 7.009673] stacktrace: [ 7.009674] dept_event (kernel/dependency/dept.c:2337) [ 7.009677] __wake_up_common (kernel/sched/wait.c:109) [ 7.009678] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) [ 7.009679] __wake_up_bit (kernel/sched/wait_bit.c:127) [ 7.009681] ext4_orphan_del (fs/ext4/orphan.c:282) [ 7.009683] ext4_truncate (fs/ext4/inode.c:4212) [ 7.009685] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) [ 7.009687] generic_perform_write (mm/filemap.c:3784) [ 7.009688] ext4_buffered_write_iter (fs/ext4/file.c:269) [ 7.009690] ext4_file_write_iter (fs/ext4/file.c:677) [ 7.009692] new_sync_write (fs/read_write.c:504 (discriminator 1)) [ 7.009694] vfs_write (fs/read_write.c:590) [ 7.009695] ksys_write (fs/read_write.c:644) [ 7.009696] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 7.009698] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) [ 7.009700] --------------------------------------------------- [ 7.009700] context B's detail [ 7.009701] --------------------------------------------------- [ 7.009702] context B [ 7.009702] [S] down_read(&ei->i_data_sem:0) [ 7.009703] [W] wait(&(bit_wait_table + i)->dmap:0) [ 7.009704] [E] up_read(&ei->i_data_sem:0) [ 7.009705] [ 7.009706] [S] down_read(&ei->i_data_sem:0): [ 7.009707] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) [ 7.009709] stacktrace: [ 7.009709] down_read (kernel/locking/rwsem.c:1461) [ 7.009711] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) [ 7.009712] ext4_getblk (fs/ext4/inode.c:851) [ 7.009714] ext4_bread (fs/ext4/inode.c:903) [ 7.009715] __ext4_read_dirblock (fs/ext4/namei.c:117) [ 7.009718] dx_probe (fs/ext4/namei.c:789) [ 7.009720] ext4_dx_find_entry (fs/ext4/namei.c:1721) [ 7.009722] __ext4_find_entry (fs/ext4/namei.c:1571) [ 7.009723] ext4_lookup (fs/ext4/namei.c:1770) [ 7.009725] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) [ 7.009727] path_openat (fs/namei.c:3401 fs/namei.c:3605) [ 7.009729] do_filp_open (fs/namei.c:3637) [ 7.009731] do_sys_openat2 (fs/open.c:1215) [ 7.009732] do_sys_open (fs/open.c:1231) [ 7.009734] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 7.009736] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) [ 7.009738] [ 7.009738] [W] wait(&(bit_wait_table + i)->dmap:0): [ 7.009739] prepare_to_wait (kernel/sched/wait.c:275) [ 7.009741] stacktrace: [ 7.009741] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) [ 7.009743] schedule (kernel/sched/core.c:6373 (discriminator 1)) [ 7.009744] io_schedule (./arch/x86/include/asm/current.h:15 kernel/sched/core.c:8392 kernel/sched/core.c:8418) [ 7.009745] bit_wait_io (./arch/x86/include/asm/current.h:15 kernel/sched/wait_bit.c:210) [ 7.009746] __wait_on_bit (kernel/sched/wait_bit.c:49) [ 7.009748] out_of_line_wait_on_bit (kernel/sched/wait_bit.c:65) [ 7.009749] ext4_read_bh (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 ./include/linux/buffer_head.h:120 fs/ext4/super.c:201) [ 7.009752] __read_extent_tree_block (fs/ext4/extents.c:545) [ 7.009754] ext4_find_extent (fs/ext4/extents.c:928) [ 7.009756] ext4_ext_map_blocks (fs/ext4/extents.c:4099) [ 7.009757] ext4_map_blocks (fs/ext4/inode.c:563) [ 7.009759] ext4_getblk (fs/ext4/inode.c:851) [ 7.009760] ext4_bread (fs/ext4/inode.c:903) [ 7.009762] __ext4_read_dirblock (fs/ext4/namei.c:117) [ 7.009764] dx_probe (fs/ext4/namei.c:789) [ 7.009765] ext4_dx_find_entry (fs/ext4/namei.c:1721) [ 7.009767] [ 7.009768] [E] up_read(&ei->i_data_sem:0): [ 7.009769] ext4_map_blocks (fs/ext4/inode.c:593) [ 7.009771] stacktrace: [ 7.009771] up_read (kernel/locking/rwsem.c:1556) [ 7.009774] ext4_map_blocks (fs/ext4/inode.c:593) [ 7.009775] ext4_getblk (fs/ext4/inode.c:851) [ 7.009777] ext4_bread (fs/ext4/inode.c:903) [ 7.009778] __ext4_read_dirblock (fs/ext4/namei.c:117) [ 7.009780] dx_probe (fs/ext4/namei.c:789) [ 7.009782] ext4_dx_find_entry (fs/ext4/namei.c:1721) [ 7.009784] __ext4_find_entry (fs/ext4/namei.c:1571) [ 7.009786] ext4_lookup (fs/ext4/namei.c:1770) [ 7.009788] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) [ 7.009789] path_openat (fs/namei.c:3401 fs/namei.c:3605) [ 7.009791] do_filp_open (fs/namei.c:3637) [ 7.009792] do_sys_openat2 (fs/open.c:1215) [ 7.009794] do_sys_open (fs/open.c:1231) [ 7.009795] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 7.009797] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) [ 7.009799] --------------------------------------------------- [ 7.009800] information that might be helpful [ 7.009800] --------------------------------------------------- [ 7.009801] CPU: 0 PID: 611 Comm: rs:main Q:Reg Tainted: G W 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 [ 7.009804] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 7.009805] Call Trace: [ 7.009806] <TASK> [ 7.009807] dump_stack_lvl (lib/dump_stack.c:107) [ 7.009809] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) [ 7.009812] ? print_circle (kernel/dependency/dept.c:1086) [ 7.009814] cb_check_dl (kernel/dependency/dept.c:1104) [ 7.009815] bfs (kernel/dependency/dept.c:860) [ 7.009818] add_dep (kernel/dependency/dept.c:1423) [ 7.009820] do_event.isra.25 (kernel/dependency/dept.c:1650) [ 7.009822] ? __wake_up_common (kernel/sched/wait.c:108) [ 7.009824] dept_event (kernel/dependency/dept.c:2337) [ 7.009826] __wake_up_common (kernel/sched/wait.c:109) [ 7.009828] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) [ 7.009830] __wake_up_bit (kernel/sched/wait_bit.c:127) [ 7.009832] ext4_orphan_del (fs/ext4/orphan.c:282) [ 7.009835] ? dept_ecxt_exit (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:2478) [ 7.009837] ext4_truncate (fs/ext4/inode.c:4212) [ 7.009839] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) [ 7.009842] generic_perform_write (mm/filemap.c:3784) [ 7.009845] ext4_buffered_write_iter (fs/ext4/file.c:269) [ 7.009848] ext4_file_write_iter (fs/ext4/file.c:677) [ 7.009851] new_sync_write (fs/read_write.c:504 (discriminator 1)) [ 7.009854] vfs_write (fs/read_write.c:590) [ 7.009856] ksys_write (fs/read_write.c:644) [ 7.009857] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:65) [ 7.009860] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 7.009862] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) [ 7.009865] RIP: 0033:0x7f3b160b335d [ 7.009867] Code: e1 20 00 00 75 10 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce fa ff ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 17 fb ff ff 48 89 d0 48 83 c4 08 48 3d 01 All code ======== 0: e1 20 loope 0x22 2: 00 00 add %al,(%rax) 4: 75 10 jne 0x16 6: b8 01 00 00 00 mov $0x1,%eax b: 0f 05 syscall d: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax 13: 73 31 jae 0x46 15: c3 retq 16: 48 83 ec 08 sub $0x8,%rsp 1a: e8 ce fa ff ff callq 0xfffffffffffffaed 1f: 48 89 04 24 mov %rax,(%rsp) 23: b8 01 00 00 00 mov $0x1,%eax 28: 0f 05 syscall 2a:* 48 8b 3c 24 mov (%rsp),%rdi <-- trapping instruction 2e: 48 89 c2 mov %rax,%rdx 31: e8 17 fb ff ff callq 0xfffffffffffffb4d 36: 48 89 d0 mov %rdx,%rax 39: 48 83 c4 08 add $0x8,%rsp 3d: 48 rex.W 3e: 3d .byte 0x3d 3f: 01 .byte 0x1 Code starting with the faulting instruction =========================================== 0: 48 8b 3c 24 mov (%rsp),%rdi 4: 48 89 c2 mov %rax,%rdx 7: e8 17 fb ff ff callq 0xfffffffffffffb23 c: 48 89 d0 mov %rdx,%rax f: 48 83 c4 08 add $0x8,%rsp 13: 48 rex.W 14: 3d .byte 0x3d 15: 01 .byte 0x1 [ 7.009869] RSP: 002b:00007f3b1340f180 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 7.009871] RAX: ffffffffffffffda RBX: 00007f3b040010a0 RCX: 00007f3b160b335d [ 7.009873] RDX: 0000000000000300 RSI: 00007f3b040010a0 RDI: 0000000000000001 [ 7.009874] RBP: 0000000000000000 R08: fffffffffffffa15 R09: fffffffffffffa05 [ 7.009875] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f3b04000df0 [ 7.009876] R13: 00007f3b1340f1a0 R14: 0000000000000220 R15: 0000000000000300 [ 7.009879] </TASK>
[ 9.008161] =================================================== [ 9.008163] DEPT: Circular dependency has been detected. [ 9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G W [ 9.008166] --------------------------------------------------- [ 9.008167] summary [ 9.008167] --------------------------------------------------- [ 9.008168] *** DEADLOCK *** [ 9.008168] [ 9.008168] context A [ 9.008169] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0) [ 9.008171] [W] wait(&(&journal->j_wait_commit)->dmap:0) [ 9.008172] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0) [ 9.008173] [ 9.008173] context B [ 9.008174] [S] down_write(mapping.invalidate_lock:0) [ 9.008175] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0) [ 9.008176] [E] up_write(mapping.invalidate_lock:0) [ 9.008177] [ 9.008178] context C [ 9.008179] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0) [ 9.008180] [W] down_write(mapping.invalidate_lock:0) [ 9.008181] [E] event(&(&journal->j_wait_commit)->dmap:0) [ 9.008181] [ 9.008182] [S]: start of the event context [ 9.008183] [W]: the wait blocked [ 9.008183] [E]: the event not reachable [ 9.008184] --------------------------------------------------- [ 9.008184] context A's detail [ 9.008185] --------------------------------------------------- [ 9.008186] context A [ 9.008186] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0) [ 9.008187] [W] wait(&(&journal->j_wait_commit)->dmap:0) [ 9.008188] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0) [ 9.008189] [ 9.008190] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0): [ 9.008191] (N/A) [ 9.008191] [ 9.008192] [W] wait(&(&journal->j_wait_commit)->dmap:0): [ 9.008193] prepare_to_wait (kernel/sched/wait.c:275) [ 9.008197] stacktrace: [ 9.008198] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) [ 9.008200] schedule (kernel/sched/core.c:6373 (discriminator 1)) [ 9.008201] kjournald2 (fs/jbd2/journal.c:250) [ 9.008203] kthread (kernel/kthread.c:377) [ 9.008206] ret_from_fork (arch/x86/entry/entry_64.S:301) [ 9.008209] [ 9.008209] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0): [ 9.008210] __wake_up_common (kernel/sched/wait.c:108) [ 9.008212] stacktrace: [ 9.008213] dept_event (kernel/dependency/dept.c:2337) [ 9.008215] __wake_up_common (kernel/sched/wait.c:109) [ 9.008217] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) [ 9.008218] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) [ 9.008221] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) [ 9.008223] kthread (kernel/kthread.c:377) [ 9.008224] ret_from_fork (arch/x86/entry/entry_64.S:301) [ 9.008226] --------------------------------------------------- [ 9.008226] context B's detail [ 9.008227] --------------------------------------------------- [ 9.008228] context B [ 9.008228] [S] down_write(mapping.invalidate_lock:0) [ 9.008229] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0) [ 9.008230] [E] up_write(mapping.invalidate_lock:0) [ 9.008231] [ 9.008232] [S] down_write(mapping.invalidate_lock:0): [ 9.008233] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) [ 9.008237] stacktrace: [ 9.008237] down_write (kernel/locking/rwsem.c:1514) [ 9.008239] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) [ 9.008241] generic_perform_write (mm/filemap.c:3784) [ 9.008243] ext4_buffered_write_iter (fs/ext4/file.c:269) [ 9.008245] ext4_file_write_iter (fs/ext4/file.c:677) [ 9.008247] new_sync_write (fs/read_write.c:504 (discriminator 1)) [ 9.008250] vfs_write (fs/read_write.c:590) [ 9.008251] ksys_write (fs/read_write.c:644) [ 9.008253] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 9.008255] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) [ 9.008258] [ 9.008258] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0): [ 9.008259] prepare_to_wait (kernel/sched/wait.c:275) [ 9.008261] stacktrace: [ 9.008261] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) [ 9.008263] schedule (kernel/sched/core.c:6373 (discriminator 1)) [ 9.008264] wait_transaction_locked (fs/jbd2/transaction.c:184) [ 9.008266] add_transaction_credits (fs/jbd2/transaction.c:248 (discriminator 3)) [ 9.008267] start_this_handle (fs/jbd2/transaction.c:427) [ 9.008269] jbd2__journal_start (fs/jbd2/transaction.c:526) [ 9.008271] __ext4_journal_start_sb (fs/ext4/ext4_jbd2.c:105) [ 9.008273] ext4_truncate (fs/ext4/inode.c:4164) [ 9.008274] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) [ 9.008276] generic_perform_write (mm/filemap.c:3784) [ 9.008277] ext4_buffered_write_iter (fs/ext4/file.c:269) [ 9.008279] ext4_file_write_iter (fs/ext4/file.c:677) [ 9.008281] new_sync_write (fs/read_write.c:504 (discriminator 1)) [ 9.008283] vfs_write (fs/read_write.c:590) [ 9.008284] ksys_write (fs/read_write.c:644) [ 9.008285] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 9.008287] [ 9.008288] [E] up_write(mapping.invalidate_lock:0): [ 9.008288] ext4_da_get_block_prep (fs/ext4/inode.c:1795 fs/ext4/inode.c:1829) [ 9.008291] --------------------------------------------------- [ 9.008291] context C's detail [ 9.008292] --------------------------------------------------- [ 9.008292] context C [ 9.008293] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0) [ 9.008294] [W] down_write(mapping.invalidate_lock:0) [ 9.008295] [E] event(&(&journal->j_wait_commit)->dmap:0) [ 9.008296] [ 9.008297] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0): [ 9.008298] (N/A) [ 9.008298] [ 9.008299] [W] down_write(mapping.invalidate_lock:0): [ 9.008299] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) [ 9.008302] stacktrace: [ 9.008302] down_write (kernel/locking/rwsem.c:1514) [ 9.008304] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) [ 9.008305] generic_perform_write (mm/filemap.c:3784) [ 9.008307] ext4_buffered_write_iter (fs/ext4/file.c:269) [ 9.008309] ext4_file_write_iter (fs/ext4/file.c:677) [ 9.008311] new_sync_write (fs/read_write.c:504 (discriminator 1)) [ 9.008312] vfs_write (fs/read_write.c:590) [ 9.008314] ksys_write (fs/read_write.c:644) [ 9.008315] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 9.008316] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) [ 9.008318] [ 9.008319] [E] event(&(&journal->j_wait_commit)->dmap:0): [ 9.008320] __wake_up_common (kernel/sched/wait.c:108) [ 9.008321] stacktrace: [ 9.008322] __wake_up_common (kernel/sched/wait.c:109) [ 9.008323] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) [ 9.008324] __jbd2_log_start_commit (fs/jbd2/journal.c:508) [ 9.008326] jbd2_log_start_commit (fs/jbd2/journal.c:527) [ 9.008327] __jbd2_journal_force_commit (fs/jbd2/journal.c:560) [ 9.008329] jbd2_journal_force_commit_nested (fs/jbd2/journal.c:583) [ 9.008331] ext4_should_retry_alloc (fs/ext4/balloc.c:670 (discriminator 3)) [ 9.008332] ext4_da_write_begin (fs/ext4/inode.c:2965 (discriminator 1)) [ 9.008334] generic_perform_write (mm/filemap.c:3784) [ 9.008335] ext4_buffered_write_iter (fs/ext4/file.c:269) [ 9.008337] ext4_file_write_iter (fs/ext4/file.c:677) [ 9.008339] new_sync_write (fs/read_write.c:504 (discriminator 1)) [ 9.008341] vfs_write (fs/read_write.c:590) [ 9.008342] ksys_write (fs/read_write.c:644) [ 9.008343] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 9.008345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) [ 9.008347] --------------------------------------------------- [ 9.008348] information that might be helpful [ 9.008348] --------------------------------------------------- [ 9.008349] CPU: 0 PID: 89 Comm: jbd2/sda1-8 Tainted: G W 5.17.0-rc1-00015-gb94f67143867-dirty #2 [ 9.008352] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 9.008353] Call Trace: [ 9.008354] <TASK> [ 9.008355] dump_stack_lvl (lib/dump_stack.c:107) [ 9.008358] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) [ 9.008360] ? print_circle (kernel/dependency/dept.c:1086) [ 9.008362] cb_check_dl (kernel/dependency/dept.c:1104) [ 9.008364] bfs (kernel/dependency/dept.c:860) [ 9.008366] add_dep (kernel/dependency/dept.c:1423) [ 9.008368] do_event.isra.25 (kernel/dependency/dept.c:1651) [ 9.008370] ? __wake_up_common (kernel/sched/wait.c:108) [ 9.008372] dept_event (kernel/dependency/dept.c:2337) [ 9.008374] __wake_up_common (kernel/sched/wait.c:109) [ 9.008376] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) [ 9.008379] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) [ 9.008381] ? arch_stack_walk (arch/x86/kernel/stacktrace.c:24) [ 9.008385] ? ret_from_fork (arch/x86/entry/entry_64.S:301) [ 9.008387] ? dept_enable_hardirq (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:1843) [ 9.008389] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:45 ./arch/x86/include/asm/irqflags.h:80 ./arch/x86/include/asm/irqflags.h:138 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) [ 9.008392] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) [ 9.008394] ? try_to_del_timer_sync (kernel/time/timer.c:1239) [ 9.008396] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) [ 9.008398] ? prepare_to_wait_exclusive (kernel/sched/wait.c:431) [ 9.008400] ? commit_timeout (fs/jbd2/journal.c:173) [ 9.008402] kthread (kernel/kthread.c:377) [ 9.008404] ? kthread_complete_and_exit (kernel/kthread.c:332) [ 9.008407] ret_from_fork (arch/x86/entry/entry_64.S:301) [ 9.008410] </TASK>
So I was trying to understand what this report is about for some time but honestly I have failed... On Thu 17-02-22 20:10:04, Byungchul Park wrote: > [ 9.008161] =================================================== > [ 9.008163] DEPT: Circular dependency has been detected. > [ 9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G W > [ 9.008166] --------------------------------------------------- > [ 9.008167] summary > [ 9.008167] --------------------------------------------------- > [ 9.008168] *** DEADLOCK *** > [ 9.008168] > [ 9.008168] context A > [ 9.008169] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0) > [ 9.008171] [W] wait(&(&journal->j_wait_commit)->dmap:0) > [ 9.008172] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0) > [ 9.008173] > [ 9.008173] context B > [ 9.008174] [S] down_write(mapping.invalidate_lock:0) > [ 9.008175] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0) > [ 9.008176] [E] up_write(mapping.invalidate_lock:0) > [ 9.008177] > [ 9.008178] context C > [ 9.008179] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0) > [ 9.008180] [W] down_write(mapping.invalidate_lock:0) > [ 9.008181] [E] event(&(&journal->j_wait_commit)->dmap:0) > [ 9.008181] > [ 9.008182] [S]: start of the event context > [ 9.008183] [W]: the wait blocked > [ 9.008183] [E]: the event not reachable So what situation is your tool complaining about here? Can you perhaps show it here in more common visualization like: TASK1 TASK2 does foo, grabs Z does X, grabs lock Y blocks on Z blocks on Y or something like that? Because I was not able to decipher this from the report even after trying for some time... Honza > [ 9.008184] --------------------------------------------------- > [ 9.008184] context A's detail > [ 9.008185] --------------------------------------------------- > [ 9.008186] context A > [ 9.008186] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0) > [ 9.008187] [W] wait(&(&journal->j_wait_commit)->dmap:0) > [ 9.008188] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0) > [ 9.008189] > [ 9.008190] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0): > [ 9.008191] (N/A) > [ 9.008191] > [ 9.008192] [W] wait(&(&journal->j_wait_commit)->dmap:0): > [ 9.008193] prepare_to_wait (kernel/sched/wait.c:275) > [ 9.008197] stacktrace: > [ 9.008198] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) > [ 9.008200] schedule (kernel/sched/core.c:6373 (discriminator 1)) > [ 9.008201] kjournald2 (fs/jbd2/journal.c:250) > [ 9.008203] kthread (kernel/kthread.c:377) > [ 9.008206] ret_from_fork (arch/x86/entry/entry_64.S:301) > [ 9.008209] > [ 9.008209] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0): > [ 9.008210] __wake_up_common (kernel/sched/wait.c:108) > [ 9.008212] stacktrace: > [ 9.008213] dept_event (kernel/dependency/dept.c:2337) > [ 9.008215] __wake_up_common (kernel/sched/wait.c:109) > [ 9.008217] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > [ 9.008218] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) > [ 9.008221] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) > [ 9.008223] kthread (kernel/kthread.c:377) > [ 9.008224] ret_from_fork (arch/x86/entry/entry_64.S:301) > [ 9.008226] --------------------------------------------------- > [ 9.008226] context B's detail > [ 9.008227] --------------------------------------------------- > [ 9.008228] context B > [ 9.008228] [S] down_write(mapping.invalidate_lock:0) > [ 9.008229] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0) > [ 9.008230] [E] up_write(mapping.invalidate_lock:0) > [ 9.008231] > [ 9.008232] [S] down_write(mapping.invalidate_lock:0): > [ 9.008233] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) > [ 9.008237] stacktrace: > [ 9.008237] down_write (kernel/locking/rwsem.c:1514) > [ 9.008239] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) > [ 9.008241] generic_perform_write (mm/filemap.c:3784) > [ 9.008243] ext4_buffered_write_iter (fs/ext4/file.c:269) > [ 9.008245] ext4_file_write_iter (fs/ext4/file.c:677) > [ 9.008247] new_sync_write (fs/read_write.c:504 (discriminator 1)) > [ 9.008250] vfs_write (fs/read_write.c:590) > [ 9.008251] ksys_write (fs/read_write.c:644) > [ 9.008253] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > [ 9.008255] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > [ 9.008258] > [ 9.008258] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0): > [ 9.008259] prepare_to_wait (kernel/sched/wait.c:275) > [ 9.008261] stacktrace: > [ 9.008261] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) > [ 9.008263] schedule (kernel/sched/core.c:6373 (discriminator 1)) > [ 9.008264] wait_transaction_locked (fs/jbd2/transaction.c:184) > [ 9.008266] add_transaction_credits (fs/jbd2/transaction.c:248 (discriminator 3)) > [ 9.008267] start_this_handle (fs/jbd2/transaction.c:427) > [ 9.008269] jbd2__journal_start (fs/jbd2/transaction.c:526) > [ 9.008271] __ext4_journal_start_sb (fs/ext4/ext4_jbd2.c:105) > [ 9.008273] ext4_truncate (fs/ext4/inode.c:4164) > [ 9.008274] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) > [ 9.008276] generic_perform_write (mm/filemap.c:3784) > [ 9.008277] ext4_buffered_write_iter (fs/ext4/file.c:269) > [ 9.008279] ext4_file_write_iter (fs/ext4/file.c:677) > [ 9.008281] new_sync_write (fs/read_write.c:504 (discriminator 1)) > [ 9.008283] vfs_write (fs/read_write.c:590) > [ 9.008284] ksys_write (fs/read_write.c:644) > [ 9.008285] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > [ 9.008287] > [ 9.008288] [E] up_write(mapping.invalidate_lock:0): > [ 9.008288] ext4_da_get_block_prep (fs/ext4/inode.c:1795 fs/ext4/inode.c:1829) > [ 9.008291] --------------------------------------------------- > [ 9.008291] context C's detail > [ 9.008292] --------------------------------------------------- > [ 9.008292] context C > [ 9.008293] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0) > [ 9.008294] [W] down_write(mapping.invalidate_lock:0) > [ 9.008295] [E] event(&(&journal->j_wait_commit)->dmap:0) > [ 9.008296] > [ 9.008297] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0): > [ 9.008298] (N/A) > [ 9.008298] > [ 9.008299] [W] down_write(mapping.invalidate_lock:0): > [ 9.008299] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) > [ 9.008302] stacktrace: > [ 9.008302] down_write (kernel/locking/rwsem.c:1514) > [ 9.008304] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) > [ 9.008305] generic_perform_write (mm/filemap.c:3784) > [ 9.008307] ext4_buffered_write_iter (fs/ext4/file.c:269) > [ 9.008309] ext4_file_write_iter (fs/ext4/file.c:677) > [ 9.008311] new_sync_write (fs/read_write.c:504 (discriminator 1)) > [ 9.008312] vfs_write (fs/read_write.c:590) > [ 9.008314] ksys_write (fs/read_write.c:644) > [ 9.008315] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > [ 9.008316] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > [ 9.008318] > [ 9.008319] [E] event(&(&journal->j_wait_commit)->dmap:0): > [ 9.008320] __wake_up_common (kernel/sched/wait.c:108) > [ 9.008321] stacktrace: > [ 9.008322] __wake_up_common (kernel/sched/wait.c:109) > [ 9.008323] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > [ 9.008324] __jbd2_log_start_commit (fs/jbd2/journal.c:508) > [ 9.008326] jbd2_log_start_commit (fs/jbd2/journal.c:527) > [ 9.008327] __jbd2_journal_force_commit (fs/jbd2/journal.c:560) > [ 9.008329] jbd2_journal_force_commit_nested (fs/jbd2/journal.c:583) > [ 9.008331] ext4_should_retry_alloc (fs/ext4/balloc.c:670 (discriminator 3)) > [ 9.008332] ext4_da_write_begin (fs/ext4/inode.c:2965 (discriminator 1)) > [ 9.008334] generic_perform_write (mm/filemap.c:3784) > [ 9.008335] ext4_buffered_write_iter (fs/ext4/file.c:269) > [ 9.008337] ext4_file_write_iter (fs/ext4/file.c:677) > [ 9.008339] new_sync_write (fs/read_write.c:504 (discriminator 1)) > [ 9.008341] vfs_write (fs/read_write.c:590) > [ 9.008342] ksys_write (fs/read_write.c:644) > [ 9.008343] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > [ 9.008345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > [ 9.008347] --------------------------------------------------- > [ 9.008348] information that might be helpful > [ 9.008348] --------------------------------------------------- > [ 9.008349] CPU: 0 PID: 89 Comm: jbd2/sda1-8 Tainted: G W 5.17.0-rc1-00015-gb94f67143867-dirty #2 > [ 9.008352] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > [ 9.008353] Call Trace: > [ 9.008354] <TASK> > [ 9.008355] dump_stack_lvl (lib/dump_stack.c:107) > [ 9.008358] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) > [ 9.008360] ? print_circle (kernel/dependency/dept.c:1086) > [ 9.008362] cb_check_dl (kernel/dependency/dept.c:1104) > [ 9.008364] bfs (kernel/dependency/dept.c:860) > [ 9.008366] add_dep (kernel/dependency/dept.c:1423) > [ 9.008368] do_event.isra.25 (kernel/dependency/dept.c:1651) > [ 9.008370] ? __wake_up_common (kernel/sched/wait.c:108) > [ 9.008372] dept_event (kernel/dependency/dept.c:2337) > [ 9.008374] __wake_up_common (kernel/sched/wait.c:109) > [ 9.008376] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > [ 9.008379] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) > [ 9.008381] ? arch_stack_walk (arch/x86/kernel/stacktrace.c:24) > [ 9.008385] ? ret_from_fork (arch/x86/entry/entry_64.S:301) > [ 9.008387] ? dept_enable_hardirq (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:1843) > [ 9.008389] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:45 ./arch/x86/include/asm/irqflags.h:80 ./arch/x86/include/asm/irqflags.h:138 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) > [ 9.008392] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) > [ 9.008394] ? try_to_del_timer_sync (kernel/time/timer.c:1239) > [ 9.008396] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) > [ 9.008398] ? prepare_to_wait_exclusive (kernel/sched/wait.c:431) > [ 9.008400] ? commit_timeout (fs/jbd2/journal.c:173) > [ 9.008402] kthread (kernel/kthread.c:377) > [ 9.008404] ? kthread_complete_and_exit (kernel/kthread.c:332) > [ 9.008407] ret_from_fork (arch/x86/entry/entry_64.S:301) > [ 9.008410] </TASK> -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Mon, Feb 21, 2022 at 08:02:04PM +0100, Jan Kara wrote: > On Thu 17-02-22 20:10:04, Byungchul Park wrote: > > [ 9.008161] =================================================== > > [ 9.008163] DEPT: Circular dependency has been detected. > > [ 9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G W > > [ 9.008166] --------------------------------------------------- > > [ 9.008167] summary > > [ 9.008167] --------------------------------------------------- > > [ 9.008168] *** DEADLOCK *** > > [ 9.008168] > > [ 9.008168] context A > > [ 9.008169] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0) > > [ 9.008171] [W] wait(&(&journal->j_wait_commit)->dmap:0) > > [ 9.008172] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0) > > [ 9.008173] > > [ 9.008173] context B > > [ 9.008174] [S] down_write(mapping.invalidate_lock:0) > > [ 9.008175] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0) > > [ 9.008176] [E] up_write(mapping.invalidate_lock:0) > > [ 9.008177] > > [ 9.008178] context C > > [ 9.008179] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0) > > [ 9.008180] [W] down_write(mapping.invalidate_lock:0) > > [ 9.008181] [E] event(&(&journal->j_wait_commit)->dmap:0) > > [ 9.008181] > > [ 9.008182] [S]: start of the event context > > [ 9.008183] [W]: the wait blocked > > [ 9.008183] [E]: the event not reachable > > So what situation is your tool complaining about here? Can you perhaps show > it here in more common visualization like: Sure. > TASK1 TASK2 > does foo, grabs Z > does X, grabs lock Y > blocks on Z > blocks on Y > > or something like that? Because I was not able to decipher this from the > report even after trying for some time... KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write) wait A --- stuck wait B --- stuck wait C --- stuck wake up B wake up C wake up A where: A is a wait_queue, j_wait_commit B is a wait_queue, j_wait_transaction_locked C is a rwsem, mapping.invalidate_lock The above is the simplest form. And it's worth noting that Dept focuses on wait and event itself rather than grabing and releasing things like lock. The following is the more descriptive form of it. KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write) wait @j_wait_commit ext4_truncate_failed_write() down_write(mapping.invalidate_lock) ext4_truncate() ... wait @j_wait_transaction_locked ext_truncate_failed_write() down_write(mapping.invalidate_lock) ext4_should_retry_alloc() ... __jbd2_log_start_commit() wake_up(j_wait_commit) jbd2_journal_commit_transaction() wake_up(j_wait_transaction_locked) up_write(mapping.invalidate_lock) I hope this would help you understand the report. Yeah... This is what Dept complained. And as Ted said, the kthread would be woken up by another wakeup. So it's not deadlock deadlock. However, these three threads and any other tasks waiting for any of the events A, B, C would be struck for a while until the wakeup eventually wakes up the kthread, kjournald2. I guess it's not what ext4 is meant to do. Honestly, ext4 and journal system look so complicated that I'm not convinced tho... Thanks, Byungchul > > Honza > > > > > [ 9.008184] --------------------------------------------------- > > [ 9.008184] context A's detail > > [ 9.008185] --------------------------------------------------- > > [ 9.008186] context A > > [ 9.008186] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0) > > [ 9.008187] [W] wait(&(&journal->j_wait_commit)->dmap:0) > > [ 9.008188] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0) > > [ 9.008189] > > [ 9.008190] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0): > > [ 9.008191] (N/A) > > [ 9.008191] > > [ 9.008192] [W] wait(&(&journal->j_wait_commit)->dmap:0): > > [ 9.008193] prepare_to_wait (kernel/sched/wait.c:275) > > [ 9.008197] stacktrace: > > [ 9.008198] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) > > [ 9.008200] schedule (kernel/sched/core.c:6373 (discriminator 1)) > > [ 9.008201] kjournald2 (fs/jbd2/journal.c:250) > > [ 9.008203] kthread (kernel/kthread.c:377) > > [ 9.008206] ret_from_fork (arch/x86/entry/entry_64.S:301) > > [ 9.008209] > > [ 9.008209] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0): > > [ 9.008210] __wake_up_common (kernel/sched/wait.c:108) > > [ 9.008212] stacktrace: > > [ 9.008213] dept_event (kernel/dependency/dept.c:2337) > > [ 9.008215] __wake_up_common (kernel/sched/wait.c:109) > > [ 9.008217] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > > [ 9.008218] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) > > [ 9.008221] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) > > [ 9.008223] kthread (kernel/kthread.c:377) > > [ 9.008224] ret_from_fork (arch/x86/entry/entry_64.S:301) > > [ 9.008226] --------------------------------------------------- > > [ 9.008226] context B's detail > > [ 9.008227] --------------------------------------------------- > > [ 9.008228] context B > > [ 9.008228] [S] down_write(mapping.invalidate_lock:0) > > [ 9.008229] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0) > > [ 9.008230] [E] up_write(mapping.invalidate_lock:0) > > [ 9.008231] > > [ 9.008232] [S] down_write(mapping.invalidate_lock:0): > > [ 9.008233] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) > > [ 9.008237] stacktrace: > > [ 9.008237] down_write (kernel/locking/rwsem.c:1514) > > [ 9.008239] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) > > [ 9.008241] generic_perform_write (mm/filemap.c:3784) > > [ 9.008243] ext4_buffered_write_iter (fs/ext4/file.c:269) > > [ 9.008245] ext4_file_write_iter (fs/ext4/file.c:677) > > [ 9.008247] new_sync_write (fs/read_write.c:504 (discriminator 1)) > > [ 9.008250] vfs_write (fs/read_write.c:590) > > [ 9.008251] ksys_write (fs/read_write.c:644) > > [ 9.008253] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > > [ 9.008255] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > > [ 9.008258] > > [ 9.008258] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0): > > [ 9.008259] prepare_to_wait (kernel/sched/wait.c:275) > > [ 9.008261] stacktrace: > > [ 9.008261] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) > > [ 9.008263] schedule (kernel/sched/core.c:6373 (discriminator 1)) > > [ 9.008264] wait_transaction_locked (fs/jbd2/transaction.c:184) > > [ 9.008266] add_transaction_credits (fs/jbd2/transaction.c:248 (discriminator 3)) > > [ 9.008267] start_this_handle (fs/jbd2/transaction.c:427) > > [ 9.008269] jbd2__journal_start (fs/jbd2/transaction.c:526) > > [ 9.008271] __ext4_journal_start_sb (fs/ext4/ext4_jbd2.c:105) > > [ 9.008273] ext4_truncate (fs/ext4/inode.c:4164) > > [ 9.008274] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) > > [ 9.008276] generic_perform_write (mm/filemap.c:3784) > > [ 9.008277] ext4_buffered_write_iter (fs/ext4/file.c:269) > > [ 9.008279] ext4_file_write_iter (fs/ext4/file.c:677) > > [ 9.008281] new_sync_write (fs/read_write.c:504 (discriminator 1)) > > [ 9.008283] vfs_write (fs/read_write.c:590) > > [ 9.008284] ksys_write (fs/read_write.c:644) > > [ 9.008285] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > > [ 9.008287] > > [ 9.008288] [E] up_write(mapping.invalidate_lock:0): > > [ 9.008288] ext4_da_get_block_prep (fs/ext4/inode.c:1795 fs/ext4/inode.c:1829) > > [ 9.008291] --------------------------------------------------- > > [ 9.008291] context C's detail > > [ 9.008292] --------------------------------------------------- > > [ 9.008292] context C > > [ 9.008293] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0) > > [ 9.008294] [W] down_write(mapping.invalidate_lock:0) > > [ 9.008295] [E] event(&(&journal->j_wait_commit)->dmap:0) > > [ 9.008296] > > [ 9.008297] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0): > > [ 9.008298] (N/A) > > [ 9.008298] > > [ 9.008299] [W] down_write(mapping.invalidate_lock:0): > > [ 9.008299] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) > > [ 9.008302] stacktrace: > > [ 9.008302] down_write (kernel/locking/rwsem.c:1514) > > [ 9.008304] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) > > [ 9.008305] generic_perform_write (mm/filemap.c:3784) > > [ 9.008307] ext4_buffered_write_iter (fs/ext4/file.c:269) > > [ 9.008309] ext4_file_write_iter (fs/ext4/file.c:677) > > [ 9.008311] new_sync_write (fs/read_write.c:504 (discriminator 1)) > > [ 9.008312] vfs_write (fs/read_write.c:590) > > [ 9.008314] ksys_write (fs/read_write.c:644) > > [ 9.008315] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > > [ 9.008316] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > > [ 9.008318] > > [ 9.008319] [E] event(&(&journal->j_wait_commit)->dmap:0): > > [ 9.008320] __wake_up_common (kernel/sched/wait.c:108) > > [ 9.008321] stacktrace: > > [ 9.008322] __wake_up_common (kernel/sched/wait.c:109) > > [ 9.008323] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > > [ 9.008324] __jbd2_log_start_commit (fs/jbd2/journal.c:508) > > [ 9.008326] jbd2_log_start_commit (fs/jbd2/journal.c:527) > > [ 9.008327] __jbd2_journal_force_commit (fs/jbd2/journal.c:560) > > [ 9.008329] jbd2_journal_force_commit_nested (fs/jbd2/journal.c:583) > > [ 9.008331] ext4_should_retry_alloc (fs/ext4/balloc.c:670 (discriminator 3)) > > [ 9.008332] ext4_da_write_begin (fs/ext4/inode.c:2965 (discriminator 1)) > > [ 9.008334] generic_perform_write (mm/filemap.c:3784) > > [ 9.008335] ext4_buffered_write_iter (fs/ext4/file.c:269) > > [ 9.008337] ext4_file_write_iter (fs/ext4/file.c:677) > > [ 9.008339] new_sync_write (fs/read_write.c:504 (discriminator 1)) > > [ 9.008341] vfs_write (fs/read_write.c:590) > > [ 9.008342] ksys_write (fs/read_write.c:644) > > [ 9.008343] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > > [ 9.008345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > > [ 9.008347] --------------------------------------------------- > > [ 9.008348] information that might be helpful > > [ 9.008348] --------------------------------------------------- > > [ 9.008349] CPU: 0 PID: 89 Comm: jbd2/sda1-8 Tainted: G W 5.17.0-rc1-00015-gb94f67143867-dirty #2 > > [ 9.008352] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > > [ 9.008353] Call Trace: > > [ 9.008354] <TASK> > > [ 9.008355] dump_stack_lvl (lib/dump_stack.c:107) > > [ 9.008358] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) > > [ 9.008360] ? print_circle (kernel/dependency/dept.c:1086) > > [ 9.008362] cb_check_dl (kernel/dependency/dept.c:1104) > > [ 9.008364] bfs (kernel/dependency/dept.c:860) > > [ 9.008366] add_dep (kernel/dependency/dept.c:1423) > > [ 9.008368] do_event.isra.25 (kernel/dependency/dept.c:1651) > > [ 9.008370] ? __wake_up_common (kernel/sched/wait.c:108) > > [ 9.008372] dept_event (kernel/dependency/dept.c:2337) > > [ 9.008374] __wake_up_common (kernel/sched/wait.c:109) > > [ 9.008376] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > > [ 9.008379] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) > > [ 9.008381] ? arch_stack_walk (arch/x86/kernel/stacktrace.c:24) > > [ 9.008385] ? ret_from_fork (arch/x86/entry/entry_64.S:301) > > [ 9.008387] ? dept_enable_hardirq (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:1843) > > [ 9.008389] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:45 ./arch/x86/include/asm/irqflags.h:80 ./arch/x86/include/asm/irqflags.h:138 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) > > [ 9.008392] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) > > [ 9.008394] ? try_to_del_timer_sync (kernel/time/timer.c:1239) > > [ 9.008396] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) > > [ 9.008398] ? prepare_to_wait_exclusive (kernel/sched/wait.c:431) > > [ 9.008400] ? commit_timeout (fs/jbd2/journal.c:173) > > [ 9.008402] kthread (kernel/kthread.c:377) > > [ 9.008404] ? kthread_complete_and_exit (kernel/kthread.c:332) > > [ 9.008407] ret_from_fork (arch/x86/entry/entry_64.S:301) > > [ 9.008410] </TASK> > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Wed 23-02-22 09:35:34, Byungchul Park wrote: > On Mon, Feb 21, 2022 at 08:02:04PM +0100, Jan Kara wrote: > > On Thu 17-02-22 20:10:04, Byungchul Park wrote: > > > [ 9.008161] =================================================== > > > [ 9.008163] DEPT: Circular dependency has been detected. > > > [ 9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G W > > > [ 9.008166] --------------------------------------------------- > > > [ 9.008167] summary > > > [ 9.008167] --------------------------------------------------- > > > [ 9.008168] *** DEADLOCK *** > > > [ 9.008168] > > > [ 9.008168] context A > > > [ 9.008169] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0) > > > [ 9.008171] [W] wait(&(&journal->j_wait_commit)->dmap:0) > > > [ 9.008172] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0) > > > [ 9.008173] > > > [ 9.008173] context B > > > [ 9.008174] [S] down_write(mapping.invalidate_lock:0) > > > [ 9.008175] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0) > > > [ 9.008176] [E] up_write(mapping.invalidate_lock:0) > > > [ 9.008177] > > > [ 9.008178] context C > > > [ 9.008179] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0) > > > [ 9.008180] [W] down_write(mapping.invalidate_lock:0) > > > [ 9.008181] [E] event(&(&journal->j_wait_commit)->dmap:0) > > > [ 9.008181] > > > [ 9.008182] [S]: start of the event context > > > [ 9.008183] [W]: the wait blocked > > > [ 9.008183] [E]: the event not reachable > > > > So what situation is your tool complaining about here? Can you perhaps show > > it here in more common visualization like: > > Sure. > > > TASK1 TASK2 > > does foo, grabs Z > > does X, grabs lock Y > > blocks on Z > > blocks on Y > > > > or something like that? Because I was not able to decipher this from the > > report even after trying for some time... > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write) > > wait A > --- stuck > wait B > --- stuck > wait C > --- stuck > > wake up B wake up C wake up A > > where: > A is a wait_queue, j_wait_commit > B is a wait_queue, j_wait_transaction_locked > C is a rwsem, mapping.invalidate_lock I see. But a situation like this is not necessarily a guarantee of a deadlock, is it? I mean there can be task D that will eventually call say 'wake up B' and unblock everything and this is how things were designed to work? Multiple sources of wakeups are quite common I'd say... What does Dept do to prevent false reports in cases like this? > The above is the simplest form. And it's worth noting that Dept focuses > on wait and event itself rather than grabing and releasing things like > lock. The following is the more descriptive form of it. > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write) > > wait @j_wait_commit > ext4_truncate_failed_write() > down_write(mapping.invalidate_lock) > > ext4_truncate() > ... > wait @j_wait_transaction_locked > > ext_truncate_failed_write() > down_write(mapping.invalidate_lock) > > ext4_should_retry_alloc() > ... > __jbd2_log_start_commit() > wake_up(j_wait_commit) > jbd2_journal_commit_transaction() > wake_up(j_wait_transaction_locked) > up_write(mapping.invalidate_lock) > > I hope this would help you understand the report. I see, thanks for explanation! So the above scenario is impossible because for anyone to block on @j_wait_transaction_locked the transaction must be committing, which is done only by kjournald2 kthread and so that thread cannot be waiting at @j_wait_commit. Essentially blocking on @j_wait_transaction_locked means @j_wait_commit wakeup was already done. I guess this shows there can be non-trivial dependencies between wait queues which are difficult to track in an automated way and without such tracking we are going to see false positives... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote: > On Wed 23-02-22 09:35:34, Byungchul Park wrote: > > On Mon, Feb 21, 2022 at 08:02:04PM +0100, Jan Kara wrote: > > > On Thu 17-02-22 20:10:04, Byungchul Park wrote: > > > > [ 9.008161] =================================================== > > > > [ 9.008163] DEPT: Circular dependency has been detected. > > > > [ 9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G W > > > > [ 9.008166] --------------------------------------------------- > > > > [ 9.008167] summary > > > > [ 9.008167] --------------------------------------------------- > > > > [ 9.008168] *** DEADLOCK *** > > > > [ 9.008168] > > > > [ 9.008168] context A > > > > [ 9.008169] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0) > > > > [ 9.008171] [W] wait(&(&journal->j_wait_commit)->dmap:0) > > > > [ 9.008172] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0) > > > > [ 9.008173] > > > > [ 9.008173] context B > > > > [ 9.008174] [S] down_write(mapping.invalidate_lock:0) > > > > [ 9.008175] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0) > > > > [ 9.008176] [E] up_write(mapping.invalidate_lock:0) > > > > [ 9.008177] > > > > [ 9.008178] context C > > > > [ 9.008179] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0) > > > > [ 9.008180] [W] down_write(mapping.invalidate_lock:0) > > > > [ 9.008181] [E] event(&(&journal->j_wait_commit)->dmap:0) > > > > [ 9.008181] > > > > [ 9.008182] [S]: start of the event context > > > > [ 9.008183] [W]: the wait blocked > > > > [ 9.008183] [E]: the event not reachable > > > > > > So what situation is your tool complaining about here? Can you perhaps show > > > it here in more common visualization like: > > > > Sure. > > > > > TASK1 TASK2 > > > does foo, grabs Z > > > does X, grabs lock Y > > > blocks on Z > > > blocks on Y > > > > > > or something like that? Because I was not able to decipher this from the > > > report even after trying for some time... > > > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write) > > > > wait A > > --- stuck > > wait B > > --- stuck > > wait C > > --- stuck > > > > wake up B wake up C wake up A > > > > where: > > A is a wait_queue, j_wait_commit > > B is a wait_queue, j_wait_transaction_locked > > C is a rwsem, mapping.invalidate_lock > > I see. But a situation like this is not necessarily a guarantee of a > deadlock, is it? I mean there can be task D that will eventually call say > 'wake up B' and unblock everything and this is how things were designed to > work? Multiple sources of wakeups are quite common I'd say... What does Yes. At the very beginning when I desgined Dept, I was thinking whether to support multiple wakeup sources or not for a quite long time. Supporting it would be a better option to aovid non-critical reports. However, I thought anyway we'd better fix it - not urgent tho - if there's any single circle dependency. That's why I decided not to support it for now and wanted to gather the kernel guys' opinions. Thing is which policy we should go with. > Dept do to prevent false reports in cases like this? > > > The above is the simplest form. And it's worth noting that Dept focuses > > on wait and event itself rather than grabing and releasing things like > > lock. The following is the more descriptive form of it. > > > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write) > > > > wait @j_wait_commit > > ext4_truncate_failed_write() > > down_write(mapping.invalidate_lock) > > > > ext4_truncate() > > ... > > wait @j_wait_transaction_locked > > > > ext_truncate_failed_write() > > down_write(mapping.invalidate_lock) > > > > ext4_should_retry_alloc() > > ... > > __jbd2_log_start_commit() > > wake_up(j_wait_commit) > > jbd2_journal_commit_transaction() > > wake_up(j_wait_transaction_locked) > > up_write(mapping.invalidate_lock) > > > > I hope this would help you understand the report. > > I see, thanks for explanation! So the above scenario is impossible because My pleasure. > for anyone to block on @j_wait_transaction_locked the transaction must be > committing, which is done only by kjournald2 kthread and so that thread > cannot be waiting at @j_wait_commit. Essentially blocking on > @j_wait_transaction_locked means @j_wait_commit wakeup was already done. kjournal2 repeatedly does the wait and the wake_up so the above scenario looks possible to me even based on what you explained. Maybe I should understand how the journal things work more for furhter discussion. Your explanation is so helpful. Thank you really. Thanks, Byungchul > I guess this shows there can be non-trivial dependencies between wait > queues which are difficult to track in an automated way and without such > tracking we are going to see false positives... > > Honza > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Thu 24-02-22 10:11:02, Byungchul Park wrote: > On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote: > > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write) > > > > > > wait A > > > --- stuck > > > wait B > > > --- stuck > > > wait C > > > --- stuck > > > > > > wake up B wake up C wake up A > > > > > > where: > > > A is a wait_queue, j_wait_commit > > > B is a wait_queue, j_wait_transaction_locked > > > C is a rwsem, mapping.invalidate_lock > > > > I see. But a situation like this is not necessarily a guarantee of a > > deadlock, is it? I mean there can be task D that will eventually call say > > 'wake up B' and unblock everything and this is how things were designed to > > work? Multiple sources of wakeups are quite common I'd say... What does > > Yes. At the very beginning when I desgined Dept, I was thinking whether > to support multiple wakeup sources or not for a quite long time. > Supporting it would be a better option to aovid non-critical reports. > However, I thought anyway we'd better fix it - not urgent tho - if > there's any single circle dependency. That's why I decided not to > support it for now and wanted to gather the kernel guys' opinions. Thing > is which policy we should go with. I see. So supporting only a single wakeup source is fine for locks I guess. But for general wait queues or other synchronization mechanisms, I'm afraid it will lead to quite some false positive reports. Just my 2c. > > Dept do to prevent false reports in cases like this? > > > > > The above is the simplest form. And it's worth noting that Dept focuses > > > on wait and event itself rather than grabing and releasing things like > > > lock. The following is the more descriptive form of it. > > > > > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write) > > > > > > wait @j_wait_commit > > > ext4_truncate_failed_write() > > > down_write(mapping.invalidate_lock) > > > > > > ext4_truncate() > > > ... > > > wait @j_wait_transaction_locked > > > > > > ext_truncate_failed_write() > > > down_write(mapping.invalidate_lock) > > > > > > ext4_should_retry_alloc() > > > ... > > > __jbd2_log_start_commit() > > > wake_up(j_wait_commit) > > > jbd2_journal_commit_transaction() > > > wake_up(j_wait_transaction_locked) > > > up_write(mapping.invalidate_lock) > > > > > > I hope this would help you understand the report. > > > > I see, thanks for explanation! So the above scenario is impossible because > > My pleasure. > > > for anyone to block on @j_wait_transaction_locked the transaction must be > > committing, which is done only by kjournald2 kthread and so that thread > > cannot be waiting at @j_wait_commit. Essentially blocking on > > @j_wait_transaction_locked means @j_wait_commit wakeup was already done. > > kjournal2 repeatedly does the wait and the wake_up so the above scenario > looks possible to me even based on what you explained. Maybe I should > understand how the journal things work more for furhter discussion. Your > explanation is so helpful. Thank you really. OK, let me provide you with more details for better understanding :) In jbd2 we have an object called 'transaction'. This object can go through many states but for our case is important that transaction is moved to T_LOCKED state and out of it only while jbd2_journal_commit_transaction() function is executing and waiting on j_wait_transaction_locked waitqueue is exactly waiting for a transaction to get out of T_LOCKED state. Function jbd2_journal_commit_transaction() is executed only by kjournald. Hence anyone can see transaction in T_LOCKED state only if kjournald is running inside jbd2_journal_commit_transaction() and thus kjournald cannot be sleeping on j_wait_commit at the same time. Does this explain things? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Thu, Feb 24, 2022 at 11:22:39AM +0100, Jan Kara wrote: > On Thu 24-02-22 10:11:02, Byungchul Park wrote: > > On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote: > > > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write) > > > > > > > > wait A > > > > --- stuck > > > > wait B > > > > --- stuck > > > > wait C > > > > --- stuck > > > > > > > > wake up B wake up C wake up A > > > > > > > > where: > > > > A is a wait_queue, j_wait_commit > > > > B is a wait_queue, j_wait_transaction_locked > > > > C is a rwsem, mapping.invalidate_lock > > > > > > I see. But a situation like this is not necessarily a guarantee of a > > > deadlock, is it? I mean there can be task D that will eventually call say > > > 'wake up B' and unblock everything and this is how things were designed to > > > work? Multiple sources of wakeups are quite common I'd say... What does > > > > Yes. At the very beginning when I desgined Dept, I was thinking whether > > to support multiple wakeup sources or not for a quite long time. > > Supporting it would be a better option to aovid non-critical reports. > > However, I thought anyway we'd better fix it - not urgent tho - if > > there's any single circle dependency. That's why I decided not to > > support it for now and wanted to gather the kernel guys' opinions. Thing > > is which policy we should go with. > > I see. So supporting only a single wakeup source is fine for locks I guess. > But for general wait queues or other synchronization mechanisms, I'm afraid > it will lead to quite some false positive reports. Just my 2c. Thank you for your feedback. I realized we've been using "false positive" differently. There exist the three types of code in terms of dependency and deadlock. It's worth noting that dependencies are built from between waits and events in Dept. --- case 1. Code with an actual circular dependency, but not deadlock. A circular dependency can be broken by a rescue wakeup source e.g. timeout. It's not a deadlock. If it's okay that the contexts participating in the circular dependency and others waiting for the events in the circle are stuck until it gets broken. Otherwise, say, if it's not meant, then it's anyway problematic. 1-1. What if we judge this code is problematic? 1-2. What if we judge this code is good? case 2. Code with an actual circular dependency, and deadlock. There's no other wakeup source than those within the circular dependency. Literally deadlock. It's problematic and critical. 2-1. What if we judge this code is problematic? 2-2. What if we judge this code is good? case 3. Code with no actual circular dependency, and not deadlock. Must be good. 3-1. What if we judge this code is problematic? 3-2. What if we judge this code is good? --- I call only 3-1 "false positive" circular dependency. And you call 1-1 and 3-1 "false positive" deadlock. I've been wondering if the kernel guys esp. Linus considers code with any circular dependency is problematic or not, even if it won't lead to a deadlock, say, case 1. Even though I designed Dept based on what I believe is right, of course, I'm willing to change the design according to the majority opinion. However, I would never allow case 1 if I were the owner of the kernel for better stability, even though the code works anyway okay for now. Thanks, Byungchul > > > Dept do to prevent false reports in cases like this? > > > > > > > The above is the simplest form. And it's worth noting that Dept focuses > > > > on wait and event itself rather than grabing and releasing things like > > > > lock. The following is the more descriptive form of it. > > > > > > > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write) > > > > > > > > wait @j_wait_commit > > > > ext4_truncate_failed_write() > > > > down_write(mapping.invalidate_lock) > > > > > > > > ext4_truncate() > > > > ... > > > > wait @j_wait_transaction_locked > > > > > > > > ext_truncate_failed_write() > > > > down_write(mapping.invalidate_lock) > > > > > > > > ext4_should_retry_alloc() > > > > ... > > > > __jbd2_log_start_commit() > > > > wake_up(j_wait_commit) > > > > jbd2_journal_commit_transaction() > > > > wake_up(j_wait_transaction_locked) > > > > up_write(mapping.invalidate_lock) > > > > > > > > I hope this would help you understand the report. > > > > > > I see, thanks for explanation! So the above scenario is impossible because > > > > My pleasure. > > > > > for anyone to block on @j_wait_transaction_locked the transaction must be > > > committing, which is done only by kjournald2 kthread and so that thread > > > cannot be waiting at @j_wait_commit. Essentially blocking on > > > @j_wait_transaction_locked means @j_wait_commit wakeup was already done. > > > > kjournal2 repeatedly does the wait and the wake_up so the above scenario > > looks possible to me even based on what you explained. Maybe I should > > understand how the journal things work more for furhter discussion. Your > > explanation is so helpful. Thank you really. > > OK, let me provide you with more details for better understanding :) In > jbd2 we have an object called 'transaction'. This object can go through > many states but for our case is important that transaction is moved to > T_LOCKED state and out of it only while jbd2_journal_commit_transaction() > function is executing and waiting on j_wait_transaction_locked waitqueue is > exactly waiting for a transaction to get out of T_LOCKED state. Function > jbd2_journal_commit_transaction() is executed only by kjournald. Hence > anyone can see transaction in T_LOCKED state only if kjournald is running > inside jbd2_journal_commit_transaction() and thus kjournald cannot be > sleeping on j_wait_commit at the same time. Does this explain things? > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Mon 28-02-22 18:28:26, Byungchul Park wrote:
> On Thu, Feb 24, 2022 at 11:22:39AM +0100, Jan Kara wrote:
> > On Thu 24-02-22 10:11:02, Byungchul Park wrote:
> > > On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote:
> > > > > KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write)
> > > > >
> > > > > wait A
> > > > > --- stuck
> > > > > wait B
> > > > > --- stuck
> > > > > wait C
> > > > > --- stuck
> > > > >
> > > > > wake up B wake up C wake up A
> > > > >
> > > > > where:
> > > > > A is a wait_queue, j_wait_commit
> > > > > B is a wait_queue, j_wait_transaction_locked
> > > > > C is a rwsem, mapping.invalidate_lock
> > > >
> > > > I see. But a situation like this is not necessarily a guarantee of a
> > > > deadlock, is it? I mean there can be task D that will eventually call say
> > > > 'wake up B' and unblock everything and this is how things were designed to
> > > > work? Multiple sources of wakeups are quite common I'd say... What does
> > >
> > > Yes. At the very beginning when I desgined Dept, I was thinking whether
> > > to support multiple wakeup sources or not for a quite long time.
> > > Supporting it would be a better option to aovid non-critical reports.
> > > However, I thought anyway we'd better fix it - not urgent tho - if
> > > there's any single circle dependency. That's why I decided not to
> > > support it for now and wanted to gather the kernel guys' opinions. Thing
> > > is which policy we should go with.
> >
> > I see. So supporting only a single wakeup source is fine for locks I guess.
> > But for general wait queues or other synchronization mechanisms, I'm afraid
> > it will lead to quite some false positive reports. Just my 2c.
>
> Thank you for your feedback.
>
> I realized we've been using "false positive" differently. There exist
> the three types of code in terms of dependency and deadlock. It's worth
> noting that dependencies are built from between waits and events in Dept.
>
> ---
>
> case 1. Code with an actual circular dependency, but not deadlock.
>
> A circular dependency can be broken by a rescue wakeup source e.g.
> timeout. It's not a deadlock. If it's okay that the contexts
> participating in the circular dependency and others waiting for the
> events in the circle are stuck until it gets broken. Otherwise, say,
> if it's not meant, then it's anyway problematic.
>
> 1-1. What if we judge this code is problematic?
> 1-2. What if we judge this code is good?
>
> case 2. Code with an actual circular dependency, and deadlock.
>
> There's no other wakeup source than those within the circular
> dependency. Literally deadlock. It's problematic and critical.
>
> 2-1. What if we judge this code is problematic?
> 2-2. What if we judge this code is good?
>
> case 3. Code with no actual circular dependency, and not deadlock.
>
> Must be good.
>
> 3-1. What if we judge this code is problematic?
> 3-2. What if we judge this code is good?
>
> ---
>
> I call only 3-1 "false positive" circular dependency. And you call 1-1
> and 3-1 "false positive" deadlock.
>
> I've been wondering if the kernel guys esp. Linus considers code with
> any circular dependency is problematic or not, even if it won't lead to
> a deadlock, say, case 1. Even though I designed Dept based on what I
> believe is right, of course, I'm willing to change the design according
> to the majority opinion.
>
> However, I would never allow case 1 if I were the owner of the kernel
> for better stability, even though the code works anyway okay for now.
So yes, I call a report for the situation "There is circular dependency but
deadlock is not possible." a false positive. And that is because in my
opinion your definition of circular dependency includes schemes that are
useful and used in the kernel.
Your example in case 1 is kind of borderline (I personally would consider
that bug as well) but there are other more valid schemes with multiple
wakeup sources like:
We have a queue of work to do Q protected by lock L. Consumer process has
code like:
while (1) {
lock L
prepare_to_wait(work_queued);
if (no work) {
unlock L
sleep
} else {
unlock L
do work
wake_up(work_done)
}
}
AFAIU Dept will create dependency here that 'wakeup work_done' is after
'wait for work_queued'. Producer has code like:
while (1) {
lock L
prepare_to_wait(work_done)
if (too much work queued) {
unlock L
sleep
} else {
queue work
unlock L
wake_up(work_queued)
}
}
And Dept will create dependency here that 'wakeup work_queued' is after
'wait for work_done'. And thus we have a trivial cycle in the dependencies
despite the code being perfectly valid and safe.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> > case 1. Code with an actual circular dependency, but not deadlock.
> >
> > A circular dependency can be broken by a rescue wakeup source e.g.
> > timeout. It's not a deadlock. If it's okay that the contexts
> > participating in the circular dependency and others waiting for the
> > events in the circle are stuck until it gets broken. Otherwise, say,
> > if it's not meant, then it's anyway problematic.
> >
> > 1-1. What if we judge this code is problematic?
> > 1-2. What if we judge this code is good?
> >
> > I've been wondering if the kernel guys esp. Linus considers code with
> > any circular dependency is problematic or not, even if it won't lead to
> > a deadlock, say, case 1. Even though I designed Dept based on what I
> > believe is right, of course, I'm willing to change the design according
> > to the majority opinion.
> >
> > However, I would never allow case 1 if I were the owner of the kernel
> > for better stability, even though the code works anyway okay for now.
Note, I used the example of the timeout as the most obvious way of
explaining that a deadlock is not possible. There is also the much
more complex explanation which Jan was trying to give, which is what
leads to the circular dependency. It can happen that when trying to
start a handle, if either (a) there is not enough space in the journal
for new handles, or (b) the current transaction is so large that if we
don't close the transaction and start a new hone, we will end up
running out of space in the future, and so in that case,
start_this_handle() will block starting any more handles, and then
wake up the commit thread. The commit thread then waits for the
currently running threads to complete, before it allows new handles to
start, and then it will complete the commit. In the case of (a) we
then need to do a journal checkpoint, which is more work to release
space in the journal, and only then, can we allow new handles to start.
The botom line is (a) it works, (b) there aren't significant delays,
and for DEPT to complain that this is somehow wrong and we need to
completely rearchitect perfectly working code because it doesn't
confirm to DEPT's idea of what is "correct" is not acceptable.
> We have a queue of work to do Q protected by lock L. Consumer process has
> code like:
>
> while (1) {
> lock L
> prepare_to_wait(work_queued);
> if (no work) {
> unlock L
> sleep
> } else {
> unlock L
> do work
> wake_up(work_done)
> }
> }
>
> AFAIU Dept will create dependency here that 'wakeup work_done' is after
> 'wait for work_queued'. Producer has code like:
>
> while (1) {
> lock L
> prepare_to_wait(work_done)
> if (too much work queued) {
> unlock L
> sleep
> } else {
> queue work
> unlock L
> wake_up(work_queued)
> }
> }
>
> And Dept will create dependency here that 'wakeup work_queued' is after
> 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> despite the code being perfectly valid and safe.
Cheers,
- Ted
On Mon, Feb 28, 2022 at 04:25:04PM -0500, Theodore Ts'o wrote:
> On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> > > case 1. Code with an actual circular dependency, but not deadlock.
> > >
> > > A circular dependency can be broken by a rescue wakeup source e.g.
> > > timeout. It's not a deadlock. If it's okay that the contexts
> > > participating in the circular dependency and others waiting for the
> > > events in the circle are stuck until it gets broken. Otherwise, say,
> > > if it's not meant, then it's anyway problematic.
> > >
> > > 1-1. What if we judge this code is problematic?
> > > 1-2. What if we judge this code is good?
> > >
> > > I've been wondering if the kernel guys esp. Linus considers code with
> > > any circular dependency is problematic or not, even if it won't lead to
> > > a deadlock, say, case 1. Even though I designed Dept based on what I
> > > believe is right, of course, I'm willing to change the design according
> > > to the majority opinion.
> > >
> > > However, I would never allow case 1 if I were the owner of the kernel
> > > for better stability, even though the code works anyway okay for now.
>
> Note, I used the example of the timeout as the most obvious way of
> explaining that a deadlock is not possible. There is also the much
> more complex explanation which Jan was trying to give, which is what
> leads to the circular dependency. It can happen that when trying to
> start a handle, if either (a) there is not enough space in the journal
> for new handles, or (b) the current transaction is so large that if we
> don't close the transaction and start a new hone, we will end up
> running out of space in the future, and so in that case,
> start_this_handle() will block starting any more handles, and then
> wake up the commit thread. The commit thread then waits for the
> currently running threads to complete, before it allows new handles to
> start, and then it will complete the commit. In the case of (a) we
> then need to do a journal checkpoint, which is more work to release
> space in the journal, and only then, can we allow new handles to start.
Thank you for the full explanation of how journal things work.
> The botom line is (a) it works, (b) there aren't significant delays,
> and for DEPT to complain that this is somehow wrong and we need to
> completely rearchitect perfectly working code because it doesn't
> confirm to DEPT's idea of what is "correct" is not acceptable.
Thanks to you and Jan Kara, I realized it's not a real dependency in the
consumer and producer scenario but again *ONLY IF* there is a rescue
wakeup source. Dept should track the rescue wakeup source instead in the
case.
I won't ask you to rearchitect the working code. The code looks sane.
Thanks a lot.
Thanks,
Byungchul
> > We have a queue of work to do Q protected by lock L. Consumer process has
> > code like:
> >
> > while (1) {
> > lock L
> > prepare_to_wait(work_queued);
> > if (no work) {
> > unlock L
> > sleep
> > } else {
> > unlock L
> > do work
> > wake_up(work_done)
> > }
> > }
> >
> > AFAIU Dept will create dependency here that 'wakeup work_done' is after
> > 'wait for work_queued'. Producer has code like:
> >
> > while (1) {
> > lock L
> > prepare_to_wait(work_done)
> > if (too much work queued) {
> > unlock L
> > sleep
> > } else {
> > queue work
> > unlock L
> > wake_up(work_queued)
> > }
> > }
> >
> > And Dept will create dependency here that 'wakeup work_queued' is after
> > 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> > despite the code being perfectly valid and safe.
>
> Cheers,
>
> - Ted
On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> On Mon 28-02-22 18:28:26, Byungchul Park wrote:
> > case 1. Code with an actual circular dependency, but not deadlock.
> >
> > A circular dependency can be broken by a rescue wakeup source e.g.
> > timeout. It's not a deadlock. If it's okay that the contexts
> > participating in the circular dependency and others waiting for the
> > events in the circle are stuck until it gets broken. Otherwise, say,
> > if it's not meant, then it's anyway problematic.
> >
> > 1-1. What if we judge this code is problematic?
> > 1-2. What if we judge this code is good?
> >
> > case 2. Code with an actual circular dependency, and deadlock.
> >
> > There's no other wakeup source than those within the circular
> > dependency. Literally deadlock. It's problematic and critical.
> >
> > 2-1. What if we judge this code is problematic?
> > 2-2. What if we judge this code is good?
> >
> > case 3. Code with no actual circular dependency, and not deadlock.
> >
> > Must be good.
> >
> > 3-1. What if we judge this code is problematic?
> > 3-2. What if we judge this code is good?
> >
> > ---
> >
> > I call only 3-1 "false positive" circular dependency. And you call 1-1
> > and 3-1 "false positive" deadlock.
> >
> > I've been wondering if the kernel guys esp. Linus considers code with
> > any circular dependency is problematic or not, even if it won't lead to
> > a deadlock, say, case 1. Even though I designed Dept based on what I
> > believe is right, of course, I'm willing to change the design according
> > to the majority opinion.
> >
> > However, I would never allow case 1 if I were the owner of the kernel
> > for better stability, even though the code works anyway okay for now.
>
> So yes, I call a report for the situation "There is circular dependency but
> deadlock is not possible." a false positive. And that is because in my
> opinion your definition of circular dependency includes schemes that are
> useful and used in the kernel.
>
> Your example in case 1 is kind of borderline (I personally would consider
> that bug as well) but there are other more valid schemes with multiple
> wakeup sources like:
>
> We have a queue of work to do Q protected by lock L. Consumer process has
> code like:
>
> while (1) {
> lock L
> prepare_to_wait(work_queued);
> if (no work) {
> unlock L
> sleep
> } else {
> unlock L
> do work
> wake_up(work_done)
> }
> }
>
> AFAIU Dept will create dependency here that 'wakeup work_done' is after
> 'wait for work_queued'. Producer has code like:
First of all, thank you for this good example.
> while (1) {
> lock L
> prepare_to_wait(work_done)
> if (too much work queued) {
> unlock L
> sleep
> } else {
> queue work
> unlock L
> wake_up(work_queued)
> }
> }
>
> And Dept will create dependency here that 'wakeup work_queued' is after
> 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> despite the code being perfectly valid and safe.
Unfortunately, it's neither perfect nor safe without another wakeup
source - rescue wakeup source.
consumer producer
lock L
(too much work queued == true)
unlock L
--- preempted
lock L
unlock L
do work
lock L
unlock L
do work
...
(no work == true)
sleep
--- scheduled in
sleep
This code leads a deadlock without another wakeup source, say, not safe.
But yes. I also think this code should be allowed if it anyway runs
alongside another wakeup source. For the case, Dept should track the
rescue wakeup source instead that leads a actual deadlock.
I will correct code to make Dept track its rescue wakeup source whenever
finding the case.
Lastly, just for your information, I need to explain how Dept works a
little more for you not to misunderstand Dept.
Assuming the consumer and producer guarantee not to lead a deadlock like
the following, Dept won't report it a problem:
consumer producer
sleep
wakeup work_done
queue work
sleep
wakeup work_queued
do work
sleep
wakeup work_done
queue work
sleep
wakeup work_queued
do work
sleep
... ...
Dept does not consider all waits preceeding an event but only waits that
might lead a deadlock. In this case, Dept works with each region
independently.
consumer producer
sleep <- initiates region 1
--- region 1 starts
... ...
--- region 1 ends
wakeup work_done
... ...
queue work
... ...
sleep <- initiates region 2
--- region 2 starts
... ...
--- region 2 ends
wakeup work_queued
... ...
do work
... ...
sleep <- initiates region 3
--- region 3 starts
... ...
--- region 3 ends
wakeup work_done
... ...
queue work
... ...
sleep <- initiates region 4
--- region 4 starts
... ...
--- region 4 ends
wakeup work_queued
... ...
do work
... ...
That is, Dept does not build dependencies across different regions. So
you don't have to worry about unreasonable false positives that much.
Thoughts?
Thanks,
Byungchul
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
On Thu, Mar 03, 2022 at 10:00:33AM +0900, Byungchul Park wrote:
>
> Unfortunately, it's neither perfect nor safe without another wakeup
> source - rescue wakeup source.
>
> consumer producer
>
> lock L
> (too much work queued == true)
> unlock L
> --- preempted
> lock L
> unlock L
> do work
> lock L
> unlock L
> do work
> ...
> (no work == true)
> sleep
> --- scheduled in
> sleep
That's not how things work in ext4. It's **way** more complicated
than that. We have multiple wait channels, one wake up the consumer
(read: the commit thread), and one which wakes up any processes
waiting for commit thread to have made forward progress. We also have
two spin-lock protected sequence number, one which indicates the
current commited transaction #, and one indicating the transaction #
that needs to be committed.
On the commit thread, it will sleep on j_wait_commit, and when it is
woken up, it will check to see if there is work to be done
(j_commit_sequence != j_commit_request), and if so, do the work, and
then wake up processes waiting on the wait_queue j_wait_done_commit.
(Again, all of this uses the pattern, "prepare to wait", then check to
see if we should sleep, if we do need to sleep, unlock j_state_lock,
then sleep. So this prevents any races leading to lost wakeups.
On the start_this_handle() thread, if we current transaction is too
full, we set j_commit_request to its transaction id to indicate that
we want the current transaction to be committed, and then we wake up
the j_wait_commit wait queue and then we enter a loop where do a
prepare_to_wait in j_wait_done_commit, check to see if
j_commit_sequence == the transaction id that we want to be completed,
and if it's not done yet, we unlock the j_state_lock spinlock, and go
to sleep. Again, because of the prepare_to_wait, there is no chance
of a lost wakeup.
So there really is no "consumer" and "producer" here. If you really
insist on using this model, which really doesn't apply, for one
thread, it's the consumer with respect to one wait queue, and the
producer with respect to the *other* wait queue. For the other
thread, the consumer and producer roles are reversed.
And of course, this is a highly simplified model, since we also have a
wait queue used by the commit thread to wait for the number of active
handles on a particular transaction to go to zero, and
stop_this_handle() will wake up commit thread via this wait queue when
the last active handle on a particular transaction is retired. (And
yes, that parameter is also protected by a different spin lock which
is per-transaction).
So it seems to me that a fundamental flaw in DEPT's model is assuming
that the only waiting paradigm that can be used is consumer/producer,
and that's simply not true. The fact that you use the term "lock" is
also going to lead a misleading line of reasoning, because properly
speaking, they aren't really locks. We are simply using wait channels
to wake up processes as necessary, and then they will check other
variables to decide whether or not they need to sleep or not, and we
have an invariant that when these variables change indicating forward
progress, the associated wait channel will be woken up.
Cheers,
- Ted
P.S. This model is also highly simplified since there are other
reasons why the commit thread can be woken up, some which might be via
a timeout, and some which is via the j_wait_commit wait channel but
not because j_commit_request has been changed, but because file system
is being unmounted, or the file system is being frozen in preparation
of a snapshot, etc. These are *not* necessary to prevent a deadlock,
because under normal circumstances the two wake channels are
sufficient of themselves. So please don't think of them as "rescue
wakeup sources"; again, that's highly misleading and the wrong way to
think of them.
And to make things even more complicated, we have more than 2 wait
channel --- we have *five*:
/**
* @j_wait_transaction_locked:
*
* Wait queue for waiting for a locked transaction to start committing,
* or for a barrier lock to be released.
*/
wait_queue_head_t j_wait_transaction_locked;
/**
* @j_wait_done_commit: Wait queue for waiting for commit to complete.
*/
wait_queue_head_t j_wait_done_commit;
/**
* @j_wait_commit: Wait queue to trigger commit.
*/
wait_queue_head_t j_wait_commit;
/**
* @j_wait_updates: Wait queue to wait for updates to complete.
*/
wait_queue_head_t j_wait_updates;
/**
* @j_wait_reserved:
*
* Wait queue to wait for reserved buffer credits to drop.
*/
wait_queue_head_t j_wait_reserved;
/**
* @j_fc_wait:
*
* Wait queue to wait for completion of async fast commits.
*/
wait_queue_head_t j_fc_wait;
"There are more things in heaven and Earth, Horatio,
Than are dreamt of in your philosophy."
- William Shakespeare, Hamlet
Ted wrote: > On Thu, Mar 03, 2022 at 10:00:33AM +0900, Byungchul Park wrote: > > > > Unfortunately, it's neither perfect nor safe without another wakeup > > source - rescue wakeup source. > > > > consumer producer > > > > lock L > > (too much work queued == true) > > unlock L > > --- preempted > > lock L > > unlock L > > do work > > lock L > > unlock L > > do work > > ... > > (no work == true) > > sleep > > --- scheduled in > > sleep > > That's not how things work in ext4. It's **way** more complicated You seem to get it wrong. This example is what Jan Kara gave me. I just tried to explain things based on Jan Kara's example so leaving all statements that Jan Kara wrote. Plus the example was so helpful. Thanks, Jan Kara. > than that. We have multiple wait channels, one wake up the consumer > (read: the commit thread), and one which wakes up any processes > waiting for commit thread to have made forward progress. We also have > two spin-lock protected sequence number, one which indicates the > current commited transaction #, and one indicating the transaction # > that needs to be committed. > > On the commit thread, it will sleep on j_wait_commit, and when it is > woken up, it will check to see if there is work to be done > (j_commit_sequence != j_commit_request), and if so, do the work, and > then wake up processes waiting on the wait_queue j_wait_done_commit. > (Again, all of this uses the pattern, "prepare to wait", then check to > see if we should sleep, if we do need to sleep, unlock j_state_lock, > then sleep. So this prevents any races leading to lost wakeups. > > On the start_this_handle() thread, if we current transaction is too > full, we set j_commit_request to its transaction id to indicate that > we want the current transaction to be committed, and then we wake up > the j_wait_commit wait queue and then we enter a loop where do a > prepare_to_wait in j_wait_done_commit, check to see if > j_commit_sequence == the transaction id that we want to be completed, > and if it's not done yet, we unlock the j_state_lock spinlock, and go > to sleep. Again, because of the prepare_to_wait, there is no chance > of a lost wakeup. The above explantion gives me a clear view about synchronization of journal things. I appreciate it. > So there really is no "consumer" and "producer" here. If you really > insist on using this model, which really doesn't apply, for one Dept does not assume "consumer" and "producer" model at all, but Dept works with general waits and events. *That model is just one of them.* > thread, it's the consumer with respect to one wait queue, and the > producer with respect to the *other* wait queue. For the other > thread, the consumer and producer roles are reversed. > > And of course, this is a highly simplified model, since we also have a > wait queue used by the commit thread to wait for the number of active > handles on a particular transaction to go to zero, and > stop_this_handle() will wake up commit thread via this wait queue when > the last active handle on a particular transaction is retired. (And > yes, that parameter is also protected by a different spin lock which > is per-transaction). This one also gives me a clear view. Thanks a lot. > So it seems to me that a fundamental flaw in DEPT's model is assuming > that the only waiting paradigm that can be used is consumer/producer, No, Dept does not. > and that's simply not true. The fact that you use the term "lock" is > also going to lead a misleading line of reasoning, because properly "lock/unlock L" comes from the Jan Kara's example. It has almost nothing to do with the explanation. I just left "lock/unlock L" as a statement that comes from the Jan Kara's example. > speaking, they aren't really locks. We are simply using wait channels I totally agree with you. *They aren't really locks but it's just waits and wakeups.* That's exactly why I decided to develop Dept. Dept is not interested in locks unlike Lockdep, but fouces on waits and wakeup sources itself. I think you get Dept wrong a lot. Please ask me more if you have things you doubt about Dept. Thanks, Byungchul
On Thu 03-03-22 10:00:33, Byungchul Park wrote:
> On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> > On Mon 28-02-22 18:28:26, Byungchul Park wrote:
> > > case 1. Code with an actual circular dependency, but not deadlock.
> > >
> > > A circular dependency can be broken by a rescue wakeup source e.g.
> > > timeout. It's not a deadlock. If it's okay that the contexts
> > > participating in the circular dependency and others waiting for the
> > > events in the circle are stuck until it gets broken. Otherwise, say,
> > > if it's not meant, then it's anyway problematic.
> > >
> > > 1-1. What if we judge this code is problematic?
> > > 1-2. What if we judge this code is good?
> > >
> > > case 2. Code with an actual circular dependency, and deadlock.
> > >
> > > There's no other wakeup source than those within the circular
> > > dependency. Literally deadlock. It's problematic and critical.
> > >
> > > 2-1. What if we judge this code is problematic?
> > > 2-2. What if we judge this code is good?
> > >
> > > case 3. Code with no actual circular dependency, and not deadlock.
> > >
> > > Must be good.
> > >
> > > 3-1. What if we judge this code is problematic?
> > > 3-2. What if we judge this code is good?
> > >
> > > ---
> > >
> > > I call only 3-1 "false positive" circular dependency. And you call 1-1
> > > and 3-1 "false positive" deadlock.
> > >
> > > I've been wondering if the kernel guys esp. Linus considers code with
> > > any circular dependency is problematic or not, even if it won't lead to
> > > a deadlock, say, case 1. Even though I designed Dept based on what I
> > > believe is right, of course, I'm willing to change the design according
> > > to the majority opinion.
> > >
> > > However, I would never allow case 1 if I were the owner of the kernel
> > > for better stability, even though the code works anyway okay for now.
> >
> > So yes, I call a report for the situation "There is circular dependency but
> > deadlock is not possible." a false positive. And that is because in my
> > opinion your definition of circular dependency includes schemes that are
> > useful and used in the kernel.
> >
> > Your example in case 1 is kind of borderline (I personally would consider
> > that bug as well) but there are other more valid schemes with multiple
> > wakeup sources like:
> >
> > We have a queue of work to do Q protected by lock L. Consumer process has
> > code like:
> >
> > while (1) {
> > lock L
> > prepare_to_wait(work_queued);
> > if (no work) {
> > unlock L
> > sleep
> > } else {
> > unlock L
> > do work
> > wake_up(work_done)
> > }
> > }
> >
> > AFAIU Dept will create dependency here that 'wakeup work_done' is after
> > 'wait for work_queued'. Producer has code like:
>
> First of all, thank you for this good example.
>
> > while (1) {
> > lock L
> > prepare_to_wait(work_done)
> > if (too much work queued) {
> > unlock L
> > sleep
> > } else {
> > queue work
> > unlock L
> > wake_up(work_queued)
> > }
> > }
> >
> > And Dept will create dependency here that 'wakeup work_queued' is after
> > 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> > despite the code being perfectly valid and safe.
>
> Unfortunately, it's neither perfect nor safe without another wakeup
> source - rescue wakeup source.
>
> consumer producer
>
> lock L
> (too much work queued == true)
> unlock L
> --- preempted
> lock L
> unlock L
> do work
> lock L
> unlock L
> do work
> ...
> (no work == true)
> sleep
> --- scheduled in
> sleep
>
> This code leads a deadlock without another wakeup source, say, not safe.
So the scenario you describe above is indeed possible. But the trick is
that the wakeup from 'consumer' as is doing work will remove 'producer'
from the wait queue and change the 'producer' process state to
'TASK_RUNNING'. So when 'producer' calls sleep (in fact schedule()), the
scheduler will just treat this as another preemption point and the
'producer' will immediately or soon continue to run. So indeed we can think
of this as "another wakeup source" but the source is in the CPU scheduler
itself. This is the standard way how waitqueues are used in the kernel...
> Lastly, just for your information, I need to explain how Dept works a
> little more for you not to misunderstand Dept.
>
> Assuming the consumer and producer guarantee not to lead a deadlock like
> the following, Dept won't report it a problem:
>
> consumer producer
>
> sleep
> wakeup work_done
> queue work
> sleep
> wakeup work_queued
> do work
> sleep
> wakeup work_done
> queue work
> sleep
> wakeup work_queued
> do work
> sleep
> ... ...
>
> Dept does not consider all waits preceeding an event but only waits that
> might lead a deadlock. In this case, Dept works with each region
> independently.
>
> consumer producer
>
> sleep <- initiates region 1
> --- region 1 starts
> ... ...
> --- region 1 ends
> wakeup work_done
> ... ...
> queue work
> ... ...
> sleep <- initiates region 2
> --- region 2 starts
> ... ...
> --- region 2 ends
> wakeup work_queued
> ... ...
> do work
> ... ...
> sleep <- initiates region 3
> --- region 3 starts
> ... ...
> --- region 3 ends
> wakeup work_done
> ... ...
> queue work
> ... ...
> sleep <- initiates region 4
> --- region 4 starts
> ... ...
> --- region 4 ends
> wakeup work_queued
> ... ...
> do work
> ... ...
>
> That is, Dept does not build dependencies across different regions. So
> you don't have to worry about unreasonable false positives that much.
>
> Thoughts?
Thanks for explanation! And what exactly defines the 'regions'? When some
process goes to sleep on some waitqueue, this defines a start of a region
at the place where all the other processes are at that moment and wakeup of
the waitqueue is an end of the region?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On Thu, Mar 03, 2022 at 10:54:56AM +0100, Jan Kara wrote:
> On Thu 03-03-22 10:00:33, Byungchul Park wrote:
> > Unfortunately, it's neither perfect nor safe without another wakeup
> > source - rescue wakeup source.
> >
> > consumer producer
> >
> > lock L
> > (too much work queued == true)
> > unlock L
> > --- preempted
> > lock L
> > unlock L
> > do work
> > lock L
> > unlock L
> > do work
> > ...
> > (no work == true)
> > sleep
> > --- scheduled in
> > sleep
> >
> > This code leads a deadlock without another wakeup source, say, not safe.
>
> So the scenario you describe above is indeed possible. But the trick is
> that the wakeup from 'consumer' as is doing work will remove 'producer'
> from the wait queue and change the 'producer' process state to
> 'TASK_RUNNING'. So when 'producer' calls sleep (in fact schedule()), the
> scheduler will just treat this as another preemption point and the
> 'producer' will immediately or soon continue to run. So indeed we can think
> of this as "another wakeup source" but the source is in the CPU scheduler
> itself. This is the standard way how waitqueues are used in the kernel...
Nice! Thanks for the explanation. I will take it into account if needed.
> > Lastly, just for your information, I need to explain how Dept works a
> > little more for you not to misunderstand Dept.
> >
> > Assuming the consumer and producer guarantee not to lead a deadlock like
> > the following, Dept won't report it a problem:
> >
> > consumer producer
> >
> > sleep
> > wakeup work_done
> > queue work
> > sleep
> > wakeup work_queued
> > do work
> > sleep
> > wakeup work_done
> > queue work
> > sleep
> > wakeup work_queued
> > do work
> > sleep
> > ... ...
> >
> > Dept does not consider all waits preceeding an event but only waits that
> > might lead a deadlock. In this case, Dept works with each region
> > independently.
> >
> > consumer producer
> >
> > sleep <- initiates region 1
> > --- region 1 starts
> > ... ...
> > --- region 1 ends
> > wakeup work_done
> > ... ...
> > queue work
> > ... ...
> > sleep <- initiates region 2
> > --- region 2 starts
> > ... ...
> > --- region 2 ends
> > wakeup work_queued
> > ... ...
> > do work
> > ... ...
> > sleep <- initiates region 3
> > --- region 3 starts
> > ... ...
> > --- region 3 ends
> > wakeup work_done
> > ... ...
> > queue work
> > ... ...
> > sleep <- initiates region 4
> > --- region 4 starts
> > ... ...
> > --- region 4 ends
> > wakeup work_queued
> > ... ...
> > do work
> > ... ...
> >
> > That is, Dept does not build dependencies across different regions. So
> > you don't have to worry about unreasonable false positives that much.
> >
> > Thoughts?
>
> Thanks for explanation! And what exactly defines the 'regions'? When some
> process goes to sleep on some waitqueue, this defines a start of a region
> at the place where all the other processes are at that moment and wakeup of
> the waitqueue is an end of the region?
Yes. Let me explain it more for better understanding.
(I copied it from the talk I did with Matthew..)
ideal view
-----------
context X context Y
request event E ...
write REQUESTEVENT when (notice REQUESTEVENT written)
... notice the request from X [S]
--- ideally region 1 starts here
wait for the event ...
sleep if (can see REQUESTEVENT written)
it's on the way to the event
...
...
--- ideally region 1 ends here
finally the event [E]
Dept basically works with the above view with regard to wait and event.
But it's very hard to identify the ideal [S] point in practice. So Dept
instead identifies [S] point by checking WAITSTART with memory barriers
like the following, which would make Dept work conservatively.
Dept's view
------------
context X context Y
request event E ...
write REQUESTEVENT when (notice REQUESTEVENT written)
... notice the request from X
--- region 2 Dept gives up starts
wait for the event ...
write barrier
write WAITSTART read barrier
sleep when (notice WAITSTART written)
ensure the request has come [S]
--- region 2 Dept gives up ends
--- region 3 starts here
...
if (can see WAITSTART written)
it's on the way to the event
...
...
--- region 3 ends here
finally the event [E]
In short, Dept works with region 3.
Thanks,
Byungchul
On Thu, Feb 17, 2022 at 08:10:03PM +0900, Byungchul Park wrote: > [ 7.009608] =================================================== > [ 7.009613] DEPT: Circular dependency has been detected. > [ 7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G W > [ 7.009616] --------------------------------------------------- > [ 7.009617] summary > [ 7.009618] --------------------------------------------------- > [ 7.009618] *** DEADLOCK *** > [ 7.009618] > [ 7.009619] context A > [ 7.009619] [S] (unknown)(&(bit_wait_table + i)->dmap:0) Why is the context unknown here? I don't see a way to debug this without knowing where we acquired the bit wait lock. > [ 7.009621] [W] down_write(&ei->i_data_sem:0) > [ 7.009623] [E] event(&(bit_wait_table + i)->dmap:0) > [ 7.009624] > [ 7.009625] context B > [ 7.009625] [S] down_read(&ei->i_data_sem:0) > [ 7.009626] [W] wait(&(bit_wait_table + i)->dmap:0) > [ 7.009627] [E] up_read(&ei->i_data_sem:0) > [ 7.009628] > [ 7.009629] [S]: start of the event context > [ 7.009629] [W]: the wait blocked > [ 7.009630] [E]: the event not reachable > [ 7.009631] --------------------------------------------------- > [ 7.009631] context A's detail > [ 7.009632] --------------------------------------------------- > [ 7.009632] context A > [ 7.009633] [S] (unknown)(&(bit_wait_table + i)->dmap:0) > [ 7.009634] [W] down_write(&ei->i_data_sem:0) > [ 7.009635] [E] event(&(bit_wait_table + i)->dmap:0) > [ 7.009636] > [ 7.009636] [S] (unknown)(&(bit_wait_table + i)->dmap:0): > [ 7.009638] (N/A) > [ 7.009638] > [ 7.009639] [W] down_write(&ei->i_data_sem:0): > [ 7.009639] ext4_truncate (fs/ext4/inode.c:4187) > [ 7.009645] stacktrace: > [ 7.009646] down_write (kernel/locking/rwsem.c:1514) > [ 7.009648] ext4_truncate (fs/ext4/inode.c:4187) > [ 7.009650] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) > [ 7.009652] generic_perform_write (mm/filemap.c:3784) > [ 7.009654] ext4_buffered_write_iter (fs/ext4/file.c:269) > [ 7.009657] ext4_file_write_iter (fs/ext4/file.c:677) > [ 7.009659] new_sync_write (fs/read_write.c:504 (discriminator 1)) > [ 7.009662] vfs_write (fs/read_write.c:590) > [ 7.009663] ksys_write (fs/read_write.c:644) > [ 7.009664] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > [ 7.009667] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > [ 7.009669] > [ 7.009670] [E] event(&(bit_wait_table + i)->dmap:0): > [ 7.009671] __wake_up_common (kernel/sched/wait.c:108) > [ 7.009673] stacktrace: > [ 7.009674] dept_event (kernel/dependency/dept.c:2337) > [ 7.009677] __wake_up_common (kernel/sched/wait.c:109) > [ 7.009678] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > [ 7.009679] __wake_up_bit (kernel/sched/wait_bit.c:127) > [ 7.009681] ext4_orphan_del (fs/ext4/orphan.c:282) > [ 7.009683] ext4_truncate (fs/ext4/inode.c:4212) > [ 7.009685] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) > [ 7.009687] generic_perform_write (mm/filemap.c:3784) > [ 7.009688] ext4_buffered_write_iter (fs/ext4/file.c:269) > [ 7.009690] ext4_file_write_iter (fs/ext4/file.c:677) > [ 7.009692] new_sync_write (fs/read_write.c:504 (discriminator 1)) > [ 7.009694] vfs_write (fs/read_write.c:590) > [ 7.009695] ksys_write (fs/read_write.c:644) > [ 7.009696] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > [ 7.009698] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > [ 7.009700] --------------------------------------------------- > [ 7.009700] context B's detail > [ 7.009701] --------------------------------------------------- > [ 7.009702] context B > [ 7.009702] [S] down_read(&ei->i_data_sem:0) > [ 7.009703] [W] wait(&(bit_wait_table + i)->dmap:0) > [ 7.009704] [E] up_read(&ei->i_data_sem:0) > [ 7.009705] > [ 7.009706] [S] down_read(&ei->i_data_sem:0): > [ 7.009707] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) > [ 7.009709] stacktrace: > [ 7.009709] down_read (kernel/locking/rwsem.c:1461) > [ 7.009711] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) > [ 7.009712] ext4_getblk (fs/ext4/inode.c:851) > [ 7.009714] ext4_bread (fs/ext4/inode.c:903) > [ 7.009715] __ext4_read_dirblock (fs/ext4/namei.c:117) > [ 7.009718] dx_probe (fs/ext4/namei.c:789) > [ 7.009720] ext4_dx_find_entry (fs/ext4/namei.c:1721) > [ 7.009722] __ext4_find_entry (fs/ext4/namei.c:1571) > [ 7.009723] ext4_lookup (fs/ext4/namei.c:1770) > [ 7.009725] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) > [ 7.009727] path_openat (fs/namei.c:3401 fs/namei.c:3605) > [ 7.009729] do_filp_open (fs/namei.c:3637) > [ 7.009731] do_sys_openat2 (fs/open.c:1215) > [ 7.009732] do_sys_open (fs/open.c:1231) > [ 7.009734] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > [ 7.009736] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > [ 7.009738] > [ 7.009738] [W] wait(&(bit_wait_table + i)->dmap:0): > [ 7.009739] prepare_to_wait (kernel/sched/wait.c:275) > [ 7.009741] stacktrace: > [ 7.009741] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) > [ 7.009743] schedule (kernel/sched/core.c:6373 (discriminator 1)) > [ 7.009744] io_schedule (./arch/x86/include/asm/current.h:15 kernel/sched/core.c:8392 kernel/sched/core.c:8418) > [ 7.009745] bit_wait_io (./arch/x86/include/asm/current.h:15 kernel/sched/wait_bit.c:210) > [ 7.009746] __wait_on_bit (kernel/sched/wait_bit.c:49) > [ 7.009748] out_of_line_wait_on_bit (kernel/sched/wait_bit.c:65) > [ 7.009749] ext4_read_bh (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 ./include/linux/buffer_head.h:120 fs/ext4/super.c:201) > [ 7.009752] __read_extent_tree_block (fs/ext4/extents.c:545) > [ 7.009754] ext4_find_extent (fs/ext4/extents.c:928) > [ 7.009756] ext4_ext_map_blocks (fs/ext4/extents.c:4099) > [ 7.009757] ext4_map_blocks (fs/ext4/inode.c:563) > [ 7.009759] ext4_getblk (fs/ext4/inode.c:851) > [ 7.009760] ext4_bread (fs/ext4/inode.c:903) > [ 7.009762] __ext4_read_dirblock (fs/ext4/namei.c:117) > [ 7.009764] dx_probe (fs/ext4/namei.c:789) > [ 7.009765] ext4_dx_find_entry (fs/ext4/namei.c:1721) > [ 7.009767] > [ 7.009768] [E] up_read(&ei->i_data_sem:0): > [ 7.009769] ext4_map_blocks (fs/ext4/inode.c:593) > [ 7.009771] stacktrace: > [ 7.009771] up_read (kernel/locking/rwsem.c:1556) > [ 7.009774] ext4_map_blocks (fs/ext4/inode.c:593) > [ 7.009775] ext4_getblk (fs/ext4/inode.c:851) > [ 7.009777] ext4_bread (fs/ext4/inode.c:903) > [ 7.009778] __ext4_read_dirblock (fs/ext4/namei.c:117) > [ 7.009780] dx_probe (fs/ext4/namei.c:789) > [ 7.009782] ext4_dx_find_entry (fs/ext4/namei.c:1721) > [ 7.009784] __ext4_find_entry (fs/ext4/namei.c:1571) > [ 7.009786] ext4_lookup (fs/ext4/namei.c:1770) > [ 7.009788] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) > [ 7.009789] path_openat (fs/namei.c:3401 fs/namei.c:3605) > [ 7.009791] do_filp_open (fs/namei.c:3637) > [ 7.009792] do_sys_openat2 (fs/open.c:1215) > [ 7.009794] do_sys_open (fs/open.c:1231) > [ 7.009795] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > [ 7.009797] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > [ 7.009799] --------------------------------------------------- > [ 7.009800] information that might be helpful > [ 7.009800] --------------------------------------------------- > [ 7.009801] CPU: 0 PID: 611 Comm: rs:main Q:Reg Tainted: G W 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 > [ 7.009804] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > [ 7.009805] Call Trace: > [ 7.009806] <TASK> > [ 7.009807] dump_stack_lvl (lib/dump_stack.c:107) > [ 7.009809] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) > [ 7.009812] ? print_circle (kernel/dependency/dept.c:1086) > [ 7.009814] cb_check_dl (kernel/dependency/dept.c:1104) > [ 7.009815] bfs (kernel/dependency/dept.c:860) > [ 7.009818] add_dep (kernel/dependency/dept.c:1423) > [ 7.009820] do_event.isra.25 (kernel/dependency/dept.c:1650) > [ 7.009822] ? __wake_up_common (kernel/sched/wait.c:108) > [ 7.009824] dept_event (kernel/dependency/dept.c:2337) > [ 7.009826] __wake_up_common (kernel/sched/wait.c:109) > [ 7.009828] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > [ 7.009830] __wake_up_bit (kernel/sched/wait_bit.c:127) > [ 7.009832] ext4_orphan_del (fs/ext4/orphan.c:282) > [ 7.009835] ? dept_ecxt_exit (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:2478) > [ 7.009837] ext4_truncate (fs/ext4/inode.c:4212) > [ 7.009839] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) > [ 7.009842] generic_perform_write (mm/filemap.c:3784) > [ 7.009845] ext4_buffered_write_iter (fs/ext4/file.c:269) > [ 7.009848] ext4_file_write_iter (fs/ext4/file.c:677) > [ 7.009851] new_sync_write (fs/read_write.c:504 (discriminator 1)) > [ 7.009854] vfs_write (fs/read_write.c:590) > [ 7.009856] ksys_write (fs/read_write.c:644) > [ 7.009857] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:65) > [ 7.009860] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > [ 7.009862] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > [ 7.009865] RIP: 0033:0x7f3b160b335d > [ 7.009867] Code: e1 20 00 00 75 10 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce fa ff ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 17 fb ff ff 48 89 d0 48 83 c4 08 48 3d 01 > All code > ======== > 0: e1 20 loope 0x22 > 2: 00 00 add %al,(%rax) > 4: 75 10 jne 0x16 > 6: b8 01 00 00 00 mov $0x1,%eax > b: 0f 05 syscall > d: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > 13: 73 31 jae 0x46 > 15: c3 retq > 16: 48 83 ec 08 sub $0x8,%rsp > 1a: e8 ce fa ff ff callq 0xfffffffffffffaed > 1f: 48 89 04 24 mov %rax,(%rsp) > 23: b8 01 00 00 00 mov $0x1,%eax > 28: 0f 05 syscall > 2a:* 48 8b 3c 24 mov (%rsp),%rdi <-- trapping instruction > 2e: 48 89 c2 mov %rax,%rdx > 31: e8 17 fb ff ff callq 0xfffffffffffffb4d > 36: 48 89 d0 mov %rdx,%rax > 39: 48 83 c4 08 add $0x8,%rsp > 3d: 48 rex.W > 3e: 3d .byte 0x3d > 3f: 01 .byte 0x1 > > Code starting with the faulting instruction > =========================================== > 0: 48 8b 3c 24 mov (%rsp),%rdi > 4: 48 89 c2 mov %rax,%rdx > 7: e8 17 fb ff ff callq 0xfffffffffffffb23 > c: 48 89 d0 mov %rdx,%rax > f: 48 83 c4 08 add $0x8,%rsp > 13: 48 rex.W > 14: 3d .byte 0x3d > 15: 01 .byte 0x1 > [ 7.009869] RSP: 002b:00007f3b1340f180 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 > [ 7.009871] RAX: ffffffffffffffda RBX: 00007f3b040010a0 RCX: 00007f3b160b335d > [ 7.009873] RDX: 0000000000000300 RSI: 00007f3b040010a0 RDI: 0000000000000001 > [ 7.009874] RBP: 0000000000000000 R08: fffffffffffffa15 R09: fffffffffffffa05 > [ 7.009875] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f3b04000df0 > [ 7.009876] R13: 00007f3b1340f1a0 R14: 0000000000000220 R15: 0000000000000300 > [ 7.009879] </TASK>
On Thu, Feb 17, 2022 at 01:27:42PM +0000, Matthew Wilcox wrote: > On Thu, Feb 17, 2022 at 08:10:03PM +0900, Byungchul Park wrote: > > [ 7.009608] =================================================== > > [ 7.009613] DEPT: Circular dependency has been detected. > > [ 7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G W > > [ 7.009616] --------------------------------------------------- > > [ 7.009617] summary > > [ 7.009618] --------------------------------------------------- > > [ 7.009618] *** DEADLOCK *** > > [ 7.009618] > > [ 7.009619] context A > > [ 7.009619] [S] (unknown)(&(bit_wait_table + i)->dmap:0) > > Why is the context unknown here? I don't see a way to debug this > without knowing where we acquired the bit wait lock. ideal view ------------------ context X context A request event E to context A ... write REQUESTEVENT if (can see REQUESTEVENT written) ... notice the request from X [S] wait for the event [W] ... write barrier write WAITSTART actual wait if (can see REQUESTEVENT written) consider it's on the way to the event ... ... finally the event [E] Dept works with the above view wrt. wait and event. [S] point varies depending on ways to make context A notice the request so that the event context A can start. Of course, by making more effort on identifying each way in the kernel, we can figure out the exact point. Here, we can also check WAITSTART with a read barrier instead of REQUESTEVENT in context A conservatively. That's how Dept works. Dept's view ------------------ context X context A request event E to context A ... write REQUESTEVENT if (can see REQUESTEVENT written) ... notice the request from X [S] wait for the event [W] ... write barrier write WAITSTART read barrier actual wait if (can see WAITSTART written) consider it's on the way to the event ... ... finally the event [E] consider all waits in context A so far that could see WAITSTART written Thanks, Byungchul > > [ 7.009621] [W] down_write(&ei->i_data_sem:0) > > [ 7.009623] [E] event(&(bit_wait_table + i)->dmap:0) > > [ 7.009624] > > [ 7.009625] context B > > [ 7.009625] [S] down_read(&ei->i_data_sem:0) > > [ 7.009626] [W] wait(&(bit_wait_table + i)->dmap:0) > > [ 7.009627] [E] up_read(&ei->i_data_sem:0) > > [ 7.009628] > > [ 7.009629] [S]: start of the event context > > [ 7.009629] [W]: the wait blocked > > [ 7.009630] [E]: the event not reachable > > [ 7.009631] --------------------------------------------------- > > [ 7.009631] context A's detail > > [ 7.009632] --------------------------------------------------- > > [ 7.009632] context A > > [ 7.009633] [S] (unknown)(&(bit_wait_table + i)->dmap:0) > > [ 7.009634] [W] down_write(&ei->i_data_sem:0) > > [ 7.009635] [E] event(&(bit_wait_table + i)->dmap:0) > > [ 7.009636] > > [ 7.009636] [S] (unknown)(&(bit_wait_table + i)->dmap:0): > > [ 7.009638] (N/A) > > [ 7.009638] > > [ 7.009639] [W] down_write(&ei->i_data_sem:0): > > [ 7.009639] ext4_truncate (fs/ext4/inode.c:4187) > > [ 7.009645] stacktrace: > > [ 7.009646] down_write (kernel/locking/rwsem.c:1514) > > [ 7.009648] ext4_truncate (fs/ext4/inode.c:4187) > > [ 7.009650] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) > > [ 7.009652] generic_perform_write (mm/filemap.c:3784) > > [ 7.009654] ext4_buffered_write_iter (fs/ext4/file.c:269) > > [ 7.009657] ext4_file_write_iter (fs/ext4/file.c:677) > > [ 7.009659] new_sync_write (fs/read_write.c:504 (discriminator 1)) > > [ 7.009662] vfs_write (fs/read_write.c:590) > > [ 7.009663] ksys_write (fs/read_write.c:644) > > [ 7.009664] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > > [ 7.009667] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > > [ 7.009669] > > [ 7.009670] [E] event(&(bit_wait_table + i)->dmap:0): > > [ 7.009671] __wake_up_common (kernel/sched/wait.c:108) > > [ 7.009673] stacktrace: > > [ 7.009674] dept_event (kernel/dependency/dept.c:2337) > > [ 7.009677] __wake_up_common (kernel/sched/wait.c:109) > > [ 7.009678] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > > [ 7.009679] __wake_up_bit (kernel/sched/wait_bit.c:127) > > [ 7.009681] ext4_orphan_del (fs/ext4/orphan.c:282) > > [ 7.009683] ext4_truncate (fs/ext4/inode.c:4212) > > [ 7.009685] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) > > [ 7.009687] generic_perform_write (mm/filemap.c:3784) > > [ 7.009688] ext4_buffered_write_iter (fs/ext4/file.c:269) > > [ 7.009690] ext4_file_write_iter (fs/ext4/file.c:677) > > [ 7.009692] new_sync_write (fs/read_write.c:504 (discriminator 1)) > > [ 7.009694] vfs_write (fs/read_write.c:590) > > [ 7.009695] ksys_write (fs/read_write.c:644) > > [ 7.009696] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > > [ 7.009698] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > > [ 7.009700] --------------------------------------------------- > > [ 7.009700] context B's detail > > [ 7.009701] --------------------------------------------------- > > [ 7.009702] context B > > [ 7.009702] [S] down_read(&ei->i_data_sem:0) > > [ 7.009703] [W] wait(&(bit_wait_table + i)->dmap:0) > > [ 7.009704] [E] up_read(&ei->i_data_sem:0) > > [ 7.009705] > > [ 7.009706] [S] down_read(&ei->i_data_sem:0): > > [ 7.009707] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) > > [ 7.009709] stacktrace: > > [ 7.009709] down_read (kernel/locking/rwsem.c:1461) > > [ 7.009711] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) > > [ 7.009712] ext4_getblk (fs/ext4/inode.c:851) > > [ 7.009714] ext4_bread (fs/ext4/inode.c:903) > > [ 7.009715] __ext4_read_dirblock (fs/ext4/namei.c:117) > > [ 7.009718] dx_probe (fs/ext4/namei.c:789) > > [ 7.009720] ext4_dx_find_entry (fs/ext4/namei.c:1721) > > [ 7.009722] __ext4_find_entry (fs/ext4/namei.c:1571) > > [ 7.009723] ext4_lookup (fs/ext4/namei.c:1770) > > [ 7.009725] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) > > [ 7.009727] path_openat (fs/namei.c:3401 fs/namei.c:3605) > > [ 7.009729] do_filp_open (fs/namei.c:3637) > > [ 7.009731] do_sys_openat2 (fs/open.c:1215) > > [ 7.009732] do_sys_open (fs/open.c:1231) > > [ 7.009734] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > > [ 7.009736] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > > [ 7.009738] > > [ 7.009738] [W] wait(&(bit_wait_table + i)->dmap:0): > > [ 7.009739] prepare_to_wait (kernel/sched/wait.c:275) > > [ 7.009741] stacktrace: > > [ 7.009741] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) > > [ 7.009743] schedule (kernel/sched/core.c:6373 (discriminator 1)) > > [ 7.009744] io_schedule (./arch/x86/include/asm/current.h:15 kernel/sched/core.c:8392 kernel/sched/core.c:8418) > > [ 7.009745] bit_wait_io (./arch/x86/include/asm/current.h:15 kernel/sched/wait_bit.c:210) > > [ 7.009746] __wait_on_bit (kernel/sched/wait_bit.c:49) > > [ 7.009748] out_of_line_wait_on_bit (kernel/sched/wait_bit.c:65) > > [ 7.009749] ext4_read_bh (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 ./include/linux/buffer_head.h:120 fs/ext4/super.c:201) > > [ 7.009752] __read_extent_tree_block (fs/ext4/extents.c:545) > > [ 7.009754] ext4_find_extent (fs/ext4/extents.c:928) > > [ 7.009756] ext4_ext_map_blocks (fs/ext4/extents.c:4099) > > [ 7.009757] ext4_map_blocks (fs/ext4/inode.c:563) > > [ 7.009759] ext4_getblk (fs/ext4/inode.c:851) > > [ 7.009760] ext4_bread (fs/ext4/inode.c:903) > > [ 7.009762] __ext4_read_dirblock (fs/ext4/namei.c:117) > > [ 7.009764] dx_probe (fs/ext4/namei.c:789) > > [ 7.009765] ext4_dx_find_entry (fs/ext4/namei.c:1721) > > [ 7.009767] > > [ 7.009768] [E] up_read(&ei->i_data_sem:0): > > [ 7.009769] ext4_map_blocks (fs/ext4/inode.c:593) > > [ 7.009771] stacktrace: > > [ 7.009771] up_read (kernel/locking/rwsem.c:1556) > > [ 7.009774] ext4_map_blocks (fs/ext4/inode.c:593) > > [ 7.009775] ext4_getblk (fs/ext4/inode.c:851) > > [ 7.009777] ext4_bread (fs/ext4/inode.c:903) > > [ 7.009778] __ext4_read_dirblock (fs/ext4/namei.c:117) > > [ 7.009780] dx_probe (fs/ext4/namei.c:789) > > [ 7.009782] ext4_dx_find_entry (fs/ext4/namei.c:1721) > > [ 7.009784] __ext4_find_entry (fs/ext4/namei.c:1571) > > [ 7.009786] ext4_lookup (fs/ext4/namei.c:1770) > > [ 7.009788] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) > > [ 7.009789] path_openat (fs/namei.c:3401 fs/namei.c:3605) > > [ 7.009791] do_filp_open (fs/namei.c:3637) > > [ 7.009792] do_sys_openat2 (fs/open.c:1215) > > [ 7.009794] do_sys_open (fs/open.c:1231) > > [ 7.009795] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > > [ 7.009797] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > > [ 7.009799] --------------------------------------------------- > > [ 7.009800] information that might be helpful > > [ 7.009800] --------------------------------------------------- > > [ 7.009801] CPU: 0 PID: 611 Comm: rs:main Q:Reg Tainted: G W 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 > > [ 7.009804] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > > [ 7.009805] Call Trace: > > [ 7.009806] <TASK> > > [ 7.009807] dump_stack_lvl (lib/dump_stack.c:107) > > [ 7.009809] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) > > [ 7.009812] ? print_circle (kernel/dependency/dept.c:1086) > > [ 7.009814] cb_check_dl (kernel/dependency/dept.c:1104) > > [ 7.009815] bfs (kernel/dependency/dept.c:860) > > [ 7.009818] add_dep (kernel/dependency/dept.c:1423) > > [ 7.009820] do_event.isra.25 (kernel/dependency/dept.c:1650) > > [ 7.009822] ? __wake_up_common (kernel/sched/wait.c:108) > > [ 7.009824] dept_event (kernel/dependency/dept.c:2337) > > [ 7.009826] __wake_up_common (kernel/sched/wait.c:109) > > [ 7.009828] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) > > [ 7.009830] __wake_up_bit (kernel/sched/wait_bit.c:127) > > [ 7.009832] ext4_orphan_del (fs/ext4/orphan.c:282) > > [ 7.009835] ? dept_ecxt_exit (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:2478) > > [ 7.009837] ext4_truncate (fs/ext4/inode.c:4212) > > [ 7.009839] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) > > [ 7.009842] generic_perform_write (mm/filemap.c:3784) > > [ 7.009845] ext4_buffered_write_iter (fs/ext4/file.c:269) > > [ 7.009848] ext4_file_write_iter (fs/ext4/file.c:677) > > [ 7.009851] new_sync_write (fs/read_write.c:504 (discriminator 1)) > > [ 7.009854] vfs_write (fs/read_write.c:590) > > [ 7.009856] ksys_write (fs/read_write.c:644) > > [ 7.009857] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:65) > > [ 7.009860] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) > > [ 7.009862] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) > > [ 7.009865] RIP: 0033:0x7f3b160b335d > > [ 7.009867] Code: e1 20 00 00 75 10 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce fa ff ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 17 fb ff ff 48 89 d0 48 83 c4 08 48 3d 01 > > All code > > ======== > > 0: e1 20 loope 0x22 > > 2: 00 00 add %al,(%rax) > > 4: 75 10 jne 0x16 > > 6: b8 01 00 00 00 mov $0x1,%eax > > b: 0f 05 syscall > > d: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > > 13: 73 31 jae 0x46 > > 15: c3 retq > > 16: 48 83 ec 08 sub $0x8,%rsp > > 1a: e8 ce fa ff ff callq 0xfffffffffffffaed > > 1f: 48 89 04 24 mov %rax,(%rsp) > > 23: b8 01 00 00 00 mov $0x1,%eax > > 28: 0f 05 syscall > > 2a:* 48 8b 3c 24 mov (%rsp),%rdi <-- trapping instruction > > 2e: 48 89 c2 mov %rax,%rdx > > 31: e8 17 fb ff ff callq 0xfffffffffffffb4d > > 36: 48 89 d0 mov %rdx,%rax > > 39: 48 83 c4 08 add $0x8,%rsp > > 3d: 48 rex.W > > 3e: 3d .byte 0x3d > > 3f: 01 .byte 0x1 > > > > Code starting with the faulting instruction > > =========================================== > > 0: 48 8b 3c 24 mov (%rsp),%rdi > > 4: 48 89 c2 mov %rax,%rdx > > 7: e8 17 fb ff ff callq 0xfffffffffffffb23 > > c: 48 89 d0 mov %rdx,%rax > > f: 48 83 c4 08 add $0x8,%rsp > > 13: 48 rex.W > > 14: 3d .byte 0x3d > > 15: 01 .byte 0x1 > > [ 7.009869] RSP: 002b:00007f3b1340f180 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 > > [ 7.009871] RAX: ffffffffffffffda RBX: 00007f3b040010a0 RCX: 00007f3b160b335d > > [ 7.009873] RDX: 0000000000000300 RSI: 00007f3b040010a0 RDI: 0000000000000001 > > [ 7.009874] RBP: 0000000000000000 R08: fffffffffffffa15 R09: fffffffffffffa05 > > [ 7.009875] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f3b04000df0 > > [ 7.009876] R13: 00007f3b1340f1a0 R14: 0000000000000220 R15: 0000000000000300 > > [ 7.009879] </TASK>
On Thu 17-02-22 20:10:03, Byungchul Park wrote: > [ 7.009608] =================================================== > [ 7.009613] DEPT: Circular dependency has been detected. > [ 7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G W > [ 7.009616] --------------------------------------------------- > [ 7.009617] summary > [ 7.009618] --------------------------------------------------- > [ 7.009618] *** DEADLOCK *** > [ 7.009618] > [ 7.009619] context A > [ 7.009619] [S] (unknown)(&(bit_wait_table + i)->dmap:0) > [ 7.009621] [W] down_write(&ei->i_data_sem:0) > [ 7.009623] [E] event(&(bit_wait_table + i)->dmap:0) > [ 7.009624] > [ 7.009625] context B > [ 7.009625] [S] down_read(&ei->i_data_sem:0) > [ 7.009626] [W] wait(&(bit_wait_table + i)->dmap:0) > [ 7.009627] [E] up_read(&ei->i_data_sem:0) > [ 7.009628] Looking into this I have noticed that Dept here tracks bitlocks (buffer locks in particular) but it apparently treats locks on all buffers as one locking class so it conflates lock on superblock buffer with a lock on extent tree block buffer. These are wastly different locks with different locking constraints. So to avoid false positives in filesystems we will need to add annotations to differentiate locks on different buffers (based on what the block is used for). Similarly how we e.g. annotate i_rwsem for different inodes. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Tue, Feb 22, 2022 at 09:27:23AM +0100, Jan Kara wrote: > On Thu 17-02-22 20:10:03, Byungchul Park wrote: > > [ 7.009608] =================================================== > > [ 7.009613] DEPT: Circular dependency has been detected. > > [ 7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G W > > [ 7.009616] --------------------------------------------------- > > [ 7.009617] summary > > [ 7.009618] --------------------------------------------------- > > [ 7.009618] *** DEADLOCK *** > > [ 7.009618] > > [ 7.009619] context A > > [ 7.009619] [S] (unknown)(&(bit_wait_table + i)->dmap:0) > > [ 7.009621] [W] down_write(&ei->i_data_sem:0) > > [ 7.009623] [E] event(&(bit_wait_table + i)->dmap:0) > > [ 7.009624] > > [ 7.009625] context B > > [ 7.009625] [S] down_read(&ei->i_data_sem:0) > > [ 7.009626] [W] wait(&(bit_wait_table + i)->dmap:0) > > [ 7.009627] [E] up_read(&ei->i_data_sem:0) > > [ 7.009628] > > Looking into this I have noticed that Dept here tracks bitlocks (buffer > locks in particular) but it apparently treats locks on all buffers as one > locking class so it conflates lock on superblock buffer with a lock on > extent tree block buffer. These are wastly different locks with different > locking constraints. So to avoid false positives in filesystems we will > need to add annotations to differentiate locks on different buffers (based > on what the block is used for). Similarly how we e.g. annotate i_rwsem for Exactly yes. All synchronization objects should be classfied by what it is used for. Even though it's already classified by the location of the code initializing the object - roughly and normally saying we can expect those have the same constraint, we are actually assigning different constraints according to the subtle design esp. in file systems. It would also help the code have better documentation ;-) I'm willing to add annotations for that to fs. > different inodes. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Tue, Feb 22, 2022 at 09:27:23AM +0100, Jan Kara wrote: > On Thu 17-02-22 20:10:03, Byungchul Park wrote: > > [ 7.009608] =================================================== > > [ 7.009613] DEPT: Circular dependency has been detected. > > [ 7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G W > > [ 7.009616] --------------------------------------------------- > > [ 7.009617] summary > > [ 7.009618] --------------------------------------------------- > > [ 7.009618] *** DEADLOCK *** > > [ 7.009618] > > [ 7.009619] context A > > [ 7.009619] [S] (unknown)(&(bit_wait_table + i)->dmap:0) > > [ 7.009621] [W] down_write(&ei->i_data_sem:0) > > [ 7.009623] [E] event(&(bit_wait_table + i)->dmap:0) > > [ 7.009624] > > [ 7.009625] context B > > [ 7.009625] [S] down_read(&ei->i_data_sem:0) > > [ 7.009626] [W] wait(&(bit_wait_table + i)->dmap:0) > > [ 7.009627] [E] up_read(&ei->i_data_sem:0) > > [ 7.009628] > > Looking into this I have noticed that Dept here tracks bitlocks (buffer > locks in particular) but it apparently treats locks on all buffers as one > locking class so it conflates lock on superblock buffer with a lock on > extent tree block buffer. These are wastly different locks with different > locking constraints. So to avoid false positives in filesystems we will > need to add annotations to differentiate locks on different buffers (based > on what the block is used for). Similarly how we e.g. annotate i_rwsem for > different inodes. Hi Jan Kara, I just understood why some guys in this space got mad at Dept reports. I barely got reports from the lock you mentioned with my system - precisely speaking only one, even though I've been rebooting my system many times. But another report that someone gave for me showed there were a lot of reports from the lock. Your comment and the report are so much helpful. I need to assign each's own class first for the buffer locks. Thank you very much. Thanks, Byungchul > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
© 2016 - 2026 Red Hat, Inc.