drivers/nvme/host/fc.c | 43 ++++++++++++++++++------------------ drivers/nvme/target/fcloop.c | 15 +++++-------- 2 files changed, 27 insertions(+), 31 deletions(-)
With the additinal 3 patches and the blktests changes [1] I am able to run the
tests in a loop for a while without failures. I expect there are more problems
hidden but with these changes the first milesone is reached.
The first new patch is actually one from a previous attempt. It addresses the
problem in unloading path where we dont make any progress more [2].
The next two patches change the existing initial connection attempt of FC to
a synchronous one. Without this the auth test fail [3].
Daniel
[1] https://lore.kernel.org/linux-nvme/20230620132703.20648-1-dwagner@suse.de/
[2] https://lore.kernel.org/linux-nvme/20230418130159.11075-1-dwagner@suse.de/
[3] https://lore.kernel.org/linux-nvme/j4w7724skjsapgtp6wtll5fgsb4adhpfdtrif52lhc7iql4inf@3mu2gbjrrtcs/
changes:
v2:
- added RBs
- added new patches
v1:
- https://lore.kernel.org/linux-nvme/20230615094356.14878-1-dwagner@suse.de/
Initial cover letter:
A couple more fixes to enable blktests for the fc transport.
1) Another module unloading hanger which can be triggered with
the nvme/048 tests:
run blktests nvme/048 at 2023-06-06 13:04:49
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:0} Association created
[478] nvmet: ctrl 1 start keep-alive timer for 1 secs
[478] nvmet: check nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a
[478] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0
[478] nvmet: No authentication provided
nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a.
[478] nvmet: adding queue 1 to ctrl 1.
[407] nvmet: adding queue 2 to ctrl 1.
[6549] nvmet: adding queue 3 to ctrl 1.
[1760] nvmet: adding queue 4 to ctrl 1.
nvme nvme2: NVME-FC{0}: controller connect complete
nvme nvme2: NVME-FC{0}: new ctrl: NQN "blktests-subsystem-1"
[478] nvmet: ctrl 1 reschedule traffic based keep-alive timer
[407] nvmet: ctrl 1 update keep-alive timer for 1 secs
nvme nvme2: NVME-FC{0}: io failed due to lldd error 6
nvme nvme2: NVME-FC{0}: transport association event: transport detected io error
nvme nvme2: NVME-FC{0}: resetting controller
[478] nvmet: ctrl 1 stop keep-alive
(NULL device *): {0:0} Association deleted
nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:0} Association freed
(NULL device *): {0:0} Association created
(NULL device *): Disconnect LS failed: No Association
[407] nvmet: ctrl 1 start keep-alive timer for 1 secs
[407] nvmet: check nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a
[407] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0
[407] nvmet: No authentication provided
nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a.
nvme nvme2: reconnect: revising io queue count from 4 to 1
[478] nvmet: adding queue 1 to ctrl 1.
nvme nvme2: NVME-FC{0}: controller connect complete
[478] nvmet: ctrl 1 reschedule traffic based keep-alive timer
[478] nvmet: ctrl 1 update keep-alive timer for 1 secs
[478] nvmet: ctrl 1 update keep-alive timer for 1 secs
[6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
[6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
[6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
[6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[407] nvmet: ctrl 1 update keep-alive timer for 1 secs
nvme nvme2: NVME-FC{0}: io failed due to lldd error 6
nvme nvme2: NVME-FC{0}: transport association event: transport detected io error
nvme nvme2: NVME-FC{0}: resetting controller
[478] nvmet: ctrl 1 stop keep-alive
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association
nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:0} Association created
[1760] nvmet: ctrl 1 start keep-alive timer for 1 secs
[1760] nvmet: check nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a
[1760] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0
[1760] nvmet: No authentication provided
nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a.
nvme nvme2: reconnect: revising io queue count from 1 to 4
[478] nvmet: adding queue 1 to ctrl 1.
[407] nvmet: adding queue 2 to ctrl 1.
[6549] nvmet: adding queue 3 to ctrl 1.
[1760] nvmet: adding queue 4 to ctrl 1.
nvme nvme2: NVME-FC{0}: controller connect complete
[1760] nvmet: ctrl 1 reschedule traffic based keep-alive timer
[478] nvmet: ctrl 1 update keep-alive timer for 1 secs
[478] nvmet: ctrl 1 update keep-alive timer for 1 secs
[478] nvmet: ctrl 1 update keep-alive timer for 1 secs
[478] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
[1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
[6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
nvme nvme2: Removing ctrl: NQN "blktests-subsystem-1"
[407] nvmet: ctrl 1 stop keep-alive
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association
nvme_fc: nvme_fc_exit_module: waiting for ctlr deletes
BTRFS info (device vda2): qgroup scan completed (inconsistency flag cleared)
INFO: task modprobe:11758 blocked for more than 491 seconds.
Tainted: G W 6.4.0-rc2+ #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:modprobe state:D stack:0 pid:11758 ppid:11585 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x17b5/0x4870
? finish_lock_switch+0x8f/0x120
? finish_task_switch+0x17f/0x5c0
? __sched_text_start+0x10/0x10
? __schedule+0x17bd/0x4870
? do_raw_spin_trylock+0xc9/0x1f0
schedule+0xe6/0x1e0
schedule_timeout+0x7e/0x1e0
? __cfi_schedule_timeout+0x10/0x10
? do_raw_spin_trylock+0xc9/0x1f0
? __cfi_lock_release+0x10/0x10
? do_raw_spin_unlock+0x116/0x8a0
wait_for_common+0x377/0x600
? wait_for_completion+0x30/0x30
cleanup_module+0x222/0x240 [nvme_fc bda1ef6f83d00208ba257c5d0ac34dd6bdb58bf9]
__se_sys_delete_module+0x388/0x550
? __x64_sys_delete_module+0x50/0x50
? task_work_run+0x236/0x290
? syscall_enter_from_user_mode+0x2e/0x210
do_syscall_64+0x6e/0xa0
? syscall_exit_to_user_mode+0x5e/0x220
? do_syscall_64+0x7d/0xa0
? syscall_exit_to_user_mode+0x5e/0x220
? do_syscall_64+0x7d/0xa0
? syscall_exit_to_user_mode+0x5e/0x220
? do_syscall_64+0x7d/0xa0
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fa80811aebb
RSP: 002b:00007fff80ea0a88 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000055a0b5acf1f0 RCX: 00007fa80811aebb
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a0b5acf258
RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000
R10: 00007fa808193ac0 R11: 0000000000000206 R12: 00007fff80ea0ad0
R13: 000055a0b5acf1f0 R14: 0000000000000000 R15: 0000000000000000
</TASK>
2) When executing blktests nvme/030 in a tight loop, I was able to
reproduce a different hanger. In this case the ->done() function
was never executed on the host fc side. It turns out we didn't
enqueue work items correctly and thus would only process one
work item.
run blktests nvme/030 at 2023-06-13 14:03:52
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
(NULL device *): {0:0} Association created
[23733] nvmet: ctrl 1 start keep-alive timer for 120 secs
nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8.
nvme nvme2: NVME-FC{0}: controller connect complete
nvme nvme2: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[23771] nvmet: ctrl 1 stop keep-alive
nvmet: adding nsid 1 to subsystem blktests-subsystem-2
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association
nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
(NULL device *): {0:0} Association created
[27320] nvmet: ctrl 1 start keep-alive timer for 120 secs
nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8.
nvme nvme2: NVME-FC{0}: controller connect complete
nvme nvme2: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[27320] nvmet: ctrl 1 stop keep-alive
nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
(NULL device *): {0:1} Association created
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association
INFO: task kworker/u8:4:102 blocked for more than 491 seconds.
Tainted: G W 6.4.0-rc2+ #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:4 state:D stack:0 pid:102 ppid:2 flags:0x00004000
Workqueue: nvme-wq nvme_fc_connect_ctrl_work [nvme_fc]
Call Trace:
<TASK>
__schedule+0x17b5/0x4870
? try_to_wake_up+0xb5/0xfc0
? __cfi_lock_release+0x10/0x10
? __sched_text_start+0x10/0x10
? _raw_spin_unlock_irqrestore+0x24/0x50
? lock_release+0x2aa/0xd10
? wq_worker_sleeping+0x1e/0x200
schedule+0xe6/0x1e0
schedule_timeout+0x7e/0x1e0
? __cfi_schedule_timeout+0x10/0x10
? mark_lock+0x94/0x340
? lockdep_hardirqs_on_prepare+0x2aa/0x5e0
wait_for_common+0x377/0x600
? queue_work_on+0x57/0xa0
? wait_for_completion+0x30/0x30
nvme_fc_connect_ctrl_work+0x7dc/0x1a80 [nvme_fc 39e2bf78272df3a610fb1f9884513e99038af746]
process_one_work+0x80f/0xfb0
? rescuer_thread+0x1130/0x1130
? do_raw_spin_trylock+0xc9/0x1f0
? lock_acquired+0x310/0x9a0
? worker_thread+0xd5e/0x1260
worker_thread+0xbde/0x1260
kthread+0x25d/0x2f0
? __cfi_worker_thread+0x10/0x10
? __cfi_kthread+0x10/0x10
ret_from_fork+0x29/0x50
</TASK>
Daniel Wagner (5):
nvme-fc: Do not wait in vain when unloading module
nvme-fcloop: queue work items correctly
nvmet-fcloop: Remove remote port from list when unlinking
nvme-fc: Make initial connect attempt synchronous
nvme-fc: do no free ctrl opts
drivers/nvme/host/fc.c | 43 ++++++++++++++++++------------------
drivers/nvme/target/fcloop.c | 15 +++++--------
2 files changed, 27 insertions(+), 31 deletions(-)
--
2.41.0
We ran this patch series through our regression and NVMe/FC boot from SAN tests.
(I think 4/5 will also resolve some instability / race condition with
kickstart, since the connect
is now synchronous, but we may need some more time to full validate
that. I was going
to pursue a userspace solution but this may be good enough.)
Notwithstanding the other comments re: error codes, etc. feel free to add:
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Tested-by: Marco Patalano (mpatalan@redhat.com>
On Tue, Jun 20, 2023 at 9:39 AM Daniel Wagner <dwagner@suse.de> wrote:
>
> With the additinal 3 patches and the blktests changes [1] I am able to run the
> tests in a loop for a while without failures. I expect there are more problems
> hidden but with these changes the first milesone is reached.
>
> The first new patch is actually one from a previous attempt. It addresses the
> problem in unloading path where we dont make any progress more [2].
>
> The next two patches change the existing initial connection attempt of FC to
> a synchronous one. Without this the auth test fail [3].
>
> Daniel
>
>
> [1] https://lore.kernel.org/linux-nvme/20230620132703.20648-1-dwagner@suse.de/
> [2] https://lore.kernel.org/linux-nvme/20230418130159.11075-1-dwagner@suse.de/
> [3] https://lore.kernel.org/linux-nvme/j4w7724skjsapgtp6wtll5fgsb4adhpfdtrif52lhc7iql4inf@3mu2gbjrrtcs/
>
> changes:
>
> v2:
> - added RBs
> - added new patches
>
> v1:
> - https://lore.kernel.org/linux-nvme/20230615094356.14878-1-dwagner@suse.de/
>
>
> Initial cover letter:
>
> A couple more fixes to enable blktests for the fc transport.
>
> 1) Another module unloading hanger which can be triggered with
> the nvme/048 tests:
>
> run blktests nvme/048 at 2023-06-06 13:04:49
> nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
> (NULL device *): {0:0} Association created
> [478] nvmet: ctrl 1 start keep-alive timer for 1 secs
> [478] nvmet: check nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a
> [478] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0
> [478] nvmet: No authentication provided
> nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a.
> [478] nvmet: adding queue 1 to ctrl 1.
> [407] nvmet: adding queue 2 to ctrl 1.
> [6549] nvmet: adding queue 3 to ctrl 1.
> [1760] nvmet: adding queue 4 to ctrl 1.
> nvme nvme2: NVME-FC{0}: controller connect complete
> nvme nvme2: NVME-FC{0}: new ctrl: NQN "blktests-subsystem-1"
> [478] nvmet: ctrl 1 reschedule traffic based keep-alive timer
> [407] nvmet: ctrl 1 update keep-alive timer for 1 secs
> nvme nvme2: NVME-FC{0}: io failed due to lldd error 6
> nvme nvme2: NVME-FC{0}: transport association event: transport detected io error
> nvme nvme2: NVME-FC{0}: resetting controller
> [478] nvmet: ctrl 1 stop keep-alive
> (NULL device *): {0:0} Association deleted
> nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
> (NULL device *): {0:0} Association freed
> (NULL device *): {0:0} Association created
> (NULL device *): Disconnect LS failed: No Association
> [407] nvmet: ctrl 1 start keep-alive timer for 1 secs
> [407] nvmet: check nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a
> [407] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0
> [407] nvmet: No authentication provided
> nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a.
> nvme nvme2: reconnect: revising io queue count from 4 to 1
> [478] nvmet: adding queue 1 to ctrl 1.
> nvme nvme2: NVME-FC{0}: controller connect complete
> [478] nvmet: ctrl 1 reschedule traffic based keep-alive timer
> [478] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [478] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [407] nvmet: ctrl 1 update keep-alive timer for 1 secs
> nvme nvme2: NVME-FC{0}: io failed due to lldd error 6
> nvme nvme2: NVME-FC{0}: transport association event: transport detected io error
> nvme nvme2: NVME-FC{0}: resetting controller
> [478] nvmet: ctrl 1 stop keep-alive
> (NULL device *): {0:0} Association deleted
> (NULL device *): {0:0} Association freed
> (NULL device *): Disconnect LS failed: No Association
> nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
> (NULL device *): {0:0} Association created
> [1760] nvmet: ctrl 1 start keep-alive timer for 1 secs
> [1760] nvmet: check nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a
> [1760] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0
> [1760] nvmet: No authentication provided
> nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:3d0c3f5d-cb37-4bc4-af29-2168953bfc0a.
> nvme nvme2: reconnect: revising io queue count from 1 to 4
> [478] nvmet: adding queue 1 to ctrl 1.
> [407] nvmet: adding queue 2 to ctrl 1.
> [6549] nvmet: adding queue 3 to ctrl 1.
> [1760] nvmet: adding queue 4 to ctrl 1.
> nvme nvme2: NVME-FC{0}: controller connect complete
> [1760] nvmet: ctrl 1 reschedule traffic based keep-alive timer
> [478] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [478] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [478] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [478] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [1760] nvmet: ctrl 1 update keep-alive timer for 1 secs
> [6549] nvmet: ctrl 1 update keep-alive timer for 1 secs
> nvme nvme2: Removing ctrl: NQN "blktests-subsystem-1"
> [407] nvmet: ctrl 1 stop keep-alive
> (NULL device *): {0:0} Association deleted
> (NULL device *): {0:0} Association freed
> (NULL device *): Disconnect LS failed: No Association
> nvme_fc: nvme_fc_exit_module: waiting for ctlr deletes
> BTRFS info (device vda2): qgroup scan completed (inconsistency flag cleared)
> INFO: task modprobe:11758 blocked for more than 491 seconds.
> Tainted: G W 6.4.0-rc2+ #3
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:modprobe state:D stack:0 pid:11758 ppid:11585 flags:0x00004002
> Call Trace:
> <TASK>
> __schedule+0x17b5/0x4870
> ? finish_lock_switch+0x8f/0x120
> ? finish_task_switch+0x17f/0x5c0
> ? __sched_text_start+0x10/0x10
> ? __schedule+0x17bd/0x4870
> ? do_raw_spin_trylock+0xc9/0x1f0
> schedule+0xe6/0x1e0
> schedule_timeout+0x7e/0x1e0
> ? __cfi_schedule_timeout+0x10/0x10
> ? do_raw_spin_trylock+0xc9/0x1f0
> ? __cfi_lock_release+0x10/0x10
> ? do_raw_spin_unlock+0x116/0x8a0
> wait_for_common+0x377/0x600
> ? wait_for_completion+0x30/0x30
> cleanup_module+0x222/0x240 [nvme_fc bda1ef6f83d00208ba257c5d0ac34dd6bdb58bf9]
> __se_sys_delete_module+0x388/0x550
> ? __x64_sys_delete_module+0x50/0x50
> ? task_work_run+0x236/0x290
> ? syscall_enter_from_user_mode+0x2e/0x210
> do_syscall_64+0x6e/0xa0
> ? syscall_exit_to_user_mode+0x5e/0x220
> ? do_syscall_64+0x7d/0xa0
> ? syscall_exit_to_user_mode+0x5e/0x220
> ? do_syscall_64+0x7d/0xa0
> ? syscall_exit_to_user_mode+0x5e/0x220
> ? do_syscall_64+0x7d/0xa0
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fa80811aebb
> RSP: 002b:00007fff80ea0a88 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> RAX: ffffffffffffffda RBX: 000055a0b5acf1f0 RCX: 00007fa80811aebb
> RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a0b5acf258
> RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000
> R10: 00007fa808193ac0 R11: 0000000000000206 R12: 00007fff80ea0ad0
> R13: 000055a0b5acf1f0 R14: 0000000000000000 R15: 0000000000000000
> </TASK>
>
>
> 2) When executing blktests nvme/030 in a tight loop, I was able to
> reproduce a different hanger. In this case the ->done() function
> was never executed on the host fc side. It turns out we didn't
> enqueue work items correctly and thus would only process one
> work item.
>
> run blktests nvme/030 at 2023-06-13 14:03:52
> nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
> (NULL device *): {0:0} Association created
> [23733] nvmet: ctrl 1 start keep-alive timer for 120 secs
> nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8.
> nvme nvme2: NVME-FC{0}: controller connect complete
> nvme nvme2: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> [23771] nvmet: ctrl 1 stop keep-alive
> nvmet: adding nsid 1 to subsystem blktests-subsystem-2
> (NULL device *): {0:0} Association deleted
> (NULL device *): {0:0} Association freed
> (NULL device *): Disconnect LS failed: No Association
> nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
> (NULL device *): {0:0} Association created
> [27320] nvmet: ctrl 1 start keep-alive timer for 120 secs
> nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8.
> nvme nvme2: NVME-FC{0}: controller connect complete
> nvme nvme2: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
> [27320] nvmet: ctrl 1 stop keep-alive
> nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
> (NULL device *): {0:1} Association created
> (NULL device *): {0:0} Association deleted
> (NULL device *): {0:0} Association freed
> (NULL device *): Disconnect LS failed: No Association
> INFO: task kworker/u8:4:102 blocked for more than 491 seconds.
> Tainted: G W 6.4.0-rc2+ #3
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u8:4 state:D stack:0 pid:102 ppid:2 flags:0x00004000
> Workqueue: nvme-wq nvme_fc_connect_ctrl_work [nvme_fc]
> Call Trace:
> <TASK>
> __schedule+0x17b5/0x4870
> ? try_to_wake_up+0xb5/0xfc0
> ? __cfi_lock_release+0x10/0x10
> ? __sched_text_start+0x10/0x10
> ? _raw_spin_unlock_irqrestore+0x24/0x50
> ? lock_release+0x2aa/0xd10
> ? wq_worker_sleeping+0x1e/0x200
> schedule+0xe6/0x1e0
> schedule_timeout+0x7e/0x1e0
> ? __cfi_schedule_timeout+0x10/0x10
> ? mark_lock+0x94/0x340
> ? lockdep_hardirqs_on_prepare+0x2aa/0x5e0
> wait_for_common+0x377/0x600
> ? queue_work_on+0x57/0xa0
> ? wait_for_completion+0x30/0x30
> nvme_fc_connect_ctrl_work+0x7dc/0x1a80 [nvme_fc 39e2bf78272df3a610fb1f9884513e99038af746]
> process_one_work+0x80f/0xfb0
> ? rescuer_thread+0x1130/0x1130
> ? do_raw_spin_trylock+0xc9/0x1f0
> ? lock_acquired+0x310/0x9a0
> ? worker_thread+0xd5e/0x1260
> worker_thread+0xbde/0x1260
> kthread+0x25d/0x2f0
> ? __cfi_worker_thread+0x10/0x10
> ? __cfi_kthread+0x10/0x10
> ret_from_fork+0x29/0x50
> </TASK>
>
> Daniel Wagner (5):
> nvme-fc: Do not wait in vain when unloading module
> nvme-fcloop: queue work items correctly
> nvmet-fcloop: Remove remote port from list when unlinking
> nvme-fc: Make initial connect attempt synchronous
> nvme-fc: do no free ctrl opts
>
> drivers/nvme/host/fc.c | 43 ++++++++++++++++++------------------
> drivers/nvme/target/fcloop.c | 15 +++++--------
> 2 files changed, 27 insertions(+), 31 deletions(-)
>
> --
> 2.41.0
>
>
© 2016 - 2026 Red Hat, Inc.