We got a bug report that a controller was stuck in the connected state
after an association dropped.
It turns out that nvme_fc_create_association can succeed even though some
operation do fail. This is on purpose to handle the degraded controller
case, where the admin queue is up and running but not the io queues. In
this case the controller will still reach the LIVE state.
Unfortunatly, this will also ignore full connectivity loss for fabric
controllers. Let's address this by not filtering out all errors in
nvme_set_queue_count.
I haven't tested this version yet, as it needs a bit of tinker in my
setup. So the question is this a better approach? I think it would also
be great to hear from Paul if this works.
In theory the nvme_set_queue_count call still could pass and later
connectivity loss could happen, just before entering the LIVE state. In
this case the only thing to observe the connectivity loss is the keep
alive handler which currently does nothing. I think we should also
trigger a reset in this case. What do you think?
---
Changes in v2:
- handle connection lost in nvme_set_queue_count directly
- collected reviewed tags
- Link to v1: https://lore.kernel.org/r/20240611190647.11856-1-dwagner@suse.de
---
Daniel Wagner (2):
nvme-fc: go straight to connecting state when initializing
nvme: handle connectivity loss in nvme_set_queue_count
drivers/nvme/host/core.c | 7 ++++++-
drivers/nvme/host/fc.c | 3 +--
2 files changed, 7 insertions(+), 3 deletions(-)
---
base-commit: 5e52f71f858eaff252a47530a5ad5e79309bd415
change-id: 20241029-nvme-fc-handle-com-lost-9b241936809a
Best regards,
--
Daniel Wagner <wagi@kernel.org>