>I fully agree that SCSI EH is in need of reworking. But adding >another layer of complexity on top of the existing one ... not sure. Perhaps it would have been better to use only the error handler on the device from the start. Users might wonder why a single disk failure could cause other disks to become blocking. >Additionally: TARGET RESET TMF is dead, and has been removed from SAM >since several years. It really is not worthwhile implementing. Hmm. >Can't we take a simple step, and just try to have a non-blocking version >of device reset? >I think that should cover quite some issues already. Do you think it's necessary to escalate the issue after the device reset fails? Should we reset the bus or the host? Moreover, a failed device reset does not necessarily indicate a fault with the target or host. And what means of "non-blocking"?
On 9/2/25 07:56, JiangJianJun wrote: >> I fully agree that SCSI EH is in need of reworking. But adding >> another layer of complexity on top of the existing one ... not sure. > > Perhaps it would have been better to use only the error handler on the > device from the start. Users might wonder why a single disk failure > could cause other disks to become blocking. > >> Additionally: TARGET RESET TMF is dead, and has been removed from SAM >> since several years. It really is not worthwhile implementing. > > Hmm. > >> Can't we take a simple step, and just try to have a non-blocking version >> of device reset? >> I think that should cover quite some issues already. > > Do you think it's necessary to escalate the issue after the device reset > fails? Should we reset the bus or the host? > Moreover, a failed device reset does not necessarily indicate a fault > with the target or host. > And what means of "non-blocking"? > On the contrary, a failed device reset _always_ needs to be escalated. The problem is that all EH issues start with a failed command (ignoring the sg_reset case for now). And a command typically is associated with data buffers / memory areas. So when a command is failed we need to know when these buffers can be released. If the device reset fails the command could not be reset, and the buffers cannot be released. And without further escalation the buffers remain locked until the next reboot. That's why host reset is so important: that typically resets the entire HBA (via a PCI-level reset or similar), so we can be sure that afterwards all buffers are released and the command can be completed. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
© 2016 - 2025 Red Hat, Inc.