[v1] scsi: scsi_error: Introduce new error handle mechanism

[PATCH 00/14] scsi: scsi_error: Introduce new error handle mechanism

Posted by JiangJianJun 1 month ago

>I fully agree that SCSI EH is in need of reworking. But adding 
>another layer of complexity on top of the existing one ... not sure.

Perhaps it would have been better to use only the error handler on the
device from the start. Users might wonder why a single disk failure
could cause other disks to become blocking.

>Additionally: TARGET RESET TMF is dead, and has been removed from SAM
>since several years. It really is not worthwhile implementing.

Hmm.

>Can't we take a simple step, and just try to have a non-blocking version
>of device reset?
>I think that should cover quite some issues already.

Do you think it's necessary to escalate the issue after the device reset
fails? Should we reset the bus or the host? 
Moreover, a failed device reset does not necessarily indicate a fault
with the target or host. 
And what means of "non-blocking"?

Re: [PATCH 00/14] scsi: scsi_error: Introduce new error handle mechanism

Posted by Hannes Reinecke 1 month ago

On 9/2/25 07:56, JiangJianJun wrote:
>> I fully agree that SCSI EH is in need of reworking. But adding
>> another layer of complexity on top of the existing one ... not sure.
> 
> Perhaps it would have been better to use only the error handler on the
> device from the start. Users might wonder why a single disk failure
> could cause other disks to become blocking.
> 
>> Additionally: TARGET RESET TMF is dead, and has been removed from SAM
>> since several years. It really is not worthwhile implementing.
> 
> Hmm.
> 
>> Can't we take a simple step, and just try to have a non-blocking version
>> of device reset?
>> I think that should cover quite some issues already.
> 
> Do you think it's necessary to escalate the issue after the device reset
> fails? Should we reset the bus or the host?
> Moreover, a failed device reset does not necessarily indicate a fault
> with the target or host.
> And what means of "non-blocking"?
> 
On the contrary, a failed device reset _always_ needs to be escalated.
The problem is that all EH issues start with a failed command (ignoring
the sg_reset case for now).
And a command typically is associated with data buffers / memory areas.
So when a command is failed we need to know when these buffers can be
released. If the device reset fails the command could not be reset,
and the buffers cannot be released. And without further escalation the
buffers remain locked until the next reboot.
That's why host reset is so important: that typically resets the entire
HBA (via a PCI-level reset or similar), so we can be sure that
afterwards all buffers are released and the command can be completed.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich