[v1] scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock

[PATCH RESEND] scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock

Posted by Chengfeng Ye 2 years, 5 months ago

There is a long call chain that &fip->ctlr_lock is acquired by isr
fnic_isr_msix_wq_copy() under hard irq context. Thus other process
context code acquiring the lock should disable irq, otherwise
deadlock could happen if the irq preempt the execution while the
lock is held in process context on the same CPU.

[ISR]
fnic_isr_msix_wq_copy()
 -> fnic_wq_copy_cmpl_handler()
 -> fnic_fcpio_cmpl_handler()
 -> fnic_fcpio_flogi_reg_cmpl_handler()
 -> fnic_flush_tx()
 -> fnic_send_frame()
 -> fcoe_ctlr_els_send()
 -> spin_lock_bh(&fip->ctlr_lock)

[Process Context]
1. fcoe_ctlr_timer_work()
 -> fcoe_ctlr_flogi_send()
 -> spin_lock_bh(&fip->ctlr_lock)

2. fcoe_ctlr_recv_work()
 -> fcoe_ctlr_recv_handler()
 -> fcoe_ctlr_recv_els()
 -> fcoe_ctlr_announce()
 -> spin_lock_bh(&fip->ctlr_lock)

3. fcoe_ctlr_recv_work()
 -> fcoe_ctlr_recv_handler()
 -> fcoe_ctlr_recv_els()
 -> fcoe_ctlr_flogi_retry()
 -> spin_lock_bh(&fip->ctlr_lock)

4. -> fcoe_xmit()
 -> fcoe_ctlr_els_send()
 -> spin_lock_bh(&fip->ctlr_lock)

spin_lock_bh() is not enough since fnic_isr_msix_wq_copy() is a
hardirq.

These flaws were found by an experimental static analysis tool I am
developing for irq-related deadlock.

The patch fix the potential deadlocks by spin_lock_irqsave() to
disable hard irq.

Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
---
 drivers/scsi/fcoe/fcoe_ctlr.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/fcoe/fcoe_ctlr.c b/drivers/scsi/fcoe/fcoe_ctlr.c
index 5c8d1ba3f8f3..19eee108db02 100644
--- a/drivers/scsi/fcoe/fcoe_ctlr.c
+++ b/drivers/scsi/fcoe/fcoe_ctlr.c
@@ -319,16 +319,17 @@ static void fcoe_ctlr_announce(struct fcoe_ctlr *fip)
 {
 	struct fcoe_fcf *sel;
 	struct fcoe_fcf *fcf;
+	unsigned long flags;
 
 	mutex_lock(&fip->ctlr_mutex);
-	spin_lock_bh(&fip->ctlr_lock);
+	spin_lock_irqsave(&fip->ctlr_lock, flags);
 
 	kfree_skb(fip->flogi_req);
 	fip->flogi_req = NULL;
 	list_for_each_entry(fcf, &fip->fcfs, list)
 		fcf->flogi_sent = 0;
 
-	spin_unlock_bh(&fip->ctlr_lock);
+	spin_unlock_irqrestore(&fip->ctlr_lock, flags);
 	sel = fip->sel_fcf;
 
 	if (sel && ether_addr_equal(sel->fcf_mac, fip->dest_addr))
@@ -699,6 +700,7 @@ int fcoe_ctlr_els_send(struct fcoe_ctlr *fip, struct fc_lport *lport,
 {
 	struct fc_frame *fp;
 	struct fc_frame_header *fh;
+	unsigned long flags;
 	u16 old_xid;
 	u8 op;
 	u8 mac[ETH_ALEN];
@@ -732,11 +734,11 @@ int fcoe_ctlr_els_send(struct fcoe_ctlr *fip, struct fc_lport *lport,
 		op = FIP_DT_FLOGI;
 		if (fip->mode == FIP_MODE_VN2VN)
 			break;
-		spin_lock_bh(&fip->ctlr_lock);
+		spin_lock_irqsave(&fip->ctlr_lock, flags);
 		kfree_skb(fip->flogi_req);
 		fip->flogi_req = skb;
 		fip->flogi_req_send = 1;
-		spin_unlock_bh(&fip->ctlr_lock);
+		spin_unlock_irqrestore(&fip->ctlr_lock, flags);
 		schedule_work(&fip->timer_work);
 		return -EINPROGRESS;
 	case ELS_FDISC:
@@ -1705,10 +1707,11 @@ static int fcoe_ctlr_flogi_send_locked(struct fcoe_ctlr *fip)
 static int fcoe_ctlr_flogi_retry(struct fcoe_ctlr *fip)
 {
 	struct fcoe_fcf *fcf;
+	unsigned long flags;
 	int error;
 
 	mutex_lock(&fip->ctlr_mutex);
-	spin_lock_bh(&fip->ctlr_lock);
+	spin_lock_irqsave(&fip->ctlr_lock, flags);
 	LIBFCOE_FIP_DBG(fip, "re-sending FLOGI - reselect\n");
 	fcf = fcoe_ctlr_select(fip);
 	if (!fcf || fcf->flogi_sent) {
@@ -1719,7 +1722,7 @@ static int fcoe_ctlr_flogi_retry(struct fcoe_ctlr *fip)
 		fcoe_ctlr_solicit(fip, NULL);
 		error = fcoe_ctlr_flogi_send_locked(fip);
 	}
-	spin_unlock_bh(&fip->ctlr_lock);
+	spin_unlock_irqrestore(&fip->ctlr_lock, flags);
 	mutex_unlock(&fip->ctlr_mutex);
 	return error;
 }
@@ -1736,8 +1739,9 @@ static int fcoe_ctlr_flogi_retry(struct fcoe_ctlr *fip)
 static void fcoe_ctlr_flogi_send(struct fcoe_ctlr *fip)
 {
 	struct fcoe_fcf *fcf;
+	unsigned long flags;
 
-	spin_lock_bh(&fip->ctlr_lock);
+	spin_lock_irqsave(&fip->ctlr_lock, flags);
 	fcf = fip->sel_fcf;
 	if (!fcf || !fip->flogi_req_send)
 		goto unlock;
@@ -1764,7 +1768,7 @@ static void fcoe_ctlr_flogi_send(struct fcoe_ctlr *fip)
 	} else /* XXX */
 		LIBFCOE_FIP_DBG(fip, "No FCF selected - defer send\n");
 unlock:
-	spin_unlock_bh(&fip->ctlr_lock);
+	spin_unlock_irqrestore(&fip->ctlr_lock, flags);
 }
 
 /**
-- 
2.17.1

Re: [PATCH RESEND] scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock

Posted by Davidlohr Bueso 2 years, 5 months ago

On Wed, 16 Aug 2023, Chengfeng Ye wrote:

>There is a long call chain that &fip->ctlr_lock is acquired by isr
>fnic_isr_msix_wq_copy() under hard irq context. Thus other process
>context code acquiring the lock should disable irq, otherwise
>deadlock could happen if the irq preempt the execution while the
>lock is held in process context on the same CPU.
>
>[ISR]
>fnic_isr_msix_wq_copy()
> -> fnic_wq_copy_cmpl_handler()
> -> fnic_fcpio_cmpl_handler()
> -> fnic_fcpio_flogi_reg_cmpl_handler()
> -> fnic_flush_tx()
> -> fnic_send_frame()
> -> fcoe_ctlr_els_send()
> -> spin_lock_bh(&fip->ctlr_lock)
>
>[Process Context]
>1. fcoe_ctlr_timer_work()
> -> fcoe_ctlr_flogi_send()
> -> spin_lock_bh(&fip->ctlr_lock)
>
>2. fcoe_ctlr_recv_work()
> -> fcoe_ctlr_recv_handler()
> -> fcoe_ctlr_recv_els()
> -> fcoe_ctlr_announce()
> -> spin_lock_bh(&fip->ctlr_lock)
>
>3. fcoe_ctlr_recv_work()
> -> fcoe_ctlr_recv_handler()
> -> fcoe_ctlr_recv_els()
> -> fcoe_ctlr_flogi_retry()
> -> spin_lock_bh(&fip->ctlr_lock)
>
>4. -> fcoe_xmit()
> -> fcoe_ctlr_els_send()
> -> spin_lock_bh(&fip->ctlr_lock)
>
>spin_lock_bh() is not enough since fnic_isr_msix_wq_copy() is a
>hardirq.
>
>These flaws were found by an experimental static analysis tool I am
>developing for irq-related deadlock.
>
>The patch fix the potential deadlocks by spin_lock_irqsave() to
>disable hard irq.

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>

Re: [PATCH RESEND] scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock

Posted by Sebastian Andrzej Siewior 2 years, 5 months ago

On 2023-08-16 15:55:24 [+0000], Chengfeng Ye wrote:
> These flaws were found by an experimental static analysis tool I am
> developing for irq-related deadlock.

So you did not use lockdep or is the code path so unlikely that nobody
stumbled upon it before?

> The patch fix the potential deadlocks by spin_lock_irqsave() to
> disable hard irq.

Shouldn't this have

Fixes: 794d98e77f590 ("[SCSI] libfcoe: retry rejected FLOGI to another FCF if possible")

?

> Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Sebastian

Re: [PATCH RESEND] scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock

Posted by Chengfeng Ye 2 years, 5 months ago

> So you did not use lockdep or is the code path so unlikely that nobody
> stumbled upon it before?

I think so, the lock acquisition inside the interrupt is guarded by an
"unlikely"
if-branch condition inside fnic_send_frame(), so it happens that no one has
encountered the problem so far by using lockdep.


> Fixes: 794d98e77f590 ("[SCSI] libfcoe: retry rejected FLOGI to another FCF if possible")

Just sent v2 patch to add the fixes tag


Thanks for the review,
Chengfeng

Re: [PATCH RESEND] scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock

Posted by Hannes Reinecke 2 years, 5 months ago

On 8/16/23 17:55, Chengfeng Ye wrote:
> There is a long call chain that &fip->ctlr_lock is acquired by isr
> fnic_isr_msix_wq_copy() under hard irq context. Thus other process
> context code acquiring the lock should disable irq, otherwise
> deadlock could happen if the irq preempt the execution while the
> lock is held in process context on the same CPU.
> 
> [ISR]
> fnic_isr_msix_wq_copy()
>   -> fnic_wq_copy_cmpl_handler()
>   -> fnic_fcpio_cmpl_handler()
>   -> fnic_fcpio_flogi_reg_cmpl_handler()
>   -> fnic_flush_tx()
>   -> fnic_send_frame()
>   -> fcoe_ctlr_els_send()
>   -> spin_lock_bh(&fip->ctlr_lock)
> 
> [Process Context]
> 1. fcoe_ctlr_timer_work()
>   -> fcoe_ctlr_flogi_send()
>   -> spin_lock_bh(&fip->ctlr_lock)
> 
> 2. fcoe_ctlr_recv_work()
>   -> fcoe_ctlr_recv_handler()
>   -> fcoe_ctlr_recv_els()
>   -> fcoe_ctlr_announce()
>   -> spin_lock_bh(&fip->ctlr_lock)
> 
> 3. fcoe_ctlr_recv_work()
>   -> fcoe_ctlr_recv_handler()
>   -> fcoe_ctlr_recv_els()
>   -> fcoe_ctlr_flogi_retry()
>   -> spin_lock_bh(&fip->ctlr_lock)
> 
> 4. -> fcoe_xmit()
>   -> fcoe_ctlr_els_send()
>   -> spin_lock_bh(&fip->ctlr_lock)
> 
> spin_lock_bh() is not enough since fnic_isr_msix_wq_copy() is a
> hardirq.
> 
> These flaws were found by an experimental static analysis tool I am
> developing for irq-related deadlock.
> 
> The patch fix the potential deadlocks by spin_lock_irqsave() to
> disable hard irq.
> 
> Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
> ---
>   drivers/scsi/fcoe/fcoe_ctlr.c | 20 ++++++++++++--------
>   1 file changed, 12 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

Re: [PATCH RESEND] scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock

Posted by Chengfeng Ye 2 years, 5 months ago

Thanks for the review!

Best,
Chengfeng

Re: [PATCH RESEND] scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock

Posted by Chengfeng Ye 2 years, 5 months ago

Dear maintainers,

May I ask if anyone could help review the patch?

Thanks,
Chengfeng