[v2] af_packet: fix soft lockup issue caused by tpacket_snd()

[PATCH v2] af_packet: fix soft lockup issue caused by tpacket_snd()

Posted by Yun Lu 7 months ago

From: Yun Lu <luyun@kylinos.cn>

When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
pending_refcnt to decrement to zero before returning. The pending_refcnt
is decremented by 1 when the skb->destructor function is called,
indicating that the skb has been successfully sent and needs to be
destroyed.

If an error occurs during this process, the tpacket_snd() function will
exit and return error, but pending_refcnt may not yet have decremented to
zero. Assuming the next send operation is executed immediately, but there
are no available frames to be sent in tx_ring (i.e., packet_current_frame
returns NULL), and skb is also NULL, the function will not execute
wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
will enter a do-while loop, waiting for pending_refcnt to be zero. Even
if the previous skb has completed transmission, the skb->destructor
function can only be invoked in the ksoftirqd thread (assuming NAPI
threading is enabled). When both the ksoftirqd thread and the tpacket_snd
operation happen to run on the same CPU, and the CPU trapped in the
do-while loop without yielding, the ksoftirqd thread will not get
scheduled to run. As a result, pending_refcnt will never be reduced to
zero, and the do-while loop cannot exit, eventually leading to a CPU soft
lockup issue.

In fact, as long as pending_refcnt is not zero, even if skb is NULL,
wait_for_completion_interruptible_timeout() should be executed to yield
the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the
execution condition of this function should be modified to check if
pending_refcnt is not zero.

Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
Signed-off-by: Yun Lu <luyun@kylinos.cn>

---
Changes in v2:
- Add a Fixes tag.
- Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
---
 net/packet/af_packet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3d43f3eae759..7df96311adb8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		ph = packet_current_frame(po, &po->tx_ring,
 					  TP_STATUS_SEND_REQUEST);
 		if (unlikely(ph == NULL)) {
-			if (need_wait && skb) {
+			if (need_wait && packet_read_pending(&po->tx_ring)) {
 				timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
 				timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
 				if (timeo <= 0) {
-- 
2.43.0

Re: [PATCH v2] af_packet: fix soft lockup issue caused by tpacket_snd()

Posted by Eric Dumazet 7 months ago

On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@163.com> wrote:
>
> From: Yun Lu <luyun@kylinos.cn>
>
> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
> pending_refcnt to decrement to zero before returning. The pending_refcnt
> is decremented by 1 when the skb->destructor function is called,
> indicating that the skb has been successfully sent and needs to be
> destroyed.
>
> If an error occurs during this process, the tpacket_snd() function will
> exit and return error, but pending_refcnt may not yet have decremented to
> zero. Assuming the next send operation is executed immediately, but there
> are no available frames to be sent in tx_ring (i.e., packet_current_frame
> returns NULL), and skb is also NULL, the function will not execute
> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
> if the previous skb has completed transmission, the skb->destructor
> function can only be invoked in the ksoftirqd thread (assuming NAPI
> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
> operation happen to run on the same CPU, and the CPU trapped in the
> do-while loop without yielding, the ksoftirqd thread will not get
> scheduled to run. As a result, pending_refcnt will never be reduced to
> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
> lockup issue.
>
> In fact, as long as pending_refcnt is not zero, even if skb is NULL,
> wait_for_completion_interruptible_timeout() should be executed to yield
> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the
> execution condition of this function should be modified to check if
> pending_refcnt is not zero.
>
> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
> Signed-off-by: Yun Lu <luyun@kylinos.cn>
>
> ---
> Changes in v2:
> - Add a Fixes tag.
> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
> ---
>  net/packet/af_packet.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 3d43f3eae759..7df96311adb8 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>                 ph = packet_current_frame(po, &po->tx_ring,
>                                           TP_STATUS_SEND_REQUEST);
>                 if (unlikely(ph == NULL)) {
> -                       if (need_wait && skb) {
> +                       if (need_wait && packet_read_pending(&po->tx_ring)) {
>                                 timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
>                                 timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>                                 if (timeo <= 0) {

packet_read_pending() is super expensive on hosts with 256 cpus (or more)

We are going to call it a second time at the end of the block:

do { ...
} while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)...

Perhaps we can remove the second one ?

Also I think there is another problem with the code.

We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO
constraint could be way off.

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5
100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po,
struct msghdr *msg)
        int len_sum = 0;
        int status = TP_STATUS_AVAILABLE;
        int hlen, tlen, copylen = 0;
-       long timeo = 0;
+       long timeo;

+       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
        mutex_lock(&po->pg_vec_lock);

        /* packet_sendmsg() check on tx_ring.pg_vec was lockless,
@@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po,
struct msghdr *msg)
                                          TP_STATUS_SEND_REQUEST);
                if (unlikely(ph == NULL)) {
                        if (need_wait && skb) {
-                               timeo = sock_sndtimeo(&po->sk,
msg->msg_flags & MSG_DONTWAIT);
                                timeo =
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
                                if (timeo <= 0) {
                                        err = !timeo ? -ETIMEDOUT :
-ERESTARTSYS;

Re: [PATCH v2] af_packet: fix soft lockup issue caused by tpacket_snd()

Posted by luyun 7 months ago

在 2025/7/8 15:12, Eric Dumazet 写道:
> On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@163.com> wrote:
>> From: Yun Lu <luyun@kylinos.cn>
>>
>> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
>> pending_refcnt to decrement to zero before returning. The pending_refcnt
>> is decremented by 1 when the skb->destructor function is called,
>> indicating that the skb has been successfully sent and needs to be
>> destroyed.
>>
>> If an error occurs during this process, the tpacket_snd() function will
>> exit and return error, but pending_refcnt may not yet have decremented to
>> zero. Assuming the next send operation is executed immediately, but there
>> are no available frames to be sent in tx_ring (i.e., packet_current_frame
>> returns NULL), and skb is also NULL, the function will not execute
>> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
>> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
>> if the previous skb has completed transmission, the skb->destructor
>> function can only be invoked in the ksoftirqd thread (assuming NAPI
>> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
>> operation happen to run on the same CPU, and the CPU trapped in the
>> do-while loop without yielding, the ksoftirqd thread will not get
>> scheduled to run. As a result, pending_refcnt will never be reduced to
>> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
>> lockup issue.
>>
>> In fact, as long as pending_refcnt is not zero, even if skb is NULL,
>> wait_for_completion_interruptible_timeout() should be executed to yield
>> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the
>> execution condition of this function should be modified to check if
>> pending_refcnt is not zero.
>>
>> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
>> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
>> Signed-off-by: Yun Lu <luyun@kylinos.cn>
>>
>> ---
>> Changes in v2:
>> - Add a Fixes tag.
>> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
>> ---
>>   net/packet/af_packet.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 3d43f3eae759..7df96311adb8 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>>                  ph = packet_current_frame(po, &po->tx_ring,
>>                                            TP_STATUS_SEND_REQUEST);
>>                  if (unlikely(ph == NULL)) {
>> -                       if (need_wait && skb) {
>> +                       if (need_wait && packet_read_pending(&po->tx_ring)) {
>>                                  timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
>>                                  timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>>                                  if (timeo <= 0) {
> packet_read_pending() is super expensive on hosts with 256 cpus (or more)
>
> We are going to call it a second time at the end of the block:
>
> do { ...
> } while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)...
>
> Perhaps we can remove the second one ?

As mentioned in the commit message, the soft lockup issue only occurs 
when tpacket_snd() is called to send,  with the pending_refcnt is 
non-zero, and there are no available packets in the tx_ring.

Therefore, at the first start of the loop, packet_read_pending() only 
needs to be called once. If the return result is already 0, the loop can 
exit directly. Otherwise, wait_for_completion_interruptible_timeout() 
needs to be executed for waiting. Later, this function should only be 
called at the end to check whether the loop can exit.


>
> Also I think there is another problem with the code.
>
> We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO
> constraint could be way off.

Yes, due to the changes in commit 581073f626e3 ("af_packet: do not call 
packet_read_pending() from tpacket_destruct_skb()"), every time 
tpacket_destruct_skb is executed, the skb_completion is marked as 
completed. When wait_for_completion_interruptible_timeout returns 
completed, the pending_refcnt has not yet been reduced to 0. Therefore, 
when ph is NULL, the wait function may need to be called multiple times 
untill packet_read_pending finally returns 0.


I have revised the code (as shown below), and it seems to be more 
reasonable. I also look forward to any better suggestions you may have. 
Thank you!


diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7df96311adb8..401ae8f6481b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2785,7 +2785,9 @@ static int tpacket_snd(struct packet_sock *po, 
struct msghdr *msg)
         int len_sum = 0;
         int status = TP_STATUS_AVAILABLE;
         int hlen, tlen, copylen = 0;
-       long timeo = 0;
+       long timeo;
+       unsigned int pending;
+       bool first = true;

         mutex_lock(&po->pg_vec_lock);

@@ -2839,18 +2841,27 @@ static int tpacket_snd(struct packet_sock *po, 
struct msghdr *msg)
         if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz)
                 size_max = dev->mtu + reserve + VLAN_HLEN;

+       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
         reinit_completion(&po->skb_completion);

         do {
                 ph = packet_current_frame(po, &po->tx_ring,
                                           TP_STATUS_SEND_REQUEST);
                 if (unlikely(ph == NULL)) {
-                       if (need_wait && 
packet_read_pending(&po->tx_ring)) {
-                               timeo = sock_sndtimeo(&po->sk, 
msg->msg_flags & MSG_DONTWAIT);
-                               timeo = 
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
-                               if (timeo <= 0) {
-                                       err = !timeo ? -ETIMEDOUT : 
-ERESTARTSYS;
-                                       goto out_put;
+                       if (need_wait) {
+                               if (skb == NULL && fisrt) {
+                                       pending = 
packet_read_pending(&po->tx_ring);
+                                       if (!pending)
+                                               goto out_put;
+                                       else
+                                               first = false;
+                               }
+                               if (skb || pending) {
+                                       timeo = 
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
+                                       if (timeo <= 0) {
+                                               err = !timeo ? 
-ETIMEDOUT : -ERESTARTSYS;
+                                               goto out_put;
+                                       }
                                 }
                         }
                         /* check for additional frames */


>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5
> 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po,
> struct msghdr *msg)
>          int len_sum = 0;
>          int status = TP_STATUS_AVAILABLE;
>          int hlen, tlen, copylen = 0;
> -       long timeo = 0;
> +       long timeo;
>
> +       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
>          mutex_lock(&po->pg_vec_lock);
>
>          /* packet_sendmsg() check on tx_ring.pg_vec was lockless,
> @@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po,
> struct msghdr *msg)
>                                            TP_STATUS_SEND_REQUEST);
>                  if (unlikely(ph == NULL)) {
>                          if (need_wait && skb) {
> -                               timeo = sock_sndtimeo(&po->sk,
> msg->msg_flags & MSG_DONTWAIT);
>                                  timeo =
> wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>                                  if (timeo <= 0) {
>                                          err = !timeo ? -ETIMEDOUT :
> -ERESTARTSYS;

Re: [PATCH v2] af_packet: fix soft lockup issue caused by tpacket_snd()

Posted by Eric Dumazet 7 months ago

On Tue, Jul 8, 2025 at 11:31 PM luyun <luyun_611@163.com> wrote:
>
>
> 在 2025/7/8 15:12, Eric Dumazet 写道:
> > On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@163.com> wrote:
> >> From: Yun Lu <luyun@kylinos.cn>
> >>
> >> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
> >> pending_refcnt to decrement to zero before returning. The pending_refcnt
> >> is decremented by 1 when the skb->destructor function is called,
> >> indicating that the skb has been successfully sent and needs to be
> >> destroyed.
> >>
> >> If an error occurs during this process, the tpacket_snd() function will
> >> exit and return error, but pending_refcnt may not yet have decremented to
> >> zero. Assuming the next send operation is executed immediately, but there
> >> are no available frames to be sent in tx_ring (i.e., packet_current_frame
> >> returns NULL), and skb is also NULL, the function will not execute
> >> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
> >> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
> >> if the previous skb has completed transmission, the skb->destructor
> >> function can only be invoked in the ksoftirqd thread (assuming NAPI
> >> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
> >> operation happen to run on the same CPU, and the CPU trapped in the
> >> do-while loop without yielding, the ksoftirqd thread will not get
> >> scheduled to run. As a result, pending_refcnt will never be reduced to
> >> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
> >> lockup issue.
> >>
> >> In fact, as long as pending_refcnt is not zero, even if skb is NULL,
> >> wait_for_completion_interruptible_timeout() should be executed to yield
> >> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the
> >> execution condition of this function should be modified to check if
> >> pending_refcnt is not zero.
> >>
> >> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
> >> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
> >> Signed-off-by: Yun Lu <luyun@kylinos.cn>
> >>
> >> ---
> >> Changes in v2:
> >> - Add a Fixes tag.
> >> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
> >> ---
> >>   net/packet/af_packet.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> >> index 3d43f3eae759..7df96311adb8 100644
> >> --- a/net/packet/af_packet.c
> >> +++ b/net/packet/af_packet.c
> >> @@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> >>                  ph = packet_current_frame(po, &po->tx_ring,
> >>                                            TP_STATUS_SEND_REQUEST);
> >>                  if (unlikely(ph == NULL)) {
> >> -                       if (need_wait && skb) {
> >> +                       if (need_wait && packet_read_pending(&po->tx_ring)) {
> >>                                  timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
> >>                                  timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
> >>                                  if (timeo <= 0) {
> > packet_read_pending() is super expensive on hosts with 256 cpus (or more)
> >
> > We are going to call it a second time at the end of the block:
> >
> > do { ...
> > } while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)...
> >
> > Perhaps we can remove the second one ?
>
> As mentioned in the commit message, the soft lockup issue only occurs
> when tpacket_snd() is called to send,  with the pending_refcnt is
> non-zero, and there are no available packets in the tx_ring.
>
> Therefore, at the first start of the loop, packet_read_pending() only
> needs to be called once. If the return result is already 0, the loop can
> exit directly. Otherwise, wait_for_completion_interruptible_timeout()
> needs to be executed for waiting. Later, this function should only be
> called at the end to check whether the loop can exit.
>
>
> >
> > Also I think there is another problem with the code.
> >
> > We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO
> > constraint could be way off.
>
> Yes, due to the changes in commit 581073f626e3 ("af_packet: do not call
> packet_read_pending() from tpacket_destruct_skb()"), every time
> tpacket_destruct_skb is executed, the skb_completion is marked as
> completed. When wait_for_completion_interruptible_timeout returns
> completed, the pending_refcnt has not yet been reduced to 0. Therefore,
> when ph is NULL, the wait function may need to be called multiple times
> untill packet_read_pending finally returns 0.
>
>
> I have revised the code (as shown below), and it seems to be more
> reasonable. I also look forward to any better suggestions you may have.
> Thank you!
>
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 7df96311adb8..401ae8f6481b 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2785,7 +2785,9 @@ static int tpacket_snd(struct packet_sock *po,
> struct msghdr *msg)
>          int len_sum = 0;
>          int status = TP_STATUS_AVAILABLE;
>          int hlen, tlen, copylen = 0;
> -       long timeo = 0;
> +       long timeo;
> +       unsigned int pending;
> +       bool first = true;
>
>          mutex_lock(&po->pg_vec_lock);
>
> @@ -2839,18 +2841,27 @@ static int tpacket_snd(struct packet_sock *po,
> struct msghdr *msg)
>          if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz)
>                  size_max = dev->mtu + reserve + VLAN_HLEN;
>
> +       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
>          reinit_completion(&po->skb_completion);
>
>          do {
>                  ph = packet_current_frame(po, &po->tx_ring,
>                                            TP_STATUS_SEND_REQUEST);
>                  if (unlikely(ph == NULL)) {
> -                       if (need_wait &&
> packet_read_pending(&po->tx_ring)) {
> -                               timeo = sock_sndtimeo(&po->sk,
> msg->msg_flags & MSG_DONTWAIT);
> -                               timeo =
> wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
> -                               if (timeo <= 0) {
> -                                       err = !timeo ? -ETIMEDOUT :
> -ERESTARTSYS;
> -                                       goto out_put;
> +                       if (need_wait) {
> +                               if (skb == NULL && fisrt) {
> +                                       pending =
> packet_read_pending(&po->tx_ring);
> +                                       if (!pending)
> +                                               goto out_put;
> +                                       else
> +                                               first = false;
> +                               }
> +                               if (skb || pending) {
> +                                       timeo =
> wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
> +                                       if (timeo <= 0) {
> +                                               err = !timeo ?
> -ETIMEDOUT : -ERESTARTSYS;
> +                                               goto out_put;
> +                                       }
>                                  }
>                          }
>                          /* check for additional frames */
>
>
> >
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5
> > 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po,
> > struct msghdr *msg)
> >          int len_sum = 0;
> >          int status = TP_STATUS_AVAILABLE;
> >          int hlen, tlen, copylen = 0;
> > -       long timeo = 0;
> > +       long timeo;
> >
> > +       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
> >          mutex_lock(&po->pg_vec_lock);
> >
> >          /* packet_sendmsg() check on tx_ring.pg_vec was lockless,
> > @@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po,
> > struct msghdr *msg)
> >                                            TP_STATUS_SEND_REQUEST);
> >                  if (unlikely(ph == NULL)) {
> >                          if (need_wait && skb) {
> > -                               timeo = sock_sndtimeo(&po->sk,
> > msg->msg_flags & MSG_DONTWAIT);
> >                                  timeo =
> > wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
> >                                  if (timeo <= 0) {
> >                                          err = !timeo ? -ETIMEDOUT :
> > -ERESTARTSYS;
>

Instead of adding two extra variables, you also could reuse ph (set it
to not zero)

Also please split in two different patches, one with the timeo fix alone.

Thanks !

Re: [PATCH v2] af_packet: fix soft lockup issue caused by tpacket_snd()

Posted by luyun 7 months ago

在 2025/7/8 15:12, Eric Dumazet 写道:
> On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@163.com> wrote:
>> From: Yun Lu <luyun@kylinos.cn>
>>
>> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
>> pending_refcnt to decrement to zero before returning. The pending_refcnt
>> is decremented by 1 when the skb->destructor function is called,
>> indicating that the skb has been successfully sent and needs to be
>> destroyed.
>>
>> If an error occurs during this process, the tpacket_snd() function will
>> exit and return error, but pending_refcnt may not yet have decremented to
>> zero. Assuming the next send operation is executed immediately, but there
>> are no available frames to be sent in tx_ring (i.e., packet_current_frame
>> returns NULL), and skb is also NULL, the function will not execute
>> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
>> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
>> if the previous skb has completed transmission, the skb->destructor
>> function can only be invoked in the ksoftirqd thread (assuming NAPI
>> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
>> operation happen to run on the same CPU, and the CPU trapped in the
>> do-while loop without yielding, the ksoftirqd thread will not get
>> scheduled to run. As a result, pending_refcnt will never be reduced to
>> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
>> lockup issue.
>>
>> In fact, as long as pending_refcnt is not zero, even if skb is NULL,
>> wait_for_completion_interruptible_timeout() should be executed to yield
>> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the
>> execution condition of this function should be modified to check if
>> pending_refcnt is not zero.
>>
>> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
>> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
>> Signed-off-by: Yun Lu <luyun@kylinos.cn>
>>
>> ---
>> Changes in v2:
>> - Add a Fixes tag.
>> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
>> ---
>>   net/packet/af_packet.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 3d43f3eae759..7df96311adb8 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>>                  ph = packet_current_frame(po, &po->tx_ring,
>>                                            TP_STATUS_SEND_REQUEST);
>>                  if (unlikely(ph == NULL)) {
>> -                       if (need_wait && skb) {
>> +                       if (need_wait && packet_read_pending(&po->tx_ring)) {
>>                                  timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
>>                                  timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>>                                  if (timeo <= 0) {
> packet_read_pending() is super expensive on hosts with 256 cpus (or more)

Yeah, the CPU is exactly stuck on packet_read_pending() when soft lockup 
occurs.


>
> We are going to call it a second time at the end of the block:
>
> do { ...
> } while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)...
>
> Perhaps we can remove the second one ?

The first call to packet_read_pending() is only needed when skb is NULL 
(i.e., at the start of the loop),  to determine whether 
wait_for_completion_interruptible_timeout() should be executed for waiting.

If the first call to packet_read_pending() has already returned 0, then 
the second call at the end of the block can also be omitted.

So, the code might be modified as shown below:


diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7df96311adb8..15a37209f872 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2785,7 +2785,8 @@ static int tpacket_snd(struct packet_sock *po, 
struct msghdr *msg)
         int len_sum = 0;
         int status = TP_STATUS_AVAILABLE;
         int hlen, tlen, copylen = 0;
-       long timeo = 0;
+       long timeo;
+       bool pending = true;

         mutex_lock(&po->pg_vec_lock);

@@ -2839,18 +2840,22 @@ static int tpacket_snd(struct packet_sock *po, 
struct msghdr *msg)
         if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz)
                 size_max = dev->mtu + reserve + VLAN_HLEN;

+       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
         reinit_completion(&po->skb_completion);

         do {
                 ph = packet_current_frame(po, &po->tx_ring,
                                           TP_STATUS_SEND_REQUEST);
                 if (unlikely(ph == NULL)) {
-                       if (need_wait && 
packet_read_pending(&po->tx_ring)) {
-                               timeo = sock_sndtimeo(&po->sk, 
msg->msg_flags & MSG_DONTWAIT);
-                               timeo = 
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
-                               if (timeo <= 0) {
-                                       err = !timeo ? -ETIMEDOUT : 
-ERESTARTSYS;
-                                       goto out_put;
+                       if (need_wait) {
+                               if (skb == NULL)
+                                       pending = 
!!packet_read_pending(&po->tx_ring);
+                               if (skb || pending) {
+                                       timeo = 
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
+                                       if (timeo <= 0) {
+                                               err = !timeo ? 
-ETIMEDOUT : -ERESTARTSYS;
+                                               goto out_put;
+                                       }
                                 }
                         }
                         /* check for additional frames */
@@ -2950,7 +2955,7 @@ static int tpacket_snd(struct packet_sock *po, 
struct msghdr *msg)
                  * condition, and luckily don't have to go that path
                  * anyway.
                  */
-                (need_wait && packet_read_pending(&po->tx_ring))));
+                (need_wait && pending && 
packet_read_pending(&po->tx_ring))));

         err = len_sum;
         goto out_put;

> Also I think there is another problem with the code.
>
> We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO
> constraint could be way off.
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5
> 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po,
> struct msghdr *msg)
>          int len_sum = 0;
>          int status = TP_STATUS_AVAILABLE;
>          int hlen, tlen, copylen = 0;
> -       long timeo = 0;
> +       long timeo;
>
> +       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
>          mutex_lock(&po->pg_vec_lock);
>
>          /* packet_sendmsg() check on tx_ring.pg_vec was lockless,
> @@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po,
> struct msghdr *msg)
>                                            TP_STATUS_SEND_REQUEST);
>                  if (unlikely(ph == NULL)) {
>                          if (need_wait && skb) {
> -                               timeo = sock_sndtimeo(&po->sk,
> msg->msg_flags & MSG_DONTWAIT);
>                                  timeo =
> wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>                                  if (timeo <= 0) {
>                                          err = !timeo ? -ETIMEDOUT :
> -ERESTARTSYS;