net/packet/af_packet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
From: Yun Lu <luyun@kylinos.cn>
When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
pending_refcnt to decrement to zero before returning. The pending_refcnt
is decremented by 1 when the skb->destructor function is called,
indicating that the skb has been successfully sent and needs to be
destroyed.
If an error occurs during this process, the tpacket_snd() function will
exit and return error, but pending_refcnt may not yet have decremented to
zero. Assuming the next send operation is executed immediately, but there
are no available frames to be sent in tx_ring (i.e., packet_current_frame
returns NULL), and skb is also NULL, the function will not execute
wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
will enter a do-while loop, waiting for pending_refcnt to be zero. Even
if the previous skb has completed transmission, the skb->destructor
function can only be invoked in the ksoftirqd thread (assuming NAPI
threading is enabled). When both the ksoftirqd thread and the tpacket_snd
operation happen to run on the same CPU, and the CPU trapped in the
do-while loop without yielding, the ksoftirqd thread will not get
scheduled to run. As a result, pending_refcnt will never be reduced to
zero, and the do-while loop cannot exit, eventually leading to a CPU soft
lockup issue.
In fact, as long as pending_refcnt is not zero, even if skb is NULL,
wait_for_completion_interruptible_timeout() should be executed to yield
the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the
execution condition of this function should be modified to check if
pending_refcnt is not zero.
Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
Signed-off-by: Yun Lu <luyun@kylinos.cn>
---
Changes in v2:
- Add a Fixes tag.
- Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
---
net/packet/af_packet.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3d43f3eae759..7df96311adb8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
ph = packet_current_frame(po, &po->tx_ring,
TP_STATUS_SEND_REQUEST);
if (unlikely(ph == NULL)) {
- if (need_wait && skb) {
+ if (need_wait && packet_read_pending(&po->tx_ring)) {
timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
if (timeo <= 0) {
--
2.43.0
On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@163.com> wrote: > > From: Yun Lu <luyun@kylinos.cn> > > When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for > pending_refcnt to decrement to zero before returning. The pending_refcnt > is decremented by 1 when the skb->destructor function is called, > indicating that the skb has been successfully sent and needs to be > destroyed. > > If an error occurs during this process, the tpacket_snd() function will > exit and return error, but pending_refcnt may not yet have decremented to > zero. Assuming the next send operation is executed immediately, but there > are no available frames to be sent in tx_ring (i.e., packet_current_frame > returns NULL), and skb is also NULL, the function will not execute > wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it > will enter a do-while loop, waiting for pending_refcnt to be zero. Even > if the previous skb has completed transmission, the skb->destructor > function can only be invoked in the ksoftirqd thread (assuming NAPI > threading is enabled). When both the ksoftirqd thread and the tpacket_snd > operation happen to run on the same CPU, and the CPU trapped in the > do-while loop without yielding, the ksoftirqd thread will not get > scheduled to run. As a result, pending_refcnt will never be reduced to > zero, and the do-while loop cannot exit, eventually leading to a CPU soft > lockup issue. > > In fact, as long as pending_refcnt is not zero, even if skb is NULL, > wait_for_completion_interruptible_timeout() should be executed to yield > the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the > execution condition of this function should be modified to check if > pending_refcnt is not zero. > > Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") > Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> > Signed-off-by: Yun Lu <luyun@kylinos.cn> > > --- > Changes in v2: > - Add a Fixes tag. > - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ > --- > net/packet/af_packet.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index 3d43f3eae759..7df96311adb8 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > ph = packet_current_frame(po, &po->tx_ring, > TP_STATUS_SEND_REQUEST); > if (unlikely(ph == NULL)) { > - if (need_wait && skb) { > + if (need_wait && packet_read_pending(&po->tx_ring)) { > timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); > timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); > if (timeo <= 0) { packet_read_pending() is super expensive on hosts with 256 cpus (or more) We are going to call it a second time at the end of the block: do { ... } while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)... Perhaps we can remove the second one ? Also I think there is another problem with the code. We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO constraint could be way off. diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) int len_sum = 0; int status = TP_STATUS_AVAILABLE; int hlen, tlen, copylen = 0; - long timeo = 0; + long timeo; + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); mutex_lock(&po->pg_vec_lock); /* packet_sendmsg() check on tx_ring.pg_vec was lockless, @@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) TP_STATUS_SEND_REQUEST); if (unlikely(ph == NULL)) { if (need_wait && skb) { - timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); if (timeo <= 0) { err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
在 2025/7/8 15:12, Eric Dumazet 写道: > On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@163.com> wrote: >> From: Yun Lu <luyun@kylinos.cn> >> >> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for >> pending_refcnt to decrement to zero before returning. The pending_refcnt >> is decremented by 1 when the skb->destructor function is called, >> indicating that the skb has been successfully sent and needs to be >> destroyed. >> >> If an error occurs during this process, the tpacket_snd() function will >> exit and return error, but pending_refcnt may not yet have decremented to >> zero. Assuming the next send operation is executed immediately, but there >> are no available frames to be sent in tx_ring (i.e., packet_current_frame >> returns NULL), and skb is also NULL, the function will not execute >> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it >> will enter a do-while loop, waiting for pending_refcnt to be zero. Even >> if the previous skb has completed transmission, the skb->destructor >> function can only be invoked in the ksoftirqd thread (assuming NAPI >> threading is enabled). When both the ksoftirqd thread and the tpacket_snd >> operation happen to run on the same CPU, and the CPU trapped in the >> do-while loop without yielding, the ksoftirqd thread will not get >> scheduled to run. As a result, pending_refcnt will never be reduced to >> zero, and the do-while loop cannot exit, eventually leading to a CPU soft >> lockup issue. >> >> In fact, as long as pending_refcnt is not zero, even if skb is NULL, >> wait_for_completion_interruptible_timeout() should be executed to yield >> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the >> execution condition of this function should be modified to check if >> pending_refcnt is not zero. >> >> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") >> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> >> Signed-off-by: Yun Lu <luyun@kylinos.cn> >> >> --- >> Changes in v2: >> - Add a Fixes tag. >> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ >> --- >> net/packet/af_packet.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >> index 3d43f3eae759..7df96311adb8 100644 >> --- a/net/packet/af_packet.c >> +++ b/net/packet/af_packet.c >> @@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >> ph = packet_current_frame(po, &po->tx_ring, >> TP_STATUS_SEND_REQUEST); >> if (unlikely(ph == NULL)) { >> - if (need_wait && skb) { >> + if (need_wait && packet_read_pending(&po->tx_ring)) { >> timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); >> timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); >> if (timeo <= 0) { > packet_read_pending() is super expensive on hosts with 256 cpus (or more) > > We are going to call it a second time at the end of the block: > > do { ... > } while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)... > > Perhaps we can remove the second one ? As mentioned in the commit message, the soft lockup issue only occurs when tpacket_snd() is called to send, with the pending_refcnt is non-zero, and there are no available packets in the tx_ring. Therefore, at the first start of the loop, packet_read_pending() only needs to be called once. If the return result is already 0, the loop can exit directly. Otherwise, wait_for_completion_interruptible_timeout() needs to be executed for waiting. Later, this function should only be called at the end to check whether the loop can exit. > > Also I think there is another problem with the code. > > We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO > constraint could be way off. Yes, due to the changes in commit 581073f626e3 ("af_packet: do not call packet_read_pending() from tpacket_destruct_skb()"), every time tpacket_destruct_skb is executed, the skb_completion is marked as completed. When wait_for_completion_interruptible_timeout returns completed, the pending_refcnt has not yet been reduced to 0. Therefore, when ph is NULL, the wait function may need to be called multiple times untill packet_read_pending finally returns 0. I have revised the code (as shown below), and it seems to be more reasonable. I also look forward to any better suggestions you may have. Thank you! diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 7df96311adb8..401ae8f6481b 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2785,7 +2785,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) int len_sum = 0; int status = TP_STATUS_AVAILABLE; int hlen, tlen, copylen = 0; - long timeo = 0; + long timeo; + unsigned int pending; + bool first = true; mutex_lock(&po->pg_vec_lock); @@ -2839,18 +2841,27 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz) size_max = dev->mtu + reserve + VLAN_HLEN; + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); reinit_completion(&po->skb_completion); do { ph = packet_current_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST); if (unlikely(ph == NULL)) { - if (need_wait && packet_read_pending(&po->tx_ring)) { - timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); - timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); - if (timeo <= 0) { - err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; - goto out_put; + if (need_wait) { + if (skb == NULL && fisrt) { + pending = packet_read_pending(&po->tx_ring); + if (!pending) + goto out_put; + else + first = false; + } + if (skb || pending) { + timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); + if (timeo <= 0) { + err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; + goto out_put; + } } } /* check for additional frames */ > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5 > 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po, > struct msghdr *msg) > int len_sum = 0; > int status = TP_STATUS_AVAILABLE; > int hlen, tlen, copylen = 0; > - long timeo = 0; > + long timeo; > > + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); > mutex_lock(&po->pg_vec_lock); > > /* packet_sendmsg() check on tx_ring.pg_vec was lockless, > @@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po, > struct msghdr *msg) > TP_STATUS_SEND_REQUEST); > if (unlikely(ph == NULL)) { > if (need_wait && skb) { > - timeo = sock_sndtimeo(&po->sk, > msg->msg_flags & MSG_DONTWAIT); > timeo = > wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); > if (timeo <= 0) { > err = !timeo ? -ETIMEDOUT : > -ERESTARTSYS;
On Tue, Jul 8, 2025 at 11:31 PM luyun <luyun_611@163.com> wrote: > > > 在 2025/7/8 15:12, Eric Dumazet 写道: > > On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@163.com> wrote: > >> From: Yun Lu <luyun@kylinos.cn> > >> > >> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for > >> pending_refcnt to decrement to zero before returning. The pending_refcnt > >> is decremented by 1 when the skb->destructor function is called, > >> indicating that the skb has been successfully sent and needs to be > >> destroyed. > >> > >> If an error occurs during this process, the tpacket_snd() function will > >> exit and return error, but pending_refcnt may not yet have decremented to > >> zero. Assuming the next send operation is executed immediately, but there > >> are no available frames to be sent in tx_ring (i.e., packet_current_frame > >> returns NULL), and skb is also NULL, the function will not execute > >> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it > >> will enter a do-while loop, waiting for pending_refcnt to be zero. Even > >> if the previous skb has completed transmission, the skb->destructor > >> function can only be invoked in the ksoftirqd thread (assuming NAPI > >> threading is enabled). When both the ksoftirqd thread and the tpacket_snd > >> operation happen to run on the same CPU, and the CPU trapped in the > >> do-while loop without yielding, the ksoftirqd thread will not get > >> scheduled to run. As a result, pending_refcnt will never be reduced to > >> zero, and the do-while loop cannot exit, eventually leading to a CPU soft > >> lockup issue. > >> > >> In fact, as long as pending_refcnt is not zero, even if skb is NULL, > >> wait_for_completion_interruptible_timeout() should be executed to yield > >> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the > >> execution condition of this function should be modified to check if > >> pending_refcnt is not zero. > >> > >> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") > >> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> > >> Signed-off-by: Yun Lu <luyun@kylinos.cn> > >> > >> --- > >> Changes in v2: > >> - Add a Fixes tag. > >> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ > >> --- > >> net/packet/af_packet.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > >> index 3d43f3eae759..7df96311adb8 100644 > >> --- a/net/packet/af_packet.c > >> +++ b/net/packet/af_packet.c > >> @@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > >> ph = packet_current_frame(po, &po->tx_ring, > >> TP_STATUS_SEND_REQUEST); > >> if (unlikely(ph == NULL)) { > >> - if (need_wait && skb) { > >> + if (need_wait && packet_read_pending(&po->tx_ring)) { > >> timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); > >> timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); > >> if (timeo <= 0) { > > packet_read_pending() is super expensive on hosts with 256 cpus (or more) > > > > We are going to call it a second time at the end of the block: > > > > do { ... > > } while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)... > > > > Perhaps we can remove the second one ? > > As mentioned in the commit message, the soft lockup issue only occurs > when tpacket_snd() is called to send, with the pending_refcnt is > non-zero, and there are no available packets in the tx_ring. > > Therefore, at the first start of the loop, packet_read_pending() only > needs to be called once. If the return result is already 0, the loop can > exit directly. Otherwise, wait_for_completion_interruptible_timeout() > needs to be executed for waiting. Later, this function should only be > called at the end to check whether the loop can exit. > > > > > > Also I think there is another problem with the code. > > > > We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO > > constraint could be way off. > > Yes, due to the changes in commit 581073f626e3 ("af_packet: do not call > packet_read_pending() from tpacket_destruct_skb()"), every time > tpacket_destruct_skb is executed, the skb_completion is marked as > completed. When wait_for_completion_interruptible_timeout returns > completed, the pending_refcnt has not yet been reduced to 0. Therefore, > when ph is NULL, the wait function may need to be called multiple times > untill packet_read_pending finally returns 0. > > > I have revised the code (as shown below), and it seems to be more > reasonable. I also look forward to any better suggestions you may have. > Thank you! > > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index 7df96311adb8..401ae8f6481b 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2785,7 +2785,9 @@ static int tpacket_snd(struct packet_sock *po, > struct msghdr *msg) > int len_sum = 0; > int status = TP_STATUS_AVAILABLE; > int hlen, tlen, copylen = 0; > - long timeo = 0; > + long timeo; > + unsigned int pending; > + bool first = true; > > mutex_lock(&po->pg_vec_lock); > > @@ -2839,18 +2841,27 @@ static int tpacket_snd(struct packet_sock *po, > struct msghdr *msg) > if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz) > size_max = dev->mtu + reserve + VLAN_HLEN; > > + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); > reinit_completion(&po->skb_completion); > > do { > ph = packet_current_frame(po, &po->tx_ring, > TP_STATUS_SEND_REQUEST); > if (unlikely(ph == NULL)) { > - if (need_wait && > packet_read_pending(&po->tx_ring)) { > - timeo = sock_sndtimeo(&po->sk, > msg->msg_flags & MSG_DONTWAIT); > - timeo = > wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); > - if (timeo <= 0) { > - err = !timeo ? -ETIMEDOUT : > -ERESTARTSYS; > - goto out_put; > + if (need_wait) { > + if (skb == NULL && fisrt) { > + pending = > packet_read_pending(&po->tx_ring); > + if (!pending) > + goto out_put; > + else > + first = false; > + } > + if (skb || pending) { > + timeo = > wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); > + if (timeo <= 0) { > + err = !timeo ? > -ETIMEDOUT : -ERESTARTSYS; > + goto out_put; > + } > } > } > /* check for additional frames */ > > > > > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > > index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5 > > 100644 > > --- a/net/packet/af_packet.c > > +++ b/net/packet/af_packet.c > > @@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po, > > struct msghdr *msg) > > int len_sum = 0; > > int status = TP_STATUS_AVAILABLE; > > int hlen, tlen, copylen = 0; > > - long timeo = 0; > > + long timeo; > > > > + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); > > mutex_lock(&po->pg_vec_lock); > > > > /* packet_sendmsg() check on tx_ring.pg_vec was lockless, > > @@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po, > > struct msghdr *msg) > > TP_STATUS_SEND_REQUEST); > > if (unlikely(ph == NULL)) { > > if (need_wait && skb) { > > - timeo = sock_sndtimeo(&po->sk, > > msg->msg_flags & MSG_DONTWAIT); > > timeo = > > wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); > > if (timeo <= 0) { > > err = !timeo ? -ETIMEDOUT : > > -ERESTARTSYS; > Instead of adding two extra variables, you also could reuse ph (set it to not zero) Also please split in two different patches, one with the timeo fix alone. Thanks !
在 2025/7/8 15:12, Eric Dumazet 写道: > On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@163.com> wrote: >> From: Yun Lu <luyun@kylinos.cn> >> >> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for >> pending_refcnt to decrement to zero before returning. The pending_refcnt >> is decremented by 1 when the skb->destructor function is called, >> indicating that the skb has been successfully sent and needs to be >> destroyed. >> >> If an error occurs during this process, the tpacket_snd() function will >> exit and return error, but pending_refcnt may not yet have decremented to >> zero. Assuming the next send operation is executed immediately, but there >> are no available frames to be sent in tx_ring (i.e., packet_current_frame >> returns NULL), and skb is also NULL, the function will not execute >> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it >> will enter a do-while loop, waiting for pending_refcnt to be zero. Even >> if the previous skb has completed transmission, the skb->destructor >> function can only be invoked in the ksoftirqd thread (assuming NAPI >> threading is enabled). When both the ksoftirqd thread and the tpacket_snd >> operation happen to run on the same CPU, and the CPU trapped in the >> do-while loop without yielding, the ksoftirqd thread will not get >> scheduled to run. As a result, pending_refcnt will never be reduced to >> zero, and the do-while loop cannot exit, eventually leading to a CPU soft >> lockup issue. >> >> In fact, as long as pending_refcnt is not zero, even if skb is NULL, >> wait_for_completion_interruptible_timeout() should be executed to yield >> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the >> execution condition of this function should be modified to check if >> pending_refcnt is not zero. >> >> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") >> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> >> Signed-off-by: Yun Lu <luyun@kylinos.cn> >> >> --- >> Changes in v2: >> - Add a Fixes tag. >> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ >> --- >> net/packet/af_packet.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >> index 3d43f3eae759..7df96311adb8 100644 >> --- a/net/packet/af_packet.c >> +++ b/net/packet/af_packet.c >> @@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >> ph = packet_current_frame(po, &po->tx_ring, >> TP_STATUS_SEND_REQUEST); >> if (unlikely(ph == NULL)) { >> - if (need_wait && skb) { >> + if (need_wait && packet_read_pending(&po->tx_ring)) { >> timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); >> timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); >> if (timeo <= 0) { > packet_read_pending() is super expensive on hosts with 256 cpus (or more) Yeah, the CPU is exactly stuck on packet_read_pending() when soft lockup occurs. > > We are going to call it a second time at the end of the block: > > do { ... > } while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)... > > Perhaps we can remove the second one ? The first call to packet_read_pending() is only needed when skb is NULL (i.e., at the start of the loop), to determine whether wait_for_completion_interruptible_timeout() should be executed for waiting. If the first call to packet_read_pending() has already returned 0, then the second call at the end of the block can also be omitted. So, the code might be modified as shown below: diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 7df96311adb8..15a37209f872 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2785,7 +2785,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) int len_sum = 0; int status = TP_STATUS_AVAILABLE; int hlen, tlen, copylen = 0; - long timeo = 0; + long timeo; + bool pending = true; mutex_lock(&po->pg_vec_lock); @@ -2839,18 +2840,22 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz) size_max = dev->mtu + reserve + VLAN_HLEN; + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); reinit_completion(&po->skb_completion); do { ph = packet_current_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST); if (unlikely(ph == NULL)) { - if (need_wait && packet_read_pending(&po->tx_ring)) { - timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); - timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); - if (timeo <= 0) { - err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; - goto out_put; + if (need_wait) { + if (skb == NULL) + pending = !!packet_read_pending(&po->tx_ring); + if (skb || pending) { + timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); + if (timeo <= 0) { + err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; + goto out_put; + } } } /* check for additional frames */ @@ -2950,7 +2955,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) * condition, and luckily don't have to go that path * anyway. */ - (need_wait && packet_read_pending(&po->tx_ring)))); + (need_wait && pending && packet_read_pending(&po->tx_ring)))); err = len_sum; goto out_put; > Also I think there is another problem with the code. > > We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO > constraint could be way off. > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5 > 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po, > struct msghdr *msg) > int len_sum = 0; > int status = TP_STATUS_AVAILABLE; > int hlen, tlen, copylen = 0; > - long timeo = 0; > + long timeo; > > + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); > mutex_lock(&po->pg_vec_lock); > > /* packet_sendmsg() check on tx_ring.pg_vec was lockless, > @@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po, > struct msghdr *msg) > TP_STATUS_SEND_REQUEST); > if (unlikely(ph == NULL)) { > if (need_wait && skb) { > - timeo = sock_sndtimeo(&po->sk, > msg->msg_flags & MSG_DONTWAIT); > timeo = > wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); > if (timeo <= 0) { > err = !timeo ? -ETIMEDOUT : > -ERESTARTSYS;
© 2016 - 2025 Red Hat, Inc.