From: Yun Lu <luyun@kylinos.cn>
When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
pending_refcnt to decrement to zero before returning. The pending_refcnt
is decremented by 1 when the skb->destructor function is called,
indicating that the skb has been successfully sent and needs to be
destroyed.
If an error occurs during this process, the tpacket_snd() function will
exit and return error, but pending_refcnt may not yet have decremented to
zero. Assuming the next send operation is executed immediately, but there
are no available frames to be sent in tx_ring (i.e., packet_current_frame
returns NULL), and skb is also NULL, the function will not execute
wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
will enter a do-while loop, waiting for pending_refcnt to be zero. Even
if the previous skb has completed transmission, the skb->destructor
function can only be invoked in the ksoftirqd thread (assuming NAPI
threading is enabled). When both the ksoftirqd thread and the tpacket_snd
operation happen to run on the same CPU, and the CPU trapped in the
do-while loop without yielding, the ksoftirqd thread will not get
scheduled to run. As a result, pending_refcnt will never be reduced to
zero, and the do-while loop cannot exit, eventually leading to a CPU soft
lockup issue.
In fact, skb is true for all but the first iterations of that loop, and
as long as pending_refcnt is not zero, even if incremented by a previous
call, wait_for_completion_interruptible_timeout() should be executed to
yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore,
the execution condition of this function should be modified to check if
pending_refcnt is not zero, instead of check skb.
As a result, packet_read_pending() may be called twice in the loop. This
will be optimized in the following patch.
Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
Cc: stable@kernel.org
Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
Signed-off-by: Yun Lu <luyun@kylinos.cn>
---
Changes in v4:
- Split to the fix alone. Thanks: Willem de Bruijn.
- Link to v3: https://lore.kernel.org/all/20250709095653.62469-3-luyun_611@163.com/
Changes in v3:
- Simplify the code and reuse ph to continue. Thanks: Eric Dumazet.
- Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/
Changes in v2:
- Add a Fixes tag.
- Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
---
---
net/packet/af_packet.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7089b8c2a655..581a96ec8e1a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2846,7 +2846,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
ph = packet_current_frame(po, &po->tx_ring,
TP_STATUS_SEND_REQUEST);
if (unlikely(ph == NULL)) {
- if (need_wait && skb) {
+ if (need_wait && packet_read_pending(&po->tx_ring)) {
timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
if (timeo <= 0) {
err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
--
2.43.0
Yun Lu wrote: > From: Yun Lu <luyun@kylinos.cn> > > When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for > pending_refcnt to decrement to zero before returning. The pending_refcnt > is decremented by 1 when the skb->destructor function is called, > indicating that the skb has been successfully sent and needs to be > destroyed. > > If an error occurs during this process, the tpacket_snd() function will > exit and return error, but pending_refcnt may not yet have decremented to > zero. Assuming the next send operation is executed immediately, but there > are no available frames to be sent in tx_ring (i.e., packet_current_frame > returns NULL), and skb is also NULL, the function will not execute > wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it > will enter a do-while loop, waiting for pending_refcnt to be zero. Even > if the previous skb has completed transmission, the skb->destructor > function can only be invoked in the ksoftirqd thread (assuming NAPI > threading is enabled). When both the ksoftirqd thread and the tpacket_snd > operation happen to run on the same CPU, and the CPU trapped in the > do-while loop without yielding, the ksoftirqd thread will not get > scheduled to run. As a result, pending_refcnt will never be reduced to > zero, and the do-while loop cannot exit, eventually leading to a CPU soft > lockup issue. > > In fact, skb is true for all but the first iterations of that loop, and > as long as pending_refcnt is not zero, even if incremented by a previous > call, wait_for_completion_interruptible_timeout() should be executed to > yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, > the execution condition of this function should be modified to check if > pending_refcnt is not zero, instead of check skb. > > As a result, packet_read_pending() may be called twice in the loop. This > will be optimized in the following patch. > > Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") > Cc: stable@kernel.org > Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> > Signed-off-by: Yun Lu <luyun@kylinos.cn> > > --- > Changes in v4: > - Split to the fix alone. Thanks: Willem de Bruijn. > - Link to v3: https://lore.kernel.org/all/20250709095653.62469-3-luyun_611@163.com/ > > Changes in v3: > - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet. > - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/ > > Changes in v2: > - Add a Fixes tag. > - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ > --- > --- > net/packet/af_packet.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index 7089b8c2a655..581a96ec8e1a 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2846,7 +2846,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > ph = packet_current_frame(po, &po->tx_ring, > TP_STATUS_SEND_REQUEST); > if (unlikely(ph == NULL)) { > - if (need_wait && skb) { > + if (need_wait && packet_read_pending(&po->tx_ring)) { Unfortunately I did not immediately fully appreciate Eric's suggestion. My comments was If [..] the extra packet_read_pending() is already present, not newly introduced with the fix But of course that expensive call is newly introduced, so my suggestion was invalid. It's btw also not possible to mix net and net-next patches in a single series like this (see Documentation/process/maintainer-netdev.rst). But, instead of going back entirely to v2, perhaps we can make the logic a bit more obvious by just having a while (1) at the end to show that the only way to exit the loop (except errors) is in the ph == NULL branch. And break in that loop directly. There are two other ways to reach that while statement. A continue on PACKET_SOCK_TP_LOSS, or by regular control flow. In both cases, ph is non-zero, so the condition is true anyway. > timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); > if (timeo <= 0) { > err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; > -- > 2.43.0 >
在 2025/7/10 21:49, Willem de Bruijn 写道: > Yun Lu wrote: >> From: Yun Lu <luyun@kylinos.cn> >> >> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for >> pending_refcnt to decrement to zero before returning. The pending_refcnt >> is decremented by 1 when the skb->destructor function is called, >> indicating that the skb has been successfully sent and needs to be >> destroyed. >> >> If an error occurs during this process, the tpacket_snd() function will >> exit and return error, but pending_refcnt may not yet have decremented to >> zero. Assuming the next send operation is executed immediately, but there >> are no available frames to be sent in tx_ring (i.e., packet_current_frame >> returns NULL), and skb is also NULL, the function will not execute >> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it >> will enter a do-while loop, waiting for pending_refcnt to be zero. Even >> if the previous skb has completed transmission, the skb->destructor >> function can only be invoked in the ksoftirqd thread (assuming NAPI >> threading is enabled). When both the ksoftirqd thread and the tpacket_snd >> operation happen to run on the same CPU, and the CPU trapped in the >> do-while loop without yielding, the ksoftirqd thread will not get >> scheduled to run. As a result, pending_refcnt will never be reduced to >> zero, and the do-while loop cannot exit, eventually leading to a CPU soft >> lockup issue. >> >> In fact, skb is true for all but the first iterations of that loop, and >> as long as pending_refcnt is not zero, even if incremented by a previous >> call, wait_for_completion_interruptible_timeout() should be executed to >> yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, >> the execution condition of this function should be modified to check if >> pending_refcnt is not zero, instead of check skb. >> >> As a result, packet_read_pending() may be called twice in the loop. This >> will be optimized in the following patch. >> >> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") >> Cc: stable@kernel.org >> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> >> Signed-off-by: Yun Lu <luyun@kylinos.cn> >> >> --- >> Changes in v4: >> - Split to the fix alone. Thanks: Willem de Bruijn. >> - Link to v3: https://lore.kernel.org/all/20250709095653.62469-3-luyun_611@163.com/ >> >> Changes in v3: >> - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet. >> - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/ >> >> Changes in v2: >> - Add a Fixes tag. >> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ >> --- >> --- >> net/packet/af_packet.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >> index 7089b8c2a655..581a96ec8e1a 100644 >> --- a/net/packet/af_packet.c >> +++ b/net/packet/af_packet.c >> @@ -2846,7 +2846,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >> ph = packet_current_frame(po, &po->tx_ring, >> TP_STATUS_SEND_REQUEST); >> if (unlikely(ph == NULL)) { >> - if (need_wait && skb) { >> + if (need_wait && packet_read_pending(&po->tx_ring)) { > Unfortunately I did not immediately fully appreciate Eric's > suggestion. > > My comments was > > If [..] the extra packet_read_pending() is already present, not > newly introduced with the fix > > But of course that expensive call is newly introduced, so my > suggestion was invalid. > > It's btw also not possible to mix net and net-next patches in a single > series like this (see Documentation/process/maintainer-netdev.rst). Sorry, I misunderstood your comments. In the next version, I will combine the second and third patches together. > > But, instead of going back entirely to v2, perhaps we can make the > logic a bit more obvious by just having a while (1) at the end to show > that the only way to exit the loop (except errors) is in the ph == NULL > branch. And break in that loop directly. > > There are two other ways to reach that while statement. A continue > on PACKET_SOCK_TP_LOSS, or by regular control flow. In both cases, ph > is non-zero, so the condition is true anyway. Following your suggestion, I tried modifying the code (as shown below), now the loop condition is still the same as origin, but the logic is now clearer and more obvious. Thanks. diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 7089b8c2a655..be608f07441f 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2846,15 +2846,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) ph = packet_current_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST); if (unlikely(ph == NULL)) { - if (need_wait && skb) { + /* Note: packet_read_pending() might be slow if we + * have to call it as it's per_cpu variable, but in + * fast-path we don't have to call it, only when ph + * is NULL, we need to check the pending_refcnt. + */ + if (need_wait && packet_read_pending(&po->tx_ring)) { timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); if (timeo <= 0) { err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; goto out_put; } - } - /* check for additional frames */ - continue; + /* check for additional frames */ + continue; + } else + break; } skb = NULL; @@ -2943,14 +2949,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) } packet_increment_head(&po->tx_ring); len_sum += tp_len; - } while (likely((ph != NULL) || - /* Note: packet_read_pending() might be slow if we have - * to call it as it's per_cpu variable, but in fast-path - * we already short-circuit the loop with the first - * condition, and luckily don't have to go that path - * anyway. - */ - (need_wait && packet_read_pending(&po->tx_ring)))); + } while (1); err = len_sum; goto out_put; > >> timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); >> if (timeo <= 0) { >> err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; >> -- >> 2.43.0 >>
luyun wrote: > > 在 2025/7/10 21:49, Willem de Bruijn 写道: > > Yun Lu wrote: > >> From: Yun Lu <luyun@kylinos.cn> > >> > >> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for > >> pending_refcnt to decrement to zero before returning. The pending_refcnt > >> is decremented by 1 when the skb->destructor function is called, > >> indicating that the skb has been successfully sent and needs to be > >> destroyed. > >> > >> If an error occurs during this process, the tpacket_snd() function will > >> exit and return error, but pending_refcnt may not yet have decremented to > >> zero. Assuming the next send operation is executed immediately, but there > >> are no available frames to be sent in tx_ring (i.e., packet_current_frame > >> returns NULL), and skb is also NULL, the function will not execute > >> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it > >> will enter a do-while loop, waiting for pending_refcnt to be zero. Even > >> if the previous skb has completed transmission, the skb->destructor > >> function can only be invoked in the ksoftirqd thread (assuming NAPI > >> threading is enabled). When both the ksoftirqd thread and the tpacket_snd > >> operation happen to run on the same CPU, and the CPU trapped in the > >> do-while loop without yielding, the ksoftirqd thread will not get > >> scheduled to run. As a result, pending_refcnt will never be reduced to > >> zero, and the do-while loop cannot exit, eventually leading to a CPU soft > >> lockup issue. > >> > >> In fact, skb is true for all but the first iterations of that loop, and > >> as long as pending_refcnt is not zero, even if incremented by a previous > >> call, wait_for_completion_interruptible_timeout() should be executed to > >> yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, > >> the execution condition of this function should be modified to check if > >> pending_refcnt is not zero, instead of check skb. > >> > >> As a result, packet_read_pending() may be called twice in the loop. This > >> will be optimized in the following patch. > >> > >> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") > >> Cc: stable@kernel.org > >> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> > >> Signed-off-by: Yun Lu <luyun@kylinos.cn> > >> > >> --- > >> Changes in v4: > >> - Split to the fix alone. Thanks: Willem de Bruijn. > >> - Link to v3: https://lore.kernel.org/all/20250709095653.62469-3-luyun_611@163.com/ > >> > >> Changes in v3: > >> - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet. > >> - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/ > >> > >> Changes in v2: > >> - Add a Fixes tag. > >> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ > >> --- > >> --- > >> net/packet/af_packet.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > >> index 7089b8c2a655..581a96ec8e1a 100644 > >> --- a/net/packet/af_packet.c > >> +++ b/net/packet/af_packet.c > >> @@ -2846,7 +2846,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > >> ph = packet_current_frame(po, &po->tx_ring, > >> TP_STATUS_SEND_REQUEST); > >> if (unlikely(ph == NULL)) { > >> - if (need_wait && skb) { > >> + if (need_wait && packet_read_pending(&po->tx_ring)) { > > Unfortunately I did not immediately fully appreciate Eric's > > suggestion. > > > > My comments was > > > > If [..] the extra packet_read_pending() is already present, not > > newly introduced with the fix > > > > But of course that expensive call is newly introduced, so my > > suggestion was invalid. > > > > It's btw also not possible to mix net and net-next patches in a single > > series like this (see Documentation/process/maintainer-netdev.rst). > > Sorry, I misunderstood your comments. In the next version, I will > combine the second and third patches together. My original suggestion was just wrong, sorry. Thanks for revising again. > > > > But, instead of going back entirely to v2, perhaps we can make the > > logic a bit more obvious by just having a while (1) at the end to show > > that the only way to exit the loop (except errors) is in the ph == NULL > > branch. And break in that loop directly. > > > > There are two other ways to reach that while statement. A continue > > on PACKET_SOCK_TP_LOSS, or by regular control flow. In both cases, ph > > is non-zero, so the condition is true anyway. > > Following your suggestion, I tried modifying the code (as shown below), > now the loop condition is still the same as origin, but the logic is now > clearer and more obvious.
© 2016 - 2025 Red Hat, Inc.