From: Yun Lu <luyun@kylinos.cn>
When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
pending_refcnt to decrement to zero before returning. The pending_refcnt
is decremented by 1 when the skb->destructor function is called,
indicating that the skb has been successfully sent and needs to be
destroyed.
If an error occurs during this process, the tpacket_snd() function will
exit and return error, but pending_refcnt may not yet have decremented to
zero. Assuming the next send operation is executed immediately, but there
are no available frames to be sent in tx_ring (i.e., packet_current_frame
returns NULL), and skb is also NULL, the function will not execute
wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
will enter a do-while loop, waiting for pending_refcnt to be zero. Even
if the previous skb has completed transmission, the skb->destructor
function can only be invoked in the ksoftirqd thread (assuming NAPI
threading is enabled). When both the ksoftirqd thread and the tpacket_snd
operation happen to run on the same CPU, and the CPU trapped in the
do-while loop without yielding, the ksoftirqd thread will not get
scheduled to run. As a result, pending_refcnt will never be reduced to
zero, and the do-while loop cannot exit, eventually leading to a CPU soft
lockup issue.
In fact, as long as pending_refcnt is not zero, even if skb is NULL,
wait_for_completion_interruptible_timeout() should be executed to yield
the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, move
the penging_refcnt check to the start of the do-while loop, and reuse ph
to continue for the next iteration.
Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
Cc: stable@kernel.org
Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
Signed-off-by: Yun Lu <luyun@kylinos.cn>
---
Changes in v3:
- Simplify the code and reuse ph to continue. Thanks: Eric Dumazet.
- Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/
Changes in v2:
- Add a Fixes tag.
- Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
---
net/packet/af_packet.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7089b8c2a655..89a5d2a3a720 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2846,11 +2846,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
ph = packet_current_frame(po, &po->tx_ring,
TP_STATUS_SEND_REQUEST);
if (unlikely(ph == NULL)) {
- if (need_wait && skb) {
+ /* Note: packet_read_pending() might be slow if we
+ * have to call it as it's per_cpu variable, but in
+ * fast-path we don't have to call it, only when ph
+ * is NULL, we need to check pending_refcnt.
+ */
+ if (need_wait && packet_read_pending(&po->tx_ring)) {
timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
if (timeo <= 0) {
err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
goto out_put;
+ } else {
+ /* Just reuse ph to continue for the next iteration, and
+ * ph will be reassigned at the start of the next iteration.
+ */
+ ph = (void *)1;
}
}
/* check for additional frames */
@@ -2943,14 +2953,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
}
packet_increment_head(&po->tx_ring);
len_sum += tp_len;
- } while (likely((ph != NULL) ||
- /* Note: packet_read_pending() might be slow if we have
- * to call it as it's per_cpu variable, but in fast-path
- * we already short-circuit the loop with the first
- * condition, and luckily don't have to go that path
- * anyway.
- */
- (need_wait && packet_read_pending(&po->tx_ring))));
+ } while (likely(ph != NULL))
err = len_sum;
goto out_put;
--
2.43.0
Hi Yun, kernel test robot noticed the following build errors: [auto build test ERROR on net-next/main] [also build test ERROR on net/main linus/master v6.16-rc5 next-20250709] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Yun-Lu/af_packet-fix-the-SO_SNDTIMEO-constraint-not-effective-on-tpacked_snd/20250709-175915 base: net-next/main patch link: https://lore.kernel.org/r/20250709095653.62469-3-luyun_611%40163.com patch subject: [PATCH v3 2/2] af_packet: fix soft lockup issue caused by tpacket_snd() config: i386-buildonly-randconfig-001-20250710 (https://download.01.org/0day-ci/archive/20250710/202507101547.Li8m6iCU-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250710/202507101547.Li8m6iCU-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202507101547.Li8m6iCU-lkp@intel.com/ All errors (new ones prefixed by >>): net/packet/af_packet.c: In function 'tpacket_snd': >> net/packet/af_packet.c:2956:37: error: expected ';' before 'err' 2956 | } while (likely(ph != NULL)) | ^ | ; 2957 | 2958 | err = len_sum; | ~~~ vim +2956 net/packet/af_packet.c 2769 2770 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) 2771 { 2772 struct sk_buff *skb = NULL; 2773 struct net_device *dev; 2774 struct virtio_net_hdr *vnet_hdr = NULL; 2775 struct sockcm_cookie sockc; 2776 __be16 proto; 2777 int err, reserve = 0; 2778 void *ph; 2779 DECLARE_SOCKADDR(struct sockaddr_ll *, saddr, msg->msg_name); 2780 bool need_wait = !(msg->msg_flags & MSG_DONTWAIT); 2781 int vnet_hdr_sz = READ_ONCE(po->vnet_hdr_sz); 2782 unsigned char *addr = NULL; 2783 int tp_len, size_max; 2784 void *data; 2785 int len_sum = 0; 2786 int status = TP_STATUS_AVAILABLE; 2787 int hlen, tlen, copylen = 0; 2788 long timeo; 2789 2790 mutex_lock(&po->pg_vec_lock); 2791 2792 /* packet_sendmsg() check on tx_ring.pg_vec was lockless, 2793 * we need to confirm it under protection of pg_vec_lock. 2794 */ 2795 if (unlikely(!po->tx_ring.pg_vec)) { 2796 err = -EBUSY; 2797 goto out; 2798 } 2799 if (likely(saddr == NULL)) { 2800 dev = packet_cached_dev_get(po); 2801 proto = READ_ONCE(po->num); 2802 } else { 2803 err = -EINVAL; 2804 if (msg->msg_namelen < sizeof(struct sockaddr_ll)) 2805 goto out; 2806 if (msg->msg_namelen < (saddr->sll_halen 2807 + offsetof(struct sockaddr_ll, 2808 sll_addr))) 2809 goto out; 2810 proto = saddr->sll_protocol; 2811 dev = dev_get_by_index(sock_net(&po->sk), saddr->sll_ifindex); 2812 if (po->sk.sk_socket->type == SOCK_DGRAM) { 2813 if (dev && msg->msg_namelen < dev->addr_len + 2814 offsetof(struct sockaddr_ll, sll_addr)) 2815 goto out_put; 2816 addr = saddr->sll_addr; 2817 } 2818 } 2819 2820 err = -ENXIO; 2821 if (unlikely(dev == NULL)) 2822 goto out; 2823 err = -ENETDOWN; 2824 if (unlikely(!(dev->flags & IFF_UP))) 2825 goto out_put; 2826 2827 sockcm_init(&sockc, &po->sk); 2828 if (msg->msg_controllen) { 2829 err = sock_cmsg_send(&po->sk, msg, &sockc); 2830 if (unlikely(err)) 2831 goto out_put; 2832 } 2833 2834 if (po->sk.sk_socket->type == SOCK_RAW) 2835 reserve = dev->hard_header_len; 2836 size_max = po->tx_ring.frame_size 2837 - (po->tp_hdrlen - sizeof(struct sockaddr_ll)); 2838 2839 if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz) 2840 size_max = dev->mtu + reserve + VLAN_HLEN; 2841 2842 timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); 2843 reinit_completion(&po->skb_completion); 2844 2845 do { 2846 ph = packet_current_frame(po, &po->tx_ring, 2847 TP_STATUS_SEND_REQUEST); 2848 if (unlikely(ph == NULL)) { 2849 /* Note: packet_read_pending() might be slow if we 2850 * have to call it as it's per_cpu variable, but in 2851 * fast-path we don't have to call it, only when ph 2852 * is NULL, we need to check pending_refcnt. 2853 */ 2854 if (need_wait && packet_read_pending(&po->tx_ring)) { 2855 timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); 2856 if (timeo <= 0) { 2857 err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; 2858 goto out_put; 2859 } else { 2860 /* Just reuse ph to continue for the next iteration, and 2861 * ph will be reassigned at the start of the next iteration. 2862 */ 2863 ph = (void *)1; 2864 } 2865 } 2866 /* check for additional frames */ 2867 continue; 2868 } 2869 2870 skb = NULL; 2871 tp_len = tpacket_parse_header(po, ph, size_max, &data); 2872 if (tp_len < 0) 2873 goto tpacket_error; 2874 2875 status = TP_STATUS_SEND_REQUEST; 2876 hlen = LL_RESERVED_SPACE(dev); 2877 tlen = dev->needed_tailroom; 2878 if (vnet_hdr_sz) { 2879 vnet_hdr = data; 2880 data += vnet_hdr_sz; 2881 tp_len -= vnet_hdr_sz; 2882 if (tp_len < 0 || 2883 __packet_snd_vnet_parse(vnet_hdr, tp_len)) { 2884 tp_len = -EINVAL; 2885 goto tpacket_error; 2886 } 2887 copylen = __virtio16_to_cpu(vio_le(), 2888 vnet_hdr->hdr_len); 2889 } 2890 copylen = max_t(int, copylen, dev->hard_header_len); 2891 skb = sock_alloc_send_skb(&po->sk, 2892 hlen + tlen + sizeof(struct sockaddr_ll) + 2893 (copylen - dev->hard_header_len), 2894 !need_wait, &err); 2895 2896 if (unlikely(skb == NULL)) { 2897 /* we assume the socket was initially writeable ... */ 2898 if (likely(len_sum > 0)) 2899 err = len_sum; 2900 goto out_status; 2901 } 2902 tp_len = tpacket_fill_skb(po, skb, ph, dev, data, tp_len, proto, 2903 addr, hlen, copylen, &sockc); 2904 if (likely(tp_len >= 0) && 2905 tp_len > dev->mtu + reserve && 2906 !vnet_hdr_sz && 2907 !packet_extra_vlan_len_allowed(dev, skb)) 2908 tp_len = -EMSGSIZE; 2909 2910 if (unlikely(tp_len < 0)) { 2911 tpacket_error: 2912 if (packet_sock_flag(po, PACKET_SOCK_TP_LOSS)) { 2913 __packet_set_status(po, ph, 2914 TP_STATUS_AVAILABLE); 2915 packet_increment_head(&po->tx_ring); 2916 kfree_skb(skb); 2917 continue; 2918 } else { 2919 status = TP_STATUS_WRONG_FORMAT; 2920 err = tp_len; 2921 goto out_status; 2922 } 2923 } 2924 2925 if (vnet_hdr_sz) { 2926 if (virtio_net_hdr_to_skb(skb, vnet_hdr, vio_le())) { 2927 tp_len = -EINVAL; 2928 goto tpacket_error; 2929 } 2930 virtio_net_hdr_set_proto(skb, vnet_hdr); 2931 } 2932 2933 skb->destructor = tpacket_destruct_skb; 2934 __packet_set_status(po, ph, TP_STATUS_SENDING); 2935 packet_inc_pending(&po->tx_ring); 2936 2937 status = TP_STATUS_SEND_REQUEST; 2938 err = packet_xmit(po, skb); 2939 if (unlikely(err != 0)) { 2940 if (err > 0) 2941 err = net_xmit_errno(err); 2942 if (err && __packet_get_status(po, ph) == 2943 TP_STATUS_AVAILABLE) { 2944 /* skb was destructed already */ 2945 skb = NULL; 2946 goto out_status; 2947 } 2948 /* 2949 * skb was dropped but not destructed yet; 2950 * let's treat it like congestion or err < 0 2951 */ 2952 err = 0; 2953 } 2954 packet_increment_head(&po->tx_ring); 2955 len_sum += tp_len; > 2956 } while (likely(ph != NULL)) 2957 2958 err = len_sum; 2959 goto out_put; 2960 2961 out_status: 2962 __packet_set_status(po, ph, status); 2963 kfree_skb(skb); 2964 out_put: 2965 dev_put(dev); 2966 out: 2967 mutex_unlock(&po->pg_vec_lock); 2968 return err; 2969 } 2970 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Yun Lu wrote: > From: Yun Lu <luyun@kylinos.cn> > > When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for > pending_refcnt to decrement to zero before returning. The pending_refcnt > is decremented by 1 when the skb->destructor function is called, > indicating that the skb has been successfully sent and needs to be > destroyed. > > If an error occurs during this process, the tpacket_snd() function will > exit and return error, but pending_refcnt may not yet have decremented to > zero. Assuming the next send operation is executed immediately, but there > are no available frames to be sent in tx_ring (i.e., packet_current_frame > returns NULL), and skb is also NULL This is a very specific edge case. And arguably the goal is to wait for any pending skbs still, even if from a previous call. skb is true for all but the first iterations of that loop. So your earlier patch - if (need_wait && skb) { + if (need_wait && packet_read_pending(&po->tx_ring)) { Is more concise and more obviously correct. >, the function will not execute > wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it > will enter a do-while loop, waiting for pending_refcnt to be zero. Even > if the previous skb has completed transmission, the skb->destructor > function can only be invoked in the ksoftirqd thread (assuming NAPI > threading is enabled). When both the ksoftirqd thread and the tpacket_snd > operation happen to run on the same CPU, and the CPU trapped in the > do-while loop without yielding, the ksoftirqd thread will not get > scheduled to run. Interestingly, this is quite similar to the issue that caused adding the completion in the first place. Commit 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") added the completion because a SCHED_FIFO task could delay ksoftirqd indefinitely. > As a result, pending_refcnt will never be reduced to > zero, and the do-while loop cannot exit, eventually leading to a CPU soft > lockup issue. > > In fact, as long as pending_refcnt is not zero, even if skb is NULL, > wait_for_completion_interruptible_timeout() should be executed to yield > the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, move > the penging_refcnt check to the start of the do-while loop, and reuse ph > to continue for the next iteration. > > Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") > Cc: stable@kernel.org > Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> > Signed-off-by: Yun Lu <luyun@kylinos.cn> > > --- > Changes in v3: > - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet. > - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/ If the fix alone is more obvious without this optimization, and the extra packet_read_pending() is already present, not newly introduced with the fix, then I would prefer to split the fix (to net, and stable) from the optimization (to net-next). > Changes in v2: > - Add a Fixes tag. > - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ > --- > net/packet/af_packet.c | 21 ++++++++++++--------- > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index 7089b8c2a655..89a5d2a3a720 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2846,11 +2846,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > ph = packet_current_frame(po, &po->tx_ring, > TP_STATUS_SEND_REQUEST); > if (unlikely(ph == NULL)) { > - if (need_wait && skb) { > + /* Note: packet_read_pending() might be slow if we > + * have to call it as it's per_cpu variable, but in > + * fast-path we don't have to call it, only when ph > + * is NULL, we need to check pending_refcnt. > + */ > + if (need_wait && packet_read_pending(&po->tx_ring)) { > timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); > if (timeo <= 0) { > err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; > goto out_put; > + } else { > + /* Just reuse ph to continue for the next iteration, and > + * ph will be reassigned at the start of the next iteration. > + */ > + ph = (void *)1; > } > } > /* check for additional frames */ > @@ -2943,14 +2953,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > } > packet_increment_head(&po->tx_ring); > len_sum += tp_len; > - } while (likely((ph != NULL) || > - /* Note: packet_read_pending() might be slow if we have > - * to call it as it's per_cpu variable, but in fast-path > - * we already short-circuit the loop with the first > - * condition, and luckily don't have to go that path > - * anyway. > - */ > - (need_wait && packet_read_pending(&po->tx_ring)))); > + } while (likely(ph != NULL)) > > err = len_sum; > goto out_put; > -- > 2.43.0 >
在 2025/7/10 05:14, Willem de Bruijn 写道: > Yun Lu wrote: >> From: Yun Lu <luyun@kylinos.cn> >> >> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for >> pending_refcnt to decrement to zero before returning. The pending_refcnt >> is decremented by 1 when the skb->destructor function is called, >> indicating that the skb has been successfully sent and needs to be >> destroyed. >> >> If an error occurs during this process, the tpacket_snd() function will >> exit and return error, but pending_refcnt may not yet have decremented to >> zero. Assuming the next send operation is executed immediately, but there >> are no available frames to be sent in tx_ring (i.e., packet_current_frame >> returns NULL), and skb is also NULL > This is a very specific edge case. And arguably the goal is to wait > for any pending skbs still, even if from a previous call. > > skb is true for all but the first iterations of that loop. So your > earlier patch > > - if (need_wait && skb) { > + if (need_wait && packet_read_pending(&po->tx_ring)) { > > Is more concise and more obviously correct. > >> , the function will not execute >> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it >> will enter a do-while loop, waiting for pending_refcnt to be zero. Even >> if the previous skb has completed transmission, the skb->destructor >> function can only be invoked in the ksoftirqd thread (assuming NAPI >> threading is enabled). When both the ksoftirqd thread and the tpacket_snd >> operation happen to run on the same CPU, and the CPU trapped in the >> do-while loop without yielding, the ksoftirqd thread will not get >> scheduled to run. > Interestingly, this is quite similar to the issue that caused adding > the completion in the first place. Commit 89ed5b519004 ("af_packet: > Block execution of tasks waiting for transmit to complete in > AF_PACKET") added the completion because a SCHED_FIFO task could delay > ksoftirqd indefinitely. > >> As a result, pending_refcnt will never be reduced to >> zero, and the do-while loop cannot exit, eventually leading to a CPU soft >> lockup issue. >> >> In fact, as long as pending_refcnt is not zero, even if skb is NULL, >> wait_for_completion_interruptible_timeout() should be executed to yield >> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, move >> the penging_refcnt check to the start of the do-while loop, and reuse ph >> to continue for the next iteration. >> >> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") >> Cc: stable@kernel.org >> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> >> Signed-off-by: Yun Lu <luyun@kylinos.cn> >> >> --- >> Changes in v3: >> - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet. >> - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/ > If the fix alone is more obvious without this optimization, and > the extra packet_read_pending() is already present, not newly > introduced with the fix, then I would prefer to split the fix (to net, > and stable) from the optimization (to net-next). Alright, referring to your suggestion, I will split this patch into two for the next version: one to fix the issue (as the first version, to net, and stable), and the other to optimize the code (to net-next). Thanks for your review and suggestion. > >> Changes in v2: >> - Add a Fixes tag. >> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ >> --- >> net/packet/af_packet.c | 21 ++++++++++++--------- >> 1 file changed, 12 insertions(+), 9 deletions(-) >> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >> index 7089b8c2a655..89a5d2a3a720 100644 >> --- a/net/packet/af_packet.c >> +++ b/net/packet/af_packet.c >> @@ -2846,11 +2846,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >> ph = packet_current_frame(po, &po->tx_ring, >> TP_STATUS_SEND_REQUEST); >> if (unlikely(ph == NULL)) { >> - if (need_wait && skb) { >> + /* Note: packet_read_pending() might be slow if we >> + * have to call it as it's per_cpu variable, but in >> + * fast-path we don't have to call it, only when ph >> + * is NULL, we need to check pending_refcnt. >> + */ >> + if (need_wait && packet_read_pending(&po->tx_ring)) { >> timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); >> if (timeo <= 0) { >> err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; >> goto out_put; >> + } else { >> + /* Just reuse ph to continue for the next iteration, and >> + * ph will be reassigned at the start of the next iteration. >> + */ >> + ph = (void *)1; >> } >> } >> /* check for additional frames */ >> @@ -2943,14 +2953,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >> } >> packet_increment_head(&po->tx_ring); >> len_sum += tp_len; >> - } while (likely((ph != NULL) || >> - /* Note: packet_read_pending() might be slow if we have >> - * to call it as it's per_cpu variable, but in fast-path >> - * we already short-circuit the loop with the first >> - * condition, and luckily don't have to go that path >> - * anyway. >> - */ >> - (need_wait && packet_read_pending(&po->tx_ring)))); >> + } while (likely(ph != NULL)) >> >> err = len_sum; >> goto out_put; >> -- >> 2.43.0 >>
On Wed, Jul 09, 2025 at 05:56:53PM +0800, Yun Lu wrote: ... > @@ -2943,14 +2953,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > } > packet_increment_head(&po->tx_ring); > len_sum += tp_len; > - } while (likely((ph != NULL) || > - /* Note: packet_read_pending() might be slow if we have > - * to call it as it's per_cpu variable, but in fast-path > - * we already short-circuit the loop with the first > - * condition, and luckily don't have to go that path > - * anyway. > - */ > - (need_wait && packet_read_pending(&po->tx_ring)))); > + } while (likely(ph != NULL)) A semicolon is needed at the end of the line above. > > err = len_sum; > goto out_put; -- pw-bot: changes-requested
在 2025/7/10 02:14, Simon Horman 写道: > On Wed, Jul 09, 2025 at 05:56:53PM +0800, Yun Lu wrote: > > ... > >> @@ -2943,14 +2953,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >> } >> packet_increment_head(&po->tx_ring); >> len_sum += tp_len; >> - } while (likely((ph != NULL) || >> - /* Note: packet_read_pending() might be slow if we have >> - * to call it as it's per_cpu variable, but in fast-path >> - * we already short-circuit the loop with the first >> - * condition, and luckily don't have to go that path >> - * anyway. >> - */ >> - (need_wait && packet_read_pending(&po->tx_ring)))); >> + } while (likely(ph != NULL)) > A semicolon is needed at the end of the line above. Sorry, this was my mistake. I will fix it in the next version. Thank you for pointing it out. > >> >> err = len_sum; >> goto out_put;
On Wed, Jul 9, 2025 at 2:57 AM Yun Lu <luyun_611@163.com> wrote: > > From: Yun Lu <luyun@kylinos.cn> > > When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for > pending_refcnt to decrement to zero before returning. The pending_refcnt > is decremented by 1 when the skb->destructor function is called, > indicating that the skb has been successfully sent and needs to be > destroyed. > > If an error occurs during this process, the tpacket_snd() function will > exit and return error, but pending_refcnt may not yet have decremented to > zero. Assuming the next send operation is executed immediately, but there > are no available frames to be sent in tx_ring (i.e., packet_current_frame > returns NULL), and skb is also NULL, the function will not execute > wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it > will enter a do-while loop, waiting for pending_refcnt to be zero. Even > if the previous skb has completed transmission, the skb->destructor > function can only be invoked in the ksoftirqd thread (assuming NAPI > threading is enabled). When both the ksoftirqd thread and the tpacket_snd > operation happen to run on the same CPU, and the CPU trapped in the > do-while loop without yielding, the ksoftirqd thread will not get > scheduled to run. As a result, pending_refcnt will never be reduced to > zero, and the do-while loop cannot exit, eventually leading to a CPU soft > lockup issue. > > In fact, as long as pending_refcnt is not zero, even if skb is NULL, > wait_for_completion_interruptible_timeout() should be executed to yield > the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, move > the penging_refcnt check to the start of the do-while loop, and reuse ph > to continue for the next iteration. > > Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") > Cc: stable@kernel.org > Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> > Signed-off-by: Yun Lu <luyun@kylinos.cn> > > --- > Changes in v3: > - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet. > - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/ > > Changes in v2: > - Add a Fixes tag. > - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ > --- > net/packet/af_packet.c | 21 ++++++++++++--------- > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index 7089b8c2a655..89a5d2a3a720 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2846,11 +2846,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > ph = packet_current_frame(po, &po->tx_ring, > TP_STATUS_SEND_REQUEST); > if (unlikely(ph == NULL)) { > - if (need_wait && skb) { > + /* Note: packet_read_pending() might be slow if we > + * have to call it as it's per_cpu variable, but in > + * fast-path we don't have to call it, only when ph > + * is NULL, we need to check pending_refcnt. > + */ > + if (need_wait && packet_read_pending(&po->tx_ring)) { > timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); > if (timeo <= 0) { > err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; > goto out_put; > + } else { nit (in case a new version is sent) : No need for an else {} after a "goto XXXXX;" if (....) { ..... goto out_put; } /* Just reuse ph to continue for the next iteration, and... * ..... */ ph = (void *)1; Reviewed-by: Eric Dumazet <edumazet@google.com>
在 2025/7/9 20:44, Eric Dumazet 写道: > On Wed, Jul 9, 2025 at 2:57 AM Yun Lu <luyun_611@163.com> wrote: >> From: Yun Lu <luyun@kylinos.cn> >> >> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for >> pending_refcnt to decrement to zero before returning. The pending_refcnt >> is decremented by 1 when the skb->destructor function is called, >> indicating that the skb has been successfully sent and needs to be >> destroyed. >> >> If an error occurs during this process, the tpacket_snd() function will >> exit and return error, but pending_refcnt may not yet have decremented to >> zero. Assuming the next send operation is executed immediately, but there >> are no available frames to be sent in tx_ring (i.e., packet_current_frame >> returns NULL), and skb is also NULL, the function will not execute >> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it >> will enter a do-while loop, waiting for pending_refcnt to be zero. Even >> if the previous skb has completed transmission, the skb->destructor >> function can only be invoked in the ksoftirqd thread (assuming NAPI >> threading is enabled). When both the ksoftirqd thread and the tpacket_snd >> operation happen to run on the same CPU, and the CPU trapped in the >> do-while loop without yielding, the ksoftirqd thread will not get >> scheduled to run. As a result, pending_refcnt will never be reduced to >> zero, and the do-while loop cannot exit, eventually leading to a CPU soft >> lockup issue. >> >> In fact, as long as pending_refcnt is not zero, even if skb is NULL, >> wait_for_completion_interruptible_timeout() should be executed to yield >> the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, move >> the penging_refcnt check to the start of the do-while loop, and reuse ph >> to continue for the next iteration. >> >> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET") >> Cc: stable@kernel.org >> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn> >> Signed-off-by: Yun Lu <luyun@kylinos.cn> >> >> --- >> Changes in v3: >> - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet. >> - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/ >> >> Changes in v2: >> - Add a Fixes tag. >> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/ >> --- >> net/packet/af_packet.c | 21 ++++++++++++--------- >> 1 file changed, 12 insertions(+), 9 deletions(-) >> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >> index 7089b8c2a655..89a5d2a3a720 100644 >> --- a/net/packet/af_packet.c >> +++ b/net/packet/af_packet.c >> @@ -2846,11 +2846,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) >> ph = packet_current_frame(po, &po->tx_ring, >> TP_STATUS_SEND_REQUEST); >> if (unlikely(ph == NULL)) { >> - if (need_wait && skb) { >> + /* Note: packet_read_pending() might be slow if we >> + * have to call it as it's per_cpu variable, but in >> + * fast-path we don't have to call it, only when ph >> + * is NULL, we need to check pending_refcnt. >> + */ >> + if (need_wait && packet_read_pending(&po->tx_ring)) { >> timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); >> if (timeo <= 0) { >> err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; >> goto out_put; >> + } else { > nit (in case a new version is sent) : No need for an else {} after a > "goto XXXXX;" > > if (....) { > ..... > goto out_put; > } > /* Just reuse ph to continue for the next iteration, and... > * ..... > */ > ph = (void *)1; Yes, the code of "else {} " is redundant. I will remove it in the next version. Thanks. > > > Reviewed-by: Eric Dumazet <edumazet@google.com>
© 2016 - 2025 Red Hat, Inc.