net: set is_cwnd_limited when the small queue check fails

[PATCH] net: set is_cwnd_limited when the small queue check fails

Posted by Peng Yu 3 months, 2 weeks ago

The limit of the small queue check is calculated from the pacing rate,
the pacing rate is calculated from the cwnd. If the cwnd is small,
the small queue check may fail.
When the samll queue check fails, the tcp layer will send less
packages, then the tcp_is_cwnd_limited would alreays return false,
then the cwnd would have no chance to get updated.
The cwnd has no chance to get updated, it keeps small, then the pacing
rate keeps small, and the limit of the small queue check keeps small,
then the small queue check would always fail.
It is a kind of dead lock, when a tcp flow comes into this situation,
it's throughput would be very small, obviously less then the correct
throughput it should have.
We set is_cwnd_limited to true when the small queue check fails, then
the cwnd would have a chance to get updated, then we can break this
deadlock.

Below ss output shows this issue:

skmem:(r0,rb131072,
t7712, <------------------------------ wmem_alloc = 7712
tb243712,f2128,w219056,o0,bl0,d0)
ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
pmtu:8500 rcvmss:536 advmss:8448
cwnd:28 <------------------------------ cwnd=28
bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
rcv_space:57088 rcv_ssthresh:57088 notsent:188240
minrtt:23.319 snd_wnd:57088

limit=(27764216 / 8) / 1024 = 3389 < 7712
So the samll queue check fails. When it happens, the throughput is
obviously less than the normal situation.

By setting the tcp_is_cwnd_limited to true when the small queue check
failed, we can avoid this issue, the cwnd could increase to a reasonalbe
size, in my test environment, it is about 4000. Then the small queue
check won't fail.

Signed-off-by: Peng Yu <peng.yu@alibaba-inc.com>
---
 net/ipv4/tcp_output.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b94efb3050d2..8c70acf3a060 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
 			break;
 
-		if (tcp_small_queue_check(sk, skb, 0))
+		if (tcp_small_queue_check(sk, skb, 0)) {
+			is_cwnd_limited = true;
 			break;
+		}
 
 		/* Argh, we hit an empty skb(), presumably a thread
 		 * is sleeping in sendmsg()/sk_stream_wait_memory().
-- 
2.47.3

Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

Posted by Eric Dumazet 3 months, 2 weeks ago

On Sun, Oct 19, 2025 at 10:00 AM Peng Yu <yupeng0921@gmail.com> wrote:
>
> The limit of the small queue check is calculated from the pacing rate,
> the pacing rate is calculated from the cwnd. If the cwnd is small,
> the small queue check may fail.
> When the samll queue check fails, the tcp layer will send less
> packages, then the tcp_is_cwnd_limited would alreays return false,
> then the cwnd would have no chance to get updated.
> The cwnd has no chance to get updated, it keeps small, then the pacing
> rate keeps small, and the limit of the small queue check keeps small,
> then the small queue check would always fail.
> It is a kind of dead lock, when a tcp flow comes into this situation,
> it's throughput would be very small, obviously less then the correct
> throughput it should have.
> We set is_cwnd_limited to true when the small queue check fails, then
> the cwnd would have a chance to get updated, then we can break this
> deadlock.
>
> Below ss output shows this issue:
>
> skmem:(r0,rb131072,
> t7712, <------------------------------ wmem_alloc = 7712
> tb243712,f2128,w219056,o0,bl0,d0)
> ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
> pmtu:8500 rcvmss:536 advmss:8448
> cwnd:28 <------------------------------ cwnd=28
> bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
> segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
> send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
> pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
> delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
> rcv_space:57088 rcv_ssthresh:57088 notsent:188240
> minrtt:23.319 snd_wnd:57088
>
> limit=(27764216 / 8) / 1024 = 3389 < 7712
> So the samll queue check fails. When it happens, the throughput is
> obviously less than the normal situation.
>
> By setting the tcp_is_cwnd_limited to true when the small queue check
> failed, we can avoid this issue, the cwnd could increase to a reasonalbe
> size, in my test environment, it is about 4000. Then the small queue
> check won't fail.


>
> Signed-off-by: Peng Yu <peng.yu@alibaba-inc.com>
> ---
>  net/ipv4/tcp_output.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b94efb3050d2..8c70acf3a060 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                     unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
>                         break;
>
> -               if (tcp_small_queue_check(sk, skb, 0))
> +               if (tcp_small_queue_check(sk, skb, 0)) {
> +                       is_cwnd_limited = true;
>                         break;
> +               }
>
>                 /* Argh, we hit an empty skb(), presumably a thread
>                  * is sleeping in sendmsg()/sk_stream_wait_memory().
> --
> 2.47.3

Sorry this makes no sense to me.  CWND_LIMITED should not be hijacked.

Something else is preventing your flows to get to nominal speed,
because we have not seen anything like that.

It is probably a driver issue or a receive side issue : Instead of
trying to work around the issue, please root cause it.

Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

Posted by YU, Peng 3 months, 2 weeks ago

I think we know the root cause in the driver. We are using the
virtio_net driver. We found that the issue happens after this driver
commit:

b92f1e6751a6 virtio-net: transmit napi

According to our test, the issue will happen if we apply below change:


 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 {
        struct virtio_net_hdr_mrg_rxbuf *hdr;
@@ -1130,6 +1174,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
        int err;
        struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
        bool kick = !skb->xmit_more;
+       bool use_napi = sq->napi.weight;

        /* Free up any pending old buffers before queueing new ones. */
        free_old_xmit_skbs(sq);
@@ -1152,8 +1197,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
        }

        /* Don't wait up for transmitted skbs to be freed. */
-       skb_orphan(skb);
-       nf_reset(skb);
+       if (!use_napi) {
+               skb_orphan(skb);
+               nf_reset(skb);
+       }


Before this change, the driver will invoke skb_orphan immediately when
it receives a skb, then the tcp layer will decrease the wmem_alloc.
Thus the small queue check won't fail. After applying this change, the
virtio_net driver will tell tcp layer to decrease the wmem_alloc when
the skp is really sent out.
If we set use_napi to false, the virtio_net driver will invoke
skb_orphan immediately as before, then the issue won't happen.
But invoking skb_orphan in start_xmit looks like a workaround to me,
I'm not sure if we should rollback this change.  The small queue check
and cwnd window would come into a kind of "dead lock" situation to me,
so I suppose we should fix that "dead lock".  If you believe we
shouldn't change TCP layer for this issue, may I know the correct
direction to resolve this issue? Should we modify the virtio_net
driver, let it always invoke skb_orphan as before?
As a workaround, we set the virtio_net module parameter napi_tx to
false, then the use_napi would be false too. Thus the issue won't
happen. But we indeed want to enable napi_tx, so may I know what's
your suggestion about this issue?


------------------------------------------------------------------
From:Eric Dumazet <edumazet@google.com>
Send Time:2025 Oct. 20 (Mon.) 01:43
To:Peng Yu<yupeng0921@gmail.com>
CC:ncardwell<ncardwell@google.com>; kuniyu<kuniyu@google.com>; netdev<netdev@vger.kernel.org>; "linux-kernel"<linux-kernel@vger.kernel.org>; Peng YU<peng.yu@alibaba-inc.com>
Subject:Re: [PATCH] net: set is_cwnd_limited when the small queue check fails


On Sun, Oct 19, 2025 at 10:00 AM Peng Yu <yupeng0921@gmail.com> wrote:
>
> The limit of the small queue check is calculated from the pacing rate,
> the pacing rate is calculated from the cwnd. If the cwnd is small,
> the small queue check may fail.
> When the samll queue check fails, the tcp layer will send less
> packages, then the tcp_is_cwnd_limited would alreays return false,
> then the cwnd would have no chance to get updated.
> The cwnd has no chance to get updated, it keeps small, then the pacing
> rate keeps small, and the limit of the small queue check keeps small,
> then the small queue check would always fail.
> It is a kind of dead lock, when a tcp flow comes into this situation,
> it's throughput would be very small, obviously less then the correct
> throughput it should have.
> We set is_cwnd_limited to true when the small queue check fails, then
> the cwnd would have a chance to get updated, then we can break this
> deadlock.
>
> Below ss output shows this issue:
>
> skmem:(r0,rb131072,
> t7712, <------------------------------ wmem_alloc = 7712
> tb243712,f2128,w219056,o0,bl0,d0)
> ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
> pmtu:8500 rcvmss:536 advmss:8448
> cwnd:28 <------------------------------ cwnd=28
> bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
> segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
> send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
> pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
> delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
> rcv_space:57088 rcv_ssthresh:57088 notsent:188240
> minrtt:23.319 snd_wnd:57088
>
> limit=(27764216 / 8) / 1024 = 3389 < 7712
> So the samll queue check fails. When it happens, the throughput is
> obviously less than the normal situation.
>
> By setting the tcp_is_cwnd_limited to true when the small queue check
> failed, we can avoid this issue, the cwnd could increase to a reasonalbe
> size, in my test environment, it is about 4000. Then the small queue
> check won't fail.


>
> Signed-off-by: Peng Yu <peng.yu@alibaba-inc.com>
> ---
>  net/ipv4/tcp_output.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index b94efb3050d2..8c70acf3a060 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                     unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
>                         break;
>
> -               if (tcp_small_queue_check(sk, skb, 0))
> +               if (tcp_small_queue_check(sk, skb, 0)) {
> +                       is_cwnd_limited = true;
>                         break;
> +               }
>
>                 /* Argh, we hit an empty skb(), presumably a thread
>                  * is sleeping in sendmsg()/sk_stream_wait_memory().
> --
> 2.47.3

Sorry this makes no sense to me.  CWND_LIMITED should not be hijacked.

Something else is preventing your flows to get to nominal speed,
because we have not seen anything like that.

It is probably a driver issue or a receive side issue : Instead of
trying to work around the issue, please root cause it.

Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

Posted by Jason Xing 3 months, 2 weeks ago

On Mon, Oct 20, 2025 at 7:00 AM YU, Peng <peng.yu@alibaba-inc.com> wrote:
>
> I think we know the root cause in the driver. We are using the
> virtio_net driver. We found that the issue happens after this driver
> commit:
>
> b92f1e6751a6 virtio-net: transmit napi
>
> According to our test, the issue will happen if we apply below change:
>
>
>  static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  {
>         struct virtio_net_hdr_mrg_rxbuf *hdr;
> @@ -1130,6 +1174,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>         int err;
>         struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>         bool kick = !skb->xmit_more;
> +       bool use_napi = sq->napi.weight;
>
>         /* Free up any pending old buffers before queueing new ones. */
>         free_old_xmit_skbs(sq);
> @@ -1152,8 +1197,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>         }
>
>         /* Don't wait up for transmitted skbs to be freed. */
> -       skb_orphan(skb);
> -       nf_reset(skb);
> +       if (!use_napi) {
> +               skb_orphan(skb);
> +               nf_reset(skb);
> +       }
>
>
> Before this change, the driver will invoke skb_orphan immediately when
> it receives a skb, then the tcp layer will decrease the wmem_alloc.
> Thus the small queue check won't fail. After applying this change, the
> virtio_net driver will tell tcp layer to decrease the wmem_alloc when
> the skp is really sent out.
> If we set use_napi to false, the virtio_net driver will invoke
> skb_orphan immediately as before, then the issue won't happen.

Very classic and annoying issues that I believe not few people have
encountered... We eventually resorted to orphan mode just as you did,
FYI.

But as to fixing it thoroughly, I agree with Eric's thoughts.

> But invoking skb_orphan in start_xmit looks like a workaround to me,
> I'm not sure if we should rollback this change.  The small queue check
> and cwnd window would come into a kind of "dead lock" situation to me,
> so I suppose we should fix that "dead lock".  If you believe we
> shouldn't change TCP layer for this issue, may I know the correct
> direction to resolve this issue? Should we modify the virtio_net
> driver, let it always invoke skb_orphan as before?

Have you ever noticed the issue cannot be reliably reproduced? It's
only randomly happening on some VMs without any reason. Once I suspect
the policy/logic from the host side might have some problems, like too
slow frequency of triggering IRQ.

Thanks,
Jason

> As a workaround, we set the virtio_net module parameter napi_tx to
> false, then the use_napi would be false too. Thus the issue won't
> happen. But we indeed want to enable napi_tx, so may I know what's
> your suggestion about this issue?
>
>
> ------------------------------------------------------------------
> From:Eric Dumazet <edumazet@google.com>
> Send Time:2025 Oct. 20 (Mon.) 01:43
> To:Peng Yu<yupeng0921@gmail.com>
> CC:ncardwell<ncardwell@google.com>; kuniyu<kuniyu@google.com>; netdev<netdev@vger.kernel.org>; "linux-kernel"<linux-kernel@vger.kernel.org>; Peng YU<peng.yu@alibaba-inc.com>
> Subject:Re: [PATCH] net: set is_cwnd_limited when the small queue check fails
>
>
> On Sun, Oct 19, 2025 at 10:00 AM Peng Yu <yupeng0921@gmail.com> wrote:
> >
> > The limit of the small queue check is calculated from the pacing rate,
> > the pacing rate is calculated from the cwnd. If the cwnd is small,
> > the small queue check may fail.
> > When the samll queue check fails, the tcp layer will send less
> > packages, then the tcp_is_cwnd_limited would alreays return false,
> > then the cwnd would have no chance to get updated.
> > The cwnd has no chance to get updated, it keeps small, then the pacing
> > rate keeps small, and the limit of the small queue check keeps small,
> > then the small queue check would always fail.
> > It is a kind of dead lock, when a tcp flow comes into this situation,
> > it's throughput would be very small, obviously less then the correct
> > throughput it should have.
> > We set is_cwnd_limited to true when the small queue check fails, then
> > the cwnd would have a chance to get updated, then we can break this
> > deadlock.
> >
> > Below ss output shows this issue:
> >
> > skmem:(r0,rb131072,
> > t7712, <------------------------------ wmem_alloc = 7712
> > tb243712,f2128,w219056,o0,bl0,d0)
> > ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
> > pmtu:8500 rcvmss:536 advmss:8448
> > cwnd:28 <------------------------------ cwnd=28
> > bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
> > segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
> > send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
> > pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
> > delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
> > rcv_space:57088 rcv_ssthresh:57088 notsent:188240
> > minrtt:23.319 snd_wnd:57088
> >
> > limit=(27764216 / 8) / 1024 = 3389 < 7712
> > So the samll queue check fails. When it happens, the throughput is
> > obviously less than the normal situation.
> >
> > By setting the tcp_is_cwnd_limited to true when the small queue check
> > failed, we can avoid this issue, the cwnd could increase to a reasonalbe
> > size, in my test environment, it is about 4000. Then the small queue
> > check won't fail.
>
>
> >
> > Signed-off-by: Peng Yu <peng.yu@alibaba-inc.com>
> > ---
> >  net/ipv4/tcp_output.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index b94efb3050d2..8c70acf3a060 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> >                     unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
> >                         break;
> >
> > -               if (tcp_small_queue_check(sk, skb, 0))
> > +               if (tcp_small_queue_check(sk, skb, 0)) {
> > +                       is_cwnd_limited = true;
> >                         break;
> > +               }
> >
> >                 /* Argh, we hit an empty skb(), presumably a thread
> >                  * is sleeping in sendmsg()/sk_stream_wait_memory().
> > --
> > 2.47.3
>
> Sorry this makes no sense to me.  CWND_LIMITED should not be hijacked.
>
> Something else is preventing your flows to get to nominal speed,
> because we have not seen anything like that.
>
> It is probably a driver issue or a receive side issue : Instead of
> trying to work around the issue, please root cause it.

Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

Posted by Eric Dumazet 3 months, 2 weeks ago

On Sun, Oct 19, 2025 at 4:00 PM YU, Peng <peng.yu@alibaba-inc.com> wrote:
>
> I think we know the root cause in the driver. We are using the
> virtio_net driver. We found that the issue happens after this driver
> commit:
>
> b92f1e6751a6 virtio-net: transmit napi
>
> According to our test, the issue will happen if we apply below change:
>
>
>  static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  {
>         struct virtio_net_hdr_mrg_rxbuf *hdr;
> @@ -1130,6 +1174,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>         int err;
>         struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>         bool kick = !skb->xmit_more;
> +       bool use_napi = sq->napi.weight;
>
>         /* Free up any pending old buffers before queueing new ones. */
>         free_old_xmit_skbs(sq);
> @@ -1152,8 +1197,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>         }
>
>         /* Don't wait up for transmitted skbs to be freed. */
> -       skb_orphan(skb);
> -       nf_reset(skb);
> +       if (!use_napi) {
> +               skb_orphan(skb);
> +               nf_reset(skb);
> +       }
>
>
> Before this change, the driver will invoke skb_orphan immediately when
> it receives a skb, then the tcp layer will decrease the wmem_alloc.
> Thus the small queue check won't fail. After applying this change, the
> virtio_net driver will tell tcp layer to decrease the wmem_alloc when
> the skp is really sent out.
> If we set use_napi to false, the virtio_net driver will invoke
> skb_orphan immediately as before, then the issue won't happen.
> But invoking skb_orphan in start_xmit looks like a workaround to me,
> I'm not sure if we should rollback this change.  The small queue check
> and cwnd window would come into a kind of "dead lock" situation to me,
> so I suppose we should fix that "dead lock".  If you believe we
> shouldn't change TCP layer for this issue, may I know the correct
> direction to resolve this issue? Should we modify the virtio_net
> driver, let it always invoke skb_orphan as before?
> As a workaround, we set the virtio_net module parameter napi_tx to
> false, then the use_napi would be false too. Thus the issue won't
> happen. But we indeed want to enable napi_tx, so may I know what's
> your suggestion about this issue?
>

I think you should start a conversation with virtio_net experts,
instead of making TCP
bufferbloated again.

TX completions dynamics are important, and we are not going to
penalize all drivers
just because of one.

You are claiming deadlocks, but the mechanisms in place are proven to
work damn well.

>
> ------------------------------------------------------------------
> From:Eric Dumazet <edumazet@google.com>
> Send Time:2025 Oct. 20 (Mon.) 01:43
> To:Peng Yu<yupeng0921@gmail.com>
> CC:ncardwell<ncardwell@google.com>; kuniyu<kuniyu@google.com>; netdev<netdev@vger.kernel.org>; "linux-kernel"<linux-kernel@vger.kernel.org>; Peng YU<peng.yu@alibaba-inc.com>
> Subject:Re: [PATCH] net: set is_cwnd_limited when the small queue check fails
>
>
> On Sun, Oct 19, 2025 at 10:00 AM Peng Yu <yupeng0921@gmail.com> wrote:
> >
> > The limit of the small queue check is calculated from the pacing rate,
> > the pacing rate is calculated from the cwnd. If the cwnd is small,
> > the small queue check may fail.
> > When the samll queue check fails, the tcp layer will send less
> > packages, then the tcp_is_cwnd_limited would alreays return false,
> > then the cwnd would have no chance to get updated.
> > The cwnd has no chance to get updated, it keeps small, then the pacing
> > rate keeps small, and the limit of the small queue check keeps small,
> > then the small queue check would always fail.
> > It is a kind of dead lock, when a tcp flow comes into this situation,
> > it's throughput would be very small, obviously less then the correct
> > throughput it should have.
> > We set is_cwnd_limited to true when the small queue check fails, then
> > the cwnd would have a chance to get updated, then we can break this
> > deadlock.
> >
> > Below ss output shows this issue:
> >
> > skmem:(r0,rb131072,
> > t7712, <------------------------------ wmem_alloc = 7712
> > tb243712,f2128,w219056,o0,bl0,d0)
> > ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
> > pmtu:8500 rcvmss:536 advmss:8448
> > cwnd:28 <------------------------------ cwnd=28
> > bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
> > segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
> > send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
> > pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
> > delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
> > rcv_space:57088 rcv_ssthresh:57088 notsent:188240
> > minrtt:23.319 snd_wnd:57088
> >
> > limit=(27764216 / 8) / 1024 = 3389 < 7712
> > So the samll queue check fails. When it happens, the throughput is
> > obviously less than the normal situation.
> >
> > By setting the tcp_is_cwnd_limited to true when the small queue check
> > failed, we can avoid this issue, the cwnd could increase to a reasonalbe
> > size, in my test environment, it is about 4000. Then the small queue
> > check won't fail.
>
>
> >
> > Signed-off-by: Peng Yu <peng.yu@alibaba-inc.com>
> > ---
> >  net/ipv4/tcp_output.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index b94efb3050d2..8c70acf3a060 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> >                     unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
> >                         break;
> >
> > -               if (tcp_small_queue_check(sk, skb, 0))
> > +               if (tcp_small_queue_check(sk, skb, 0)) {
> > +                       is_cwnd_limited = true;
> >                         break;
> > +               }
> >
> >                 /* Argh, we hit an empty skb(), presumably a thread
> >                  * is sleeping in sendmsg()/sk_stream_wait_memory().
> > --
> > 2.47.3
>
> Sorry this makes no sense to me.  CWND_LIMITED should not be hijacked.
>
> Something else is preventing your flows to get to nominal speed,
> because we have not seen anything like that.
>
> It is probably a driver issue or a receive side issue : Instead of
> trying to work around the issue, please root cause it.

Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

Posted by Jason Xing 3 months, 2 weeks ago

On Mon, Oct 20, 2025 at 12:00 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Oct 19, 2025 at 4:00 PM YU, Peng <peng.yu@alibaba-inc.com> wrote:
> >
> > I think we know the root cause in the driver. We are using the
> > virtio_net driver. We found that the issue happens after this driver
> > commit:
> >
> > b92f1e6751a6 virtio-net: transmit napi
> >
> > According to our test, the issue will happen if we apply below change:
> >
> >
> >  static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
> >  {
> >         struct virtio_net_hdr_mrg_rxbuf *hdr;
> > @@ -1130,6 +1174,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >         int err;
> >         struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> >         bool kick = !skb->xmit_more;
> > +       bool use_napi = sq->napi.weight;
> >
> >         /* Free up any pending old buffers before queueing new ones. */
> >         free_old_xmit_skbs(sq);
> > @@ -1152,8 +1197,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> >         }
> >
> >         /* Don't wait up for transmitted skbs to be freed. */
> > -       skb_orphan(skb);
> > -       nf_reset(skb);
> > +       if (!use_napi) {
> > +               skb_orphan(skb);
> > +               nf_reset(skb);
> > +       }
> >
> >
> > Before this change, the driver will invoke skb_orphan immediately when
> > it receives a skb, then the tcp layer will decrease the wmem_alloc.
> > Thus the small queue check won't fail. After applying this change, the
> > virtio_net driver will tell tcp layer to decrease the wmem_alloc when
> > the skp is really sent out.
> > If we set use_napi to false, the virtio_net driver will invoke
> > skb_orphan immediately as before, then the issue won't happen.
> > But invoking skb_orphan in start_xmit looks like a workaround to me,
> > I'm not sure if we should rollback this change.  The small queue check
> > and cwnd window would come into a kind of "dead lock" situation to me,
> > so I suppose we should fix that "dead lock".  If you believe we
> > shouldn't change TCP layer for this issue, may I know the correct
> > direction to resolve this issue? Should we modify the virtio_net
> > driver, let it always invoke skb_orphan as before?
> > As a workaround, we set the virtio_net module parameter napi_tx to
> > false, then the use_napi would be false too. Thus the issue won't
> > happen. But we indeed want to enable napi_tx, so may I know what's
> > your suggestion about this issue?
> >
>
> I think you should start a conversation with virtio_net experts,
> instead of making TCP
> bufferbloated again.
>
> TX completions dynamics are important, and we are not going to
> penalize all drivers
> just because of one.
>
> You are claiming deadlocks, but the mechanisms in place are proven to
> work damn well.

Oh, we're almost at the same time to reply to the thread :)

Right, I'm totally with you.

Thanks,
Jason

>
> >
> > ------------------------------------------------------------------
> > From:Eric Dumazet <edumazet@google.com>
> > Send Time:2025 Oct. 20 (Mon.) 01:43
> > To:Peng Yu<yupeng0921@gmail.com>
> > CC:ncardwell<ncardwell@google.com>; kuniyu<kuniyu@google.com>; netdev<netdev@vger.kernel.org>; "linux-kernel"<linux-kernel@vger.kernel.org>; Peng YU<peng.yu@alibaba-inc.com>
> > Subject:Re: [PATCH] net: set is_cwnd_limited when the small queue check fails
> >
> >
> > On Sun, Oct 19, 2025 at 10:00 AM Peng Yu <yupeng0921@gmail.com> wrote:
> > >
> > > The limit of the small queue check is calculated from the pacing rate,
> > > the pacing rate is calculated from the cwnd. If the cwnd is small,
> > > the small queue check may fail.
> > > When the samll queue check fails, the tcp layer will send less
> > > packages, then the tcp_is_cwnd_limited would alreays return false,
> > > then the cwnd would have no chance to get updated.
> > > The cwnd has no chance to get updated, it keeps small, then the pacing
> > > rate keeps small, and the limit of the small queue check keeps small,
> > > then the small queue check would always fail.
> > > It is a kind of dead lock, when a tcp flow comes into this situation,
> > > it's throughput would be very small, obviously less then the correct
> > > throughput it should have.
> > > We set is_cwnd_limited to true when the small queue check fails, then
> > > the cwnd would have a chance to get updated, then we can break this
> > > deadlock.
> > >
> > > Below ss output shows this issue:
> > >
> > > skmem:(r0,rb131072,
> > > t7712, <------------------------------ wmem_alloc = 7712
> > > tb243712,f2128,w219056,o0,bl0,d0)
> > > ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
> > > pmtu:8500 rcvmss:536 advmss:8448
> > > cwnd:28 <------------------------------ cwnd=28
> > > bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
> > > segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
> > > send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
> > > pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
> > > delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
> > > rcv_space:57088 rcv_ssthresh:57088 notsent:188240
> > > minrtt:23.319 snd_wnd:57088
> > >
> > > limit=(27764216 / 8) / 1024 = 3389 < 7712
> > > So the samll queue check fails. When it happens, the throughput is
> > > obviously less than the normal situation.
> > >
> > > By setting the tcp_is_cwnd_limited to true when the small queue check
> > > failed, we can avoid this issue, the cwnd could increase to a reasonalbe
> > > size, in my test environment, it is about 4000. Then the small queue
> > > check won't fail.
> >
> >
> > >
> > > Signed-off-by: Peng Yu <peng.yu@alibaba-inc.com>
> > > ---
> > >  net/ipv4/tcp_output.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index b94efb3050d2..8c70acf3a060 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> > >                     unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
> > >                         break;
> > >
> > > -               if (tcp_small_queue_check(sk, skb, 0))
> > > +               if (tcp_small_queue_check(sk, skb, 0)) {
> > > +                       is_cwnd_limited = true;
> > >                         break;
> > > +               }
> > >
> > >                 /* Argh, we hit an empty skb(), presumably a thread
> > >                  * is sleeping in sendmsg()/sk_stream_wait_memory().
> > > --
> > > 2.47.3
> >
> > Sorry this makes no sense to me.  CWND_LIMITED should not be hijacked.
> >
> > Something else is preventing your flows to get to nominal speed,
> > because we have not seen anything like that.
> >
> > It is probably a driver issue or a receive side issue : Instead of
> > trying to work around the issue, please root cause it.
>

Re: [PATCH] net: set is_cwnd_limited when the small queue check fails

Posted by Eric Dumazet 3 months, 2 weeks ago

On Sun, Oct 19, 2025 at 10:43 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Oct 19, 2025 at 10:00 AM Peng Yu <yupeng0921@gmail.com> wrote:
> >
> > The limit of the small queue check is calculated from the pacing rate,
> > the pacing rate is calculated from the cwnd. If the cwnd is small,
> > the small queue check may fail.
> > When the samll queue check fails, the tcp layer will send less
> > packages, then the tcp_is_cwnd_limited would alreays return false,
> > then the cwnd would have no chance to get updated.
> > The cwnd has no chance to get updated, it keeps small, then the pacing
> > rate keeps small, and the limit of the small queue check keeps small,
> > then the small queue check would always fail.
> > It is a kind of dead lock, when a tcp flow comes into this situation,
> > it's throughput would be very small, obviously less then the correct
> > throughput it should have.
> > We set is_cwnd_limited to true when the small queue check fails, then
> > the cwnd would have a chance to get updated, then we can break this
> > deadlock.
> >
> > Below ss output shows this issue:
> >
> > skmem:(r0,rb131072,
> > t7712, <------------------------------ wmem_alloc = 7712
> > tb243712,f2128,w219056,o0,bl0,d0)
> > ts sack cubic wscale:7,10 rto:224 rtt:23.364/0.019 ato:40 mss:1448
> > pmtu:8500 rcvmss:536 advmss:8448
> > cwnd:28 <------------------------------ cwnd=28
> > bytes_sent:2166208 bytes_acked:2148832 bytes_received:37
> > segs_out:1497 segs_in:751 data_segs_out:1496 data_segs_in:1
> > send 13882554bps lastsnd:7 lastrcv:2992 lastack:7
> > pacing_rate 27764216bps <--------------------- pacing_rate=27764216bps
> > delivery_rate 5786688bps delivered:1485 busy:2991ms unacked:12
> > rcv_space:57088 rcv_ssthresh:57088 notsent:188240
> > minrtt:23.319 snd_wnd:57088
> >
> > limit=(27764216 / 8) / 1024 = 3389 < 7712
> > So the samll queue check fails. When it happens, the throughput is
> > obviously less than the normal situation.
> >
> > By setting the tcp_is_cwnd_limited to true when the small queue check
> > failed, we can avoid this issue, the cwnd could increase to a reasonalbe
> > size, in my test environment, it is about 4000. Then the small queue
> > check won't fail.
>
>
> >
> > Signed-off-by: Peng Yu <peng.yu@alibaba-inc.com>
> > ---
> >  net/ipv4/tcp_output.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index b94efb3050d2..8c70acf3a060 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2985,8 +2985,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> >                     unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
> >                         break;
> >
> > -               if (tcp_small_queue_check(sk, skb, 0))
> > +               if (tcp_small_queue_check(sk, skb, 0)) {
> > +                       is_cwnd_limited = true;
> >                         break;
> > +               }
> >
> >                 /* Argh, we hit an empty skb(), presumably a thread
> >                  * is sleeping in sendmsg()/sk_stream_wait_memory().
> > --
> > 2.47.3
>
> Sorry this makes no sense to me.  CWND_LIMITED should not be hijacked.
>
> Something else is preventing your flows to get to nominal speed,
> because we have not seen anything like that.
>
> It is probably a driver issue or a receive side issue : Instead of
> trying to work around the issue, please root cause it.


 BTW we recently fixed a bug in tcp_tso_should_defer()

Make sure to try a kernel with this fix ?

295ce1eb36ae47dc862d6c8a1012618a25516208 tcp: fix
tcp_tso_should_defer() vs large RTT