net/ipv4/tcp.c | 6 +++- net/ipv4/tcp_input.c | 13 ++++++-- net/ipv4/tcp_minisocks.c | 8 +++-- net/ipv4/tcp_output.c | 5 +-- .../net/packetdrill/tcp_rcv_sockopt_lowat.pkt | 24 ++++++++++++++ .../net/packetdrill/tcp_rcv_sockopt_wnd_clamp.pkt | 28 ++++++++++++++++ .../packetdrill/tcp_rcv_wnd_active_no_scaling.pkt | 27 ++++++++++++++++ .../tcp_rcv_wnd_active_peer_no_scaling.pkt | 26 +++++++++++++++ .../packetdrill/tcp_rcv_wnd_passive_no_scaling.pkt | 30 ++++++++++++++++++ .../tcp_rcv_wnd_passive_peer_no_scaling.pkt | 29 +++++++++++++++++ .../packetdrill/tcp_rcv_wnd_snd_ack_no_scaling.pkt | 37 ++++++++++++++++++++++ 11 files changed, 225 insertions(+), 8 deletions(-)
Hi,
this series ensures that rcv_wnd and window_clamp do not exceed the
maximum window size representable for the connection's window scale
factor.
This is most visible when TCP window scaling is not used for a
connection. In that case, the advertised window is limited to 65535
bytes, but rcv_wnd or window_clamp can still grow beyond 65535 when
large receive buffers are used. The resulting mismatch breaks
calculations that depend on the advertised window, such as the ACK
decision in __tcp_ack_snd_check(), and can prevent immediate ACKs.
Similar effects may also occur when window scaling is in use, e.g. if
the application dynamically adjusts SO_RCVBUF in unusual ways or when
the rmem sysctl parameters change during a connection’s lifetime.
Summary:
- Patch 1 keeps rcv_wnd capped by the (window scale-limited)
window_clamp at connection start.
- Patch 3 and 6 ensure that window_clamp is limited to the
representable window when it is updated.
- The other patches add packetdrill tests to verify the new behavior.
A simple iperf test on a virtme-ng VM (Intel i5-7500, 4 cores,
loopback) shows a noticeable improvement with window scaling disabled:
Fixed receive buffer:
- sysctl net.ipv4.tcp_window_scaling=0
- Server: iperf -l 256K -w 256K -s
- Client: iperf -l 256K -w 256K -c 127.0.0.1 -t 30
- net-next: ~47 Gbit/sec
- with this series: ~62 Gbit/sec
Receive buffer autotuning (net.ipv4.tcp_rmem = 4096 131072 7813888):
- sysctl net.ipv4.tcp_window_scaling=0
- Server: iperf -s
- Client: iperf -c 127.0.0.1 -t 30
- net-next: ~48 Gbit/sec
- with this series: ~60 Gbit/sec
Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
---
Simon Baatz (7):
tcp: keep rcv_wnd/rcv_ssthresh clamped by window_clamp if no scaling in use
selftests/net: packetdrill: verify non-scaled rcv_wnd initialization
tcp: Ensure window_clamp is limited to representable window
selftests/net: packetdrill: add tcp_rcv_wnd_snd_ack_no_scaling.pkt
selftests/net: packetdrill: add TCP_WINDOW_CLAMP test
tcp: use tcp_set_window_clamp() for SO_RCVLOWAT
selftests/net: packetdrill: add test for SO_RCVLOWAT window clamp
net/ipv4/tcp.c | 6 +++-
net/ipv4/tcp_input.c | 13 ++++++--
net/ipv4/tcp_minisocks.c | 8 +++--
net/ipv4/tcp_output.c | 5 +--
.../net/packetdrill/tcp_rcv_sockopt_lowat.pkt | 24 ++++++++++++++
.../net/packetdrill/tcp_rcv_sockopt_wnd_clamp.pkt | 28 ++++++++++++++++
.../packetdrill/tcp_rcv_wnd_active_no_scaling.pkt | 27 ++++++++++++++++
.../tcp_rcv_wnd_active_peer_no_scaling.pkt | 26 +++++++++++++++
.../packetdrill/tcp_rcv_wnd_passive_no_scaling.pkt | 30 ++++++++++++++++++
.../tcp_rcv_wnd_passive_peer_no_scaling.pkt | 29 +++++++++++++++++
.../packetdrill/tcp_rcv_wnd_snd_ack_no_scaling.pkt | 37 ++++++++++++++++++++++
11 files changed, 225 insertions(+), 8 deletions(-)
---
base-commit: b3e69fc3196fc421e26196e7792f17b0463edc6f
change-id: 20260402-tcp_rcv_exact_clamp_and_wnd-427d853e7491
Best regards,
--
Simon Baatz <gmbnomis@gmail.com>
On Wed, Apr 8, 2026 at 2:50 PM Simon Baatz via B4 Relay <devnull+gmbnomis.gmail.com@kernel.org> wrote: > > Hi, > > this series ensures that rcv_wnd and window_clamp do not exceed the > maximum window size representable for the connection's window scale > factor. > > This is most visible when TCP window scaling is not used for a > connection. In that case, the advertised window is limited to 65535 > bytes, but rcv_wnd or window_clamp can still grow beyond 65535 when > large receive buffers are used. The resulting mismatch breaks > calculations that depend on the advertised window, such as the ACK > decision in __tcp_ack_snd_check(), and can prevent immediate ACKs. > > Similar effects may also occur when window scaling is in use, e.g. if > the application dynamically adjusts SO_RCVBUF in unusual ways or when > the rmem sysctl parameters change during a connection’s lifetime. > > Summary: > > - Patch 1 keeps rcv_wnd capped by the (window scale-limited) > window_clamp at connection start. > - Patch 3 and 6 ensure that window_clamp is limited to the > representable window when it is updated. > - The other patches add packetdrill tests to verify the new behavior. > > A simple iperf test on a virtme-ng VM (Intel i5-7500, 4 cores, > loopback) shows a noticeable improvement with window scaling disabled: Explain why we should spend time reviewing patches trying to help stacks from 2 decades ago, risking breaking other usages. Almost every time we change the rcvbuf logic, we introduce bugs. Not using window scaling in 2026 and expecting "iperf improvement" is quite something! Out of curiosity, which legacy product is stuck in the 20th century?
Hi Eric,
On Thu, Apr 09, 2026 at 07:52:03AM -0700, Eric Dumazet wrote:
> On Wed, Apr 8, 2026 at 2:50???PM Simon Baatz via B4 Relay
> <devnull+gmbnomis.gmail.com@kernel.org> wrote:
> >
> > Hi,
> >
> > this series ensures that rcv_wnd and window_clamp do not exceed the
> > maximum window size representable for the connection's window scale
> > factor.
> >
> > This is most visible when TCP window scaling is not used for a
> > connection. In that case, the advertised window is limited to 65535
> > bytes, but rcv_wnd or window_clamp can still grow beyond 65535 when
> > large receive buffers are used. The resulting mismatch breaks
> > calculations that depend on the advertised window, such as the ACK
> > decision in __tcp_ack_snd_check(), and can prevent immediate ACKs.
> >
> > Similar effects may also occur when window scaling is in use, e.g. if
> > the application dynamically adjusts SO_RCVBUF in unusual ways or when
> > the rmem sysctl parameters change during a connection???s lifetime.
> >
> > Summary:
> >
> > - Patch 1 keeps rcv_wnd capped by the (window scale-limited)
> > window_clamp at connection start.
> > - Patch 3 and 6 ensure that window_clamp is limited to the
> > representable window when it is updated.
> > - The other patches add packetdrill tests to verify the new behavior.
> >
> > A simple iperf test on a virtme-ng VM (Intel i5-7500, 4 cores,
> > loopback) shows a noticeable improvement with window scaling disabled:
>
> Explain why we should spend time reviewing patches trying to help
> stacks from 2 decades ago,
> risking breaking other usages.
>
> Almost every time we change the rcvbuf logic, we introduce bugs.
As soon as someone gives me access to a link with a bandwidth delay
product of probably > 500 MB I am happy to provide another set of
benchmarks results:
`./defaults.sh
sysctl -q net.ipv4.tcp_rmem="4096 2147483647 2147483647"`
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 14>
+0 < . 1:1(0) ack 1 win 32792
+0 accept(3, ..., ...) = 4
+0 getsockopt(4, IPPROTO_TCP, 10, [1073725440], [4]) = 0
+0 < P. 1:65001(65000) ack 1 win 32792
+0 > . 1:1(0) ack 65001 win 65535
+0 < P. 65001:130001(65000) ack 1 win 32792
+0 > . 1:1(0) ack 130001 win 65535
+0 < P. 130001:195001(65000) ack 1 win 32792
+0 > . 1:1(0) ack 195001 win 65535
+0 < P. 195001:260001(65000) ack 1 win 32792
+0 > . 1:1(0) ack 260001 win 65535
+0 < P. 260001:325001(65000) ack 1 win 32792
+0 > . 1:1(0) ack 325001 win 65535
+0 < P. 325001:390001(65000) ack 1 win 32792
+0 > . 1:1(0) ack 390001 win 65535
+0 getsockopt(4, IPPROTO_TCP, 10, [2113929215], [4]) = 0
+.1 %{ assert tcpi_rcv_wnd <= 1073725440, tcpi_rcv_wnd }%
Fails with:
AssertionError: 1074511872
on a current kernel.
So, I think we should spend time reviewing this because currently we
just pretend to clamp the window at its limits.
> Not using window scaling in 2026 and expecting "iperf improvement" is
> quite something!
I wondered if providing these numbers was a good idea and apparently
it wasn't. I just found the difference to be striking. The only
thing I wanted to demonstrate is that basing our calculations on
bogus window sizes can have real effects.
> Out of curiosity, which legacy product is stuck in the 20th century?
I have half a dozen of these products "stuck in the 20th century" at
home. They are called IoT devices and I find saying that TCP
connections to such devices need not to have proper sequence number
acceptability tests according to RFC 9293 quite something. ;-)
- Simon
--
Simon Baatz <gmbnomis@gmail.com>
On Thu, Apr 9, 2026 at 2:24 PM Simon Baatz <gmbnomis@gmail.com> wrote: > > Hi Eric, > > On Thu, Apr 09, 2026 at 07:52:03AM -0700, Eric Dumazet wrote: > > On Wed, Apr 8, 2026 at 2:50???PM Simon Baatz via B4 Relay > > <devnull+gmbnomis.gmail.com@kernel.org> wrote: > > > > > > Hi, > > > > > > this series ensures that rcv_wnd and window_clamp do not exceed the > > > maximum window size representable for the connection's window scale > > > factor. > > > > > > This is most visible when TCP window scaling is not used for a > > > connection. In that case, the advertised window is limited to 65535 > > > bytes, but rcv_wnd or window_clamp can still grow beyond 65535 when > > > large receive buffers are used. The resulting mismatch breaks > > > calculations that depend on the advertised window, such as the ACK > > > decision in __tcp_ack_snd_check(), and can prevent immediate ACKs. > > > > > > Similar effects may also occur when window scaling is in use, e.g. if > > > the application dynamically adjusts SO_RCVBUF in unusual ways or when > > > the rmem sysctl parameters change during a connection???s lifetime. > > > > > > Summary: > > > > > > - Patch 1 keeps rcv_wnd capped by the (window scale-limited) > > > window_clamp at connection start. > > > - Patch 3 and 6 ensure that window_clamp is limited to the > > > representable window when it is updated. > > > - The other patches add packetdrill tests to verify the new behavior. > > > > > > A simple iperf test on a virtme-ng VM (Intel i5-7500, 4 cores, > > > loopback) shows a noticeable improvement with window scaling disabled: > > > > Explain why we should spend time reviewing patches trying to help > > stacks from 2 decades ago, > > risking breaking other usages. > > > > Almost every time we change the rcvbuf logic, we introduce bugs. > > As soon as someone gives me access to a link with a bandwidth delay > product of probably > 500 MB I am happy to provide another set of > benchmarks results: > > `./defaults.sh > sysctl -q net.ipv4.tcp_rmem="4096 2147483647 2147483647"` Please do not do this. Stick to reasonable limits. You might have missed that we are flooded with bug reports (and buggy patches). We have very limited time for bugs not proven by real-world conditions.
© 2016 - 2026 Red Hat, Inc.